typeof(missing)
Missing
NA
s and Missing ValuesIn this tutorial, we will learn how to handle missing values in DataFrame
s, commonly called as NA
s in R. If you come from R, there will be a lot of examples and comparisons between R and Julia.
Missing
typeR has several missing types: NA
, NA_integer_
, NA_real_
, NA_character_
and NA_complex_
. Julia, on the contrary, has only one: Missing
:
typeof(missing)
Missing
You can see that the type is Missing
with title case and the instance of it is missing
with lowercase. Most of the time you’ll be wanting to use the instantiation of a Missing
type: missing
.
missing
values propagates through several operations, specially mathematical ones.
For example, addition, subtraction, multiplication and division:
missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing
They also propagate through equality and comparison operators:
missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing
Be cautious with comparisons and testing equalities in the presence of missing values.
For equality testing use instead the ismissing()
function.
missing
s in JuliaWe will focus on 3 ways of handling missing data:
missing
s with ismissing()
and dropmissing()
missing
’s values with coalesce()
missing
s with skipmissing()
First, let’s import the packages with the using
statement:
using CSV
using PharmaDatasets
using DataFramesMeta
We’ll use a dataset that has a lot of missing values as strings with a dot ("."
).
Notice that the column type that has missing
values will display a ?
after their inferred type in the subheader of the rendered DataFrame
. For example, Float64?
type actually means Union{Missing, Float64}
type, in other words, it can be either a Float64
or a Missing
.
Remember that in Julia there are no special missing
values for different types.
Hover over the column headers in the DataFrame
below to see the underlying types.
= CSV.read(
df dataset("pumas_tutorials/iv_sd_final_target_format", String),
DataFrame;= ".",
missingstring
)first(df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
Missing
s with dropmissing()
and ismissing()
Most of the time we want to remove missing values from our data.
Removing missing
s can be done with two options.
The first is the dropmissing()
function from DataFrames.jl
. Like all DataFrames.jl
functions, it accepts a DataFrame
as the first argument followed by which columns you’ll want to remove the missing data from.
By default, if you do not specify a column it will remove any observation (row) that has missing
values:
dropmissing(df)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 |
Since all rows had at least one missing
value across multiple columns, we get back an empty DataFrame
.
Now, let’s restrict the columns that we want to remove the missing
values from. You can either pass a single column as a Symbol
or several columns as a vector of Symbol
s:
= dropmissing(df, :AMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
= dropmissing(df, [:AMT, :CMT])
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
You can use any of the selectors that you’ve learned in the previous tutorial on DataFramesMeta.jl
to select which columns you’ll want to drop the missing
values.
For instance, you can use Cols
, All
, Between
, Not
or a regular expression with the string literal r"[regex]"
.
The second option is the ismissing()
function applied to a filtering (subsetting) routine. ismissing(x)
is a boolean function that returns true
if x
has the type Missing
, and false otherwise.
You can do whatever pleases you:
ismissing()
to just keep the missing
values!ismissing()
(the negation) to keep anything but the missing
valuesMissing values comparisons in Julia are the same as in R. If you compare a missing
with any value it will be a missing
:
> 1 == missing
juliamissing
This is the same in R:
> 1 == NA
1] NA [
This is why it is important to use the ismissing()
function instead of the ==
for comparisons.
= @rsubset df ismissing(:AMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
= @rsubset df !ismissing(:AMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
You can also pick a different filtering macro:
@subset
with a vectorized ismissing()
@rsubset
with a simple ismissing()
= @subset df ismissing.(:AMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
= @subset df @byrow ismissing(:AMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
You can chain together several conditions with the AND &&
or the OR ||
operators inside the @[r]subset
macro (watch out for precedence by using appropriate parentheses ()
):
@rsubset df (ismissing(:col1) && ismissing(:col2)) || ismissing(:col3)
Alternatively, if everything is AND:
@rsubset df begin
ismissing(:col1)
ismissing(:col2)
ismissing(:col3)
end
Missing
’s values with coalesce()
A common data wrangling pattern is to replace or fill missing values.
Like R (and SQL), Julia has the coalesce()
function. We often use it in a vectorized way over an array to fill all missing
values with a specific value.
For example, here’s a vector containing two missing
values that we apply a broadcasted coalesce()
function to replace all missing
s into 0
s:
coalesce.([missing, "some_value", missing], 0)
3-element Vector{Any}:
0
"some_value"
0
We could do the same with one or more columns in a DataFrame
, but coalescing it to 2
instead of 0
:
= @rtransform df :CMT = coalesce(:CMT, 2)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
Remember that the coalesce()
function is only defined on single elements.
If you want to apply it on a vector you’ll need to broadcast it, i.e. coalesce.(v)
where v
is a vector.
If we want to apply it to a DataFrame
column, we need to use the “row” macros, i.e. @rtransform[!]
or @rselect[!]
; or we could use the @byrow
macro in a “non-row” macro, i.e. @transform[!]
or @select[!]
.
Missing
s with skipmissing()
Suppose that you’ll want to compute a sum where one or more elements are missing
values. Same as in R, this wouldn’t work in Julia:
sum([1, 2, missing])
missing
In order to have functions like sum
to ignore missing
values in their arguments, you’ll need to use the skipmissing()
function.
skipmissing()
technically returns an iterator that skips over the missing
values. This is handy for all collections, such as matrices and vectors, that have missing
values.
The previous example can be computed with:
sum(skipmissing([1, 2, missing]))
3
We can use skipmissing()
easily in DataFrame
s. For example, suppose you’ll want to compute a value that is the sum of two or more columns:
= @rtransform df :SOME_SUM = sum(skipmissing([:EVID, :CMT]))
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT | SOME_SUM |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | Int64 | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 | 2 |
2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
Missing
s with @passmissing
If you want to “propagate” missing
values in DataFramesMeta.jl
transformations with @[r]transform[!]
you’ll need to use the macro @passmissing
. It will not let transformations that would fail in the presence of missing
values but instead will return missing
.
For example, consider the following transformation on the :CMT
column trying to parse it as a Bool
:
= @rtransform df :CMT_Bool = convert(Bool, :CMT)
my_df first(my_df, 5)
MethodError: MethodError(convert, (Bool, missing), 0x0000000000007b33)
MethodError: Cannot `convert` an object of type Missing to an object of type Bool
Closest candidates are:
convert(::Type{T}, !Matched::T) where T
@ Base Base.jl:84
convert(::Type{T}, !Matched::AbstractChar) where T<:Number
@ Base char.jl:185
convert(::Type{T}, !Matched::Unitful.Level) where T<:Real
@ Unitful ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GYzMo/src/logarithm.jl:22
...
Stacktrace:
[1] (::var"#5#6")(228::Missing)
@ Main.Notebook ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/parsing.jl:303
[2] iterate
@ ./generator.jl:47 [inlined]
[3] collect_to!(dest::Vector{Bool}, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, offs::Int64, st::Tuple{Base.OneTo{Int64}, Int64})
@ Base ./array.jl:892
[4] collect_to_with_first!
@ ./array.jl:870 [inlined]
[5] _collect(c::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, ::Base.EltypeUnknown, isz::Base.HasShape{1})
@ Base ./array.jl:864
[6] collect_similar
@ ./array.jl:763 [inlined]
[7] map
@ ./abstractarray.jl:3285 [inlined]
[8] (::ByRow{var"#5#6"})(cols::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}})
@ Tables ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Tables/8p03y/src/utils.jl:229
[9] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any})
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:562
[10] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
[11] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
[12] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
[13] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
[14] select
@ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303 [inlined]
[15] transform(df::DataFrame, args::Any)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383
[16] top-level scope
@ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/macros.jl:1599
It fails because Julia cannot convert a Missing
into a Bool
. But if we add the @passmissing
macro, we do not get an error and missing
values get propagated forward in the transformation:
= @rtransform df @passmissing :CMT_Bool = convert(Bool, :CMT)
my_df first(my_df, 5)
Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT | CMT_Bool |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | Bool? | |
1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 | true |
2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |