typeof(missing)Missing
NAs and Missing ValuesIn this tutorial, we will learn how to handle missing values in DataFrames, commonly called as NAs in R. If you come from R, there will be a lot of examples and comparisons between R and Julia.
Missing typeR has several missing types: NA, NA_integer_, NA_real_, NA_character_ and NA_complex_. Julia, on the contrary, has only one: Missing:
typeof(missing)Missing
You can see that the type is Missing with title case and the instance of it is missing with lowercase. Most of the time you’ll be wanting to use the instantiation of a Missing type: missing.
missing values propagates through several operations, specially mathematical ones.
For example, addition, subtraction, multiplication and division:
missing + 1missing
missing - 1missing
missing * 1missing
missing / 1missing
They also propagate through equality and comparison operators:
missing == 1missing
missing == missingmissing
missing > 1missing
missing > missingmissing
Be cautious with comparisons and testing equalities in the presence of missing values.
For equality testing use instead the ismissing() function.
missings in JuliaWe will focus on 3 ways of handling missing data:
missings with ismissing() and dropmissing()missing’s values with coalesce()missings with skipmissing()First, let’s import the packages with the using statement:
using CSV
using PharmaDatasets
using DataFramesMetaWe’ll use a dataset that has a lot of missing values as strings with a dot (".").
Notice that the column type that has missing values will display a ? after their inferred type in the subheader of the rendered DataFrame. For example, Float64? type actually means Union{Missing, Float64} type, in other words, it can be either a Float64 or a Missing.
Remember that in Julia there are no special missing values for different types.
Hover over the column headers in the DataFrame below to see the underlying types.
df = CSV.read(
dataset("pumas_tutorials/iv_sd_final_target_format", String),
DataFrame;
missingstring = ".",
)
first(df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
| 2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
Missings with dropmissing() and ismissing()Most of the time we want to remove missing values from our data.
Removing missings can be done with two options.
The first is the dropmissing() function from DataFrames.jl. Like all DataFrames.jl functions, it accepts a DataFrame as the first argument followed by which columns you’ll want to remove the missing data from.
By default, if you do not specify a column it will remove any observation (row) that has missing values:
dropmissing(df)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 |
Since all rows had at least one missing value across multiple columns, we get back an empty DataFrame.
Now, let’s restrict the columns that we want to remove the missing values from. You can either pass a single column as a Symbol or several columns as a vector of Symbols:
my_df = dropmissing(df, :AMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
| 2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
| 3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
| 4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
| 5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
my_df = dropmissing(df, [:AMT, :CMT])
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
| 2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
| 3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
| 4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
| 5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
You can use any of the selectors that you’ve learned in the previous tutorial on DataFramesMeta.jl to select which columns you’ll want to drop the missing values.
For instance, you can use Cols, All, Between, Not or a regular expression with the string literal r"[regex]".
The second option is the ismissing() function applied to a filtering (subsetting) routine. ismissing(x) is a boolean function that returns true if x has the type Missing, and false otherwise.
You can do whatever pleases you:
ismissing() to just keep the missing values!ismissing() (the negation) to keep anything but the missing valuesMissing values comparisons in Julia are the same as in R. If you compare a missing with any value it will be a missing:
julia> 1 == missing
missingThis is the same in R:
> 1 == NA
[1] NAThis is why it is important to use the ismissing() function instead of the == for comparisons.
my_df = @rsubset df ismissing(:AMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
my_df = @rsubset df !ismissing(:AMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
| 2 | 2 | 0.0 | missing | 100 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | 1 |
| 3 | 3 | 0.0 | missing | 100 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | 1 |
| 4 | 4 | 0.0 | missing | 100 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | 1 |
| 5 | 5 | 0.0 | missing | 100 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 1 | 1 |
You can also pick a different filtering macro:
@subset with a vectorized ismissing()@rsubset with a simple ismissing()my_df = @subset df ismissing.(:AMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
my_df = @subset df @byrow ismissing(:AMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | |
| 1 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 2 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 3 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 4 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
| 5 | 1 | 1.5 | 13.146 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing |
You can chain together several conditions with the AND && or the OR || operators inside the @[r]subset macro (watch out for precedence by using appropriate parentheses ()):
@rsubset df (ismissing(:col1) && ismissing(:col2)) || ismissing(:col3)Alternatively, if everything is AND:
@rsubset df begin
ismissing(:col1)
ismissing(:col2)
ismissing(:col3)
endMissing’s values with coalesce()A common data wrangling pattern is to replace or fill missing values.
Like R (and SQL), Julia has the coalesce() function. We often use it in a vectorized way over an array to fill all missing values with a specific value.
For example, here’s a vector containing two missing values that we apply a broadcasted coalesce() function to replace all missings into 0s:
coalesce.([missing, "some_value", missing], 0)3-element Vector{Any}:
0
"some_value"
0
We could do the same with one or more columns in a DataFrame, but coalescing it to 2 instead of 0:
my_df = @rtransform df :CMT = coalesce(:CMT, 2)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64 | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 |
| 2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
| 3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
| 4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
| 5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | 2 |
Remember that the coalesce() function is only defined on single elements.
If you want to apply it on a vector you’ll need to broadcast it, i.e. coalesce.(v) where v is a vector.
If we want to apply it to a DataFrame column, we need to use the “row” macros, i.e. @rtransform[!] or @rselect[!]; or we could use the @byrow macro in a “non-row” macro, i.e. @transform[!] or @select[!].
Missings with skipmissing()Suppose that you’ll want to compute a sum where one or more elements are missing values. Same as in R, this wouldn’t work in Julia:
sum([1, 2, missing])missing
In order to have functions like sum to ignore missing values in their arguments, you’ll need to use the skipmissing() function.
skipmissing() technically returns an iterator that skips over the missing values. This is handy for all collections, such as matrices and vectors, that have missing values.
The previous example can be computed with:
sum(skipmissing([1, 2, missing]))3
We can use skipmissing() easily in DataFrames. For example, suppose you’ll want to compute a value that is the sum of two or more columns:
my_df = @rtransform df :SOME_SUM = sum(skipmissing([:EVID, :CMT]))
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT | SOME_SUM |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | Int64 | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 | 2 |
| 2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
| 3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
| 4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
| 5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | 0 |
Missings with @passmissingIf you want to “propagate” missing values in DataFramesMeta.jl transformations with @[r]transform[!] you’ll need to use the macro @passmissing. It will not let transformations that would fail in the presence of missing values but instead will return missing.
For example, consider the following transformation on the :CMT column trying to parse it as a Bool:
my_df = @rtransform df :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)MethodError: Cannot `convert` an object of type Missing to an object of type Bool The function `convert` exists, but no method is defined for this combination of argument types. Closest candidates are: convert(::Type{T}, ::Unitful.Quantity) where T<:Real @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/conversion.jl:167 convert(::Type{T}, ::Unitful.Level) where T<:Real @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/logarithm.jl:46 convert(::Type{T}, ::Unitful.Gain) where T<:Real @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/logarithm.jl:87 ... Stacktrace: [1] (::var"#5#6")(233::Missing) @ Main.Notebook ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/parsing.jl:303 [2] iterate @ ./generator.jl:48 [inlined] [3] collect_to! @ ./array.jl:849 [inlined] [4] collect_to_with_first!(dest::Vector{Bool}, v1::Bool, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, st::Tuple{Base.OneTo{Int64}, Int64}) @ Base ./array.jl:827 [5] _collect(c::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, ::Base.EltypeUnknown, isz::Base.HasShape{1}) @ Base ./array.jl:821 [6] collect_similar @ ./array.jl:720 [inlined] [7] map @ ./abstractarray.jl:3371 [inlined] [8] (::ByRow{var"#5#6"})(cols::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}) @ Tables ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Tables/cRTb7/src/utils.jl:229 [9] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any}) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:562 [10] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805 [11] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783 [12] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703 [13] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303 [14] select @ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303 [inlined] [15] transform(df::DataFrame, args::Any) @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383 [16] top-level scope @ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/macros.jl:1599
It fails because Julia cannot convert a Missing into a Bool. But if we add the @passmissing macro, we do not get an error and missing values get propagated forward in the transformation:
my_df = @rtransform df @passmissing :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)| Row | ID | TIME_AFTER_DOSING | CONC | AMT | AGE | WEIGHT | SCR | ISMALE | eGFR | EVID | CMT | CMT_Bool |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64? | Int64? | Float64 | Float64 | Float64 | Int64 | Float64 | Int64 | Int64? | Bool? | |
| 1 | 1 | 0.0 | missing | 100 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 1 | 1 | true |
| 2 | 1 | 0.25 | 13.026 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
| 3 | 1 | 0.5 | 14.984 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
| 4 | 1 | 0.75 | 14.16 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |
| 5 | 1 | 1.0 | 19.316 | missing | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | missing | missing |