Handling `NA`s and Missing Values

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

In this tutorial, we will learn how to handle missing values in DataFrames, commonly called as NAs in R. If you come from R, there will be a lot of examples and comparisons between R and Julia.

1 ❓ `Missing` type

R has several missing types: NA, NA_integer_, NA_real_, NA_character_ and NA_complex_. Julia, on the contrary, has only one: Missing:

typeof(missing)

Missing

Tip

You can see that the type is Missing with title case and the instance of it is missing with lowercase. Most of the time you’ll be wanting to use the instantiation of a Missing type: missing.

missing values propagates through several operations, specially mathematical ones.

For example, addition, subtraction, multiplication and division:

missing + 1

missing

missing - 1

missing

missing * 1

missing

missing / 1

missing

They also propagate through equality and comparison operators:

missing == 1

missing

missing == missing

missing

missing > 1

missing

missing > missing

missing

Caution

Be cautious with comparisons and testing equalities in the presence of missing values.

For equality testing use instead the ismissing() function.

2 👷 Handling `missing`s in Julia

We will focus on 3 ways of handling missing data:

Filtering missings with ismissing() and dropmissing()
Filling or Replacing missing’s values with coalesce()
Skipping missings with skipmissing()

First, let’s import the packages with the using statement:

using CSV
using PharmaDatasets
using DataFramesMeta

We’ll use a dataset that has a lot of missing values as strings with a dot (".").

Notice that the column type that has missing values will display a ? after their inferred type in the subheader of the rendered DataFrame. For example, Float64? type actually means Union{Missing, Float64} type, in other words, it can be either a Float64 or a Missing.

Remember that in Julia there are no special missing values for different types.

Tip

Hover over the column headers in the DataFrame below to see the underlying types.

df = CSV.read(
    dataset("pumas_tutorials/iv_sd_final_target_format", String),
    DataFrame;
    missingstring = ".",
)
first(df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing
3	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing
4	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing
5	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing

2.1 Filtering `Missing`s with `dropmissing()` and `ismissing()`

Most of the time we want to remove missing values from our data.

Removing missings can be done with two options.

The first is the dropmissing() function from DataFrames.jl. Like all DataFrames.jl functions, it accepts a DataFrame as the first argument followed by which columns you’ll want to remove the missing data from.

By default, if you do not specify a column it will remove any observation (row) that has missing values:

dropmissing(df)

0×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64	Int64	Float64	Float64	Float64	Int64	Float64	Int64	Int64

Since all rows had at least one missing value across multiple columns, we get back an empty DataFrame.

Now, let’s restrict the columns that we want to remove the missing values from. You can either pass a single column as a Symbol or several columns as a vector of Symbols:

my_df = dropmissing(df, :AMT)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	2	0.0	missing	100	32.765	74.838	0.8846	1	126.0	1	1
3	3	0.0	missing	100	35.974	37.303	1.1004	1	48.981	1	1
4	4	0.0	missing	100	38.206	32.969	1.1972	1	38.934	1	1
5	5	0.0	missing	100	33.559	47.139	1.5924	0	37.198	1	1

my_df = dropmissing(df, [:AMT, :CMT])
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64	Float64	Float64	Float64	Int64	Float64	Int64	Int64
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	2	0.0	missing	100	32.765	74.838	0.8846	1	126.0	1	1
3	3	0.0	missing	100	35.974	37.303	1.1004	1	48.981	1	1
4	4	0.0	missing	100	38.206	32.969	1.1972	1	38.934	1	1
5	5	0.0	missing	100	33.559	47.139	1.5924	0	37.198	1	1

Note

You can use any of the selectors that you’ve learned in the previous tutorial on DataFramesMeta.jl to select which columns you’ll want to drop the missing values.

For instance, you can use Cols, All, Between, Not or a regular expression with the string literal r"[regex]".

The second option is the ismissing() function applied to a filtering (subsetting) routine. ismissing(x) is a boolean function that returns true if x has the type Missing, and false otherwise.

You can do whatever pleases you:

ismissing() to just keep the missing values
the opposite, !ismissing() (the negation) to keep anything but the missing values

Tip

Missing values comparisons in Julia are the same as in R. If you compare a missing with any value it will be a missing:

julia> 1 == missing
missing

This is the same in R:

> 1 == NA
[1] NA

This is why it is important to use the ismissing() function instead of the == for comparisons.

my_df = @rsubset df ismissing(:AMT)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing
2	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing
3	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing
4	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing
5	1	1.5	13.146	missing	34.823	38.212	1.1129	0	42.635	0	missing

my_df = @rsubset df !ismissing(:AMT)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	2	0.0	missing	100	32.765	74.838	0.8846	1	126.0	1	1
3	3	0.0	missing	100	35.974	37.303	1.1004	1	48.981	1	1
4	4	0.0	missing	100	38.206	32.969	1.1972	1	38.934	1	1
5	5	0.0	missing	100	33.559	47.139	1.5924	0	37.198	1	1

You can also pick a different filtering macro:

@subset with a vectorized ismissing()
@rsubset with a simple ismissing()

my_df = @subset df ismissing.(:AMT)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing
2	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing
3	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing
4	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing
5	1	1.5	13.146	missing	34.823	38.212	1.1129	0	42.635	0	missing

my_df = @subset df @byrow ismissing(:AMT)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing
2	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing
3	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing
4	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing
5	1	1.5	13.146	missing	34.823	38.212	1.1129	0	42.635	0	missing

Tip

You can chain together several conditions with the AND && or the OR || operators inside the @[r]subset macro (watch out for precedence by using appropriate parentheses ()):

@rsubset df (ismissing(:col1) && ismissing(:col2)) || ismissing(:col3)

Alternatively, if everything is AND:

@rsubset df begin
    ismissing(:col1)
    ismissing(:col2)
    ismissing(:col3)
end

2.2 Filling `Missing`’s values with `coalesce()`

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the coalesce() function. We often use it in a vectorized way over an array to fill all missing values with a specific value.

For example, here’s a vector containing two missing values that we apply a broadcasted coalesce() function to replace all missings into 0s:

coalesce.([missing, "some_value", missing], 0)

3-element Vector{Any}:
 0
  "some_value"
 0

We could do the same with one or more columns in a DataFrame, but coalescing it to 2 instead of 0:

my_df = @rtransform df :CMT = coalesce(:CMT, 2)
first(my_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	2
3	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	2
4	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	2
5	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	2

Caution

Remember that the coalesce() function is only defined on single elements.

If you want to apply it on a vector you’ll need to broadcast it, i.e. coalesce.(v) where v is a vector.

If we want to apply it to a DataFrame column, we need to use the “row” macros, i.e. @rtransform[!] or @rselect[!]; or we could use the @byrow macro in a “non-row” macro, i.e. @transform[!] or @select[!].

2.3 Removing `Missing`s with `skipmissing()`

Suppose that you’ll want to compute a sum where one or more elements are missing values. Same as in R, this wouldn’t work in Julia:

sum([1, 2, missing])

missing

In order to have functions like sum to ignore missing values in their arguments, you’ll need to use the skipmissing() function.

skipmissing() technically returns an iterator that skips over the missing values. This is handy for all collections, such as matrices and vectors, that have missing values.

The previous example can be computed with:

sum(skipmissing([1, 2, missing]))

We can use skipmissing() easily in DataFrames. For example, suppose you’ll want to compute a value that is the sum of two or more columns:

my_df = @rtransform df :SOME_SUM = sum(skipmissing([:EVID, :CMT]))
first(my_df, 5)

5×12 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT	SOME_SUM
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?	Int64
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1	2
2	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing	0
3	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing	0
4	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing	0
5	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing	0

2.4 Propagate `Missing`s with `@passmissing`

If you want to “propagate” missing values in DataFramesMeta.jl transformations with @[r]transform[!] you’ll need to use the macro @passmissing. It will not let transformations that would fail in the presence of missing values but instead will return missing.

For example, consider the following transformation on the :CMT column trying to parse it as a Bool:

my_df = @rtransform df :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)

MethodError: Cannot `convert` an object of type Missing to an object of type Bool
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  convert(::Type{T}, ::Unitful.Quantity) where T<:Real
   @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/conversion.jl:167
  convert(::Type{T}, ::Unitful.Level) where T<:Real
   @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/logarithm.jl:46
  convert(::Type{T}, ::Unitful.Gain) where T<:Real
   @ Unitful ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Unitful/GMTL8/src/logarithm.jl:87
  ...

Stacktrace:
  [1] (::var"#5#6")(233::Missing)
    @ Main.Notebook ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/parsing.jl:303
  [2] iterate
    @ ./generator.jl:48 [inlined]
  [3] collect_to!
    @ ./array.jl:849 [inlined]
  [4] collect_to_with_first!(dest::Vector{Bool}, v1::Bool, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, st::Tuple{Base.OneTo{Int64}, Int64})
    @ Base ./array.jl:827
  [5] _collect(c::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, itr::Base.Generator{SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}}, var"#5#6"}, ::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:821
  [6] collect_similar
    @ ./array.jl:720 [inlined]
  [7] map
    @ ./abstractarray.jl:3371 [inlined]
  [8] (::ByRow{var"#5#6"})(cols::SentinelArrays.SentinelVector{Int64, Int64, Missing, Vector{Int64}})
    @ Tables ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Tables/cRTb7/src/utils.jl:229
  [9] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any})
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:562
 [10] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
 [11] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
 [12] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
 [13] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
 [14] select
    @ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303 [inlined]
 [15] transform(df::DataFrame, args::Any)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383
 [16] top-level scope
    @ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/macros.jl:1599

It fails because Julia cannot convert a Missing into a Bool. But if we add the @passmissing macro, we do not get an error and missing values get propagated forward in the transformation:

my_df = @rtransform df @passmissing :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)

5×12 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT	CMT_Bool
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?	Bool?
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1	true
2	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing	missing
3	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing	missing
4	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing	missing
5	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing	missing

Reuse

CC BY-SA 4.0

1 ❓ Missing type

2 👷 Handling missings in Julia

2.1 Filtering Missings with dropmissing() and ismissing()

2.2 Filling Missing’s values with coalesce()

2.3 Removing Missings with skipmissing()

2.4 Propagate Missings with @passmissing

Reuse

1 ❓ `Missing` type

2 👷 Handling `missing`s in Julia

2.1 Filtering `Missing`s with `dropmissing()` and `ismissing()`

2.2 Filling `Missing`’s values with `coalesce()`

2.3 Removing `Missing`s with `skipmissing()`

2.4 Propagate `Missing`s with `@passmissing`