Handling NAs and Missing Values

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

In this tutorial, we will learn how to handle missing values in DataFrames, commonly called as NAs in R. If you come from R, there will be a lot of examples and comparisons between R and Julia.

1Missing type

R has several missing types: NA, NA_integer_, NA_real_, NA_character_ and NA_complex_. Julia, on the contrary, has only one: Missing:

typeof(missing)
Missing
Tip

You can see that the type is Missing with title case and the instance of it is missing with lowercase. Most of the time you’ll be wanting to use the instantiation of a Missing type: missing.

missing values propagates through several operations, specially mathematical ones.

For example, addition, subtraction, multiplication and division:

missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing

They also propagate through equality and comparison operators:

missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing
Caution

Be cautious with comparisons and testing equalities in the presence of missing values.

For equality testing use instead the ismissing() function.

2 👷 Handling missings in Julia

We will focus on 3 ways of handling missing data:

  1. Filtering missings with ismissing() and dropmissing()
  2. Filling or Replacing missing’s values with coalesce()
  3. Skipping missings with skipmissing()

First, let’s import the packages with the using statement:

using CSV
using PharmaDatasets
using DataFramesMeta

We’ll use a dataset that has a lot of missing values as strings with a dot (".").

Notice that the column type that has missing values will display a ? after their inferred type in the subheader of the rendered DataFrame. For example, Float64? type actually means Union{Missing, Float64} type, in other words, it can be either a Float64 or a Missing.

Remember that in Julia there are no special missing values for different types.

Tip

Hover over the column headers in the DataFrame below to see the underlying types.

df = CSV.read(
    dataset("pumas_tutorials/iv_sd_final_target_format", String),
    DataFrame;
    missingstring = ".",
)
first(df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing

2.1 Filtering Missings with dropmissing() and ismissing()

Most of the time we want to remove missing values from our data.

Removing missings can be done with two options.

The first is the dropmissing() function from DataFrames.jl. Like all DataFrames.jl functions, it accepts a DataFrame as the first argument followed by which columns you’ll want to remove the missing data from.

By default, if you do not specify a column it will remove any observation (row) that has missing values:

dropmissing(df)
0×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64 Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64

Since all rows had at least one missing value across multiple columns, we get back an empty DataFrame.

Now, let’s restrict the columns that we want to remove the missing values from. You can either pass a single column as a Symbol or several columns as a vector of Symbols:

my_df = dropmissing(df, :AMT)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1
my_df = dropmissing(df, [:AMT, :CMT])
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1
Note

You can use any of the selectors that you’ve learned in the previous tutorial on DataFramesMeta.jl to select which columns you’ll want to drop the missing values.

For instance, you can use Cols, All, Between, Not or a regular expression with the string literal r"[regex]".

The second option is the ismissing() function applied to a filtering (subsetting) routine. ismissing(x) is a boolean function that returns true if x has the type Missing, and false otherwise.

You can do whatever pleases you:

  1. ismissing() to just keep the missing values
  2. the opposite, !ismissing() (the negation) to keep anything but the missing values
Tip

Missing values comparisons in Julia are the same as in R. If you compare a missing with any value it will be a missing:

julia> 1 == missing
missing

This is the same in R:

> 1 == NA
[1] NA

This is why it is important to use the ismissing() function instead of the == for comparisons.

my_df = @rsubset df ismissing(:AMT)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
my_df = @rsubset df !ismissing(:AMT)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1

You can also pick a different filtering macro:

  1. @subset with a vectorized ismissing()
  2. @rsubset with a simple ismissing()
my_df = @subset df ismissing.(:AMT)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
my_df = @subset df @byrow ismissing(:AMT)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
Tip

You can chain together several conditions with the AND && or the OR || operators inside the @[r]subset macro (watch out for precedence by using appropriate parentheses ()):

@rsubset df (ismissing(:col1) && ismissing(:col2)) || ismissing(:col3)

Alternatively, if everything is AND:

@rsubset df begin
    ismissing(:col1)
    ismissing(:col2)
    ismissing(:col3)
end

2.2 Filling Missing’s values with coalesce()

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the coalesce() function. We often use it in a vectorized way over an array to fill all missing values with a specific value.

For example, here’s a vector containing two missing values that we apply a broadcasted coalesce() function to replace all missings into 0s:

coalesce.([missing, "some_value", missing], 0)
3-element Vector{Any}:
 0
  "some_value"
 0

We could do the same with one or more columns in a DataFrame, but coalescing it to 2 instead of 0:

my_df = @rtransform df :CMT = coalesce(:CMT, 2)
first(my_df, 5)
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 2
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 2
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 2
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 2
Caution

Remember that the coalesce() function is only defined on single elements.

If you want to apply it on a vector you’ll need to broadcast it, i.e. coalesce.(v) where v is a vector.

If we want to apply it to a DataFrame column, we need to use the “row” macros, i.e. @rtransform[!] or @rselect[!]; or we could use the @byrow macro in a “non-row” macro, i.e. @transform[!] or @select[!].

2.3 Removing Missings with skipmissing()

Suppose that you’ll want to compute a sum where one or more elements are missing values. Same as in R, this wouldn’t work in Julia:

sum([1, 2, missing])
missing

In order to have functions like sum to ignore missing values in their arguments, you’ll need to use the skipmissing() function.

skipmissing() technically returns an iterator that skips over the missing values. This is handy for all collections, such as matrices and vectors, that have missing values.

The previous example can be computed with:

sum(skipmissing([1, 2, missing]))
3

We can use skipmissing() easily in DataFrames. For example, suppose you’ll want to compute a value that is the sum of two or more columns:

my_df = @rtransform df :SOME_SUM = sum(skipmissing([:EVID, :CMT]))
first(my_df, 5)
5×12 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT SOME_SUM
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64? Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1 2
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing 0

2.4 Propagate Missings with @passmissing

If you want to “propagate” missing values in DataFramesMeta.jl transformations with @[r]transform[!] you’ll need to use the macro @passmissing. It will not let transformations that would fail in the presence of missing values but instead will return missing.

For example, consider the following transformation on the :CMT column trying to parse it as a Bool:

my_df = @rtransform df :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)
MethodError: MethodError(convert, (Bool, missing), 0x000000000000834d)

It fails because Julia cannot convert a Missing into a Bool. But if we add the @passmissing macro, we do not get an error and missing values get propagated forward in the transformation:

my_df = @rtransform df @passmissing :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)
5×12 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT CMT_Bool
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64? Bool?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1 true
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing missing