# Handling `NA`s and Missing Values

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

In this tutorial, we will learn how to handle missing values in `DataFrame`s, commonly called as `NA`s in R. If you come from R, there will be a lot of examples and comparisons between R and Julia.

## 1 ❓ `Missing` type

R has several missing types: `NA`, `NA_integer_`, `NA_real_`, `NA_character_` and `NA_complex_`. Julia, on the contrary, has only one: `Missing`:

``typeof(missing)``
``Missing``
Tip

You can see that the type is `Missing` with title case and the instance of it is `missing` with lowercase. Most of the time you’ll be wanting to use the instantiation of a `Missing` type: `missing`.

`missing` values propagates through several operations, specially mathematical ones.

For example, addition, subtraction, multiplication and division:

``missing + 1``
``missing``
``missing - 1``
``missing``
``missing * 1``
``missing``
``missing / 1``
``missing``

They also propagate through equality and comparison operators:

``missing == 1``
``missing``
``missing == missing``
``missing``
``missing > 1``
``missing``
``missing > missing``
``missing``
Caution

Be cautious with comparisons and testing equalities in the presence of missing values.

For equality testing use instead the `ismissing()` function.

## 2 👷 Handling `missing`s in Julia

We will focus on 3 ways of handling missing data:

1. Filtering `missing`s with `ismissing()` and `dropmissing()`
2. Filling or Replacing `missing`’s values with `coalesce()`
3. Skipping `missing`s with `skipmissing()`

First, let’s import the packages with the `using` statement:

``````using CSV
using DataFramesMeta``````

We’ll use a dataset that has a lot of missing values as strings with a dot (`"."`).

Notice that the column type that has `missing` values will display a `?` after their inferred type in the subheader of the rendered `DataFrame`. For example, `Float64?` type actually means `Union{Missing, Float64}` type, in other words, it can be either a `Float64` or a `Missing`.

Remember that in Julia there are no special `missing` values for different types.

Tip

Hover over the column headers in the `DataFrame` below to see the underlying types.

``````df = CSV.read(
dataset("pumas_tutorials/iv_sd_final_target_format", String),
DataFrame;
missingstring = ".",
)
first(df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing

### 2.1 Filtering `Missing`s with `dropmissing()` and `ismissing()`

Most of the time we want to remove missing values from our data.

Removing `missing`s can be done with two options.

The first is the `dropmissing()` function from `DataFrames.jl`. Like all `DataFrames.jl` functions, it accepts a `DataFrame` as the first argument followed by which columns you’ll want to remove the missing data from.

By default, if you do not specify a column it will remove any observation (row) that has `missing` values:

``dropmissing(df)``
0×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64 Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64

Since all rows had at least one `missing` value across multiple columns, we get back an empty `DataFrame`.

Now, let’s restrict the columns that we want to remove the `missing` values from. You can either pass a single column as a `Symbol` or several columns as a vector of `Symbol`s:

``````my_df = dropmissing(df, :AMT)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1
``````my_df = dropmissing(df, [:AMT, :CMT])
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64 Float64 Float64 Float64 Int64 Float64 Int64 Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1
Note

You can use any of the selectors that you’ve learned in the previous tutorial on `DataFramesMeta.jl` to select which columns you’ll want to drop the `missing` values.

For instance, you can use `Cols`, `All`, `Between`, `Not` or a regular expression with the string literal `r"[regex]"`.

The second option is the `ismissing()` function applied to a filtering (subsetting) routine. `ismissing(x)` is a boolean function that returns `true` if `x` has the type `Missing`, and false otherwise.

You can do whatever pleases you:

1. `ismissing()` to just keep the `missing` values
2. the opposite, `!ismissing()` (the negation) to keep anything but the `missing` values
Tip

Missing values comparisons in Julia are the same as in R. If you compare a `missing` with any value it will be a `missing`:

``````julia> 1 == missing
missing``````

This is the same in R:

``````> 1 == NA
[1] NA``````

This is why it is important to use the `ismissing()` function instead of the `==` for comparisons.

``````my_df = @rsubset df ismissing(:AMT)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
``````my_df = @rsubset df !ismissing(:AMT)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 2 0.0 missing 100 32.765 74.838 0.8846 1 126.0 1 1
3 3 0.0 missing 100 35.974 37.303 1.1004 1 48.981 1 1
4 4 0.0 missing 100 38.206 32.969 1.1972 1 38.934 1 1
5 5 0.0 missing 100 33.559 47.139 1.5924 0 37.198 1 1

You can also pick a different filtering macro:

1. `@subset` with a vectorized `ismissing()`
2. `@rsubset` with a simple `ismissing()`
``````my_df = @subset df ismissing.(:AMT)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
``````my_df = @subset df @byrow ismissing(:AMT)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
2 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.5 13.146 missing 34.823 38.212 1.1129 0 42.635 0 missing
Tip

You can chain together several conditions with the AND `&&` or the OR `||` operators inside the `@[r]subset` macro (watch out for precedence by using appropriate parentheses `()`):

``@rsubset df (ismissing(:col1) && ismissing(:col2)) || ismissing(:col3)``

Alternatively, if everything is AND:

``````@rsubset df begin
ismissing(:col1)
ismissing(:col2)
ismissing(:col3)
end``````

### 2.2 Filling `Missing`’s values with `coalesce()`

A common data wrangling pattern is to replace or fill missing values.

Like R (and SQL), Julia has the `coalesce()` function. We often use it in a vectorized way over an array to fill all `missing` values with a specific value.

For example, here’s a vector containing two `missing` values that we apply a broadcasted `coalesce()` function to replace all `missing`s into `0`s:

``coalesce.([missing, "some_value", missing], 0)``
``````3-element Vector{Any}:
0
"some_value"
0``````

We could do the same with one or more columns in a `DataFrame`, but coalescing it to `2` instead of `0`:

``````my_df = @rtransform df :CMT = coalesce(:CMT, 2)
first(my_df, 5)``````
5×11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 2
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 2
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 2
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 2
Caution

Remember that the `coalesce()` function is only defined on single elements.

If you want to apply it on a vector you’ll need to broadcast it, i.e. `coalesce.(v)` where `v` is a vector.

If we want to apply it to a `DataFrame` column, we need to use the “row” macros, i.e. `@rtransform[!]` or `@rselect[!]`; or we could use the `@byrow` macro in a “non-row” macro, i.e. `@transform[!]` or `@select[!]`.

### 2.3 Removing `Missing`s with `skipmissing()`

Suppose that you’ll want to compute a sum where one or more elements are `missing` values. Same as in R, this wouldn’t work in Julia:

``sum([1, 2, missing])``
``missing``

In order to have functions like `sum` to ignore `missing` values in their arguments, you’ll need to use the `skipmissing()` function.

`skipmissing()` technically returns an iterator that skips over the `missing` values. This is handy for all collections, such as matrices and vectors, that have `missing` values.

The previous example can be computed with:

``sum(skipmissing([1, 2, missing]))``
``3``

We can use `skipmissing()` easily in `DataFrame`s. For example, suppose you’ll want to compute a value that is the sum of two or more columns:

``````my_df = @rtransform df :SOME_SUM = sum(skipmissing([:EVID, :CMT]))
first(my_df, 5)``````
5×12 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT SOME_SUM
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64? Int64
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1 2
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing 0
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing 0

### 2.4 Propagate `Missing`s with `@passmissing`

If you want to “propagate” `missing` values in `DataFramesMeta.jl` transformations with `@[r]transform[!]` you’ll need to use the macro `@passmissing`. It will not let transformations that would fail in the presence of `missing` values but instead will return `missing`.

For example, consider the following transformation on the `:CMT` column trying to parse it as a `Bool`:

``````my_df = @rtransform df :CMT_Bool=convert(Bool, :CMT)
first(my_df, 5)``````
``MethodError: MethodError(convert, (Bool, missing), 0x000000000000832c)``

It fails because Julia cannot convert a `Missing` into a `Bool`. But if we add the `@passmissing` macro, we do not get an error and `missing` values get propagated forward in the transformation:

``````my_df = @rtransform df @passmissing :CMT_Bool = convert(Bool, :CMT)
first(my_df, 5)``````
5×12 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT CMT_Bool
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64? Bool?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1 true
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing missing
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing missing