using Random: seed!
seed!(123)
Handling Factors and Categorical Data with CategoricalArrays.jl
We begin by setting a seed!
to make this notebook reproducible:
In this tutorial, we tackle qualitative data, also known as categorical data, or, from users coming from R and tidyverse: factors.
The main package in the Julia ecosystem which deals with this kind of data is CategoricalArrays.jl
with the categorical()
function.
By default, categorical()
works on vectors, but we can use it inside the @[r]transform[!]
macros from DataFramesMeta.jl
to transform any DataFrame
column.
To start, let’s import our df
dataset from the previous tutorials:
using PharmaDatasets
using DataFrames
using DataFramesMeta
= dataset("demographics_1")
df first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
1 📊 categorical()
function
The categorical()
function from CategoricalArrays.jl
takes as an argument a vector and returns a vector with a qualitative representation of the underlying data. In our case it will be the :ISMALE
column from the df
DataFrame
, which should be represented as a factor/category.
First, we need to import the CategoricalArrays.jl
package with the using
statement:
using CategoricalArrays
Now we can use any of the @[r]select
/@[r]transform
macros from DataFramesMeta.jl
to apply the categorical()
function to the :ISMALE
column and save it as a new column called :SEX
:
@transform! df :SEX = categorical(:ISMALE)
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX |
---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 |
2 🔪 cut()
function
We can also create CategoricalArray
s with the cut()
function which takes a numerical array and breaks into intervals at certain values and makes those categories/factors.
For example let’s “cut” the column :eGFR
into "low"
and "high"
based if the value is lower or higher than the median value:
@transform! df :eGRF_cat = cut(:eGFR, 2; labels = ["low", "high"])
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | low |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | high |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | low |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | low |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 | low |
2.1 CategoricalValue
As you can see, the underlying data was transformed into a CategoricalValue
object that holds two types:
- An
Int64
: this is the underlying data type from the original data from the:ISMALE
column. - An
UInt32
: this is the optimized data type to represent the categorical/factor data from the:ISMALE
column.
Under the hood, CategoricalArrays.jl
creates a sort of dictionary that keeps track of the original values and the new representation. By default, the optimized data type will always be a UInt32
. In most cases this works fine, we have some performance gains since we need to use less data to store UInt32
s than Int64
s. Imagine if we had large strings instead of Int64
s: those gains would be much higher.
We can ask CategoricalArrays.jl
to compress the data, if possible, to something smaller than UInt32
s. This is done with the compress
keyword argument. CategoricalArrays.jl
will then scan all the unique values of the target categorical vector to define its size (cardinality) to choose an appropriate UInt
size.
For example, our :ISMALE
column that has only 2 unique elements can be represented easily as an UInt8
(\(2^8 = 256\)):
@transform! df :SEX = categorical(:SEX; compress = true)
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | low |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | high |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | low |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | low |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 | low |
CategoricalArrays
with the ordered = false
(default) argument only supports equality (==
) and inequality (!=
) operators for comparisons. If you need to order categories/factors and compare them using higher >
/>=
or lower <=
operator comparison you’ll need to have ordered factors/categories with the ordered-true
argument.
3 🚧 Recoding factors/categories
One of the most common operations using factors and categories is the dreadful recoding. If you come from R and tidyverse, this is much easier to do in Julia.
We can recode CategoricalArray
s with the recode()
/recode!()
function which accepts as first argument a CategoricalArray
, followed by Pair
s of old_value => new_value
.
Suppose we would like to change the values in the :SEX
column to "male"
or "female"
:
@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male")
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low |
recode()
and recode!()
takes an optional second argument as the default value. This allows overriding any occurrence that does not match with the old_value
in the Pair
to have the same default value as the new_value
. It is similar to the TRUE ~ new_value
that you’ve probably used in the dplyr::case_when()
function.
For example, we could have used this in the previous recode()
example:
@transform! df :SEX = recode(:SEX, "male", 0 => "female")
first(df, 5)
This will work because anything that does not match 0
will be recoded as "male"
.
4 👉 ordered
factors/categories
Often we need to represent ordered factors/categories. Suppose we have a 3-categorical vector of strings with values "high
“, "medium"
and "low"
. We’ll probably want to have them ordered to make comparisons and this is accomplished with the ordered
keyword argument.
Let’s first generate a random :INTENSITY
which can take those 3 values (just to have a good example):
@transform! df :INTENSITY = rand(["low", "medium", "high"], nrow(df))
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | String | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | high |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | low |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | high |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | medium |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
Now we call categorical()
on the :INTENSITY
column with the ordered = true
as a keyword argument:
@transform! df :INTENSITY = categorical(:INTENSITY; ordered = true)
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | high |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | low |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | high |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | medium |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
5 ℹ levels
– custom levels
By default, categorical(:column; ordered = true)
takes the ordering of the levels in the order of appearance from lowest to highest. So, since "high"
is the first entry in the :INTENSITY
column, it will be encoded as the lowest value. This is not what we want, since the resulting comparison is not the behavior we intend:
# "high" > "medium" (should be true)
1, :INTENSITY] > df[2, :INTENSITY] df[
false
To remedy this, we can use the levels
keyword argument in the categorical()
function. levels
accepts a vector of strings and will interpret them as lowest to highest:
@transform! df :INTENSITY =
categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"])
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | high |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | low |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | high |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | medium |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
Now our comparison behaves as expected:
# "high" > "medium" (should be true)
1, :INTENSITY] > df[2, :INTENSITY] df[
true
You can also see what levels a CategoricalArray
has with the levels()
function:
levels(df.INTENSITY)
3-element Vector{String}:
"low"
"medium"
"high"
levels(df.SEX)
2-element Vector{String}:
"female"
"male"
You can also change in-place the levels of a CategoricalArray
using the levels!()
function. What we accomplished above with the categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"])
function could be done with:
levels!(dfINTENSITY, ["low", "medium", "high"])
6 🔢 Converting factors/categories to numerical values
Handling factors in Julia is easy, but there is also a time when we need to transform them back to numerical values. This is often desired when we are inputting the DataFrame
with one or more CategoricalArray
s into a model.
This is easily accomplished with the levelcode()
function. For example, let’s convert our :INTENSITY
column into a numerical column called :INTENSITY_NUM
:
@rtransform! df :INTENSITY_NUM = levelcode(:INTENSITY)
first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY | INTENSITY_NUM |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | Int64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | high | 3 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | low | 1 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | high | 3 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | medium | 2 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high | 3 |
And it works! Our :INTENSITY_NUM
column has the values 1
, 2
and 3
, respectively for the "low"
, "medium"
and "high"
levels.
Note that the levelcode()
function acts on a CategoricalValue
and not on a CategoricalArray
. In other words, it must be applied to every element of the CategoricalArray
and not to the array itself.
This can be done with the @rtransform[!]
/@rselect[!]
macros or using the @byrow
macro.
7 🎁 Extra Operations with Factors and Categories
This section of the tutorial is geared towards users coming from R’s tidyverse which are used to the extensive ad-hoc functions of forcats
.
First, notice that these are the exported functions from CategoricalArrays.jl
:
categorical(A)
- Construct a categorical array with values fromA
compress(A)
- Return a copy of categorical arrayA
using the smallest possible reference typecut(x)
- Cut a numeric array into intervals and return an orderedCategoricalArray
decompress(A)
- Return a copy of categorical arrayA
using the default reference typeisordered(A)
- Test whether entries inA
can be compared using<
,>
and similar operatorsordered!(A, ordered)
- Set whether entries inA
can be compared using<
,>
and similar operatorsrecode(a[, default], pairs...)
- Return a copy ofa
after replacing one or more valuesrecode!(a[, default], pairs...)
- Replace one or more values ina
in-placeunwrap(x)
- Return the value contained in categorical valuex
; ifx
isMissing
returnmissing
levelcode(x)
- Return the code of categorical valuex
, i.e. its index in the set of possible values returned bylevels(x)
We don’t have so many functions as the ones in forcats
because with Julia’s multiple dispatch, rich ecosystem of packages and tightly integrated standard library we do not need them. We can use other Julia’s functions that were either extended to work on CategoricalArrays
and CategoricalValue
s or work right out of the box without any customization.
replaces
fct_count()
The first one is the fct_count()
, which is perfectly and elegantly replaced by the macro @by
with the length()
function:
@by df :SEX :count = length(:SEX)
Row | SEX | count |
---|---|---|
Cat… | Int64 | |
1 | female | 53 |
2 | male | 47 |
replaces the
fct_match()
In order to test if you have a certain level inside a CategoricalArrays
just test whether a certain level is in
the levels returned by the function levels()
:
"female" in levels(df.SEX)
true
replaces
fct_c()
Julia already has a vertical concatenation function for vectors and other collections. So why create a new one? Much more easy and natural to extend the vcat()
.
This is exactly what CategoricalArrays.jl
has done. Let’s first create two categorical
vectors:
= categorical(["a", "b"]) cat1
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
= categorical(["c", "d"]) cat2
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"c"
"d"
Now we just vcat()
them together:
vcat(cat1, cat2)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
replaces
fct_rev()
This is the same argument for the vcat()
replacement for fct_c()
. The reverse()
function can be easily extended for CategoricalArray
s:
levels(df.SEX)
2-element Vector{String}:
"female"
"male"
reverse(levels(df.SEX))
2-element Vector{String}:
"male"
"female"
Then, just do a in-place levels!()
on your CategoricalArray
:
levels!(df.SEX, reverse(levels(df.SEX)))
100-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
"female"
"male"
"male"
"male"
"female"
"female"
"female"
"female"
"female"
"male"
⋮
"male"
"female"
"female"
"male"
"male"
"male"
"female"
"male"
"female"
Just to confirm that it worked correctly:
levels(df.SEX) # 👍
2-element Vector{String}:
"male"
"female"
replaces
fct_other()
recode()
with the optional positional argument default
replaces the fct_other()
function:
= categorical(["a", "b", "c", "d"]) big_cat
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
Suppose you only want to retain "a"
and "d"
categories and the rest will be coded as "other"
:
recode(big_cat, "other", "a" => "a", "d" => "d")
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"other"
"other"
"d"
replaces
fct_shuffle()
This is a nice replacement. Just set in-place new levels with the levels!()
function and specifying the new levels as shuffle()
on the original levels:
using Random: shuffle
shuffle(levels(big_cat))
4-element Vector{String}:
"c"
"b"
"d"
"a"
levels!(big_cat, shuffle(levels(big_cat)))
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
Just to confirm that it worked correctly:
levels(big_cat) # 👍
4-element Vector{String}:
"a"
"b"
"c"
"d"