using Random: seed!
seed!(123)
Handling Factors and Categorical Data with CategoricalArrays.jl
We begin by setting a seed! to make this notebook reproducible:
In this tutorial, we tackle qualitative data, also known as categorical data, or, from users coming from R and tidyverse: factors.
The main package in the Julia ecosystem which deals with this kind of data is CategoricalArrays.jl with the categorical() function.
By default, categorical() works on vectors, but we can use it inside the @[r]transform[!] macros from DataFramesMeta.jl to transform any DataFrame column.
To start, let’s import our df dataset from the previous tutorials:
using PharmaDatasets
using DataFrames
using DataFramesMetadf = dataset("demographics_1")
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
1 📊 categorical() function
The categorical() function from CategoricalArrays.jl takes as an argument a vector and returns a vector with a qualitative representation of the underlying data. In our case it will be the :ISMALE column from the df DataFrame, which should be represented as a factor/category.
First, we need to import the CategoricalArrays.jl package with the using statement:
using CategoricalArraysNow we can use any of the @[r]select/@[r]transform macros from DataFramesMeta.jl to apply the categorical() function to the :ISMALE column and save it as a new column called :SEX:
@transform! df :SEX = categorical(:ISMALE)
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX |
|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 |
2 🔪 cut() function
We can also create CategoricalArrays with the cut() function which takes a numerical array and breaks into intervals at certain values and makes those categories/factors.
For example let’s “cut” the column :eGFR into "low" and "high" based if the value is lower or higher than the median value:
@transform! df :eGRF_cat = cut(:eGFR, 2; labels = ["low", "high"])
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | low |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | low |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 | low |
2.1 CategoricalValue
As you can see, the underlying data was transformed into a CategoricalValue object that holds two types:
- An
Int64: this is the underlying data type from the original data from the:ISMALEcolumn. - An
UInt32: this is the optimized data type to represent the categorical/factor data from the:ISMALEcolumn.
Under the hood, CategoricalArrays.jl creates a sort of dictionary that keeps track of the original values and the new representation. By default, the optimized data type will always be a UInt32. In most cases this works fine, we have some performance gains since we need to use less data to store UInt32s than Int64s. Imagine if we had large strings instead of Int64s: those gains would be much higher.
We can ask CategoricalArrays.jl to compress the data, if possible, to something smaller than UInt32s. This is done with the compress keyword argument. CategoricalArrays.jl will then scan all the unique values of the target categorical vector to define its size (cardinality) to choose an appropriate UInt size.
For example, our :ISMALE column that has only 2 unique elements can be represented easily as an UInt8 (\(2^8 = 256\)):
@transform! df :SEX = categorical(:SEX; compress = true)
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 0 | low |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1 | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 1 | low |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 1 | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 0 | low |
CategoricalArrays with the ordered = false (default) argument only supports equality (==) and inequality (!=) operators for comparisons. If you need to order categories/factors and compare them using higher >/>= or lower <= operator comparison you’ll need to have ordered factors/categories with the ordered-true argument.
3 🚧 Recoding factors/categories
One of the most common operations using factors and categories is the dreadful recoding. If you come from R and tidyverse, this is much easier to do in Julia.
We can recode CategoricalArrays with the recode()/recode!() function which accepts as first argument a CategoricalArray, followed by Pairs of old_value => new_value.
Suppose we would like to change the values in the :SEX column to "male" or "female":
@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male")
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat |
|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low |
recode() and recode!() takes an optional second argument as the default value. This allows overriding any occurrence that does not match with the old_value in the Pair to have the same default value as the new_value. It is similar to the TRUE ~ new_value that you’ve probably used in the dplyr::case_when() function.
For example, we could have used this in the previous recode() example:
@transform! df :SEX = recode(:SEX, "male", 0 => "female")
first(df, 5)This will work because anything that does not match 0 will be recoded as "male".
4 👉 ordered factors/categories
Often we need to represent ordered factors/categories. Suppose we have a 3-categorical vector of strings with values "high“, "medium" and "low". We’ll probably want to have them ordered to make comparisons and this is accomplished with the ordered keyword argument.
Let’s first generate a random :INTENSITY which can take those 3 values (just to have a good example):
@transform! df :INTENSITY = rand(["low", "medium", "high"], nrow(df))
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | String | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | medium |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | medium |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
Now we call categorical() on the :INTENSITY column with the ordered = true as a keyword argument:
@transform! df :INTENSITY = categorical(:INTENSITY; ordered = true)
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | medium |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | medium |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
5 ℹ levels – custom levels
By default, categorical(:column; ordered = true) takes the ordering of the levels in the order of appearance from lowest to highest. So, since "high" is the first entry in the :INTENSITY column, it will be encoded as the lowest value. This is not what we want, since the resulting comparison is not the behavior we intend:
# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]true
To remedy this, we can use the levels keyword argument in the categorical() function. levels accepts a vector of strings and will interpret them as lowest to highest:
@transform! df :INTENSITY =
categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"])
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY |
|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | medium |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | high |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | medium |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | low |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high |
Now our comparison behaves as expected:
# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]false
You can also see what levels a CategoricalArray has with the levels() function:
levels(df.INTENSITY)3-element Vector{String}:
"low"
"medium"
"high"
levels(df.SEX)2-element Vector{String}:
"female"
"male"
You can also change in-place the levels of a CategoricalArray using the levels!() function. What we accomplished above with the categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"]) function could be done with:
levels!(dfINTENSITY, ["low", "medium", "high"])6 🔢 Converting factors/categories to numerical values
Handling factors in Julia is easy, but there is also a time when we need to transform them back to numerical values. This is often desired when we are inputting the DataFrame with one or more CategoricalArrays into a model.
This is easily accomplished with the levelcode() function. For example, let’s convert our :INTENSITY column into a numerical column called :INTENSITY_NUM:
@rtransform! df :INTENSITY_NUM = levelcode(:INTENSITY)
first(df, 5)| Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | SEX | eGRF_cat | INTENSITY | INTENSITY_NUM |
|---|---|---|---|---|---|---|---|---|---|---|
| Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Cat… | Cat… | Cat… | Int64 | |
| 1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | female | low | medium | 2 |
| 2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | male | high | high | 3 |
| 3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | male | low | medium | 2 |
| 4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | male | low | low | 1 |
| 5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | female | low | high | 3 |
And it works! Our :INTENSITY_NUM column has the values 1, 2 and 3, respectively for the "low", "medium" and "high" levels.
Note that the levelcode() function acts on a CategoricalValue and not on a CategoricalArray. In other words, it must be applied to every element of the CategoricalArray and not to the array itself.
This can be done with the @rtransform[!]/@rselect[!] macros or using the @byrow macro.
7 🎁 Extra Operations with Factors and Categories
This section of the tutorial is geared towards users coming from R’s tidyverse which are used to the extensive ad-hoc functions of forcats.
First, notice that these are the exported functions from CategoricalArrays.jl:
categorical(A)- Construct a categorical array with values fromAcompress(A)- Return a copy of categorical arrayAusing the smallest possible reference typecut(x)- Cut a numeric array into intervals and return an orderedCategoricalArraydecompress(A)- Return a copy of categorical arrayAusing the default reference typeisordered(A)- Test whether entries inAcan be compared using<,>and similar operatorsordered!(A, ordered)- Set whether entries inAcan be compared using<,>and similar operatorsrecode(a[, default], pairs...)- Return a copy ofaafter replacing one or more valuesrecode!(a[, default], pairs...)- Replace one or more values inain-placeunwrap(x)- Return the value contained in categorical valuex; ifxisMissingreturnmissinglevelcode(x)- Return the code of categorical valuex, i.e. its index in the set of possible values returned bylevels(x)
We don’t have so many functions as the ones in forcats because with Julia’s multiple dispatch, rich ecosystem of packages and tightly integrated standard library we do not need them. We can use other Julia’s functions that were either extended to work on CategoricalArrays and CategoricalValues or work right out of the box without any customization.
replaces
fct_count()
The first one is the fct_count(), which is perfectly and elegantly replaced by the macro @by with the length() function:
@by df :SEX :count = length(:SEX)| Row | SEX | count |
|---|---|---|
| Cat… | Int64 | |
| 1 | female | 53 |
| 2 | male | 47 |
replaces the
fct_match()
In order to test if you have a certain level inside a CategoricalArrays just test whether a certain level is in the levels returned by the function levels():
"female" in levels(df.SEX)true
replaces
fct_c()
Julia already has a vertical concatenation function for vectors and other collections. So why create a new one? Much more easy and natural to extend the vcat().
This is exactly what CategoricalArrays.jl has done. Let’s first create two categorical vectors:
cat1 = categorical(["a", "b"])2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
cat2 = categorical(["c", "d"])2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"c"
"d"
Now we just vcat() them together:
vcat(cat1, cat2)4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
replaces
fct_rev()
This is the same argument for the vcat() replacement for fct_c(). The reverse() function can be easily extended for CategoricalArrays:
levels(df.SEX)2-element Vector{String}:
"female"
"male"
reverse(levels(df.SEX))2-element Vector{String}:
"male"
"female"
Then, just do a in-place levels!() on your CategoricalArray:
levels!(df.SEX, reverse(levels(df.SEX)))100-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
"female"
"male"
"male"
"male"
"female"
"female"
"female"
"female"
"female"
"male"
⋮
"male"
"female"
"female"
"male"
"male"
"male"
"female"
"male"
"female"
Just to confirm that it worked correctly:
levels(df.SEX) # 👍2-element Vector{String}:
"male"
"female"
replaces
fct_other()
recode() with the optional positional argument default replaces the fct_other() function:
big_cat = categorical(["a", "b", "c", "d"])4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
Suppose you only want to retain "a" and "d" categories and the rest will be coded as "other":
recode(big_cat, "other", "a" => "a", "d" => "d")4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"other"
"other"
"d"
replaces
fct_shuffle()
This is a nice replacement. Just set in-place new levels with the levels!() function and specifying the new levels as shuffle() on the original levels:
using Random: shuffleshuffle(levels(big_cat))4-element Vector{String}:
"c"
"d"
"b"
"a"
levels!(big_cat, shuffle(levels(big_cat)))4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"b"
"c"
"d"
Just to confirm that it worked correctly:
levels(big_cat) # 👍4-element Vector{String}:
"b"
"d"
"a"
"c"