Handling Factors and Categorical Data with CategoricalArrays.jl

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

Note

We begin by setting a seed! to make this notebook reproducible:

using Random: seed!
seed!(123)

In this tutorial, we tackle qualitative data, also known as categorical data, or, from users coming from R and tidyverse: factors.

The main package in the Julia ecosystem which deals with this kind of data is CategoricalArrays.jl with the categorical() function.

By default, categorical() works on vectors, but we can use it inside the @[r]transform[!] macros from DataFramesMeta.jl to transform any DataFrame column.

To start, let’s import our df dataset from the previous tutorials:

using PharmaDatasets
using DataFrames
using DataFramesMeta
df = dataset("demographics_1")
first(df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

1 📊 categorical() function

The categorical() function from CategoricalArrays.jl takes as an argument a vector and returns a vector with a qualitative representation of the underlying data. In our case it will be the :ISMALE column from the df DataFrame, which should be represented as a factor/category.

First, we need to import the CategoricalArrays.jl package with the using statement:

using CategoricalArrays

Now we can use any of the @[r]select/@[r]transform macros from DataFramesMeta.jl to apply the categorical() function to the :ISMALE column and save it as a new column called :SEX:

@transform! df :SEX = categorical(:ISMALE)
first(df, 5)
5×7 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX
Int64 Float64 Float64 Float64 Int64 Float64 Cat…
1 1 34.823 38.212 1.1129 0 42.635 0
2 2 32.765 74.838 0.8846 1 126.0 1
3 3 35.974 37.303 1.1004 1 48.981 1
4 4 38.206 32.969 1.1972 1 38.934 1
5 5 33.559 47.139 1.5924 0 37.198 0

2 🔪 cut() function

We can also create CategoricalArrays with the cut() function which takes a numerical array and breaks into intervals at certain values and makes those categories/factors.

For example let’s “cut” the column :eGFR into "low" and "high" based if the value is lower or higher than the median value:

@transform! df :eGRF_cat = cut(:eGFR, 2; labels = ["low", "high"])
first(df, 5)
5×8 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat…
1 1 34.823 38.212 1.1129 0 42.635 0 low
2 2 32.765 74.838 0.8846 1 126.0 1 high
3 3 35.974 37.303 1.1004 1 48.981 1 low
4 4 38.206 32.969 1.1972 1 38.934 1 low
5 5 33.559 47.139 1.5924 0 37.198 0 low

2.1 CategoricalValue

As you can see, the underlying data was transformed into a CategoricalValue object that holds two types:

  1. An Int64: this is the underlying data type from the original data from the :ISMALE column.
  2. An UInt32: this is the optimized data type to represent the categorical/factor data from the :ISMALE column.

Under the hood, CategoricalArrays.jl creates a sort of dictionary that keeps track of the original values and the new representation. By default, the optimized data type will always be a UInt32. In most cases this works fine, we have some performance gains since we need to use less data to store UInt32s than Int64s. Imagine if we had large strings instead of Int64s: those gains would be much higher.

We can ask CategoricalArrays.jl to compress the data, if possible, to something smaller than UInt32s. This is done with the compress keyword argument. CategoricalArrays.jl will then scan all the unique values of the target categorical vector to define its size (cardinality) to choose an appropriate UInt size.

For example, our :ISMALE column that has only 2 unique elements can be represented easily as an UInt8 (\(2^8 = 256\)):

@transform! df :SEX = categorical(:SEX; compress = true)
first(df, 5)
5×8 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat…
1 1 34.823 38.212 1.1129 0 42.635 0 low
2 2 32.765 74.838 0.8846 1 126.0 1 high
3 3 35.974 37.303 1.1004 1 48.981 1 low
4 4 38.206 32.969 1.1972 1 38.934 1 low
5 5 33.559 47.139 1.5924 0 37.198 0 low
Note

CategoricalArrays with the ordered = false (default) argument only supports equality (==) and inequality (!=) operators for comparisons. If you need to order categories/factors and compare them using higher >/>= or lower <= operator comparison you’ll need to have ordered factors/categories with the ordered-true argument.

3 🚧 Recoding factors/categories

One of the most common operations using factors and categories is the dreadful recoding. If you come from R and tidyverse, this is much easier to do in Julia.

We can recode CategoricalArrays with the recode()/recode!() function which accepts as first argument a CategoricalArray, followed by Pairs of old_value => new_value.

Suppose we would like to change the values in the :SEX column to "male" or "female":

@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male")
first(df, 5)
5×8 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat…
1 1 34.823 38.212 1.1129 0 42.635 female low
2 2 32.765 74.838 0.8846 1 126.0 male high
3 3 35.974 37.303 1.1004 1 48.981 male low
4 4 38.206 32.969 1.1972 1 38.934 male low
5 5 33.559 47.139 1.5924 0 37.198 female low
Tip

recode() and recode!() takes an optional second argument as the default value. This allows overriding any occurrence that does not match with the old_value in the Pair to have the same default value as the new_value. It is similar to the TRUE ~ new_value that you’ve probably used in the dplyr::case_when() function.

For example, we could have used this in the previous recode() example:

@transform! df :SEX = recode(:SEX, "male", 0 => "female")
first(df, 5)

This will work because anything that does not match 0 will be recoded as "male".

4 👉 ordered factors/categories

Often we need to represent ordered factors/categories. Suppose we have a 3-categorical vector of strings with values "high“, "medium" and "low". We’ll probably want to have them ordered to make comparisons and this is accomplished with the ordered keyword argument.

Let’s first generate a random :INTENSITY which can take those 3 values (just to have a good example):

@transform! df :INTENSITY = rand(["low", "medium", "high"], nrow(df))
first(df, 5)
5×9 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat INTENSITY
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat… String
1 1 34.823 38.212 1.1129 0 42.635 female low medium
2 2 32.765 74.838 0.8846 1 126.0 male high medium
3 3 35.974 37.303 1.1004 1 48.981 male low high
4 4 38.206 32.969 1.1972 1 38.934 male low low
5 5 33.559 47.139 1.5924 0 37.198 female low medium

Now we call categorical() on the :INTENSITY column with the ordered = true as a keyword argument:

@transform! df :INTENSITY = categorical(:INTENSITY; ordered = true)
first(df, 5)
5×9 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat INTENSITY
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat… Cat…
1 1 34.823 38.212 1.1129 0 42.635 female low medium
2 2 32.765 74.838 0.8846 1 126.0 male high medium
3 3 35.974 37.303 1.1004 1 48.981 male low high
4 4 38.206 32.969 1.1972 1 38.934 male low low
5 5 33.559 47.139 1.5924 0 37.198 female low medium

5levels – custom levels

By default, categorical(:column; ordered = true) takes the ordering of the levels in the order of appearance from lowest to highest. So, since "high" is the first entry in the :INTENSITY column, it will be encoded as the lowest value. This is not what we want, since the resulting comparison is not the behavior we intend:

# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]
false

To remedy this, we can use the levels keyword argument in the categorical() function. levels accepts a vector of strings and will interpret them as lowest to highest:

@transform! df :INTENSITY =
    categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"])
first(df, 5)
5×9 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat INTENSITY
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat… Cat…
1 1 34.823 38.212 1.1129 0 42.635 female low medium
2 2 32.765 74.838 0.8846 1 126.0 male high medium
3 3 35.974 37.303 1.1004 1 48.981 male low high
4 4 38.206 32.969 1.1972 1 38.934 male low low
5 5 33.559 47.139 1.5924 0 37.198 female low medium

Now our comparison behaves as expected:

# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]
false

You can also see what levels a CategoricalArray has with the levels() function:

levels(df.INTENSITY)
3-element Vector{String}:
 "low"
 "medium"
 "high"
levels(df.SEX)
2-element Vector{String}:
 "female"
 "male"
Tip

You can also change in-place the levels of a CategoricalArray using the levels!() function. What we accomplished above with the categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"]) function could be done with:

levels!(dfINTENSITY, ["low", "medium", "high"])

6 🔢 Converting factors/categories to numerical values

Handling factors in Julia is easy, but there is also a time when we need to transform them back to numerical values. This is often desired when we are inputting the DataFrame with one or more CategoricalArrays into a model.

This is easily accomplished with the levelcode() function. For example, let’s convert our :INTENSITY column into a numerical column called :INTENSITY_NUM:

@rtransform! df :INTENSITY_NUM = levelcode(:INTENSITY)
first(df, 5)
5×10 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR SEX eGRF_cat INTENSITY INTENSITY_NUM
Int64 Float64 Float64 Float64 Int64 Float64 Cat… Cat… Cat… Int64
1 1 34.823 38.212 1.1129 0 42.635 female low medium 2
2 2 32.765 74.838 0.8846 1 126.0 male high medium 2
3 3 35.974 37.303 1.1004 1 48.981 male low high 3
4 4 38.206 32.969 1.1972 1 38.934 male low low 1
5 5 33.559 47.139 1.5924 0 37.198 female low medium 2

And it works! Our :INTENSITY_NUM column has the values 1, 2 and 3, respectively for the "low", "medium" and "high" levels.

Caution

Note that the levelcode() function acts on a CategoricalValue and not on a CategoricalArray. In other words, it must be applied to every element of the CategoricalArray and not to the array itself.

This can be done with the @rtransform[!]/@rselect[!] macros or using the @byrow macro.

7 🎁 Extra Operations with Factors and Categories

This section of the tutorial is geared towards users coming from R’s tidyverse which are used to the extensive ad-hoc functions of forcats.

First, notice that these are the exported functions from CategoricalArrays.jl:

  • categorical(A) - Construct a categorical array with values from A
  • compress(A) - Return a copy of categorical array A using the smallest possible reference type
  • cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray
  • decompress(A) - Return a copy of categorical array A using the default reference type
  • isordered(A) - Test whether entries in A can be compared using <, > and similar operators
  • ordered!(A, ordered) - Set whether entries in A can be compared using <, > and similar operators
  • recode(a[, default], pairs...) - Return a copy of a after replacing one or more values
  • recode!(a[, default], pairs...) - Replace one or more values in a in-place
  • unwrap(x) - Return the value contained in categorical value x; if x is Missing return missing
  • levelcode(x) - Return the code of categorical value x, i.e. its index in the set of possible values returned by levels(x)

We don’t have so many functions as the ones in forcats because with Julia’s multiple dispatch, rich ecosystem of packages and tightly integrated standard library we do not need them. We can use other Julia’s functions that were either extended to work on CategoricalArrays and CategoricalValues or work right out of the box without any customization.

replaces fct_count()

The first one is the fct_count(), which is perfectly and elegantly replaced by the macro @by with the length() function:

@by df :SEX :count = length(:SEX)
2×2 DataFrame
Row SEX count
Cat… Int64
1 female 53
2 male 47

replaces the fct_match()

In order to test if you have a certain level inside a CategoricalArrays just test whether a certain level is in the levels returned by the function levels():

"female" in levels(df.SEX)
true

replaces fct_c()

Julia already has a vertical concatenation function for vectors and other collections. So why create a new one? Much more easy and natural to extend the vcat().

This is exactly what CategoricalArrays.jl has done. Let’s first create two categorical vectors:

cat1 = categorical(["a", "b"])
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
cat2 = categorical(["c", "d"])
2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "c"
 "d"

Now we just vcat() them together:

vcat(cat1, cat2)
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

replaces fct_rev()

This is the same argument for the vcat() replacement for fct_c(). The reverse() function can be easily extended for CategoricalArrays:

levels(df.SEX)
2-element Vector{String}:
 "female"
 "male"
reverse(levels(df.SEX))
2-element Vector{String}:
 "male"
 "female"

Then, just do a in-place levels!() on your CategoricalArray:

levels!(df.SEX, reverse(levels(df.SEX)))
100-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "female"
 "male"
 "male"
 "male"
 "female"
 "female"
 "female"
 "female"
 "female"
 "male"
 ⋮
 "male"
 "female"
 "female"
 "male"
 "male"
 "male"
 "female"
 "male"
 "female"

Just to confirm that it worked correctly:

levels(df.SEX) # 👍
2-element Vector{String}:
 "male"
 "female"

replaces fct_other()

recode() with the optional positional argument default replaces the fct_other() function:

big_cat = categorical(["a", "b", "c", "d"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

Suppose you only want to retain "a" and "d" categories and the rest will be coded as "other":

recode(big_cat, "other", "a" => "a", "d" => "d")
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "other"
 "other"
 "d"

replaces fct_shuffle()

This is a nice replacement. Just set in-place new levels with the levels!() function and specifying the new levels as shuffle() on the original levels:

using Random: shuffle
shuffle(levels(big_cat))
4-element Vector{String}:
 "b"
 "d"
 "c"
 "a"
levels!(big_cat, shuffle(levels(big_cat)))
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

Just to confirm that it worked correctly:

levels(big_cat) # 👍
4-element Vector{String}:
 "c"
 "a"
 "b"
 "d"