Handling Factors and Categorical Data with `CategoricalArrays.jl`

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

Note

We begin by setting a seed! to make this notebook reproducible:

using Random: seed!
seed!(123)

In this tutorial, we tackle qualitative data, also known as categorical data, or, from users coming from R and tidyverse: factors.

The main package in the Julia ecosystem which deals with this kind of data is CategoricalArrays.jl with the categorical() function.

By default, categorical() works on vectors, but we can use it inside the @[r]transform[!] macros from DataFramesMeta.jl to transform any DataFrame column.

To start, let’s import our df dataset from the previous tutorials:

using PharmaDatasets
using DataFrames
using DataFramesMeta

df = dataset("demographics_1")
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

1 📊 `categorical()` function

The categorical() function from CategoricalArrays.jl takes as an argument a vector and returns a vector with a qualitative representation of the underlying data. In our case it will be the :ISMALE column from the df DataFrame, which should be represented as a factor/category.

First, we need to import the CategoricalArrays.jl package with the using statement:

using CategoricalArrays

Now we can use any of the @[r]select/@[r]transform macros from DataFramesMeta.jl to apply the categorical() function to the :ISMALE column and save it as a new column called :SEX:

@transform! df :SEX = categorical(:ISMALE)
first(df, 5)

5×7 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…
1	1	34.823	38.212	1.1129	0	42.635	0
2	2	32.765	74.838	0.8846	1	126.0	1
3	3	35.974	37.303	1.1004	1	48.981	1
4	4	38.206	32.969	1.1972	1	38.934	1
5	5	33.559	47.139	1.5924	0	37.198	0

2 🔪 `cut()` function

We can also create CategoricalArrays with the cut() function which takes a numerical array and breaks into intervals at certain values and makes those categories/factors.

For example let’s “cut” the column :eGFR into "low" and "high" based if the value is lower or higher than the median value:

@transform! df :eGRF_cat = cut(:eGFR, 2; labels = ["low", "high"])
first(df, 5)

5×8 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…
1	1	34.823	38.212	1.1129	0	42.635	0	low
2	2	32.765	74.838	0.8846	1	126.0	1	high
3	3	35.974	37.303	1.1004	1	48.981	1	low
4	4	38.206	32.969	1.1972	1	38.934	1	low
5	5	33.559	47.139	1.5924	0	37.198	0	low

2.1 `CategoricalValue`

As you can see, the underlying data was transformed into a CategoricalValue object that holds two types:

An Int64: this is the underlying data type from the original data from the :ISMALE column.
An UInt32: this is the optimized data type to represent the categorical/factor data from the :ISMALE column.

Under the hood, CategoricalArrays.jl creates a sort of dictionary that keeps track of the original values and the new representation. By default, the optimized data type will always be a UInt32. In most cases this works fine, we have some performance gains since we need to use less data to store UInt32s than Int64s. Imagine if we had large strings instead of Int64s: those gains would be much higher.

We can ask CategoricalArrays.jl to compress the data, if possible, to something smaller than UInt32s. This is done with the compress keyword argument. CategoricalArrays.jl will then scan all the unique values of the target categorical vector to define its size (cardinality) to choose an appropriate UInt size.

For example, our :ISMALE column that has only 2 unique elements can be represented easily as an UInt8 (\(2^8 = 256\)):

@transform! df :SEX = categorical(:SEX; compress = true)
first(df, 5)

5×8 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…
1	1	34.823	38.212	1.1129	0	42.635	0	low
2	2	32.765	74.838	0.8846	1	126.0	1	high
3	3	35.974	37.303	1.1004	1	48.981	1	low
4	4	38.206	32.969	1.1972	1	38.934	1	low
5	5	33.559	47.139	1.5924	0	37.198	0	low

Note

CategoricalArrays with the ordered = false (default) argument only supports equality (==) and inequality (!=) operators for comparisons. If you need to order categories/factors and compare them using higher >/>= or lower <= operator comparison you’ll need to have ordered factors/categories with the ordered-true argument.

3 🚧 Recoding factors/categories

One of the most common operations using factors and categories is the dreadful recoding. If you come from R and tidyverse, this is much easier to do in Julia.

We can recode CategoricalArrays with the recode()/recode!() function which accepts as first argument a CategoricalArray, followed by Pairs of old_value => new_value.

Suppose we would like to change the values in the :SEX column to "male" or "female":

@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male")
first(df, 5)

5×8 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…
1	1	34.823	38.212	1.1129	0	42.635	female	low
2	2	32.765	74.838	0.8846	1	126.0	male	high
3	3	35.974	37.303	1.1004	1	48.981	male	low
4	4	38.206	32.969	1.1972	1	38.934	male	low
5	5	33.559	47.139	1.5924	0	37.198	female	low

Tip

recode() and recode!() takes an optional second argument as the default value. This allows overriding any occurrence that does not match with the old_value in the Pair to have the same default value as the new_value. It is similar to the TRUE ~ new_value that you’ve probably used in the dplyr::case_when() function.

For example, we could have used this in the previous recode() example:

@transform! df :SEX = recode(:SEX, "male", 0 => "female")
first(df, 5)

This will work because anything that does not match 0 will be recoded as "male".

4 👉 `ordered` factors/categories

Often we need to represent ordered factors/categories. Suppose we have a 3-categorical vector of strings with values "high“, "medium" and "low". We’ll probably want to have them ordered to make comparisons and this is accomplished with the ordered keyword argument.

Let’s first generate a random :INTENSITY which can take those 3 values (just to have a good example):

@transform! df :INTENSITY = rand(["low", "medium", "high"], nrow(df))
first(df, 5)

5×9 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat	INTENSITY
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…	String
1	1	34.823	38.212	1.1129	0	42.635	female	low	medium
2	2	32.765	74.838	0.8846	1	126.0	male	high	medium
3	3	35.974	37.303	1.1004	1	48.981	male	low	high
4	4	38.206	32.969	1.1972	1	38.934	male	low	low
5	5	33.559	47.139	1.5924	0	37.198	female	low	medium

Now we call categorical() on the :INTENSITY column with the ordered = true as a keyword argument:

@transform! df :INTENSITY = categorical(:INTENSITY; ordered = true)
first(df, 5)

5×9 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat	INTENSITY
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…	Cat…
1	1	34.823	38.212	1.1129	0	42.635	female	low	medium
2	2	32.765	74.838	0.8846	1	126.0	male	high	medium
3	3	35.974	37.303	1.1004	1	48.981	male	low	high
4	4	38.206	32.969	1.1972	1	38.934	male	low	low
5	5	33.559	47.139	1.5924	0	37.198	female	low	medium

5 ℹ `levels` – custom levels

By default, categorical(:column; ordered = true) takes the ordering of the levels in the order of appearance from lowest to highest. So, since "high" is the first entry in the :INTENSITY column, it will be encoded as the lowest value. This is not what we want, since the resulting comparison is not the behavior we intend:

# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]

false

To remedy this, we can use the levels keyword argument in the categorical() function. levels accepts a vector of strings and will interpret them as lowest to highest:

@transform! df :INTENSITY =
    categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"])
first(df, 5)

5×9 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat	INTENSITY
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…	Cat…
1	1	34.823	38.212	1.1129	0	42.635	female	low	medium
2	2	32.765	74.838	0.8846	1	126.0	male	high	medium
3	3	35.974	37.303	1.1004	1	48.981	male	low	high
4	4	38.206	32.969	1.1972	1	38.934	male	low	low
5	5	33.559	47.139	1.5924	0	37.198	female	low	medium

Now our comparison behaves as expected:

# "high" > "medium" (should be true)
df[1, :INTENSITY] > df[2, :INTENSITY]

false

You can also see what levels a CategoricalArray has with the levels() function:

levels(df.INTENSITY)

3-element Vector{String}:
 "low"
 "medium"
 "high"

levels(df.SEX)

2-element Vector{String}:
 "female"
 "male"

Tip

You can also change in-place the levels of a CategoricalArray using the levels!() function. What we accomplished above with the categorical(:INTENSITY; ordered = true, levels = ["low", "medium", "high"]) function could be done with:

levels!(dfINTENSITY, ["low", "medium", "high"])

6 🔢 Converting factors/categories to numerical values

Handling factors in Julia is easy, but there is also a time when we need to transform them back to numerical values. This is often desired when we are inputting the DataFrame with one or more CategoricalArrays into a model.

This is easily accomplished with the levelcode() function. For example, let’s convert our :INTENSITY column into a numerical column called :INTENSITY_NUM:

@rtransform! df :INTENSITY_NUM = levelcode(:INTENSITY)
first(df, 5)

5×10 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	SEX	eGRF_cat	INTENSITY	INTENSITY_NUM
	Int64	Float64	Float64	Float64	Int64	Float64	Cat…	Cat…	Cat…	Int64
1	1	34.823	38.212	1.1129	0	42.635	female	low	medium	2
2	2	32.765	74.838	0.8846	1	126.0	male	high	medium	2
3	3	35.974	37.303	1.1004	1	48.981	male	low	high	3
4	4	38.206	32.969	1.1972	1	38.934	male	low	low	1
5	5	33.559	47.139	1.5924	0	37.198	female	low	medium	2

And it works! Our :INTENSITY_NUM column has the values 1, 2 and 3, respectively for the "low", "medium" and "high" levels.

Caution

Note that the levelcode() function acts on a CategoricalValue and not on a CategoricalArray. In other words, it must be applied to every element of the CategoricalArray and not to the array itself.

This can be done with the @rtransform[!]/@rselect[!] macros or using the @byrow macro.

7 🎁 Extra Operations with Factors and Categories

This section of the tutorial is geared towards users coming from R’s tidyverse which are used to the extensive ad-hoc functions of forcats.

First, notice that these are the exported functions from CategoricalArrays.jl:

categorical(A) - Construct a categorical array with values from A
compress(A) - Return a copy of categorical array A using the smallest possible reference type
cut(x) - Cut a numeric array into intervals and return an ordered CategoricalArray
decompress(A) - Return a copy of categorical array A using the default reference type
isordered(A) - Test whether entries in A can be compared using <, > and similar operators
ordered!(A, ordered) - Set whether entries in A can be compared using <, > and similar operators
recode(a[, default], pairs...) - Return a copy of a after replacing one or more values
recode!(a[, default], pairs...) - Replace one or more values in a in-place
unwrap(x) - Return the value contained in categorical value x; if x is Missing return missing
levelcode(x) - Return the code of categorical value x, i.e. its index in the set of possible values returned by levels(x)

We don’t have so many functions as the ones in forcats because with Julia’s multiple dispatch, rich ecosystem of packages and tightly integrated standard library we do not need them. We can use other Julia’s functions that were either extended to work on CategoricalArrays and CategoricalValues or work right out of the box without any customization.

replaces fct_count()

The first one is the fct_count(), which is perfectly and elegantly replaced by the macro @by with the length() function:

@by df :SEX :count = length(:SEX)

2×2 DataFrame

Row	SEX	count
	Cat…	Int64
1	female	53
2	male	47

replaces the fct_match()

In order to test if you have a certain level inside a CategoricalArrays just test whether a certain level is in the levels returned by the function levels():

"female" in levels(df.SEX)

true

replaces fct_c()

Julia already has a vertical concatenation function for vectors and other collections. So why create a new one? Much more easy and natural to extend the vcat().

This is exactly what CategoricalArrays.jl has done. Let’s first create two categorical vectors:

cat1 = categorical(["a", "b"])

2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"

cat2 = categorical(["c", "d"])

2-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "c"
 "d"

Now we just vcat() them together:

vcat(cat1, cat2)

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

replaces fct_rev()

This is the same argument for the vcat() replacement for fct_c(). The reverse() function can be easily extended for CategoricalArrays:

levels(df.SEX)

2-element Vector{String}:
 "female"
 "male"

reverse(levels(df.SEX))

2-element Vector{String}:
 "male"
 "female"

Then, just do a in-place levels!() on your CategoricalArray:

levels!(df.SEX, reverse(levels(df.SEX)))

100-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "female"
 "male"
 "male"
 "male"
 "female"
 "female"
 "female"
 "female"
 "female"
 "male"
 ⋮
 "male"
 "female"
 "female"
 "male"
 "male"
 "male"
 "female"
 "male"
 "female"

Just to confirm that it worked correctly:

levels(df.SEX) # 👍

2-element Vector{String}:
 "male"
 "female"

replaces fct_other()

recode() with the optional positional argument default replaces the fct_other() function:

big_cat = categorical(["a", "b", "c", "d"])

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

Suppose you only want to retain "a" and "d" categories and the rest will be coded as "other":

recode(big_cat, "other", "a" => "a", "d" => "d")

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "other"
 "other"
 "d"

replaces fct_shuffle()

This is a nice replacement. Just set in-place new levels with the levels!() function and specifying the new levels as shuffle() on the original levels:

using Random: shuffle

shuffle(levels(big_cat))

4-element Vector{String}:
 "b"
 "d"
 "c"
 "a"

levels!(big_cat, shuffle(levels(big_cat)))

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

Just to confirm that it worked correctly:

levels(big_cat) # 👍

4-element Vector{String}:
 "c"
 "a"
 "b"
 "d"

Reuse

CC BY-SA 4.0

1 📊 categorical() function

2 🔪 cut() function

2.1 CategoricalValue

3 🚧 Recoding factors/categories

4 👉 ordered factors/categories

5 ℹ levels – custom levels

6 🔢 Converting factors/categories to numerical values

7 🎁 Extra Operations with Factors and Categories

Reuse

1 📊 `categorical()` function

2 🔪 `cut()` function

2.1 `CategoricalValue`

4 👉 `ordered` factors/categories

5 ℹ `levels` – custom levels