Plotting Statistical Visualizations with AlgebraOfGraphics.jl

Authors

Jose Storopoli

Juan Oneto

In this tutorial, we’ll explore different statistical transformations that we can apply to our visualizations with AoG.jl.

These can be added to any AoG.jl layer with the * operator.

We will also cover some geometries that are commonly used with the statistical visualization functions: Contour and Heatmap

Statistical transformations are paired with defaults for visual(), so you don’t need to specify a visual() yourself. You just add the data() layer along with a mapping() layer and finalize with a statistical transformation layer.

We’ll cover 6 statistical transformation functions:

1 📋 stat_*() - AoG.jl Table

The table below references ggplot2’s stat_*() to AoG.jl statistical visualizations functions:

ggplot2 AoG.jl
geom_bar() or stat_count() frequency()
stat_summary(fun = "mean") expectation()
geom_histogram() or stat_bin() histogram()
geom_density() or stat_density() density()
geom_smooth() or stat_smooth() smooth()
geom_smooth(method = "lm") or stat_smooth(method = "lm") linear()

2 📈 Statistical Transformations with AoG.jl

To start let us load AoG.jl, data wrangling libraries and the DataFrame we’ve used previously:

using PharmaDatasets
using DataFramesMeta

df = dataset("demographics_1")
first(df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

We will also do some columns transformations to CategoricalArrays:

using CategoricalArrays
@transform! df :SEX = categorical(:ISMALE);
@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male");
@transform! df :WEIGHT_cat = cut(:WEIGHT, 2; labels = ["light", "heavy"])

And now load AoG.jl along with CairoMakie.jl:

using CairoMakie
using AlgebraOfGraphics

2.1 histogram()

The first statistical transformation is the histogram() function, which performs a binning operation on the data and outputs a histogram.

For example, if we want to visualize a histogram of the column :AGE, we can do easily:

data(df) * mapping(:AGE) * histogram() |> draw

We can also add faceting to our histogram by specifying either a layout, row or col as keyword arguments inside mapping():

data(df) * mapping(:AGE; layout = :SEX) * histogram() |> draw

If you want to customize other aspects of the histogram’s underlying bar plot, you can add a visual() layer without any plotting type inside and just with the keyword desired customizations as keywords arguments:

data(df) * mapping(:AGE; layout = :SEX) * histogram() * visual(; color = :blue) |> draw

The number of bins to use is determined automatically. But if you want, you can customize with the keyword argument bins inside histogram():

data(df) * mapping(:AGE) * histogram(; bins = 20) |> draw

histogram() also has a normalization keyword argument which lets you specify a normalization scheme. There are 4 possible normalizations schemes and you can specify them using the corresponding Symbols:

  1. :none: the default, no normalization applied, i.e. raw count.
  2. :pdf: normalize by sum of weights and bin sizes. The resulting histogram will behave as a probability density function (PDF) which the sum of all bins sums to 1.
  3. :density: normalize by bin sizes only. The resulting histogram represents count density of input and does not sum to 1.
  4. :probability: normalize by sum of weights only. The resulting histogram represents the fraction of probability mass for each bin and does not sum to 1.

Our advice is to use either :none (the default) for a raw histogram or :pdf for a relative histogram.

data(df) * mapping(:AGE) * histogram(; normalization = :pdf) |> draw

Tip

Note that the default y-axis label changes with normalization. In the default case, :none, the y-label is count and in the :pdf case pdf.

You can also specify a two-axes histogram() transformation, i.e. a 2-D histogram plot.

This is similar to geom_tile() in ggplot2.

For example, let’s add the column :eGFR to take a look at the relationship between age and kidney function:

data(df) * mapping(:AGE, :eGFR) * histogram() |> draw

Notice that by default a 2-D histogram uses the visual(Heatmap) under the hood:

data(df) * mapping(:AGE, :eGFR) * histogram() * visual(Heatmap) |> draw

We can also change the colormap keyword argument inside visual(). Let’s make a black and white printer-friendly visualization with colormap = Reverse(:greys):

data(df) * mapping(:AGE, :eGFR) * histogram() * visual(; colormap = Reverse(:greys)) |> draw

2.2 density()

Our second statistical function is density() which fits a kernel density estimation (KDE) of the underlying data.

Using the same example column :AGE as before, we just change the function to density():

Caution

Both AoG.jl and CairoMakie.jl exports a function named density(). So if we just blindly use density() we get an error:

density()
WARNING: both AlgebraOfGraphics and CairoMakie export "density"; uses of it in module Notebook must be qualified
UndefVarError: UndefVarError(:density)

Thus, we need to qualify which density() function we want with Package.function naming. This is why need to call AlgebraOfGraphics.density().

Or we can also use an import statement:

import AlgebraOfGraphics: density

to tell Julia which density() function to use.

data(df) * mapping(:AGE) * AlgebraOfGraphics.density() |> draw

As before we can customize our plot with keyword arguments using a visual() layer without any plotting type inside:

data(df) * mapping(:AGE) * AlgebraOfGraphics.density() * visual(; color = (:blue, 0.75)) |>
draw

Tip

Makie.jl lets us specify colors either as Symbols, e.g. :blue; or as a tuple with length 2 where the first element is a Symbol representing the desired color and the second element is a Float representing the desired transparency (i.e. alpha).

Since AoG.jl uses Makie.jl as a backend we can use Makie.jl’s multiple dispatch on the color argument.

For a complete list of named colors that Makie.jl has access to, see here.

Also as before, we can perform faceting by specifying either a layout, row or col as keyword arguments inside mapping():

data(df) * mapping(:AGE, col = :SEX) * AlgebraOfGraphics.density() |> draw

Similar to histogram(), you can also specify a two-axes density() transformation, i.e. a 2-D density plot.

Let’s use the same example as before with :AGE and :eGFR to take a look at the relationship between age and kidney function:

data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() |> draw

Note that, just like histogram(), density() in 2-D visualizations by default uses visual(Heatmap) under the hood:

data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Heatmap) |> draw

You can change Heatmap for Contour to make a contour plot instead of a heatmap plot:

data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Contour) |> draw

As before, you can also specify a custom colormap inside visual():

data(df) *
mapping(:AGE, :eGFR) *
AlgebraOfGraphics.density() *
visual(Contour; colormap = Reverse(:greys)) |> draw

2.2.1 3-D Visualizations

density() has some nice extra features. You can also use it with 3-D visualizations by passing an Axis3 as type inside the axis NamedTuple customization inside the draw() function. This is done with the Surface plotting type inside visual():

draw(
    data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Surface);
    axis = (; type = Axis3),
)

This also works for facetting. Let’s add :SEX as layout inside mapping():

draw(
    data(df) *
    mapping(:AGE, :eGFR; layout = :SEX) *
    AlgebraOfGraphics.density() *
    visual(Surface);
    axis = (; type = Axis3),
)

Note

We will cover much more customizations in Tutorial 4 - Customization of AlgebraOfGraphics.jl Plots. Don’t forget to check it out.

2.3 frequency()

The third statistical function we will cover is the frequency() function which computes a raw frequency table of the arguments.

Tip

frequency() does not take any arguments.

The simplest example could be just computing the frequency of the column :SEX:

data(df) * mapping(:SEX) * frequency() |> draw

As before we can add faceting with either a layout, row or col as keyword arguments inside mapping(); and customize with keyword arguments inside an empty visual() layer:

data(df) * mapping(:SEX; layout = :WEIGHT_cat) * frequency() * visual(; color = :blue) |>
draw

A nice plot to have up your sleeve is a frequency() plot using stacked bars.

This can be done by specifying the keyword arguments color and stack to a desired column inside mapping():

data(df) * mapping(:SEX; color = :WEIGHT_cat, stack = :WEIGHT_cat) * frequency() |> draw

frequency() can also be paired with a 2-D visualization in order to have a heatmap plot of raw counts.

For example, the previous stacked bar frequency plot can be done as a 2-D heatmap frequency plot:

data(df) * mapping(:SEX, :WEIGHT_cat) * frequency() |> draw

Also, it supports a custom colormap:

data(df) *
mapping(:SEX, :WEIGHT_cat) *
frequency() *
visual(; colormap = Reverse(:greys)) |> draw

2.4 expectation()

Our fourth statistical visualization function is expectation() which is the mathematical term for the “mean” of a random variable and used extensively in fields like probability.

expectation computes the expected value, i.e. “mean” of a random variable, of the second argument conditioned on the values of the first argument inside mapping(). In other words, the mean of the y column conditioned on the x column.

Here is an example with only the columns :SEX and :AGE:

data(df) * mapping(:SEX, :AGE) * expectation() |> draw

As before we can add faceting with either a layout, row or col as keyword arguments inside mapping(); and customize with keyword arguments inside an empty visual() layer:

data(df) *
mapping(:SEX, :AGE; layout = :WEIGHT_cat) *
expectation() *
visual(; color = :blue) |> draw

Another nice plot to have up your sleeve is a grouped expected value bar plot.

This is accomplished with adding keyword arguments of color along with dodge inside mapping():

data(df) *
mapping(:SEX, :AGE; color = :WEIGHT_cat, dodge = :WEIGHT_cat) *
expectation() *
visual(; color = :blue) |> draw

Tip

expectation() does not take any arguments.

2.5 linear()

Our fifth statistical visualization function is linear() which draws a linear trend line between two variables. This is similar to geom_smooth(method = "lm") in ggplot2. It computes a linear fit using the following formula: y ~ 1 + x.

Let’s see an example using :AGE vs :eGFR:

data(df) * mapping(:AGE, :eGFR) * linear() |> draw

Caution

If we also add a visual(Scatter) to see the data points along with our linear trend line, we get something that is not our original intention. This is because AoG.jl has two algebraic operations: addition with + and multiplication with *. To superimpose layers you need to use the + and not the * operator. We’ll discuss more in-depth AoG.jl’s algebraic operations in Tutorial 5 - Grammar of Graphics with AlgebraOfGraphics.jl. Don’t forget to check it out.

data(df) * mapping(:AGE, :eGFR) * (linear() + visual(Scatter)) |> draw

Of course we can also use customizations inside mapping() or even inside visual():

data(df) *
mapping(:AGE, :eGFR; color = :SEX) *
(linear() + visual(Scatter; marker = '+', markersize = 20)) |> draw

2.6 smooth()

Our final statistical visualization is smooth(). It adds a smooth trend curve to our data. It fills the same need as geom_smooth() in ggplot2.

Here is the previous example, but now using smooth():

data(df) * mapping(:AGE, :eGFR) * smooth() |> draw

By default it uses a LOESS (Locally Estimated Scatterplot Smoothing) of degree 2. You change the degree to make it less or more “smooth”:

data(df) * mapping(:AGE, :eGFR) * smooth(; degree = 5) |> draw

data(df) * mapping(:AGE, :eGFR) * smooth(; degree = 1) |> draw

Like all of the previous statistical functions, it supports all keyword arguments and mapping() customizations:

data(df) *
mapping(:AGE, :eGFR; color = :SEX) *
(smooth() + visual(Scatter; marker = '+', markersize = 20)) |> draw