# Plotting Statistical Visualizations with `AlgebraOfGraphics.jl`

Authors

Jose Storopoli

Juan Oneto

In this tutorial, we’ll explore different statistical transformations that we can apply to our visualizations with `AoG.jl`.

These can be added to any `AoG.jl` layer with the `*` operator.

We will also cover some geometries that are commonly used with the statistical visualization functions: `Contour` and `Heatmap`

Statistical transformations are paired with defaults for `visual()`, so you don’t need to specify a `visual()` yourself. You just add the `data()` layer along with a `mapping()` layer and finalize with a statistical transformation layer.

We’ll cover 6 statistical transformation functions:

• `histogram()`
• `density()`
• `frequency()`
• `expectation()`
• `linear()`
• `smooth()`

## 1 📋 `stat_*()` - `AoG.jl` Table

The table below references `ggplot2`’s `stat_*()` to `AoG.jl` statistical visualizations functions:

`ggplot2` `AoG.jl`
`geom_bar()` or `stat_count()` `frequency()`
`stat_summary(fun = "mean")` `expectation()`
`geom_histogram()` or `stat_bin()` `histogram()`
`geom_density()` or `stat_density()` `density()`
`geom_smooth()` or `stat_smooth()` `smooth()`
`geom_smooth(method = "lm")` or `stat_smooth(method = "lm")` `linear()`

## 2 📈 Statistical Transformations with `AoG.jl`

To start let us load `AoG.jl`, data wrangling libraries and the `DataFrame` we’ve used previously:

``````using PharmaDatasets
using DataFramesMeta

df = dataset("demographics_1")
first(df, 5)``````
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

We will also do some columns transformations to `CategoricalArray`s:

``````using CategoricalArrays
@transform! df :SEX = categorical(:ISMALE);
@transform! df :SEX = recode(:SEX, 0 => "female", 1 => "male");
@transform! df :WEIGHT_cat = cut(:WEIGHT, 2; labels = ["light", "heavy"])``````

And now load `AoG.jl` along with `CairoMakie.jl`:

``````using CairoMakie
using AlgebraOfGraphics``````

### 2.1`histogram()`

The first statistical transformation is the `histogram()` function, which performs a binning operation on the data and outputs a histogram.

For example, if we want to visualize a histogram of the column `:AGE`, we can do easily:

``data(df) * mapping(:AGE) * histogram() |> draw``

We can also add faceting to our histogram by specifying either a `layout`, `row` or `col` as keyword arguments inside `mapping()`:

``data(df) * mapping(:AGE; layout = :SEX) * histogram() |> draw``

If you want to customize other aspects of the histogram’s underlying bar plot, you can add a `visual()` layer without any plotting type inside and just with the keyword desired customizations as keywords arguments:

``data(df) * mapping(:AGE; layout = :SEX) * histogram() * visual(; color = :blue) |> draw``

The number of bins to use is determined automatically. But if you want, you can customize with the keyword argument `bins` inside `histogram()`:

``data(df) * mapping(:AGE) * histogram(; bins = 20) |> draw``

`histogram()` also has a `normalization` keyword argument which lets you specify a normalization scheme. There are 4 possible normalizations schemes and you can specify them using the corresponding `Symbol`s:

1. `:none`: the default, no normalization applied, i.e. raw count.
2. `:pdf`: normalize by sum of weights and bin sizes. The resulting histogram will behave as a probability density function (PDF) which the sum of all bins sums to 1.
3. `:density`: normalize by bin sizes only. The resulting histogram represents count density of input and does not sum to 1.
4. `:probability`: normalize by sum of weights only. The resulting histogram represents the fraction of probability mass for each bin and does not sum to 1.

Our advice is to use either `:none` (the default) for a raw histogram or `:pdf` for a relative histogram.

``data(df) * mapping(:AGE) * histogram(; normalization = :pdf) |> draw``
Tip

Note that the default y-axis label changes with `normalization`. In the default case, `:none`, the y-label is `count` and in the `:pdf` case `pdf`.

You can also specify a two-axes `histogram()` transformation, i.e. a 2-D histogram plot.

This is similar to `geom_tile()` in `ggplot2`.

For example, let’s add the column `:eGFR` to take a look at the relationship between age and kidney function:

``data(df) * mapping(:AGE, :eGFR) * histogram() |> draw``

Notice that by default a 2-D histogram uses the `visual(Heatmap)` under the hood:

``data(df) * mapping(:AGE, :eGFR) * histogram() * visual(Heatmap) |> draw``

We can also change the `colormap` keyword argument inside `visual()`. Let’s make a black and white printer-friendly visualization with `colormap = Reverse(:greys)`:

``data(df) * mapping(:AGE, :eGFR) * histogram() * visual(; colormap = Reverse(:greys)) |> draw``

### 2.2`density()`

Our second statistical function is `density()` which fits a kernel density estimation (KDE) of the underlying data.

Using the same example column `:AGE` as before, we just change the function to `density()`:

Caution

Both `AoG.jl` and `CairoMakie.jl` exports a function named `density()`. So if we just blindly use `density()` we get an error:

``density()``
``WARNING: both AlgebraOfGraphics and CairoMakie export "density"; uses of it in module Notebook must be qualified``
``UndefVarError: UndefVarError(:density)``

Thus, we need to qualify which `density()` function we want with `Package.function` naming. This is why need to call `AlgebraOfGraphics.density()`.

Or we can also use an `import` statement:

``import AlgebraOfGraphics: density``

to tell Julia which `density()` function to use.

``data(df) * mapping(:AGE) * AlgebraOfGraphics.density() |> draw``

As before we can customize our plot with keyword arguments using a `visual()` layer without any plotting type inside:

``````data(df) * mapping(:AGE) * AlgebraOfGraphics.density() * visual(; color = (:blue, 0.75)) |>
draw``````
Tip

`Makie.jl` lets us specify `color`s either as `Symbol`s, e.g. `:blue`; or as a tuple with length 2 where the first element is a `Symbol` representing the desired color and the second element is a `Float` representing the desired transparency (i.e. alpha).

Since `AoG.jl` uses `Makie.jl` as a backend we can use `Makie.jl`’s multiple dispatch on the `color` argument.

For a complete list of named colors that `Makie.jl` has access to, see here.

Also as before, we can perform faceting by specifying either a `layout`, `row` or `col` as keyword arguments inside `mapping()`:

``data(df) * mapping(:AGE, col = :SEX) * AlgebraOfGraphics.density() |> draw``

Similar to `histogram()`, you can also specify a two-axes `density()` transformation, i.e. a 2-D density plot.

Let’s use the same example as before with `:AGE` and `:eGFR` to take a look at the relationship between age and kidney function:

``data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() |> draw``

Note that, just like `histogram()`, `density()` in 2-D visualizations by default uses `visual(Heatmap)` under the hood:

``data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Heatmap) |> draw``

You can change `Heatmap` for `Contour` to make a contour plot instead of a heatmap plot:

``data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Contour) |> draw``

As before, you can also specify a custom `colormap` inside `visual()`:

``````data(df) *
mapping(:AGE, :eGFR) *
AlgebraOfGraphics.density() *
visual(Contour; colormap = Reverse(:greys)) |> draw``````

#### 2.2.1 3-D Visualizations

`density()` has some nice extra features. You can also use it with 3-D visualizations by passing an `Axis3` as `type` inside the `axis` `NamedTuple` customization inside the `draw()` function. This is done with the `Surface` plotting type inside `visual()`:

``````draw(
data(df) * mapping(:AGE, :eGFR) * AlgebraOfGraphics.density() * visual(Surface);
axis = (; type = Axis3),
)``````

This also works for facetting. Let’s add `:SEX` as `layout` inside `mapping()`:

``````draw(
data(df) *
mapping(:AGE, :eGFR; layout = :SEX) *
AlgebraOfGraphics.density() *
visual(Surface);
axis = (; type = Axis3),
)``````
Note

We will cover much more customizations in Tutorial 4 - Customization of AlgebraOfGraphics.jl Plots. Don’t forget to check it out.

### 2.3`frequency()`

The third statistical function we will cover is the `frequency()` function which computes a raw frequency table of the arguments.

Tip

`frequency()` does not take any arguments.

The simplest example could be just computing the frequency of the column `:SEX`:

``data(df) * mapping(:SEX) * frequency() |> draw``

As before we can add faceting with either a `layout`, `row` or `col` as keyword arguments inside `mapping()`; and customize with keyword arguments inside an empty `visual()` layer:

``````data(df) * mapping(:SEX; layout = :WEIGHT_cat) * frequency() * visual(; color = :blue) |>
draw``````

A nice plot to have up your sleeve is a `frequency()` plot using stacked bars.

This can be done by specifying the keyword arguments `color` and `stack` to a desired column inside `mapping()`:

``data(df) * mapping(:SEX; color = :WEIGHT_cat, stack = :WEIGHT_cat) * frequency() |> draw``

`frequency()` can also be paired with a 2-D visualization in order to have a heatmap plot of raw counts.

For example, the previous stacked bar frequency plot can be done as a 2-D heatmap frequency plot:

``data(df) * mapping(:SEX, :WEIGHT_cat) * frequency() |> draw``

Also, it supports a custom `colormap`:

``````data(df) *
mapping(:SEX, :WEIGHT_cat) *
frequency() *
visual(; colormap = Reverse(:greys)) |> draw``````

### 2.4`expectation()`

Our fourth statistical visualization function is `expectation()` which is the mathematical term for the “mean” of a random variable and used extensively in fields like probability.

`expectation` computes the expected value, i.e. “mean” of a random variable, of the second argument conditioned on the values of the first argument inside `mapping()`. In other words, the mean of the `y` column conditioned on the `x` column.

Here is an example with only the columns `:SEX` and `:AGE`:

``data(df) * mapping(:SEX, :AGE) * expectation() |> draw``

As before we can add faceting with either a `layout`, `row` or `col` as keyword arguments inside `mapping()`; and customize with keyword arguments inside an empty `visual()` layer:

``````data(df) *
mapping(:SEX, :AGE; layout = :WEIGHT_cat) *
expectation() *
visual(; color = :blue) |> draw``````

Another nice plot to have up your sleeve is a grouped expected value bar plot.

This is accomplished with adding keyword arguments of `color` along with `dodge` inside `mapping()`:

``````data(df) *
mapping(:SEX, :AGE; color = :WEIGHT_cat, dodge = :WEIGHT_cat) *
expectation() *
visual(; color = :blue) |> draw``````
Tip

`expectation()` does not take any arguments.

### 2.5`linear()`

Our fifth statistical visualization function is `linear()` which draws a linear trend line between two variables. This is similar to `geom_smooth(method = "lm")` in `ggplot2`. It computes a linear fit using the following formula: `y ~ 1 + x`.

Let’s see an example using `:AGE` vs `:eGFR`:

``data(df) * mapping(:AGE, :eGFR) * linear() |> draw``
Caution

If we also add a `visual(Scatter)` to see the data points along with our linear trend line, we get something that is not our original intention. This is because `AoG.jl` has two algebraic operations: addition with `+` and multiplication with `*`. To superimpose layers you need to use the `+` and not the `*` operator. We’ll discuss more in-depth `AoG.jl`’s algebraic operations in Tutorial 5 - Grammar of Graphics with `AlgebraOfGraphics.jl`. Don’t forget to check it out.

``data(df) * mapping(:AGE, :eGFR) * (linear() + visual(Scatter)) |> draw``

Of course we can also use customizations inside `mapping()` or even inside `visual()`:

``````data(df) *
mapping(:AGE, :eGFR; color = :SEX) *
(linear() + visual(Scatter; marker = '+', markersize = 20)) |> draw``````

### 2.6`smooth()`

Our final statistical visualization is `smooth()`. It adds a smooth trend curve to our data. It fills the same need as `geom_smooth()` in `ggplot2`.

Here is the previous example, but now using `smooth()`:

``data(df) * mapping(:AGE, :eGFR) * smooth() |> draw``

By default it uses a LOESS (Locally Estimated Scatterplot Smoothing) of degree 2. You change the degree to make it less or more “smooth”:

``data(df) * mapping(:AGE, :eGFR) * smooth(; degree = 5) |> draw``
``data(df) * mapping(:AGE, :eGFR) * smooth(; degree = 1) |> draw``

Like all of the previous statistical functions, it supports all keyword arguments and `mapping()` customizations:

``````data(df) *
mapping(:AGE, :eGFR; color = :SEX) *
(smooth() + visual(Scatter; marker = '+', markersize = 20)) |> draw``````