Empirical CDF plots using AlgebraOfGraphics.jl

Author

Patrick Kofod Mogensen

In this tutorial, we will cover how to produce an empirical CDF plot in AlgebraOfGraphics.jl. Empirical CDFs or ECDFS are useful when looking at the distribution of a covariate, a predicted response, weighted residuals, empirical bayes estimates, and more.

To begin, like in other AlgebraOfGraphics tutorials, let’s load AlgebraOfGraphics.jl, data wrangling libraries and the DataFrame we’ve used previously:

using PharmaDatasets
using DataFrames

df = dataset("demographics_1")
first(df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

Finally, load AlgebraOfGraphics.jl and CairoMakie.jl:

using AlgebraOfGraphics
using CairoMakie

1 An Empirical CDF Plot

An empirical cumulative density function plot (Empirical CDF plot or ECDF plot) is a plot that presents an estimate of the function that describes the probability of a random draw from the distribution of interest to be below some value. This means, that if we find a point p on the first axis and read the function value it will tell us the estimated probability of a random draw to being less than or equal to p. Let us consider an example:

spec1 = data(df) * mapping(:WEIGHT) * visual(ECDFPlot; label = "Data")
draw(
    spec1;
    axis = (xlabel = "Weight", ylabel = "F(Weight)"),
    legend = (position = :top, titleposition = :left),
)

Looking at the plot, we see that the cumulative density reaches 0.5 at around 45 kg. This means that roughly 50% of the entries in the data set are weights below 45 kg and roughly 50% are above.

The plot is constructed by setting the ECDF to 0 at negative infinity and then it goes through the sorted values of the data. Every time it gets to a new value it increments the ECDF by 1/n where n is the total number of data points. If there a m observations with the exact same value, the increment is m/n instead. In the plot we see a region from just below 80 kg to around 90 kg where the function is completely flat. Then means that no observations were made in that interval.

1.1 Normalize to Compare to Standard Normal

Our distribution of weight can be evaluated in many different ways depending on the end goal. Suppose we want to study if the weight distribution is approximately normal. Then, we could overlay a plot based on the theoretical Normal distribution with the data mean and data standard deviation. The more usual approach is to compare the normalized values with the standard Normal distribution. To normalize means to subtract the mean and divide by the standard deviation in all data rows. This creates a “normalized” dataset that has mean 0 and standard deviation 1 by construction. A standard Normal distribution is a Normal distribution that also has mean 0 and standard deviation 1. If the normalized data ECDF matches up with the theoretical CDF of a standard Normal distribution then our original data will be judged to be Normal. It is also possible to read quantiles of interest by starting from the second axis and finding the appropriate values on the first axis. Of course, other plots may be of interest in this case such as empirical quantile function plots or QQ plots. Let us look at an example:

using Statistics, Distributions
df.WEIGHT_norm .= (df.WEIGHT .- mean(df.WEIGHT)) ./ std(df.WEIGHT)
input = range(extrema(df.WEIGHT_norm)...; length = 300)
stdnormal = (x = input, y = cdf.(Normal(), input))
spec1 =
    data(df) * mapping(:WEIGHT_norm) * visual(ECDFPlot, label = "Normalized Data") +
    data(stdnormal) *
    mapping(:x, :y) *
    visual(Lines; color = :red, label = "Standard Normal")
draw(
    spec1;
    axis = (xlabel = "Normalized Weight", ylabel = "F(Normalized Weight)"),
    legend = (position = :top, titleposition = :left),
)

The two plots seem to have somewhat of an overlap, though you could argue that we have a bit too few values of low weights and too many of the moderately high weights.

2 Conclusion

That’s it for our tutorial on constructing empirical CDF plots using AlgebraOfGraphics.jl