Deep Learning for Covariate Modelling in NLME Models

Authors

Lucas Pereira

Mohamed Tarek

Anthony Blaom

1 Introduction

The first unit of the AI for Drug Development course presented the fundamentals of machine learning (ML). With the current tutorial, we move on to the second unit: “Neural Networks and SciML”. Deep learning is the subfield of ML focused on neural networks, which we will explore in this tutorial, alongside regularization, training approaches and other topics. The tutorial supposes familiarity with the concepts from the previous tutorial, and is organized as follows:

  • Read in and pre-process the data
  • Perform some exploratory data analysis
  • Represent covariates in form suitable for MLJ models (inputs)
  • Fit a PK model to obtain empirical Bayes estimates (EBEs) (outputs)
  • Perform ordinary linear regression to related outputs to inputs
  • Compare with LASSO and Ridge regression
  • Discuss neural network building blocks
  • Try a basic neural network model
  • Add early stopping and learning rate shrinkage
  • Apply some explanatory AI tools and derive feature importance rankings
  • Repeat our analyses using a synthetic version of the data

Furthermore, as in the previous tutorial, a study in pharmacometrics will be used as a source of context, data and discussions. This time, “Pharmacokinetics of Quinidine in Male Patients” (Verme et al. 1992) was chosen. The authors analyzed data from 139 adult hospitalized men receiving oral therapy of quinidine, an antiarrhythmic drug. In previous works, quinidine has shown considerable intersubject variability in pharmacokinetic (PK) parameters, prompting the search by the authors after predictive covariates, specially for clearance. We will use the resulting data to several train linear models and a neural network to map baseline covariates to empirical bayes estimates.

While ultimately we will fail to extract any predictive value in the covariates for PK parameters, the same analysis is useful in the modeling of other drugs, and it might help in the case of a larger Quinidine clinical trial. A proof-of-concept is included in this case study by repeating our analysis using a synthetic data set.

2 Setup

Moving on to environment setup, if necessary, edit the following line, so LOCAL is the directory containing this tutorial:

LOCAL = @__DIR__
using Pkg
Pkg.activate(dirname(LOCAL))

Depending on how this tutorial is run, the packages may need to be installed first (only required once).

Pkg.add(
    [
        Pkg.PackageSpec(name = "CSV", version = v"0.10.16"),
        Pkg.PackageSpec(name = "PharmaDatasets", version = v"0.13.1"),
        Pkg.PackageSpec(name = "DataFrames", version = v"1.8.1"),
        Pkg.PackageSpec(name = "CategoricalArrays", version = v"1.1.0"),
        Pkg.PackageSpec(name = "DataFramesMeta", version = v"0.15.6"),
        Pkg.PackageSpec(name = "Flux", version = v"0.16.9"),
        Pkg.PackageSpec(name = "MLJFlux", version = v"0.6.7"),
        Pkg.PackageSpec(name = "ShapML", version = v"0.3.2"),
        Pkg.PackageSpec(name = "Optimisers", version = v"0.4.7"),
        Pkg.PackageSpec(name = "MLJMultivariateStatsInterface", version = v"0.5.4"),
    ];
    preserve = Pkg.PRESERVE_ALL,
)

Next we the load libraries we need. A seed for the random number generator is also set for consistency across runs.

using Pumas,
    CSV,
    DataFrames,
    DataFramesMeta,
    LinearAlgebra,
    MLJTuning,
    MLJIteration,
    StatisticalMeasures
using Random, MLJFlux, CategoricalArrays, ShapML, MLJBase, MLJModels, IterationControl
import Flux
import Flux: Chain, Dense, relu, σ
import MLJFlux: @builder
import DataFrames: select
import Pumas: fit
using Optimisers: Adam
const SEED = 378
Random.seed!(SEED)

Now the models are loaded.

# Load model types
MultitargetNeuralNetworkRegressor =
    MLJModels.@load MultitargetNeuralNetworkRegressor pkg = MLJFlux
LinearRegressor = MLJModels.@load MultitargetLinearRegressor pkg = MultivariateStats
ContinuousEncoder = MLJModels.@load ContinuousEncoder pkg = MLJTransforms
Standardizer = MLJModels.@load Standardizer pkg = MLJTransforms

3 Data processing

As usual in ML projects, we begin by getting the data. The dataset has great impact of the model’s performance, and should be fully understood to consider industry practices, such as the US Federal Drug Administration’s “Risk-Based Credibility Assessment Framework” (FDA 2025). The data can be downloaded from Kaggle, an online platform and community for sharing datasets and trained models. Alternatively, the data may be obtained from the RDatasets.jl package, which makes available to Julia many datasets already in the R programming language. Right from the start, we take the opportunity for a bit of processing by removing the first column, which just counts lines. Also, this dataset represents missing values with the string “NA”, so we indicate that in the call to CSV.read().

datafile = joinpath(LOCAL, "Quinidine.csv")
df = CSV.read(datafile, DataFrame; missingstring = ["NA"])[:, 2:end]
describe(df)[:, Not(7)] |> println;
14×6 DataFrame
 Row │ variable    mean     min       median  max     nmissing
     │ Symbol      Union…   Any       Union…  Any     Int64
─────┼─────────────────────────────────────────────────────────
   1 │ Subject     73.9266  1         74.0    223            0
   2 │ time        373.252  0.0       60.0    8095.5         0
   3 │ conc        2.45429  0.4       2.3     9.4         1110
   4 │ dose        225.302  83        201.0   603          443
   5 │ interval    7.10843  4         6.0     12          1222
   6 │ Age         66.7322  42        66.0    92             0
   7 │ Height      69.1931  60        69.0    79             0
   8 │ Weight      79.6608  41        78.0    119            0
   9 │ Race                 Black             Latin          0
  10 │ Smoke                no                yes            0
  11 │ Ethanol              current           none           0
  12 │ Heart                Moderate          Severe         0
  13 │ Creatinine           < 50              >= 50          0
  14 │ glyco       1.28168  0.39      1.23    3.16           0

3.1 Recoding

As can be seen by the dataset summary printed previously, some preprocessing is still necessary. First, some columns include categories as strings, which will be encoded as integers. Also, two new columns are necessary for future steps, namely the compartment (cmt) to which the drug is administered and evid indicating dosing events.

@chain df begin
    # Encode string categories as integers
    @transform! :Smoke = recode(:Smoke, "no" => 0, "yes" => 1)
    @transform! :Heart = recode(:Heart, "No/Mild" => 1, "Moderate" => 2, "Severe" => 3)
    @transform! :Creatinine = recode(:Creatinine, ">= 50" => 0, "< 50" => 1)
    @transform! :Race = recode(:Race, "Caucasian" => 1, "Latin" => 2, "Black" => 3)
    @transform! :Ethanol = recode(:Ethanol, "none" => 1, "former" => 2, "current" => 3)
    # Fields for pumas_read()
    @transform! :cmt = 1
    @rtransform! :evid = ismissing(:dose) ? 0 : 1
end
describe(df)[:, Not(7)] |> println;
16×6 DataFrame
 Row │ variable    mean        min    median   max      nmissing
     │ Symbol      Float64     Real   Float64  Real     Int64
─────┼───────────────────────────────────────────────────────────
   1 │ Subject      73.9266     1       74.0    223            0
   2 │ time        373.252      0.0     60.0   8095.5          0
   3 │ conc          2.45429    0.4      2.3      9.4       1110
   4 │ dose        225.302     83      201.0    603          443
   5 │ interval      7.10843    4        6.0     12         1222
   6 │ Age          66.7322    42       66.0     92            0
   7 │ Height       69.1931    60       69.0     79            0
   8 │ Weight       79.6608    41       78.0    119            0
   9 │ Race          1.42284    1        1.0      3            0
  10 │ Smoke         0.303875   0        0.0      1            0
  11 │ Ethanol       1.45615    1        1.0      3            0
  12 │ Heart         1.93202    1        2.0      3            0
  13 │ Creatinine    0.28416    0        0.0      1            0
  14 │ glyco         1.28168    0.39     1.23     3.16         0
  15 │ cmt           1.0        1        1.0      1            0
  16 │ evid          0.698844   0        1.0      1            0

3.2 Exploration

Since familiarizing with the data we are using can be useful for future troubleshooting, we’ll do a bit of exploration. In the code below, the dataset is first sorted by subject IDs. Then the number of unique values per column is determined and listed.

sort!(df, :Subject)
uniqs =
    LittleDict([Symbol(name) => unique(col) for (name, col) in zip(names(df), eachcol(df))])

# Display results:
(; variable = collect(keys(uniqs)), n_unique_values = length.(values(uniqs))) |>
DataFrame |>
println
16×2 DataFrame
 Row │ variable    n_unique_values
     │ Symbol      Int64
─────┼─────────────────────────────
   1 │ Subject                 136
   2 │ time                    728
   3 │ conc                     57
   4 │ dose                     16
   5 │ interval                  5
   6 │ Age                      38
   7 │ Height                   20
   8 │ Weight                   58
   9 │ Race                      3
  10 │ Smoke                     2
  11 │ Ethanol                   3
  12 │ Heart                     3
  13 │ Creatinine                2
  14 │ glyco                   145
  15 │ cmt                       1
  16 │ evid                      2

And to get a better idea of the distributions of variables, Figure 1 includes the histograms of all columns.

Figure 1: Histograms of variables.
using CairoMakie
CairoMakie.activate!()
fig = Figure()
for (title, index) in zip(names(df), CartesianIndices((4, 4)))
    ax = Axis(fig[index.I...]; title)
    hist!(ax, df[!, title] |> skipmissing |> collect)
end
save(joinpath(LOCAL, "histograms.svg"), fig)

We can also check how each patient in the dataset looks. In the following block, columns associated with pharmacokinetics are set aside, and printed by subject.

pk_df = select(df, ["Subject", "time", "dose", "conc", "evid"])
pd_subdfs = groupby(pk_df, :Subject) # Group by subject

# Print rows from random subjects
for subdf in sample(collect(pd_subdfs), 3; replace = false)
    println(subdf)
end
3×5 SubDataFrame
 Row │ Subject  time     dose     conc       evid
     │ Int64    Float64  Int64?   Float64?   Int64
─────┼─────────────────────────────────────────────
   1 │      16     0.0       249  missing        1
   2 │      16  8047.0       249  missing        1
   3 │      16  8047.67  missing        1.6      0
15×5 SubDataFrame
 Row │ Subject  time     dose     conc       evid
     │ Int64    Float64  Int64?   Float64?   Int64
─────┼─────────────────────────────────────────────
   1 │     139     0.0       301  missing        1
   2 │     139     6.0       201  missing        1
   3 │     139    36.0       301  missing        1
   4 │     139    42.0       301  missing        1
   5 │     139    48.0       301  missing        1
   6 │     139    54.0       301  missing        1
   7 │     139    60.0       301  missing        1
   8 │     139    66.0       301  missing        1
   9 │     139    73.5   missing        3.3      0
  10 │     139   114.0       301  missing        1
  11 │     139   126.0       301  missing        1
  12 │     139   140.0   missing  missing        0
  13 │     139   144.28  missing        3.2      0
  14 │     139   160.0       301  missing        1
  15 │     139   169.55  missing        2.1      0
6×5 SubDataFrame
 Row │ Subject  time     dose     conc       evid
     │ Int64    Float64  Int64?   Float64?   Int64
─────┼─────────────────────────────────────────────
   1 │      22      0.0      125  missing        1
   2 │      22      6.0      125  missing        1
   3 │      22     12.0      125  missing        1
   4 │      22     17.0      125  missing        1
   5 │      22     23.0      125  missing        1
   6 │      22     25.0  missing        1.7      0

The same is done to the covariate columns in the code below to check how they look.

covariates = [:Smoke, :Heart, :Creatinine, :Age, :Race, :Height, :glyco, :Ethanol, :Weight]

# New DataFrame with covariates and subject column
covariate_df = select(df, [:Subject, covariates...])
covariate_subdfs = groupby(covariate_df, :Subject) # Group by subjects

# Print rows from random subjects
for subgroup in sample(collect(covariate_subdfs), 3; replace = false)
    show(subgroup; allcols = true)
end
12×10 SubDataFrame
 Row │ Subject  Smoke  Heart  Creatinine  Age    Race   Height  glyco    Ethanol  Weight
     │ Int64    Int64  Int64  Int64       Int64  Int64  Int64   Float64  Int64    Int64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │      45      1      2           0     66      1      68     0.72        1      55
   2 │      45      1      2           0     66      1      68     0.72        1      55
   3 │      45      1      2           0     66      1      68     0.78        1      55
   4 │      45      1      2           0     66      1      68     0.78        1      55
   5 │      45      1      2           0     66      1      68     0.78        1      55
   6 │      45      1      2           0     66      1      68     0.78        1      55
   7 │      45      1      2           0     66      1      68     0.78        1      55
   8 │      45      1      2           0     66      1      68     0.78        1      55
   9 │      45      1      2           0     66      1      68     0.78        1      55
  10 │      45      1      2           0     66      1      68     0.86        1      45
  11 │      45      1      2           0     66      1      68     0.86        1      45
  12 │      45      1      2           0     66      1      68     0.86        1      458×10 SubDataFrame
 Row │ Subject  Smoke  Heart  Creatinine  Age    Race   Height  glyco    Ethanol  Weight
     │ Int64    Int64  Int64  Int64       Int64  Int64  Int64   Float64  Int64    Int64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │      80      1      1           0     69      3      77     1.77        2      77
   2 │      80      1      1           0     69      3      77     1.77        2      77
   3 │      80      1      1           0     69      3      77     1.77        2      77
   4 │      80      1      1           0     69      3      77     1.77        2      77
   5 │      80      1      1           0     69      3      77     1.77        2      77
   6 │      80      1      1           0     69      3      77     1.69        2      73
   7 │      80      1      1           0     69      3      77     1.69        2      73
   8 │      80      1      1           0     69      3      77     1.69        2      739×10 SubDataFrame
 Row │ Subject  Smoke  Heart  Creatinine  Age    Race   Height  glyco    Ethanol  Weight
     │ Int64    Int64  Int64  Int64       Int64  Int64  Int64   Float64  Int64    Int64
─────┼───────────────────────────────────────────────────────────────────────────────────
   1 │      46      0      3           0     73      2      66     1.09        2      91
   2 │      46      0      3           0     73      2      66     1.09        2      91
   3 │      46      0      3           0     73      2      66     1.09        2      91
   4 │      46      0      3           0     73      2      66     1.12        2      91
   5 │      46      0      3           0     73      2      66     1.12        2      90
   6 │      46      0      3           0     73      2      66     1.12        2      90
   7 │      46      0      3           0     73      2      66     1.12        2      90
   8 │      46      0      3           1     73      2      66     0.93        2      87
   9 │      46      0      3           1     73      2      66     0.93        2      87
NoteWhat does “…” mean in Julia code?

By now you may have noticed here and in other tutorials the appearance of ellipsis, “…” in Julia code. In Julia one uses ellipses in function definitions to handle an unknown number of arguments, as shown in this example:

# function *definition*:
f(numbers...) = numbers

f(6, 4, 1)
(6, 4, 1)

This is called slurping: we say the numbers 6, 4, 1 are slurped into the tuple (6, 4, 1).

We can use ellipsis when calling a function to reverse this process:

triple = (6, 4, 1)

# function *call*:
f(triple...)
(6, 4, 1)

This is called splatting: we say the triple (6, 4, 1) is splatted into the multiple arguments 6, 4, 1 that f is defined to handle. Many built-in Julia functions can handle an unspecified number of arguments. Here are a couple of examples:

min(10, 3, 4, 6)
3
vcat(1:3, 4:6, 7:10)
10-element Vector{Int64}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
min(triple...)
1

Semantically, the constructor [...] for vectors also amounts to variable argument function, and so we can do this

[7, 8, 9, triple...]
6-element Vector{Int64}:
 7
 8
 9
 6
 4
 1

For more on slurping and splatting refer to the Julia FAQ.

3.3 Input setup

Now, let’s actually prepare the data for the models. We begin by getting the number of subjects and creating a placeholder DataFrame with the first row of each subject and the columns listed in the variable covariates. Then, using the covariate DataFrame grouped by subjects (covariate_subdfs), we fill the values in the placeholder (raw_inputs).

nsubjects = length(uniqs[:Subject]) # Number of subjects
raw_inputs = deepcopy(covariate_df)[1:nsubjects, Not(1)] # Placeholder

# Populate `raw_inputs` DataFrame
for (i, subgroup) in enumerate(covariate_subdfs)
    raw_inputs[i, :] = deepcopy(subgroup[1, Not(1)])
end

Next, similar to the previous tutorial, a few common data preprocessing steps are applied. First, as required by MLJ.jl, the columns are coerced to the appropriate scientific data types. Specifically, they are all coerced into Continuous, except the column Race, which is one-hot encoded, since it is categorical and without order, i.e. nominal. The drop_last = true keyword argument for ContinuousEncoder was used here, resulting in a encoding different from usual: only the binary columns Race_1 and Race_2 are created, and Race_3 is implicitly encoded by both Race_1 and Race_2 being 0.

Some models will require us to standardize our inputs. We can see they vary on very different scales, by inspecting the earlier data summaries. For example, glyco (\(\alpha_1\)-acid glycoprotein) is in the [0.39, 3.16] mg/100 dL range, while weight is in [41, 119] kg. However, following good data hygiene practices, we will implement standardization by replacing our models with a pipeline that includes the standardization, together with a wrapper to handle standardization of the outputs. See the with_standardization(model) function defined in Section 4.2.

# Type-coercion pipeline:
coerce_to_ordered_factor = input -> coerce(input, Symbol("Race") => OrderedFactor)
coerce_to_continuous = ContinuousEncoder(; one_hot_ordered_factors = true, drop_last = true)
pipe = coerce_to_ordered_factor |> coerce_to_continuous

# Apply transformations in pipeline
mach = machine(pipe, raw_inputs)
fit!(mach, verbosity = 0)
inputs = MLJBase.transform(mach, raw_inputs)

# Check results
schema(inputs)
┌────────────┬────────────┬─────────┐
│ names      │ scitypes   │ types   │
├────────────┼────────────┼─────────┤
│ Smoke      │ Continuous │ Float64 │
│ Heart      │ Continuous │ Float64 │
│ Creatinine │ Continuous │ Float64 │
│ Age        │ Continuous │ Float64 │
│ Race__1    │ Continuous │ Float64 │
│ Race__2    │ Continuous │ Float64 │
│ Height     │ Continuous │ Float64 │
│ glyco      │ Continuous │ Float64 │
│ Ethanol    │ Continuous │ Float64 │
│ Weight     │ Continuous │ Float64 │
└────────────┴────────────┴─────────┘

Figure 2 shows a heatmap for the correlation matrix of our inputs (and for this purpose we have standardized the data).

Figure 2: Correlation matrix between covariates shows very weak correlations. The only pairs of covariates with correlations larger than 0.3 in absolute value are: creatinine-age, race_2-height and creatinine-weight.
using CairoMakie
# standardize the inputs:
mach = machine(Standardizer(), inputs) |> fit!
whitened_inputs = MLJ.transform(mach, inputs);
CairoMakie.activate!()
fig = Figure(; size = (1200, 700), fontsize = 20)
ax = Axis(
    fig[1, 1];
    title = "Correlations",
    xticks = (1:10, names(whitened_inputs)),
    yticks = (10:-1:1, names(whitened_inputs)),
)
covariance = (whitened_inputs |> Matrix |> cov)
# filterCov = replace(
#     v -> abs(v) > 0.3 ? v : NaN,
#     covariance
# )
hm = heatmap!(ax, 1:10, 10:-1:1, covariance)
Colorbar(fig[1, 2], hm)
display(fig)
save(joinpath(LOCAL, "images", "corrMat.svg"), fig)

3.4 Outputs

Moving on to the outputs, to get the empirical bayes estimates for clearance, central volume and absorption rate, some PK steps are necessary. Based on (Verme et al. 1992), below is the definition for a Pumas PK model with one compartment, first-order absorption and elimination, 3 fixed parameters (clearance tvcl, central volume tvvc, and absorption constant tvka), their random effects, and an additive residual error term (\(\sigma\)). Additionally, reference parameters extracted from (Pinheiro and Bates 2000) are also defined.

pumas_model = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ  RealDomain(; lower = 0)
    end
    @random η ~ MvNormal(Ω)
    @pre begin
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
        Ka = tvka * exp(η[3])
    end
    @dynamics Depots1Central1
    @derived conc ~ @. Normal(Central / Vc, σ)
end

# Parameters for initialization:
params = (;
    tvcl = ^2.5, # 12.18
    tvvc = ^5.4, # 221
    tvka = ^-0.21, # 0.81
    Ω = [0.056 0.0 0.0; 0.0 0.03 0.0; 0.0 0.0 0.01],
    σ = 0.0123,
)

And the Population required by such model can be prepared in one step using the read_pumas function. In the beginning, we created the columns cmt and evid because this function requires them.

population = read_pumas(
    df;
    id = :Subject,
    time = :time,
    amt = :dose,
    cmt = :cmt,
    evid = :evid,
    covariates = covariates,
    observations = [:conc],
)
Population
  Subjects: 136
  Covariates: Smoke, Heart, Creatinine, Age, Race, Height, glyco, Ethanol, Weight (heterogenous)
  Observations: conc (heterogenous)
Note

A Population is a vector of Subjects. To learn more about Subject and read_pumas, enter ?Subject and ?read_pumas in the Julia REPL (command line).

With the population ready, the PK model is fitted.

fitted_pumas_model =
    fit(pumas_model, population, params, FOCE(); optim_options = (show_trace = false,))
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
FittedPumasModel

Dynamical system type:                 Closed form

Number of subjects:                            136

Observation records:         Active        Missing
    conc:                       361             82
    Total:                      361             82

Number of parameters:      Constant      Optimized
                                  0             10

Likelihood approximation:                     FOCE
Likelihood optimizer:                         BFGS

Termination Reason:                      NoXChange
Log-likelihood value:                   -611.21547

-----------------
       Estimate
-----------------
tvcl    9.5556
tvvc   35.467
tvka    0.42799
Ω₁,₁    0.09674
Ω₂,₁   -0.099655
Ω₃,₁   -0.17095
Ω₂,₂    0.18456
Ω₃,₂    0.1447
Ω₃,₃    0.32621
σ       1.1189
-----------------

And the empirical bayes estimates are extracted and reorganized into a DataFrame for the outputs.

ebe = only.(empirical_bayes(fitted_pumas_model, population))
ebe_matrix = permutedims(reduce(hcat, ebe)) # Reshape into 136x3 matrix
outputs = DataFrame(ebe_matrix, [:cl, :vc, :ka]) # Wrap in DataFrame
describe(outputs)[:, Not(7)] |> println;
3×6 DataFrame
 Row │ variable  mean          min        median       max       nmissing
     │ Symbol    Float64       Float64    Float64      Float64   Int64
─────┼────────────────────────────────────────────────────────────────────
   1 │ cl        -0.0145753    -0.687295  -0.0182438   0.558038         0
   2 │ vc         0.000559858  -0.794945  -0.00243347  0.911257         0
   3 │ ka         0.0322281    -1.03891    0.0313174   1.17878          0

4 Models

4.1 Linear regression

Now that inputs and outputs are in the desired formats, the ML models can be fitted. The first one is linear regression, which makes a prediction \(\hat{y}\) by adding a bias term \(b\) to the result of weighing the inputs \(x\) by multiplying them by a weight matrix \(W\):

\[ \hat{y} = Wx + b \tag{1}\]

It’s worth pointing out that our output, the empirical bayes estimates, are represented as a vector of 3 elements. Therefore, this is technically a multitarget regression problem, instead of the usual prediction of a single scalar value per data point. Fitting a linear regression model means optimizing the parameters \(W\) and \(b\) to minimize the prediction error. The implementation used here applies the “least squares” algorithm, minimizing the loss \(L\) defined by sum of squares of residuals (deviation between prediction \(\hat{y}\) and label \(y\)):

\[ L = \sum_{data} |\hat{y} - y|^2 \]

In the next code block (not needed in the sequel), the model is setup, fitted and then evaluated using 6-fold cross-validation. When there are many covariates, regularization is recommended, but we still want to showcase basic linear regression.

# Linear regression
linear_regressor = LinearRegressor()
mach = machine(linear_regressor, inputs, outputs)
evaluate!(mach; measure = multitarget_l2, verbosity = 0)
PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌────────────────────┬───────────┬─────────────┐
│ measure            │ operation │ measurement │
├────────────────────┼───────────┼─────────────┤
│ MultitargetLPLoss( │ predict   │ 0.0705      │
│   p = 2)           │           │             │
└────────────────────┴───────────┴─────────────┘
┌────────────────────────────────────────────────┬─────────┐
│ per_fold                                       │ 1.96*SE │
├────────────────────────────────────────────────┼─────────┤
│ [0.0365, 0.0301, 0.108, 0.0985, 0.109, 0.0421] │ 0.0332  │
└────────────────────────────────────────────────┴─────────┘

The next code block includes the algebraic steps necessary for the linear regression model to process a batch of 50 subjects. After getting the inputs, the parameters are extracted from the trained model. Then, they are manually used for the batched linear algebra operations, as in equation Equation 1. Afterwards, the trained MLJ model makes predictions based on the same inputs, and both manual and automatic results are compared.

input_batch = inputs[1:50, :] # Input batch with 50 data points
coefficient_matrix = fitted_params(mach)[1]
bias = fitted_params(mach)[2]

# Use coefficients and biases manually
manually_computed_output =
    mapslices(d -> d + bias, Matrix(input_batch) * coefficient_matrix; dims = 2)
model_prediction = MLJBase.predict(mach, input_batch) |> Matrix
manually_computed_output == model_prediction # check results
true
Note

mapslices(f, A; dims) applies the function f to the array A along the dimensions dims. In the previous code block, we used mapslices(d -> d + bias, Matrix(input_batch) * coefficient_matrix; dims = 2) to add the bias to each column of the matrix that results from Matrix(input_batch) * coefficient_matrix. A simpler use case is the following: a function calculation processes each row in matrix A. mapslices can be applied to arrays of many dimensions.

julia> A = reshape(1:9, (3, 3))
3×3 Matrix{Int64}:
 1  4  7
 2  5  8
 3  6  9

julia> calculation(x) = x[1] + x[2] / x[3]

julia> mapslices(calculation, A; dims = 2)
3×1 Matrix{Float64}:
 1.57 # 1 + 4/7
 2.63 # 2 + 5/8
 3.67 # 3 + 6/9

4.2 Regularization

As mentioned in the previous tutorial in the course, overfitting is adjusting the model too tightly to the data distribution. And it should be avoided so the model’s performance doesn’t drop too much from the training set to new unseen cases. And tied to overfitting is the notion of model capacity: the diversity of input-label mappings it can represent (Ian Goodfellow 2016). Approaches to purposely restrict capacity to avoid overfitting are called regularization.

Two such techniques for linear regression are LASSO (least absolute shrinkage and selection operator) and ridge (Tikhonov) regression. As with many regularization methods, they can be described as including a penalty term in the objective function to be minimized during optimization, as if giving the optimizer a second goal other than only minimizing the cost/loss. Equation 2 and Equation 3 include the new expressions for the LASSO and Ridge losses, respectively.

\[ \sum_{data} |\hat{y} - y|^2 + \lambda \sum_{coefficients} |w| \tag{2}\]

\[ \sum_{data} |\hat{y} - y|^2 + \lambda \sum_{coefficients} w^2 \tag{3}\]

Where \(w\) are the coefficients in the weight matrix \(W\) from Equation 1; and \(\lambda\) is the Lagrangian multiplier, a weight defining the strength of the regularization. Higher values make the optimiser prioritize constraining the model parameters.

We implement LASSO using a single layer neural network. Neural networks, discussed in the next section, include linear models as a very special case. Additionally, we wrap our model in input/output standardization:

# Wrapper for input and output standardization
with_standardization(model) =
    Standardizer() |> TransformedTargetModel(model, transformer = Standardizer)

lasso = with_standardization(
    MultitargetNeuralNetworkRegressor(
        builder = MLJFlux.Linear= identity),
        alpha = 1,
        lambda = 0.1,
        epochs = 50,
        rng = SEED, # seed
        optimiser_changes_trigger_retraining = true,
    ),
)
show(lasso, 3) # display nested hyper-parameters
DeterministicPipeline(
  standardizer = Standardizer(
        features = Symbol[], 
        ignore = false, 
        ordered_factor = false, 
        count = false), 
  transformed_target_model_deterministic = TransformedTargetModelDeterministic(
        model = MultitargetNeuralNetworkRegressor(
              builder = Linear(
                    σ = identity), 
              optimiser = Adam(eta=0.001, beta=(0.9, 0.999), epsilon=1.0e-8), 
              loss = Flux.Losses.mse, 
              epochs = 50, 
              batch_size = 1, 
              lambda = 0.1, 
              alpha = 1.0, 
              rng = 378, 
              optimiser_changes_trigger_retraining = true, 
              acceleration = CPU1{Nothing}(nothing), 
              embedding_dims = Dict{Symbol, Real}()), 
        transformer = Standardizer(
              features = Symbol[], 
              ignore = false, 
              ordered_factor = false, 
              count = false), 
        inverse = nothing, 
        cache = true), 
  cache = true)

Notes:

  • The setting builder = MLJFlux.Linear(σ=identity) is where we make the neural network a linear model.

  • The setting alpha=1 means pure L1-regularization (LASSO). Setting alpha=0 will make this pure L2 (ridge regression).

  • lambda is the regularization strength (\(\lambda\) above).

  • epochs is the neural network iteration parameter.

The ridge regression and multilayer perceptron models considered later can be implemented with the same general structure; only the hyper-parameters builder and alpha will be different.

MLJ also provides the MultiTaskLassoRegressor model for our data, but it uses the MLJScikitLearnInterface.jl package as the backend, an interface for a Python package, requiring a more cumbersome installation process. Our neural network implementation will be mathematically equivalent, although training will be a lot slower because we miss out on some optimizations specific to linear models.

When training a non-vector target, you will want to use the linear models provided by GLM.jl or MLJLinearModels.jl (regularized), which have MLJ interfaces.

To get a rough idea of the number of iterations we need, let’s generate an MLJ learning curve.

nested_epochs = :(transformed_target_model_deterministic.model.epochs)
_, _, epochs, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_epochs, lower = 1, upper = 100),
    measure = multitarget_l2,
    verbosity = 0,
);

One then plots loss against epochs, as shown in Figure 3.

Figure 3: Learning curve for epochs
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `epochs` in LASSO regression",
        xlabel = "epochs",
        ylabel = "l2 loss",
        limits = ((0, 100), (0.065, 0.085)),
    )
    scatter!(axis, epochs, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "epochs_learning_curve.svg"), fig)

By default, learning_curve estimates an out-of-sample loss (using the Holdout() resampling strategy) and here we see a slight regularization effect by stopping early (after around 60 epochs the loss begins to deteriorate). However, as we are interested here in L1 regularization, we will want to iterate this model close to convergence of the training loss. So we add the resampling=InSample() option to our learning curve to get the in-sample loss trace:

nested_epochs = :(transformed_target_model_deterministic.model.epochs)
_, _, epochs, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_epochs, lower = 1, upper = 100),
    measure = multitarget_l2,
    resampling = InSample(),
    verbosity = 0,
);
Figure 4: Learning curve for epochs (in-sample)
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `epochs` in LASSO regression (in-sample loss)",
        xlabel = "epochs",
        ylabel = "l2 loss",
        limits = ((0, 100), (0.050, 0.085)),
    )
    scatter!(axis, epochs, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "epochs_learning_curve_in_sample.svg"), fig)

Looks like about 70 iterations is close enough.

Similarly, we’ve generated learning curves to get an idea of an appropriate tuning range for lambda. Here is the curve for the range we decided on:

nested_lambda = :(transformed_target_model_deterministic.model.lambda)
_, _, lambdas, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_lambda, lower = 1, upper = 100, scale = :log10),
    measure = multitarget_l2,
);
Figure 5: Learning curve for lambda
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `lambda` in LASSO",
        xlabel = "lambda",
        xscale = log10,
        ylabel = "l2 loss",
    )
    scatter!(axis, lambdas, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "lambda_learning_curve.svg"), fig)

Here we see a marked regularization effect, although quite strong regularization (lambda $>$ 10) is needed. This suggest that the covariates are not strongly informative.

Putting this disappointing outcome aside for now, we proceed to try other models. First, to define our neural network implementation of the (untuned) ridge regressor, we need only change the alpha hyper-parameter:

ridge = with_standardization(
    MultitargetNeuralNetworkRegressor(
        builder = MLJFlux.Linear= identity),
        alpha = 0,
        lambda = 0.1,
        epochs = 50,
        rng = SEED, # seed
        optimiser_changes_trigger_retraining = true,
    ),
);

Next, we define a function tune(model) to wrap lasso and ridge in hyper-parameter tuning. In addition to tuning lambda, we will tune the learning rate of the Adam optimiser (Adaptive moment estimation) (Kingma and Ba 2017) we will use to update the learned parameters, so as to minimize the loss (given by Equation 2 or Equation 3). The Adam optimization algorithm is a variation on stochastic gradient descent that is very popular in machine learning. A larger learning rate helps the model to escape big plateaus in the parameter space faster, but causes the optimizer to overshoot minima. And a lower learning rate avoids overshooting, but slows down the optimization.

Note “optimizer” is spelled with an “s” in the name of the hyper-parameter.

# Adam instances with different learning rates
optimizers = [Adam(10^i) for i = -6.0:0.0]
nested_optimiser = :(transformed_target_model_deterministic.model.optimiser)
tuned(model) = TunedModel(
    model,
    resampling = CV(),
    tuning = Grid(),
    range = [
        range(model, nested_optimiser, values = optimizers),
        range(model, nested_lambda, lower = 1, upper = 100, scale = :log),
    ],
    measure = multitarget_l2,
)

Our protocol for evaluating the performance of our self-tuning models, based on an L2 multi-target loss function (measure), will be to apply Monte Carlo cross-validation, with two folds, repeated five times, giving us a total of ten estimates on which to base a 95% confidence interval \([a, b]\) (applying the standard fiction of independence and normality). We repeat this process three times, with different RNG seeds for the Monte Carlo, and report the confidence interval \([\min_i a_i, \max_i b_i]\).

While we should compare the wrapped, self-tuning models directly, this will trigger nested resampling (as discussed in the lecture, “Using MLJ. Lesson 3”). To save run-time in our illustration, we will take a pragmatic approach and extract the best model from the tuning process, and estimate it’s performance instead (effectively “freezing” the hyper-parameters that would be otherwise learned on each CV training fold).

Here are the seed-dependent options we need to pass to evaluate to realize the Monte Carlo cross-validation using a sum-of-squares measure:

monte_carlo_cv(seed) = (
    repeats = 5,
    resampling = CV(nfolds = 2, rng = seed), # shuffling is implied when `rng` is specified
    measure = multitarget_l2,
)

And here’s the function that creates the performances estimates themselves:

nseeds = 3 # 4 or more would be better

# Here `named_models` is one or more pairs of the form, "model name" => model:
function compare(named_models, inputs, outputs; verbosity = 0)
    df =
        map(named_models) do (name, model)
            verbosity > 0 && @info "Evaluating $name..."

            # initialize CI endpoints
            a = Inf
            b = -Inf

            foreach(rand(Xoshiro(SEED), UInt64, nseeds)) do seed
                # for self-tuning models, replace `model` with best model:
                if model isa MLJTuning.DeterministicTunedModel
                    old_flag = model.train_best
                    model.train_best = false  # to save some extra runtime
                    mach = machine(model, inputs, outputs)
                    fit!(mach, verbosity = 0)
                    model.train_best = old_flag # return model to original state
                    model = report(mach).best_model
                end
                e = evaluate(
                    model,
                    inputs,
                    outputs;
                    verbosity = 0,
                    monte_carlo_cv(seed)...,
                )
                cv_estimates = e.per_fold[1]
                μ = e.measurement[1]
                δ = 1.96 * std(cv_estimates) / sqrt(length(cv_estimates) - 1)
                a = min(a, μ - δ)
                b = max(b, μ + δ)
            end
            if verbosity > 0
                ar = round(a, sigdigits = 3)
                br = round(b, sigdigits = 3)
                @info "\t$ar ≤ loss  ≤ $br"
            end
            (; name, loss_lower = a, loss_upper = b, midpoint = mean([a, b]), model)
        end |> DataFrame
    sort!(df, :midpoint)
end

We now compare the performance of ridge and LASSO regressors, with ordinary linear regression. We’ll also include, as a baseline, a regressor that just predicts the mean value of the training target (outputs):

comparisons = compare(
    [
        "linear regression" => linear_regressor,
        "LASSO (untuned)" => lasso,
        "LASSO" => tuned(lasso),
        "ridge (untuned)" => ridge,
        "ridge" => tuned(ridge),
        "constant predictor" => DeterministicConstantRegressor(),
    ],
    inputs,
    outputs,
)
println(comparisons[:, 1:4]);
6×4 DataFrame
 Row │ name                loss_lower  loss_upper  midpoint
     │ String              Float64     Float64     Float64
─────┼───────────────────────────────────────────────────────
   1 │ ridge                0.0571683   0.0745036  0.065836
   2 │ LASSO                0.057764    0.0748687  0.0663164
   3 │ constant predictor   0.059257    0.0771778  0.0682174
   4 │ LASSO (untuned)      0.066826    0.0805361  0.0736811
   5 │ ridge (untuned)      0.06699     0.080693   0.0738415
   6 │ linear regression    0.0660207   0.0822911  0.0741559

The tuned ridge and LASSO regressors score only slightly better than the constant predictor, and from the informal confidence intervals, it’s clear the benefits are not statistically significant. This confirms our earlier suspicion that the covariates are poor predictors of the PK parameters.

To get some insight into the hyper-parameter optimization, let us examine Figure 6, which shows a contour plot of the loss function, in the case of tuning the ridge regressor.

Figure 6: Contour plot of loss in tuning ridge regressor
using CairoMakie
model = tuned(ridge)
mach = machine(model, inputs, outputs) |> fit!
plotting = report(mach).plotting
x = map(plotting.parameter_values[:, 1]) do optimizer
    log10(optimizer.eta)
end;
y = log10.(plotting.parameter_values[:, 2]);
z = plotting.measurements;

# find local min:
j = argmin(z)
min_loss = z[j]
X, Y = x[j], y[j]

# to clip the contours
mask = z .< 0.25
x = x[mask]
y = y[mask]
z = z[mask]

begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        xlabel = "log10 of learning rate",
        ylabel = "log10 of lambda",
        title = "Contours of out-of-sample loss for ridge regression",
    )
    contour!(axis, x, y, log.(z); levels = 20)
    scatter!(axis, X, Y, markersize = 10, color = :black)
    text!(
        axis,
        X + 0.15,
        Y,
        text = "min loss",
        fontsize = 14,
        color = :black,
        align = (:left, :bottom), # Align the text anchor to the bottom-left of the text box
    )
end
display(fig)
save(joinpath(LOCAL, "images", "tuning_contours.svg"), fig)

We see that the loss is generally more sensitive to the learning rate. For smaller learning rates, the poorer performance may be attributable to incomplete training, because epochs is fixed to 50 iterations, a defect we will address later.

Note

The dynamics of overfitting, capacity and regularization is sometimes referred to as the bias-variance tradeoff (Wikipedia 2025):

  1. The bias error comes from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). This can be caused by over-regularizing, e.g. too high of a Lagrangian multiplier in the LASSO and Ridge losses.
  2. The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting). This can happen in the absence of regularization.

4.3 Neural networks from scratch (optional)

In this section, we take a tangent in the workflow to build neural networks step by step, since they are the most discussed class of models in ML. That’s mainly due to their flexibility in complexity and architecture, enabling them to solve an endless variety of problems. As mentioned in the previous section, neural networks can be so simple as to equate to linear regression. But they can also have billions of parameters, distributed across many layers that are connected in very creative ways and apply very specific transformations to data.

The starting point to understanding neural networks is the threshold logic unit (TLU). When receiving a vector of inputs \(x\) it performs the dot product with the TLU’s weights \(w\). For 3 dimensions, this would look like \(x^T w\), resulting in a scalar, of course.

first_subject_inputs = collect(inputs[1, :]) # Sample subject, containing 10 features
input_size = length(first_subject_inputs) # 11
weights = rand(input_size) # Weights of TLU

# Compare manual and automatic dot products
dot_product_manual = sum(first_subject_inputs .* weights)
dot_product_function = dot(first_subject_inputs, weights) # first_subject_inputs' * weights
all(dot_product_manual .≈ dot_product_function) # Check results
true

Then, the TLU applies an activation function to the result of the dot product. Originally, this was a step function, such as the Heaviside step function, which is 0 if its input is negative; and 1 otherwise. However, over the years, many alternative activation functions have been proposed. In the following code are some common activation functions, including their plots in Figure 7.

input_range = LinRange(-3, 3, 100)

# Apply activations pointwise
relu_output = relu.(input_range)
σ_output = σ.(input_range) # sigmoid
tan_output = tanh.(input_range)
heaviside(x) = x < 0 ? 0 : 1
heavyside_output = heaviside.(input_range)
Figure 7: Common activation functions.
CairoMakie.activate!()
activation_outputs = Dict(
    "ReLU" => relu.(input_range),
    "Sigmoid" => σ.(input_range),
    "Hyperbolic tangent" => tanh.(input_range),
    "Heaviside" => heaviside.(input_range),
)
fig = Figure()
yticks = range(-10, 10, 10)
for ((actName, vals), indices) in zip(activation_outputs, CartesianIndices((2, 2)))
    ax = Axis(
        fig[indices.I...];
        xlabel = "x",
        ylabel = "Activation",
        title = actName,
        ytickformat = "{:.2f}",
    )
    lines!(ax, input_range, vals)
end
save(joinpath(LOCAL, "activations.svg"), fig)

The lower case sigma σ function used here is defined in Flux.NNlib.σ. It is the sigmoid, an activation function with output in [0, 1] and defined by \(\sigma (x) = \frac{1}{1 + e^{-x}}\). Additionally, relu is the ReLU (Rectified Linear Unit) function, defined as \(ReLU(x) = max(x, 0)\).

To represent a TLU, the following code block defines a type TLU with two fields: a vector weights, and a function activation. Afterwards, its behavior on an array is defined. It calculates the dot product between the inputs and the weights, then applies the activation function. An example of instantiation and usage are also included.

# Type
struct TLU{T<:Real,F}
    weights::Vector{T}
    activation::F
end
# Operation definition
(t::TLU)(x) = t.activation(dot(t.weights, x))
tlu = TLU(rand(input_size), σ) # Instantiate
tlu(rand(input_size)) # Use on random input
0.9312271831234411
Note

Julia has a very rich type system. In TLU{T<:Real, F}, the fields of the new type TLU will have types T<:Real and F. Here, T and F are parametric types, assuming the types of the inputs provided to the TLU object initialization. However, T<:Real constrains T to be a subtype of Real, i.e. float, integer, rational or irrational. And as for the fields, weights is constrained to be a vector with elements of type T; and activation, to have type F.

After the type description, (t::TLU)(x) defines what happens when we call a TLU object as a function on an input x. Notice that the operations (dot(t.weights, x) |> t.activation) are able to use the input x and the fields (t.weights and t.activation) of the object.

Then, TLU(rand(input_size), σ) creates an instance of the TLU type with the field weights containing rand(input_size); and activation, σ (sigmoid - plotted in Figure 7).

The next building block for neural networks is the perceptron, a combination of multiple TLUs. Each unit individually processes the input through its usual operations as explained above. Also, a bias term associated with the layer is added to the results from the TLUs. Below, a type for the perceptron is defined with two fields: layer, a vector of TLUs; and the bias term to be added to the results. Then, the operation of a Perceptron acting on an array is defined. Lastly, examples of instantiation and usage are also included.

# Type
struct Perceptron{T<:Real}
    layer::AbstractVector{<:TLU}
    bias::T
end
# Operation definition
(p::Perceptron)(x) = map(l -> l(x) .+ p.bias, p.layer)
# Vector of TLUs for layer
units = [TLU(rand(input_size), σ) for _ = 1:10]
per = Perceptron(units, 20) # Instantiation with TLUs and bias
per(rand(input_size)) # Usage on random input
10-element Vector{Float64}:
 20.90470976817967
 20.84300714548213
 20.874802128254913
 20.92651856375763
 20.92217033748405
 20.801243418759828
 20.818704515078064
 20.88633449200735
 20.945884026979886
 20.878219072705406

The next level of abstraction is the multilayer perceptron (MLP). As the name implies, it combines multiple perceptrons in sequence. Therefore, now the type only has one field: a vector of perceptrons called layer. Its operation on an array is set by a full function definition this time, since it’s a bit more involved. Basically, the first layer is applied to the input, following the previous perceptron definition. Then, if there’s only one layer, the result is returned. Otherwise, execution continues with a for loop that applies the remaining layers sequentially, passing one’s output to the next. Once the last layer is applied, the result is returned.

struct MLP
    layers::AbstractVector{<:Perceptron}
end
function (mlp::MLP)(x)
    out = mlp.layers[1](x)
    length(mlp.layers) == 1 && (return out)
    for l in mlp.layers[2:end]
        out = l(out)
    end
    return out
end

As an example that ties all the previous concepts together, the following code includes the detailed construction of an MLP. Figure 8 shows the 10 values in the vector resulting from rand(input_size) |> mlp |> softmax.

out1 = 6
out2 = 4
mlp_output_dim = 10
mlp = MLP([
    # TLUs in first layer need to have 'input_size' weights each, so
    # the dimension of their weight vector matches that of
    # the inputs for the dot products
    Perceptron([TLU(rand(input_size), relu) for _ = 1:out1], 1),

    # Size of weight vectors of TLUs in layer n has to match
    # number of TLUs in layer (n - 1), here out1
    Perceptron([TLU(rand(out1), relu) for _ = 1:out2], 1),

    # Unlike the previous layers, the number of TLUs in the last
    # one isn't arbitrary, as it defines the size of the MLP's output
    Perceptron([TLU(rand(out2), identity) for _ = 1:mlp_output_dim], 1),
])
Figure 8: Output from example MLP after softmax (rand(input_size) |> mlp |> softmax).
using CairoMakie
barplot(rand(input_size) |> mlp |> softmax)

Softmax is the function defined by Equation 4. It transforms a collection of values into a probability distribution, since the new values are nonnegative and add up to 1. The values that result from applying softmax to the output of the neural network in the example can be interpreted as the classification probabilities among mlp_output_dim = 10 categories.

\[ (softmax)_i = \frac{e^{x_i}}{\sum_i e^{x_i}} \tag{4}\]

4.4 Applying neural networks

Going back to our workflow, let’s check what models MLJ.jl provides for our data. As mentioned in the beginning, we are solving a multitarget regression task, hence why most models include “multitarget”, “multitask” and “regressor” in the name.

models(matching(inputs, outputs)) .|> println;
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = MultiTaskElasticNetCVRegressor, package_name = MLJScikitLearnInterface, ... )
(name = MultiTaskElasticNetRegressor, package_name = MLJScikitLearnInterface, ... )
(name = MultiTaskLassoCVRegressor, package_name = MLJScikitLearnInterface, ... )
(name = MultiTaskLassoRegressor, package_name = MLJScikitLearnInterface, ... )
(name = MultitargetKNNRegressor, package_name = NearestNeighborModels, ... )
(name = MultitargetLinearRegressor, package_name = MultivariateStats, ... )
(name = MultitargetNeuralNetworkRegressor, package_name = MLJFlux, ... )
(name = MultitargetRidgeRegressor, package_name = MultivariateStats, ... )
(name = MultitargetSRRegressor, package_name = SymbolicRegression, ... )

We do not see many models here. That’s because multitarget regression is a very specific task. To avoid being limited by the number of models, this task could have been broken down to smaller parallel tasks around each PK parameter (Cl, Vc, Ka), or even mixing in tree-based models for the categorical covariates.

The next code block defines the neural network’s architecture through a “builder” and Flux.Chain. That is, what types of layers we are going to use, how many of them, and what’s the data flow inside the model. For now, we keep it simple, using a sequence of Dense layers. Note that L2 regularization is used (alpha=0), as in ridge regression.

builder = @builder begin
    Chain(
        Dense(n_in, 12, relu),
        Dense(12, 12, relu),
        Dense(12, 12, relu),
        Dense(12, n_out, identity),
    )
end
neural_network = with_standardization(
    MultitargetNeuralNetworkRegressor(;
        builder,
        batch_size = 32,
        alpha = 0,
        epochs = 100,
        rng = SEED,
        optimiser_changes_trigger_retraining = true,
    ),
)
Note

Flux.jl is one of the main deep learning packages in Julia, providing architecture definitions, training procedures, activation functions, GPU support (through CUDA.jl for NVIDIA cards), loss functions, optimizers, etc. An alternative is Lux.jl.

Mathematically, Dense layers equate to the linear algebra operations in (Section 4.1), i.e. matrix multiplication and addition, plus application of a non-linear activation function. The parameters included in the Dense layer definitions are the sizes of the input and output; the specified activation function is applied pointwise. The output size of one layer matches the input size of the next, since data goes in sequence from one layer to the next.

Note that MLJ will automatically interpolate the correct input and output dimension n_in and n_out using the data and will get rng from the neural_network hyper-parameter rng (set here to SEED). For our data n_in will be 10 (the number of covariates after one-hot encoding) and n_out will be 3, since we are predicting the vector of EBEs for clearance, volume and absorption constant.

Also notice that we have increased the epochs value we used in the single layer LASSO and ridge models, as adding layers generally increases the number of epochs needed for convergence. A more principled approach to controlling epochs is given in the next section.

Now is a good opportunity to discuss batching: certain ML models are able to process multiple inputs at once. Not much optimization tends to be done on this hyperparameter, with common values being powers of 2 between 32 and 512. This can be useful for speeding up and parallelizing model training and inference (using a trained model). It’s specially the case when working with neural networks, since they can be parallelized in the GPU (graphics processing unit), a hardware component for efficiently handling arrays in parallel. In fact, when handling large datasets, they won’t fit in memory anymore, making batching a necessity. Lastly, working with batches is also useful to add a bit of randomness to the inputs of the neural network, indirectly helping with regularization.

Let’s estimate the neural network’s model performance, for comparison with the linear models:

comparisons = vcat(
    comparisons,
    compare(
        [
            "neural network (untuned)" => neural_network,
            "neural network" => tuned(neural_network),
        ],
        inputs,
        outputs,
    ),
)
sort!(comparisons, :midpoint)
println(comparisons[:, 1:4]);
8×4 DataFrame
 Row │ name                      loss_lower  loss_upper  midpoint
     │ String                    Float64     Float64     Float64
─────┼─────────────────────────────────────────────────────────────
   1 │ ridge                      0.0571683   0.0745036  0.065836
   2 │ LASSO                      0.057764    0.0748687  0.0663164
   3 │ neural network             0.0591544   0.0771833  0.0681689
   4 │ constant predictor         0.059257    0.0771778  0.0682174
   5 │ LASSO (untuned)            0.066826    0.0805361  0.0736811
   6 │ ridge (untuned)            0.06699     0.080693   0.0738415
   7 │ linear regression          0.0660207   0.0822911  0.0741559
   8 │ neural network (untuned)   0.0729304   0.0912526  0.0820915

4.5 Early-stopping

With neural networks, there are two main ways to decide when training should stop. The simplest one is based on the number of epochs, i.e. how many times the model has processed the entire training/validation split. Once enough training epochs are done, a validation epoch is performed and training ends. The alternative is early-stopping: every few training epochs, a validation epoch is done to keep track of the validation loss. With this, some stopping criteria can be set up to dynamically end training when appropriate. When configured correctly, this method provides more flexibility, since a fixed number of epochs is arbitrary, and you instead let the model train for as long as it needs, without overfitting.

Before training the neural networks, as a high-level pseudocode, below are the basic steps taken to train ML models using gradient-based optimization with holdout set for validation in early-stopping:

  1. Setup initial components:
    1. Optimiser \(O\), loss function \(L\), data \(D\), hyperparameters, and model \(M\) with random initial parameters \(p\)
  2. Split dataset between training and validation
  3. Loop in the following steps:
    1. For a prespecified number of training epochs:
    1. Sample batch \(B = (x, y)_i\) of data points from the training split, including inputs \(x\) and respective targets/labels \(y\)
    2. Use model \(M\) to get predictions \(\hat{y} = M(x)\)
    3. Calculate gradient of loss function in relation to model parameters \(g = \nabla_{p} L(p, y, \hat{y})\)
    4. Apply gradient step with optimiser to update model parameters: \(p \leftarrow p + \eta \cdot O(g)\)
    1. Complete a validation epoch in a similar manner, but without gradient calculations and updates. Log average validation error across batches
    2. Check early-stopping conditions. If any were violated, stop training

Below, another neural network is created. Afterwards, some conditions are set. In MLJ.jl, early-stopping is based on the control cycles from IterationControl.jl. So it’s important to keep in mind that the conditions are tested at the end of a control cycle, which contains Steps(n) training epochs. Among these conditions, some work as stopping criteria. There’s NumberLimit for the usual total number of epochs, but also Patience, which is based on the trend of the validation loss across control cycles.

# Function to update learning rate
function shrink_learning_rate!(neural_network)
    core_model = neural_network.transformed_target_model_deterministic.model
    core_model.optimiser = Adam(core_model.optimiser.eta * 0.99)
    return neural_network
end
# stop conditions and learning rate scheduling:
controls = [
    Step(1), # Training epochs per control cycle
    NumberLimit(150), # Don't train for more than this many control cycles
    Patience(5), # Stop after n iterations of worsening in validation loss
    WithModelDo(f = shrink_learning_rate!),
    InvalidValue(), # Stop if loss is NaN, Inf, or -Inf
];

There’s also the option of learning rate scheduling: changing the learning rate as the optimization progresses (Géron 2019). In the previous code block, we reset the initial learning rate to 1 (neural_network_stopped_early.optimiser = Adam(1)). That’s higher than usual, but the scheduling will multiply it by 0.99 every control cycle (WithModelDo(f = shrink_learning_rate)). This type of dynamic control provides flexibility and tries to enjoy the benefits of both high and low learning rate values when appropriate.

Next, we also define a “callback”, a common term used to describe procedures that are executed sporadically throughout iterative processes, such as printing or plotting in real time. Specifically, WithLossDo logs the validation loss history for later plotting and post-processing. Then, as in the previous tutorial, we wrap the model in an IteratedModel and provide some settings.

# Callback
validation_losses = []
callbacks = [WithLossDo(loss -> push!(validation_losses, loss))]

# Wrap the network from before (untuned) with iteration controls:
neural_network_stopped_early = IteratedModel(
    deepcopy(neural_network), # untuned model
    # loss and stopping are based on out-of-sample:
    resampling = Holdout(fraction_train = 0.7);
    measures = multitarget_l2,
    controls = vcat(controls, callbacks),
)
DeterministicIteratedModel(
  model = DeterministicPipeline(
        standardizer = Standardizer(features = Symbol[], …), 
        transformed_target_model_deterministic = TransformedTargetModelDeterministic(model = MultitargetNeuralNetworkRegressor(builder = GenericBuilder(apply = #48), …), …), 
        cache = true), 
  controls = Any[Step(1), NumberLimit(150), Patience(5), WithModelDo{typeof(shrink_learning_rate!)}(shrink_learning_rate!, false, nothing), InvalidValue(), WithLossDo{var"#50#51"}(var"#50#51"(), false, nothing)], 
  resampling = Holdout(
        fraction_train = 0.7, 
        shuffle = false, 
        rng = TaskLocalRNG()), 
  measure = MultitargetLPLoss(p = 2), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  retrain = false, 
  check_measure = true, 
  iteration_parameter = nothing, 
  cache = true)

Let’s train this model on all the data to see the callback in action:

mach = machine(neural_network_stopped_early, inputs, outputs) |> fit!
validation_losses[(end-5):end]
[ Info: Training machine(DeterministicIteratedModel(model = DeterministicPipeline(standardizer = Standardizer(features = Symbol[], …), …), …), …).
[ Info: No iteration parameter specified. Using `iteration_parameter=:(transformed_target_model_deterministic.model.epochs)`. 
[ Info: final loss: 0.06741769922846703
[ Info: final training loss: 0.8812067914648667
[ Info: Stop triggered by EarlyStopping.NumberLimit(150) stopping criterion. 
[ Info: Total of 150 iterations. 
6-element Vector{Any}:
 0.06554591611357158
 0.06197454951679159
 0.06349798250341177
 0.06764061707806036
 0.06576148603202346
 0.06741769922846703

We cannot use tuned(iterated_neural_network) here, because the extra wrapping has pushed the hyper-parameters of interest one level deeper. So we’ll do this one by hand. Also, to reduce training times in this illustration, we will reduce the hyper-parameter ranges (which we have obtained separately by running with the original larger ranges):

deeply_nested_optimiser = :(model.transformed_target_model_deterministic.model.optimiser)
deeply_nested_lambda = :(model.transformed_target_model_deterministic.model.lambda)
tuned_neural_network_stopped_early = TunedModel(
    neural_network_stopped_early,
    resampling = CV(),
    tuning = Grid(goal = 12),
    range = [
        range(
            neural_network_stopped_early,
            deeply_nested_optimiser,
            values = optimizers[3:5],
        ),
        range(Float64, deeply_nested_lambda, lower = 1, upper = 10, scale = :log),
    ],
    measure = multitarget_l2,
)
comparisons = vcat(
    comparisons,
    compare(
        [
            "neural network stopped early (untuned)" => neural_network_stopped_early,
            "neural network stopped early" => tuned_neural_network_stopped_early,
        ],
        inputs,
        outputs,
    ),
)
sort!(comparisons, :midpoint)
println(comparisons[:, 1:4]);
10×4 DataFrame
 Row │ name                               loss_lower  loss_upper  midpoint
     │ String                             Float64     Float64     Float64
─────┼──────────────────────────────────────────────────────────────────────
   1 │ ridge                               0.0571683   0.0745036  0.065836
   2 │ LASSO                               0.057764    0.0748687  0.0663164
   3 │ neural network                      0.0591544   0.0771833  0.0681689
   4 │ constant predictor                  0.059257    0.0771778  0.0682174
   5 │ neural network stopped early        0.0597053   0.0790145  0.0693599
   6 │ neural network stopped early (un…   0.0603822   0.0792296  0.0698059
   7 │ LASSO (untuned)                     0.066826    0.0805361  0.0736811
   8 │ ridge (untuned)                     0.06699     0.080693   0.0738415
   9 │ linear regression                   0.0660207   0.0822911  0.0741559
  10 │ neural network (untuned)            0.0729304   0.0912526  0.0820915

We see adding early stopping and learning rate shrinkage improves the neural network performance quite a bit. The model comparisons are plotted in Figure Figure 9.

Figure 9: Model comparisons
using CairoMakie
model_names = comparisons.name
midpoints = comparisons.midpoint
radii = comparisons.loss_upper - comparisons.loss_lower / 2
nmodels = length(model_names)

fig = Figure()
axis = Axis(
    fig[1, 1],
    xticks = (1:nmodels, model_names),
    xlabel = "Model",
    ylabel = "L2 loss estimate",
    xticklabelrotation = pi / 2,
    xticklabelalign = (:right, :top),
    xticklabelpad = 10,
)

# Plot the center points (optional, but good practice)
scatter!(
    axis,
    1:nmodels,
    midpoints;
    marker = :circle,
    markersize = 15,
    color = :blue,
    label = "midpoints",
)

# Plot the error bars
errorbars!(
    axis,
    1:nmodels,
    midpoints,
    radii,
    color = :black,
    linewidth = 2,
    whiskerwidth = 10, # Control the width of the small horizontal lines at the ends
)
display(fig)
save(joinpath(LOCAL, "images", "model_comparisons.svg"), fig)

4.6 Performance comparisons

No model considered significantly outperforms the constant predictor, and so it seems that our covariates are poor predictors of the PK parameters, as perviously discussed. It may be that they have some predictive power, but a larger study is likely needed to establish this possibility.

For the record, here are details about the “best” model found in our analysis, the optimized ridge regressor:

comparisons[1, :model].transformed_target_model_deterministic.model
MultitargetNeuralNetworkRegressor(
  builder = Linear(
        σ = identity), 
  optimiser = Adam(eta=0.001, beta=(0.9, 0.999), epsilon=1.0e-8), 
  loss = Flux.Losses.mse, 
  epochs = 50, 
  batch_size = 1, 
  lambda = 59.948425031894104, 
  alpha = 0.0, 
  rng = 378, 
  optimiser_changes_trigger_retraining = true, 
  acceleration = CPU1{Nothing}(nothing), 
  embedding_dims = Dict{Symbol, Real}())

We see that lambda=59.9, which is a significant amount of regularization.

Generally, factors other than generalization performance can factor into the choice of models. Computational cost and ease of interpretation are other considerations.

5 Shapley values

One of the main drawbacks of neural networks is their “black-box nature”: we can’t reason on a high level of abstraction about what it’s doing to the inputs, or how and why it returned certain outputs. This means they lack interpretability (Molnar 2025). To fix this, the field of explainable AI (XAI) was developed to “[…] augment the training process, the learned representations, and the decisions with human-interpretable explanations” (Samek et al. 2021), especially for neural networks.

Among explainable AI methods, Shapley values (Shapley 1951) are a common way to interpret ML models. Originally, it was proposed in game theory to quantify the contribution of members in a cooperative game. In the context of neural networks, the team members are the inputs; and the performance in the “game” is tied to the model’s output. In the following code, ShapML.jl is used to calculate the contribution of each feature for each subject on the EBE of the clearance random effect (MLJ.predict(modelMach, x)[1]).

# Train the optimized LASSO, ridge and neural network models on all data:
named_models = []
for (name, model) in zip(comparisons.name, comparisons.model)
    name in ["LASSO", "ridge", "neural network stopped early"] || continue
    push!(named_models, name => model)
end
named_machines = map(named_models) do (name, model)
    mach = machine(model, inputs, outputs)
    return name => fit!(mach, verbosity = 0)
end

predict_function(mach, data) = DataFrame(; prediction = MLJBase.predict(mach, data).cl)
shapley_given_model_name =
    map(named_machines) do (name, mach)
        name => ShapML.shap(;
            sample_size = 300,
            explain = inputs,
            model = mach,
            predict_function,
            seed = SEED,
        )
    end |> LittleDict
LittleDict{String, DataFrame, Vector{String}, Vector{DataFrame}} with 3 entries:
  "ridge"                        => 1360×6 DataFrame…
  "LASSO"                        => 1360×6 DataFrame…
  "neural network stopped early" => 1360×6 DataFrame

In the ShapML.shap() call above, sample_size is set to 300 so this step doesn’t take too long when running the tutorial. In practice, stochastic estimation such as the one done here require more samples to have reasonable accuracy (usually in the thousands). The images that follow used sample_size = 10_000.

Here’s an extract of some output of the shap function:

shapley = shapley_given_model_name["LASSO"]
show(shapley[[1, 2, 3, 1359, 1260], :]; allcols = true)
5×6 DataFrame
 Row │ index  feature_name  feature_value  shap_effect   shap_effect_sd  intercept
     │ Int64  String        Float64?       Float64       Float64         Float64
─────┼──────────────────────────────────────────────────────────────────────────────
   1 │     1  Smoke                   0.0  -4.23397e-7       6.58263e-7  -0.0142308
   2 │     2  Smoke                   0.0  -4.23547e-7       6.58497e-7  -0.0142308
   3 │     3  Smoke                   1.0   1.01968e-6       6.58058e-7  -0.0142308
   4 │   135  Weight                 91.0   0.000173936      0.00025312  -0.0142308
   5 │    36  Weight                 73.0  -0.000123297      0.00025312  -0.0142308

Figure 10, Figure 11 and Figure 12 show boxplots of the features for LASSO, ridge and neural network with early stopping, respectively. Blue box spans the interquartile (IQR) range with a midline marking the median. Error bar whiskers span 1.5 * IQR. Red horizontal line indicates y = 0.

Note that covariates with only a few distinct values in the training data have unusual median values. This is because of clustering of the Shapley values about a few modes. For this reason, we work with the mean rather than the median in the next section.

Figure 10: Box plot of Shapley values for LASSO.
Figure 11: Box plot of Shapley values for ridge regression.
Figure 12: Box plot of Shapley values for neural network with early stopping.
using CairoMakie
for (name, shapley) in shapley_given_model_name
    fig = Figure(; size = (1200, 700))
    ax = Axis(
        fig[1, 1];
        title = string(name),
        ytickformat = "{:.2e}",
        xticks = (1:10, names(inputs)),
        ylabel = "Shapley values",
    )
    hlines!(ax, 0; color = :red, linewidth = 2)
    boxplot!(
        ax,
        repeat(eachindex(input_names); inner = length(population)),
        shapley.shap_effect,
        show_outliers = false,
    )
    display(fig)
    sleep(1)
    snakecase_name = replace(name, ' ' => '_')
    save(joinpath(LOCAL, "images", "shapley_$snakecase_name.svg"), fig)
end

A simple interpretation is that the models considered features with positive mean Shapley values to increase the quinidine clearance rate; and the opposite for features with negative mean Shapley values. Although popular, Shapley values have drawbacks (Budhkar et al. 2025). Their theoretical computation is combinatorial, meaning it can easily get extremely costly. To avoid this in practice, some approximations are made, which frequently include the big assumptions of feature independence and linearity. Alternative interpretative machine learning techniques, such as the ones presented in (Budhkar et al. 2025), should also be considered.

6 Feature importance

In ML and data analysis, it’s very common to ask which variables are the best predictors of an outcome. As discussed in the previous section, Shapley values can help answering that. To compare how each of these models prioritized the inputs, the next code block first calculate the mean of Shapley values across subjects for each model. Then, this information is organized into a DataFrame.

input_names = names(inputs) # Covariate names

# Mean Shapley values across subjects
mean_shapley_given_model_name = LittleDict()
for (name, shapley) in shapley_given_model_name # Results from each model
    subdfs = collect(groupby(shapley, :feature_name; sort = false))
    mean_shapley_given_model_name[name] = getproperty.(subdfs, :shap_effect) .|> mean
end
mean_shapley_summary =
    hcat(DataFrame(Covariate = input_names), DataFrame(mean_shapley_given_model_name));
println(mean_shapley_summary);
10×4 DataFrame
 Row │ Covariate   ridge         LASSO         neural network stopped early
     │ String      Float64       Float64       Float64
─────┼──────────────────────────────────────────────────────────────────────
   1 │ Smoke        5.37994e-5    5.41508e-8                   -0.000143096
   2 │ Heart        0.000632293   0.000604226                   0.000395539
   3 │ Creatinine  -0.000574477  -0.000912725                  -0.000274037
   4 │ Age         -3.60302e-5   -1.42249e-5                    0.000244547
   5 │ Race__1     -0.000205596  -9.87052e-5                    6.9491e-5
   6 │ Race__2     -3.20377e-6   -2.55908e-6                    0.000308162
   7 │ Height      -3.5353e-5    -1.04663e-5                   -2.65821e-5
   8 │ glyco        0.000804194   0.00104528                    0.000130249
   9 │ Ethanol      0.000110647   2.17291e-5                   -0.000123954
  10 │ Weight      -0.000414675  -1.46269e-5                   -0.000269624

These values seem exceedingly small. Shapley values are computed on the same scale as the target, in this case the clearance rate exponent, which for comparison has these statistics:

@info "cl statistics" mean(outputs.cl) std(outputs.cl)
Info: cl statistics
  mean(outputs.cl) = -0.014575282544285851
  std(outputs.cl) = 0.18909894480413864

We conclude that, in this case, rankings based on Shapley values likely have little meaning.

Proceeding for didactic purposes, we simplify the comparison by using mean_shapley_summary to rank the covariates using the absolute value of the Shapley means. We do this for each of the three models. Afterwards, DataFrames are printed with the covariate ranking for each model. A higher ranking means a greater absolute Shapley mean. Figure 13 plots the rankings for each model.

# Absolute impact of covariates per model
covariate_rankings = map(2:4) do model_index
    # Temporary dataframe with covariate names and Shapley means for current model
    model_summary = deepcopy(mean_shapley_summary[:, [1, model_index]])
    model_summary[:, 2] = abs.(model_summary[:, 2]) # In place abs()
    # Sort from highest to lowest absolute mean shapley
    sort!(model_summary, 2; rev = true)
    # New row of each covariate indicates ranking
    return [findfirst(==(name), model_summary.Covariate) for name in input_names]
end
model_names = keys(mean_shapley_given_model_name)
for (model, rankings) in zip(model_names, covariate_rankings)
    println("$model ranking:\n", DataFrame(Covariate = input_names, Rank = rankings), "\n")
end
ridge ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke           7
   2 │ Heart           2
   3 │ Creatinine      3
   4 │ Age             8
   5 │ Race__1         5
   6 │ Race__2        10
   7 │ Height          9
   8 │ glyco           1
   9 │ Ethanol         6
  10 │ Weight          4

LASSO ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke          10
   2 │ Heart           3
   3 │ Creatinine      2
   4 │ Age             7
   5 │ Race__1         4
   6 │ Race__2         9
   7 │ Height          8
   8 │ glyco           1
   9 │ Ethanol         5
  10 │ Weight          6

neural network stopped early ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke           6
   2 │ Heart           1
   3 │ Creatinine      3
   4 │ Age             5
   5 │ Race__1         9
   6 │ Race__2         2
   7 │ Height         10
   8 │ glyco           7
   9 │ Ethanol         8
  10 │ Weight          4

Figure 13: Rankings of covariate absolute impact per model. Higher rank means greater impact on Quinidine clearance random effect.
using CairoMakie
fig = Figure(; size = (850, 700), fontsize = 20)
for (fig_column, model) in enumerate(model_names)
    ax = Axis(
        fig[1, fig_column];
        title = string(model),
        titlegap = 32,
        yreversed = true,
        limits = (-1, 1, 0.5, 10.5),
    )
    hidedecorations!(ax)
    hidespines!(ax)
    for (name, rank) in zip(input_names, covariate_rankings[fig_column])
        text!(ax, 0, rank; text = name, align = (:center, :center))
    end
end
ax = Axis(
    fig[1, 1:3];
    xticks = 0:10,
    limits = (0, 10, 0.5, 10.5),
    yticks = 1:10,
    yreversed = true,
)
hidespines!(ax);
hidedecorations!(ax);
for i = 1:2
    for (iter, (rankLeft, rankRight)) in
        enumerate(zip(covariate_rankings[i], covariate_rankings[i+1]))
        if i == 1
            lines!(
                ax,
                [2.2, 4.4],
                [rankLeft, rankRight];
                linewidth = 3,
                color = iter,
                colormap = :tab10,
                colorrange = (1, 10),
            )
        else
            lines!(
                ax,
                [5.6, 7.8],
                [rankLeft, rankRight];
                linewidth = 3,
                color = iter,
                colormap = :tab10,
                colorrange = (1, 10),
            )
        end
    end
end
display(fig)
save(joinpath(LOCAL, "images", "rankModels.svg"), fig)

7 Experimenting with synthetic data

Given the disappointing results above, we repeat our analyses with synthetic data. This data will be generated using the same patient covariate data in the clinical trial, with the same dosing profiles. However, we obtain corresponding PK observations by running a simulation, using a modification to our NLME model used to fit the clinical data above, but where CL, Vc and Ka depend on the covariates in way prescribed by us.

7.1 The population

Since our covariate dependence model will rely on the covariates being continuous quantities, to prepare a Pumas Population for simulation, we need to include the pre-processed versions of the covariates, what we have called inputs above. It will be convenient to construct the Population from the dataframe df in the same way we did before, but the covariates in df are unprocessed. We will remedy this by adding a new column to df called x that will store the pre-processed covariate vectors. This procedure is complicated by the fact the df has multiple rows for each subject. But we can use a left-join to handle this:

inputs_for_join = deepcopy(inputs) # temporary dataframe
inputs_for_join.x = collect.(collect.(eachrow(inputs)))
inputs_for_join.Subject = unique(df.Subject)
df = leftjoin(df, inputs_for_join[:, [:Subject, :x]], on = :Subject)
show(first(select(df, [:Subject, :x]), 3); allcols = true)
3×2 DataFrame
 Row │ Subject  x
     │ Int64    Array…?
─────┼────────────────────────────────────────────
   1 │       1  [0.0, 2.0, 0.0, 60.0, 1.0, 0.0, …
   2 │       1  [0.0, 2.0, 0.0, 60.0, 1.0, 0.0, …
   3 │       1  [0.0, 2.0, 0.0, 60.0, 1.0, 0.0, …

And here is the modified version of the original population:

population = read_pumas(
    df;
    id = :Subject,
    time = :time,
    amt = :dose,
    cmt = :cmt,
    evid = :evid,
    covariates = [:x],
    observations = [:conc],
)
Population
  Subjects: 136
  Covariates: x
  Observations: conc

7.2 The simulation

On a log scale, each PK parameter in the simulation model will be the sum of the following contributions:

  • A population-wide constant (e.g., tvcl)

  • A subject-dependent random effect

  • A linear combination of the normalized covariates + small non-linear distortion

The vectors of linear coefficients are called βcl, βvc, and βka in the simulation below. Normalization of covariates will be based on all observations. We start by defining a function, whiten(input), for computing the normalized representations of covariate inputs:

input_stats = describe(inputs, :mean, :std)
inputs_mean = input_stats.mean
inputs_std = input_stats.std
whiten(input) = (input - inputs_mean) ./ inputs_std

For the non-linear distortion, we’ll use the distort function below, which is plotted in Figure 14.

nonlinearity = 0.1
distort(t) = t + nonlinearity * sin(2π / (1 + exp(7t)))
distort (generic function with 1 method)
Figure 14: The non-linear distortion used in the simulation model
using CairoMakie

xvals = -1:0.01:1
begin
    fig = Figure()
    axis = Axis(fig[1, 1], limits = ((-1, 1), (-1, 1)), xlabel = "t", ylabel = "distort(t)")
    lines!(axis, xvals, distort.(xvals))
end
display(fig)
save(joinpath(LOCAL, "images", "distort.svg"), fig)

Here then is the simulation model:

ncovariates = size(inputs)[2]
simulation_model = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        βcl  VectorDomain(ncovariates)
        βvc  VectorDomain(ncovariates)
        βka  VectorDomain(ncovariates)
        Ω  PSDDomain(3)
        σ  RealDomain(; lower = 0)
    end
    @covariates x
    @random η ~ MvNormal(Ω)
    @pre begin
        CL = tvcl * exp(η[1] + distort(βcl' * whiten(x)))
        Vc = tvvc * exp(η[2] + distort(βvc' * whiten(x)))
        Ka = tvka * exp(η[3] + distort(βka' * whiten(x)))
    end
    @dynamics Depots1Central1
    @derived conc ~ @. Normal(Central / Vc, σ)
end

To finish the specification, we need to choose values for the population-wide parameters appearing under @param. The first three and last two we’ll grab from the Pumas NLME model we fitted to the clinical data, except we will scale down the the variance parameters, to reduce noise. The β parameters are new and we choose these so that CL, Vc and Ka depend only on the first three covariates, Smoke, Heart, and Creatinine.

denoising = 5.0
old_params = coef(fitted_pumas_model)
denoised_params = (
    tvcl = old_params.tvcl,
    tvvc = old_params.tvvc,
    tvka = old_params.tvka,
    Ω = old_params.Ω / denoising,
    σ = old_params.σ / denoising,
)
simulation_params = merge(
    denoised_params,
    (
        βcl = [2, 1, -0.5, 0, 0, 0, 0, 0, 0, 0],
        βvc = [-0.5, 2, 1, 0, 0, 0, 0, 0, 0, 0],
        βka = [1, -0.5, 2, 0, 0, 0, 0, 0, 0, 0],
    ),
)
for key in keys(simulation_params)
    println("$key=$(getproperty(simulation_params, key))")
end
tvcl=9.555618782983991
tvvc=35.46672485436152
tvka=0.42799480784871746
Ω=[0.019347984326835065 -0.019931007248337726 -0.03418931096398294; -0.019931007248337726 0.03691111069450203 0.02894089544618443; -0.03418931096398294 0.02894089544618443 0.0652415561507599]
σ=0.22378102295690297
βcl=[2.0, 1.0, -0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
βvc=[-0.5, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
βka=[1.0, -0.5, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

And now we simulate:

simulated_population =
    Subject.(simobs(simulation_model, population, simulation_params, rng = Xoshiro(SEED)))

Next we repeat our previous analysis using the synthetic data in place of the clinical data, pretending that we do not know the underlying dependence of PK parameters on the covariates.

7.3 Fitting a covariate-free NLME model to the simulated data:

We refit our covariate-free NLME model, pumas_model, to the new data (using the same parameters for initialization as before):

fitted_pumas_model = fit(
    pumas_model,
    simulated_population,
    params,
    FOCE();
    optim_options = (show_trace = false,),
)

And extract the empirical Bayes estimates of the PK random effects, which constitute our new outputs:

ebe = only.(empirical_bayes(fitted_pumas_model, simulated_population));
ebe_matrix = permutedims(reduce(hcat, ebe)) # Reshape into 136x3 matrix
outputs = DataFrame(ebe_matrix, [:cl, :vc, :ka]) # Wrap in DataFrame
describe(outputs)[:, Not(7)] |> println;
3×6 DataFrame
 Row │ variable  mean       min       median     max      nmissing
     │ Symbol    Float64    Float64   Float64    Float64  Int64
─────┼─────────────────────────────────────────────────────────────
   1 │ cl        -0.116974  -2.05461   0.181969  1.74373         0
   2 │ vc        -0.128222  -5.42844  -0.116997  4.94512         0
   3 │ ka         0.157699  -3.79285   0.626246  3.31414         0

7.4 Learning curves for LASSO

Here’s our out-of-sample epochs learning curve for the LASSO model:

_, _, epochs, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_epochs, lower = 1, upper = 100),
    measure = multitarget_l2,
    verbosity = 0,
);

Learning curve for epochs
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `epochs` in LASSO regression (synthetic data)",
        xlabel = "epochs",
        ylabel = "l2 loss",
    )
    scatter!(axis, epochs, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "epochs_learning_curve_synthetic.svg"), fig)

Here’s the in-sample epochs learning curve:

_, _, epochs, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_epochs, lower = 1, upper = 100),
    measure = multitarget_l2,
    resampling = InSample(),
    verbosity = 0,
);

Learning curve for epochs (in-sample)
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `epochs` in LASSO regression (in-sample loss, synthetic data)",
        xlabel = "epochs",
        ylabel = "l2 loss",
    )
    scatter!(axis, epochs, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "epochs_learning_curve_in_sample_synthetic.svg"), fig)

As in the case of real data, it looks like 70 iterations is sufficient (for the default learning rate). Turning to the regularization parameter lambda:

nested_lambda = :(transformed_target_model_deterministic.model.lambda)
_, _, lambdas, losses = learning_curve(
    lasso,
    inputs,
    outputs;
    range = range(lasso, nested_lambda, lower = 0.01, upper = 10, scale = :log10),
    measure = multitarget_l2,
);

Learning curve for lambda
using CairoMakie
begin
    fig = Figure()
    axis = Axis(
        fig[1, 1],
        title = "Varying `lambda` in LASSO (synthetic data)",
        xlabel = "lambda",
        xscale = log10,
        ylabel = "l2 loss",
    )
    scatter!(axis, lambdas, losses)
end
display(fig)
save(joinpath(LOCAL, "images", "lambda_learning_curve_synthetic.svg"), fig)

We’ll assume that we aren’t going to need lambda > 10 or lambda < 0.01.

7.5 The tuned models

Since we will want to modify the lambda range for optimization, we need to redefine our tuning wrapper tuned(model), and the special case tuned_neural_network_stopped_early. The only difference between the following code blocks and the ones used for the clinical data is the change to the lambda range.

optimizers = [Adam(10^i) for i = -6.0:0.0] # same as before
tuned(model) = TunedModel(
    model,
    resampling = CV(),
    tuning = Grid(),
    range = [
        range(model, nested_optimiser, values = optimizers),
        range(model, nested_lambda, lower = 0.01, upper = 10, scale = :log),
    ],
    measure = multitarget_l2,
)

tuned_neural_network_stopped_early = TunedModel(
    neural_network_stopped_early,
    resampling = CV(),
    tuning = Grid(goal = 12),
    range = [
        range(
            neural_network_stopped_early,
            deeply_nested_optimiser,
            values = optimizers[3:5],
        ),
        range(Float64, deeply_nested_lambda, lower = 0.01, upper = 10, scale = :log),
    ],
    measure = multitarget_l2,
)
DeterministicTunedModel(
  model = DeterministicIteratedModel(
        model = DeterministicPipeline(standardizer = Standardizer(features = Symbol[], …), …), 
        controls = Any[Step(1), NumberLimit(150), Patience(5), WithModelDo{typeof(shrink_learning_rate!)}(shrink_learning_rate!, false, nothing), InvalidValue(), WithLossDo{var"#50#51"}(var"#50#51"(), false, nothing)], 
        resampling = Holdout(fraction_train = 0.7, …), 
        measure = MultitargetLPLoss(p = 2), 
        weights = nothing, 
        class_weights = nothing, 
        operation = nothing, 
        retrain = false, 
        check_measure = true, 
        iteration_parameter = nothing, 
        cache = true), 
  tuning = Grid(
        goal = 12, 
        resolution = 10, 
        shuffle = true, 
        rng = TaskLocalRNG()), 
  resampling = CV(
        nfolds = 6, 
        shuffle = false, 
        rng = TaskLocalRNG()), 
  measure = MultitargetLPLoss(p = 2), 
  weights = nothing, 
  class_weights = nothing, 
  operation = nothing, 
  range = ParamRange[NominalRange(model.transformed_target_model_deterministic.model.optimiser = Optimisers.Adam(eta=0.0001, beta=(0.9, 0.999), epsilon=1.0e-8), Optimisers.Adam(eta=0.001, beta=(0.9, 0.999), epsilon=1.0e-8), Optimisers.Adam(eta=0.010000000000000002, beta=(0.9, 0.999), epsilon=1.0e-8)), NumericRange(0.01 ≤ model.transformed_target_model_deterministic.model.lambda ≤ 10.0; origin=5.005, unit=4.995; on log scale)], 
  selection_heuristic = NaiveSelection(nothing), 
  train_best = true, 
  repeats = 1, 
  n = nothing, 
  acceleration = CPU1{Nothing}(nothing), 
  acceleration_resampling = CPU1{Nothing}(nothing), 
  check_measure = true, 
  cache = true, 
  compact_history = true, 
  logger = nothing)

7.6 Model comparisons

We are now ready to compare the performance of our models on the synthetic data:

comparisons = compare(
    [
        "constant predictor" => DeterministicConstantRegressor(),
        "linear regression" => linear_regressor,
        "LASSO (untuned)" => lasso,
        "LASSO" => tuned(lasso),
        "ridge (untuned)" => ridge,
        "ridge" => tuned(ridge),
        "neural network (untuned)" => neural_network,
        "neural network" => tuned(neural_network),
        "neural network stopped early (untuned)" => neural_network_stopped_early,
        "neural network stopped early" => tuned_neural_network_stopped_early,
    ],
    inputs,
    outputs,
)
println(comparisons[:, 1:4]);
10×4 DataFrame
 Row │ name                               loss_lower  loss_upper  midpoint
     │ String                             Float64     Float64     Float64
─────┼─────────────────────────────────────────────────────────────────────
   1 │ neural network                       0.875321     1.28598   1.08065
   2 │ linear regression                    1.26287      1.62451   1.44369
   3 │ neural network stopped early         1.01787      1.93441   1.47614
   4 │ LASSO                                1.3946       1.69573   1.54517
   5 │ ridge                                1.4871       1.89756   1.69233
   6 │ LASSO (untuned)                      1.49964      1.94544   1.72254
   7 │ ridge (untuned)                      1.50592      1.95493   1.73043
   8 │ neural network (untuned)             1.48402      2.08964   1.78683
   9 │ neural network stopped early (un…    2.84025      3.49328   3.16677
  10 │ constant predictor                   3.43954      3.78193   3.61074

These results are plotted in Figure 15.

Figure 15: Model comparisons
using CairoMakie
model_names = comparisons.name
midpoints = comparisons.midpoint
radii = comparisons.loss_upper - comparisons.loss_lower / 2
nmodels = length(model_names)

fig = Figure()
axis = Axis(
    fig[1, 1],
    xticks = (1:nmodels, model_names),
    xlabel = "Model",
    ylabel = "L2 loss estimate",
    xticklabelrotation = pi / 2,
    xticklabelalign = (:right, :top),
    xticklabelpad = 10,
)

# Plot the center points (optional, but good practice)
scatter!(
    axis,
    1:nmodels,
    midpoints;
    marker = :circle,
    markersize = 15,
    color = :blue,
    label = "midpoints",
)

# Plot the error bars
errorbars!(
    axis,
    1:nmodels,
    midpoints,
    radii,
    color = :black,
    linewidth = 2,
    whiskerwidth = 10, # Control the width of the small horizontal lines at the ends
)
display(fig)
save(joinpath(LOCAL, "images", "model_comparisons_synthetic.svg"), fig)

7.7 Shapley values

We proceed as before to compute Shapley values, but consider neural network without early stopping, as this time this is the better performer:

named_models = []
for (name, model) in zip(comparisons.name, comparisons.model)
    name in ["LASSO", "ridge", "neural network"] || continue
    push!(named_models, name => model)
end
named_machines = map(named_models) do (name, model)
    mach = machine(model, inputs, outputs)
    return name => fit!(mach, verbosity = 0)
end
predict_function(mach, data) = DataFrame(; prediction = MLJBase.predict(mach, data).cl)
shapley_given_model_name =
    map(named_machines) do (name, mach)
        name => ShapML.shap(;
            sample_size = 300,
            explain = inputs,
            model = mach,
            predict_function,
            seed = SEED,
        )
    end |> LittleDict
LittleDict{String, DataFrame, Vector{String}, Vector{DataFrame}} with 3 entries:
  "neural network" => 1360×6 DataFrame…
  "LASSO"          => 1360×6 DataFrame…
  "ridge"          => 1360×6 DataFrame

Plotting the results:

Box plot of Shapley values for LASSO.

Box plot of Shapley values for ridge regression.

Box plot of Shapley values for neural network
using CairoMakie
for (name, shapley) in shapley_given_model_name
    fig = Figure(; size = (1200, 700))
    axis = Axis(
        fig[1, 1];
        title = string(name),
        ytickformat = "{:.2e}",
        xticks = (1:10, names(inputs)),
        ylabel = "Shapley values",
    )
    hlines!(axis, 0; color = :red, linewidth = 2)
    boxplot!(
        axis,
        repeat(1:10; inner = length(population)),
        shapley.shap_effect,
        show_outliers = true,
    )

    display(fig)
    sleep(1)
    snakecase_name = replace(name, ' ' => '_')
    save(joinpath(LOCAL, "images", "shapley_$(snakecase_name)_synthetic.svg"), fig)
end

7.8 Feature importances

Finally, we derive Shapley-based feature importances and corresponding rankings. First, we compute the mean values:

input_names = names(inputs) # Covariate names

# Mean Shapley values across subjects
mean_shapley_given_model_name = LittleDict()
for (name, shapley) in shapley_given_model_name # Results from each model
    subdfs = collect(groupby(shapley, :feature_name; sort = false))
    mean_shapley_given_model_name[name] = getproperty.(subdfs, :shap_effect) .|> mean
end
mean_shapley_summary =
    hcat(DataFrame(Covariate = input_names), DataFrame(mean_shapley_given_model_name));
println(mean_shapley_summary);
10×4 DataFrame
 Row │ Covariate   neural network  LASSO         ridge
     │ String      Float64         Float64       Float64
─────┼────────────────────────────────────────────────────────
   1 │ Smoke          0.0104829     0.0230679     0.0236146
   2 │ Heart         -0.0246605    -0.0247285    -0.0251158
   3 │ Creatinine    -0.00391598   -0.00517912   -0.00612581
   4 │ Age           -0.000655213   0.000916393   0.0020135
   5 │ Race__1       -0.000503561   0.00013291    9.83446e-5
   6 │ Race__2        0.00247408    3.45005e-5    5.61629e-5
   7 │ Height         0.00222356    0.000272441   0.000396001
   8 │ glyco          0.0025613     0.000468834   0.000991782
   9 │ Ethanol       -0.00135861   -0.000171509  -0.000961769
  10 │ Weight         0.00604845    0.0003971     0.00136299

These values still seem exceedingly small, but note that these values closely match the linear coefficients prescribed in our simulation model, βcl = [2, 1, -0.5, 0, 0, 0, 0, 0, 0, 0], both in sense and relative magnitude, for all three models.

Turning to feature rankings:

covariate_rankings = map(2:4) do model_index
    # Temporary dataframe with covariate names and Shapley means for current model
    model_summary = deepcopy(mean_shapley_summary[:, [1, model_index]])
    model_summary[:, 2] = abs.(model_summary[:, 2]) # In place abs()
    # Sort from highest to lowest absolute mean shapley
    sort!(model_summary, 2; rev = true)
    # New row of each covariate indicates ranking
    return [findfirst(==(name), model_summary.Covariate) for name in input_names]
end
model_names = keys(mean_shapley_given_model_name)
for (model, rankings) in zip(model_names, covariate_rankings)
    println("$model ranking:\n", DataFrame(Covariate = input_names, Rank = rankings), "\n")
end
neural network ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke           2
   2 │ Heart           1
   3 │ Creatinine      4
   4 │ Age             9
   5 │ Race__1        10
   6 │ Race__2         6
   7 │ Height          7
   8 │ glyco           5
   9 │ Ethanol         8
  10 │ Weight          3

LASSO ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke           2
   2 │ Heart           1
   3 │ Creatinine      3
   4 │ Age             4
   5 │ Race__1         9
   6 │ Race__2        10
   7 │ Height          7
   8 │ glyco           5
   9 │ Ethanol         8
  10 │ Weight          6

ridge ranking:
10×2 DataFrame
 Row │ Covariate   Rank
     │ String      Int64
─────┼───────────────────
   1 │ Smoke           2
   2 │ Heart           1
   3 │ Creatinine      3
   4 │ Age             4
   5 │ Race__1         9
   6 │ Race__2        10
   7 │ Height          8
   8 │ glyco           6
   9 │ Ethanol         7
  10 │ Weight          5

Here’s the corresponding visualization:

Rankings of covariate absolute impact per model. Higher rank means greater impact on Quinidine clearance random effect.
using CairoMakie
fig = Figure(; size = (850, 700), fontsize = 20)
for (fig_column, model) in enumerate(model_names)
    ax = Axis(
        fig[1, fig_column];
        title = string(model),
        titlegap = 32,
        yreversed = true,
        limits = (-1, 1, 0.5, 10.5),
    )
    hidedecorations!(ax)
    hidespines!(ax)
    for (name, rank) in zip(input_names, covariate_rankings[fig_column])
        text!(ax, 0, rank; text = name, align = (:center, :center))
    end
end
ax = Axis(
    fig[1, 1:3];
    xticks = 0:10,
    limits = (0, 10, 0.5, 10.5),
    yticks = 1:10,
    yreversed = true,
)
hidespines!(ax);
hidedecorations!(ax);
for i = 1:2
    for (iter, (rankLeft, rankRight)) in
        enumerate(zip(covariate_rankings[i], covariate_rankings[i+1]))
        if i == 1
            lines!(
                ax,
                [2.2, 4.4],
                [rankLeft, rankRight];
                linewidth = 3,
                color = iter,
                colormap = :tab10,
                colorrange = (1, 10),
            )
        else
            lines!(
                ax,
                [5.6, 7.8],
                [rankLeft, rankRight];
                linewidth = 3,
                color = iter,
                colormap = :tab10,
                colorrange = (1, 10),
            )
        end
    end
end
display(fig)
save(joinpath(LOCAL, "images", "rankModels_synthetic.svg"), fig)

8 References

Budhkar, Aishwarya, Qianqian Song, Jing Su, and Xuhong Zhang. 2025. “Demystifying the Black Box: A Survey on Explainable Artificial Intelligence (XAI) in Bioinformatics.” Computational and Structural Biotechnology Journal 27: 346–59. https://doi.org/https://doi.org/10.1016/j.csbj.2024.12.027.
FDA. 2025. “Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products.” 2025. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/considerations-use-artificial-intelligence-support-regulatory-decision-making-drug-and-biological.
Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly.
Ian Goodfellow, Aaron Courville, Yoshua Bengio. 2016. “Deep Learning.” 2016. https://www.deeplearningbook.org/.
Kingma, Diederik P., and Jimmy Ba. 2017. “Adam: A Method for Stochastic Optimization.” https://arxiv.org/abs/1412.6980.
Molnar, Christoph. 2025. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 3rd ed. https://christophm.github.io/interpretable-ml-book.
Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in s and s-PLUS. Springer.
Samek, Wojciech, Grégoire Montavon, Sebastian Lapuschkin, Christopher J. Anders, and Klaus-Robert Müller. 2021. “Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications.” Proceedings of the IEEE 109 (3): 247–78. https://doi.org/10.1109/JPROC.2021.3060483.
Shapley, Lloyd S. 1951. Notes on the n-Person Game &Mdash; II: The Value of an n-Person Game. Santa Monica, CA: RAND Corporation. https://doi.org/10.7249/RM0670.
Verme, Catherine N., Thomas M. Ludden, William A. Clementi, and Steven C. Harris. 1992. “Pharmacokinetics of Quinidine in Male Patients: A Population Analysis.” Clinical Pharmacokinetics 22 (6): 468–80. https://doi.org/10.2165/00003088-199222060-00005.
Wikipedia. 2025. “Bias–Variance Tradeoff.” 2025. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff.

Reuse