using CSV, DataFrames, DataFramesMeta, LinearAlgebra
using Random, MLJ, CategoricalArrays, Pumas, CairoMakie
Random.seed!(378)

LASSO for Covariate Modeling in NLME Models
1 Introduction
LASSO, which stands for Least Absolute Shrinkage and Selection Operator (Ribbing et al. 2007), is a powerful covariate selection method in statistics. It enhances the base model by adding a penalty term to the objective function value, aka loss function, which is proportional to the absolute value of the coefficients. This penalty term enables the shrinking of some coefficients to be close to zero. As a result, LASSO helps in reducing model complexity and selecting a subset of relevant covariates from a potentially large set of covariates. Note that the LASSO solution may not be unique but it may still offer some insights into the data. The covariates with a coefficient close to 0 can be interpreted as not adding any more predictive power to the model, assuming the remaining covariates with non-zero coefficients are already present in the model.
Assuming the LASSO solution included a non-zero coefficient for covariate \(x_1\), LASSO can also be used to answer questions such as: “does including covariate \(x_2\) replace the need for including covariate \(x_1\) in the model?” One can answer this question by not penalizing the coefficient of \(x_2\), performing LASSO, and then observing the effect of that on the coefficient of \(x_1\) which was previously selected when all coefficients were equally penalized. Not penalizing the coefficient of \(x_2\) makes it more favorable to be selected by LASSO if it adds a comparable amount of predictive power to the model as covariate \(x_1\). If covariate \(x_1\) is still significantly more predictive of the response, LASSO will still assign a non-zero coefficient to \(x_1\) despite the coefficient of \(x_2\) not having a penalty.
The rest of this tutorial is organized as follows:
- Reading and pre-processing the data
- Presenting the LASSO PK model
- Cross-validation and hyper-parameter tuning
- Interpreting the results
In this tutorial, the model and dataset from “Pharmacokinetics of Quinidine in Male Patients” (Verme et al. 1992) will be used and adapted. In the original study, the authors analyzed data from 139 adult hospitalized men receiving oral therapy of quinidine, an anti-arrhythmic drug. In previous works, quinidine has shown considerable inter-subject variability in pharmacokinetic (PK) parameters, prompting the search by the authors after predictive covariates, specially for clearance. We will use the resulting data to train a LASSO covariate PK model.
2 Loading packages
First, some environment setup is necessary. The MLJ.jl package will be used throughout the tutorial, but the ML concepts and techniques discussed here are tool-agnostic. Depending on how you are running Julia, you may need to add MLJ.jl separately. Just copy, paste and execute the following code block.
import Pkg
Pkg.add("MLJ", preserve = Pkg.PRESERVE_ALL)
Next, the following code brings MLJ.jl and other packages into the environment. A seed for the random number generator is also set for reproducibility.
3 Data processing
3.1 Data reading
We begin by reading the data. The data can be downloaded from Kaggle, an online platform and community for sharing datasets and trained models. Alternatively, the data may be obtained from the RDatasets.jl package, which makes available to Julia many datasets already in the R programming language. Right from the start, we take the opportunity for a bit of processing by removing the first column, which just counts lines. Also, this dataset represents missing values with the string “NA”, so we indicate that in the call to CSV.read()
.
= CSV.read("./Quinidine.csv", DataFrame; missingstring = ["NA"])[:, 2:end]
df describe(df)[:, Not(7)]
Row | variable | mean | min | median | max | nmissing |
---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | |
1 | Subject | 73.9266 | 1 | 74.0 | 223 | 0 |
2 | time | 373.252 | 0.0 | 60.0 | 8095.5 | 0 |
3 | conc | 2.45429 | 0.4 | 2.3 | 9.4 | 1110 |
4 | dose | 225.302 | 83 | 201.0 | 603 | 443 |
5 | interval | 7.10843 | 4 | 6.0 | 12 | 1222 |
6 | Age | 66.7322 | 42 | 66.0 | 92 | 0 |
7 | Height | 69.1931 | 60 | 69.0 | 79 | 0 |
8 | Weight | 79.6608 | 41 | 78.0 | 119 | 0 |
9 | Race | Black | Latin | 0 | ||
10 | Smoke | no | yes | 0 | ||
11 | Ethanol | current | none | 0 | ||
12 | Heart | Moderate | Severe | 0 | ||
13 | Creatinine | < 50 | >= 50 | 0 | ||
14 | glyco | 1.28168 | 0.39 | 1.23 | 3.16 | 0 |
3.2 Categorical variable recoding
As can be seen by the dataset summary printed previously, some preprocessing is still necessary. First, some columns include categories as strings, which will be encoded as integers. Also, two new columns are necessary for future steps, namely the compartment (cmt
) to which the drug is administered and evid
indicating dosing events.
@chain df begin
# Encode string categories as integers
@transform! :Smoke = recode(:Smoke, "no" => 0, "yes" => 1)
@transform! :Heart = recode(:Heart, "No/Mild" => 1, "Moderate" => 2, "Severe" => 3)
@transform! :Creatinine = recode(:Creatinine, ">= 50" => 0, "< 50" => 1)
@transform! :Race = recode(:Race, "Caucasian" => 1, "Latin" => 2, "Black" => 3)
@transform! :Ethanol = recode(:Ethanol, "none" => 1, "former" => 2, "current" => 3)
# Fields for pumas_read()
@transform! :cmt = ifelse(ismissing(:dose), missing, 1)
@rtransform! :evid = ismissing(:dose) ? 0 : 1
end
describe(df)[:, Not(7)]
Row | variable | mean | min | median | max | nmissing |
---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Int64 | |
1 | Subject | 73.9266 | 1 | 74.0 | 223 | 0 |
2 | time | 373.252 | 0.0 | 60.0 | 8095.5 | 0 |
3 | conc | 2.45429 | 0.4 | 2.3 | 9.4 | 1110 |
4 | dose | 225.302 | 83 | 201.0 | 603 | 443 |
5 | interval | 7.10843 | 4 | 6.0 | 12 | 1222 |
6 | Age | 66.7322 | 42 | 66.0 | 92 | 0 |
7 | Height | 69.1931 | 60 | 69.0 | 79 | 0 |
8 | Weight | 79.6608 | 41 | 78.0 | 119 | 0 |
9 | Race | 1.42284 | 1 | 1.0 | 3 | 0 |
10 | Smoke | 0.303875 | 0 | 0.0 | 1 | 0 |
11 | Ethanol | 1.45615 | 1 | 1.0 | 3 | 0 |
12 | Heart | 1.93202 | 1 | 2.0 | 3 | 0 |
13 | Creatinine | 0.28416 | 0 | 0.0 | 1 | 0 |
14 | glyco | 1.28168 | 0.39 | 1.23 | 3.16 | 0 |
15 | cmt | 1.0 | 1 | 1.0 | 1 | 0 |
16 | evid | 0.698844 | 0 | 1.0 | 1 | 0 |
3.3 Data exploration
Since familiarizing with the data we are using can be useful for future troubleshooting, we’ll do a bit of exploration. In the code below, the dataset is first sorted by subject IDs. Then the number of unique values per column is determined and listed.
sort!(df, :Subject)
= Dict([Symbol(n) => unique(col) for (n, col) in zip(names(df), eachcol(df))])
uniqs for (name, uni) in uniqs
println("$(rpad(name, 12))$(length(uni))")
end
Smoke 2
interval 5
Heart 3
time 728
Creatinine 2
evid 2
Age 38
Race 3
Height 20
glyco 145
cmt 1
Ethanol 3
Subject 136
dose 16
Weight 58
conc 57
And to get a better idea of the distributions of variables, we can check the histograms of all columns.
activate!()
CairoMakie.= Figure()
fig for (title, cartInd) in zip(names(df), CartesianIndices((4, 4)))
= Axis(fig[cartInd.I...]; title)
ax hist!(ax, df[!, title] |> skipmissing |> collect)
end
fig
We can also check how each patient in the dataset looks. In the following block, columns associated with pharmacokinetics are set aside, and shown for a subject.
= select(df, ["Subject", "time", "dose", "conc", "evid"])
pkDF = groupby(pkDF, :Subject) # Group by subject
pkDFsubGroup 1] # Show rows from subject pkDFsubGroup[
Row | Subject | time | dose | conc | evid |
---|---|---|---|---|---|
Int64 | Float64 | Int64? | Float64? | Int64 | |
1 | 1 | 0.0 | 249 | missing | 1 |
2 | 1 | 3.0 | 249 | missing | 1 |
3 | 1 | 6.0 | 249 | missing | 1 |
4 | 1 | 16.0 | 201 | missing | 1 |
5 | 1 | 24.0 | 201 | missing | 1 |
6 | 1 | 32.0 | 201 | missing | 1 |
7 | 1 | 40.0 | 201 | missing | 1 |
8 | 1 | 49.25 | missing | 1.4 | 0 |
9 | 1 | 138.0 | 201 | missing | 1 |
10 | 1 | 145.5 | 201 | missing | 1 |
11 | 1 | 150.0 | 201 | missing | 1 |
12 | 1 | 160.0 | 402 | missing | 1 |
13 | 1 | 168.0 | 402 | missing | 1 |
14 | 1 | 176.0 | 402 | missing | 1 |
15 | 1 | 184.0 | 402 | missing | 1 |
16 | 1 | 193.33 | missing | 3.0 | 0 |
The same is done to the covariate columns in the code below to check how they look.
# Covariates
= [:Smoke, :Heart, :Creatinine, :Age, :Race, :Height, :glyco, :Ethanol, :Weight]
covas = select(df, [:Subject, covas...]) # New DataFrame with covariates and subject column
covDF = groupby(covDF, :Subject) # Group by subjects
covSubGroup 1] # Show rows from subject covSubGroup[
Row | Subject | Smoke | Heart | Creatinine | Age | Race | Height | glyco | Ethanol | Weight |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Float64 | Int64 | Int64 | |
1 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
2 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
3 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
4 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
5 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
6 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
7 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
8 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 0.41 | 3 | 106 |
9 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
10 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
11 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
12 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
13 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
14 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
15 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
16 | 1 | 0 | 2 | 0 | 60 | 1 | 69 | 1.11 | 3 | 106 |
3.4 Data encoding and standardization
Now, let’s actually prepare the data for the models. We begin by taking only the first row for each subject and dropping the :Subject
column.
= unique(covDF, :Subject)[:, Not(:Subject)] baselineInput
All covariates must be coerced into numerically continuous values. Continuous and ordinal discrete variables need not be modified. Nominal categorical variables with \(n\) values are one-hot encoded using \(n - 1\) binary variables. Consider :Race
as an example which has 3 possible values. The one-hot encoding of :Race
involving representing the race of a subject using 2 binary covariates:
:Race__1
: 1 if and only if the race has the value 1 (Caucasian):Race__2
: 1 if and only if the race has the value 2 (Latin)
The third race (Black) is considered the reference value. The covariate effect of :Race__1
is therefore interpreted as the effect of the Caucasian race relative to the reference Black race. Similarly, the covariate effect of :Race__2
is interpreted as the effect of the Latin race relative to the reference Black race.
After encoding all covariates using continuous values, standardization is performed to ensure all covariates are on the same scale. As examples from the data summaries earlier, ‘glyco’ (\(\alpha_1\)-acid glycoprotein) is in the [0.39, 3.16] mg/100 dL range, while ‘weight’ is in [41, 119] kg. In LASSO, it is essential for all covariates to have the same scale to ensure their covariate effects are fairly penalized. After standardization, all the covariates will have a mean of 0 and standard deviation of 1. If a covariate has the same value across all subjects (standard deviation of 0), it should be filtered out.
# Coerce and one-hot encode the nominal categorical Race covariate
= inp -> coerce(inp, Symbol("Race") => OrderedFactor)
ordFacCoerce = ContinuousEncoder(; one_hot_ordered_factors = true, drop_last = true)
contCoerce # Also standardize, since covariates have different units/sizes
= ordFacCoerce |> contCoerce |> Standardizer
preparePipe # Apply transformations in pipeline
= machine(preparePipe, baselineInput)
machPrepare fit!(machPrepare, verbosity = 0)
= MLJ.transform(machPrepare, baselineInput)
stdInputs # Check results
describe(stdInputs, :mean, :std, :min, :max)
Row | variable | mean | std | min | max |
---|---|---|---|---|---|
Symbol | Float64 | Float64 | Float64 | Float64 | |
1 | Smoke | 6.04092e-17 | 1.0 | -0.700621 | 1.41681 |
2 | Heart | 2.12249e-17 | 1.0 | -1.05675 | 1.33855 |
3 | Creatinine | 7.9185e-17 | 1.0 | -0.59779 | 1.66053 |
4 | Age | -4.89804e-18 | 1.0 | -2.79031 | 2.81836 |
5 | Race__1 | 9.63282e-17 | 1.0 | -1.41681 | 0.700621 |
6 | Race__2 | 4.73477e-17 | 1.0 | -0.586504 | 1.69248 |
7 | Height | 1.96412e-15 | 1.0 | -2.85786 | 2.77932 |
8 | glyco | -3.60822e-16 | 1.0 | -1.72042 | 4.27123 |
9 | Ethanol | 1.26533e-16 | 1.0 | -0.653312 | 2.21283 |
10 | Weight | 1.68166e-16 | 1.0 | -2.46618 | 2.51976 |
Finally, let’s combine all the standardized covariates together in a single column of vectors \(x\):
= names(stdInputs)
x_names = Vector{Float64}.(eachrow(stdInputs))
stdInputs.x = unique(df.Subject)
stdInputs.Subject = leftjoin(df, stdInputs[:, [:Subject, :x]], on = :Subject) cov_df
4 Covariate-free model
4.1 Model definition
Based on (Verme et al. 1992), below is the definition for a Pumas PK model with one compartment, first-order absorption and elimination, 3 fixed effect parameters (clearance tvcl
, central volume tvvc
, and absorption constant tvka
), associated random effects, and an additive residual error term (\(\sigma\)). Additionally, reference parameter values were extracted from (Pinheiro and Bates 2000) to be used as initial parameter values when fitting.
= @model begin
model @param begin
∈ RealDomain(; lower = 0)
tvcl ∈ RealDomain(; lower = 0)
tvvc ∈ RealDomain(; lower = 0)
tvka ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0)
σ end
@random η ~ MvNormal(Ω)
@pre begin
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc = tvka * exp(η[3])
Ka end
@dynamics Depots1Central1
@derived conc ~ @. Normal(Central / Vc, σ)
end
# Parameters
= (;
params = ℯ^2.5, # 12.18
tvcl = ℯ^5.4, # 221
tvvc = ℯ^-0.21, # 0.81
tvka = [0.056 0.0 0.0; 0.0 0.03 0.0; 0.0 0.0 0.01],
Ω = 0.0123,
σ )
The above model does not include any covariates yet. It is generally a good idea to fit such a model first and obtain values for the population parameters before introducing covariates. The optimal population parameter values when fitting the covariate-free model can then be used as initial parameter values when fitting the model with covariates included.
4.2 Pumas population
To fit the above Pumas model to the data, the data needs to be converted to a Pumas Population
using the read_pumas
function:
= read_pumas(
pop
df;= :Subject,
id = :time,
time = :dose,
amt = :cmt,
cmt = :evid,
evid = [:conc],
observations )
Population
Subjects: 136
Observations: conc
Note that, we created the columns cmt
and evid
because they are required by read_pumas
.
4.3 Model fit
Now, we fit the model using the FOCE
algorithm:
= fit(model, pop, params, FOCE(), optim_options = (; show_trace = false)) fpm
[ Info: Checking the initial parameter values. [ Info: The initial negative log likelihood and its gradient are finite. Check passed.
FittedPumasModel
Dynamical system type: Closed form
Number of subjects: 136
Observation records: Active Missing
conc: 361 82
Total: 361 82
Number of parameters: Constant Optimized
0 10
Likelihood approximation: FOCE
Likelihood optimizer: BFGS
Termination Reason: NoXChange
Log-likelihood value: -611.20246
-----------------
Estimate
-----------------
tvcl 9.5526
tvvc 35.525
tvka 0.4292
Ω₁,₁ 0.096606
Ω₂,₁ -0.099594
Ω₃,₁ -0.17087
Ω₂,₂ 0.19311
Ω₃,₂ 0.14906
Ω₃,₃ 0.32244
σ 1.1192
-----------------
5 LASSO model
5.1 Laplace prior
The LASSO objective function value to be minimized is:
\[ -\log p(y \mid x, \beta) + \lambda \cdot \left\lVert \beta \right\rVert_1 \]
where \(\beta\) is the set of covariate effects in the model and \(p(y \mid x, \beta)\) is the marginal likelihood of the model given the population data. An equivalent way to implement LASSO is using a Laplace prior distribution on \(\beta\) and using maximum a-posteriori (MAP) estimation to optimize for \(\beta\).
\[ \begin{aligned} \beta_i & \sim \text{Laplace}(0, b) \\ -\log p(\beta) & = \sum_{i=1}^{n_\beta} \Bigg( \log (2 \cdot b) + \frac{|\beta_i|}{b} \Bigg) \\ & = \underbrace{n_\beta \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_\beta} |\beta_i|}_{\left\lVert \beta \right\rVert_1} \\ & = \lambda \cdot \left\lVert \beta \right\rVert_1 + \text{constant} \end{aligned} \]
Therefore, by setting the parameter \(b\) of the Laplace prior, one can control the LASSO penalty parameter \(\lambda = \frac{1}{b}\).
5.2 Pumas model
To implement the above prior in a Pumas model, the following model can be used:
= 1.0 # LASSO penalty parameter
λ # multi-variate Laplace distribution
function mvlaplace(λ)
return product_distribution(fill(Distributions.Laplace(0.0, 1 / λ), 10))
end
= @model begin
model_map @param begin
∈ RealDomain(; lower = 0)
tvcl ∈ RealDomain(; lower = 0)
tvvc ∈ RealDomain(; lower = 0)
tvka ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0)
σ ~ mvlaplace(λ)
βcl ~ mvlaplace(λ)
βvc ~ mvlaplace(λ)
βka end
@covariates x
@random η ~ MvNormal(Ω)
@pre begin
= tvcl * exp(η[1] + dot(x, βcl))
CL = tvvc * exp(η[2] + dot(x, βvc))
Vc = tvka * exp(η[3] + dot(x, βka))
Ka end
@dynamics Depots1Central1
@derived conc ~ @. Normal(Central / Vc, σ)
end
In the above model, all the standardized covariates are assumed to be in a vector \(x\) for each subject. The vector \(x\) has length 10 for each subject.
5.3 Pumas population with covariates
= read_pumas(
cov_pop
cov_df;= :Subject,
id = :time,
time = :dose,
amt = :cmt,
cmt = :evid,
evid = [:x],
covariates = [:conc],
observations )
Population
Subjects: 136
Covariates: x
Observations: conc
5.4 Fitting the model
To fit the above model using MAP estimation, call:
# initial parameter values
= merge(init_params(model_map), coef(fpm))
map_params =
fpm_map fit(model_map, cov_pop, map_params, MAP(FOCE()), optim_options = (; show_trace = false))
[ Info: Checking the initial parameter values. [ Info: The initial negative log likelihood and its gradient are finite. Check passed.
FittedPumasModel
Dynamical system type: Closed form
Number of subjects: 136
Observation records: Active Missing
conc: 361 82
Total: 361 82
Number of parameters: Constant Optimized
0 40
Likelihood approximation: MAP (FOCE)
Likelihood optimizer: BFGS
Termination Reason: NoObjectiveChange
Log-likelihood value: -582.67
-------------------
Estimate
-------------------
tvcl 9.9192
tvvc 24.887
tvka 0.29886
Ω₁,₁ 0.075356
Ω₂,₁ -0.071626
Ω₃,₁ -0.079186
Ω₂,₂ 0.096738
Ω₃,₂ 0.15427
Ω₃,₃ 0.30102
σ 1.0529
βcl₁ -0.00719
βcl₂ -0.051275
βcl₃ -0.05313
βcl₄ 0.045971
βcl₅ 0.12473
βcl₆ 0.11286
βcl₇ 0.0018929
βcl₈ -0.12279
βcl₉ 0.0068348
βcl₁₀ 0.045669
βvc₁ -0.028743
βvc₂ 0.18605
βvc₃ 0.070931
βvc₄ 0.026971
βvc₅ -0.29503
βvc₆ 0.11275
βvc₇ 0.062718
βvc₈ -0.0074304
βvc₉ -0.25887
βvc₁₀ -0.01051
βka₁ 1.6909e-9
βka₂ 0.051136
βka₃ 0.067026
βka₄ 0.11377
βka₅ -5.6923e-7
βka₆ 0.39926
βka₇ -0.1641
βka₈ 0.15879
βka₉ -0.13896
βka₁₀ -0.068371
-------------------
The following function summarizes the covariate effects and sorts them by their absolute value for any input variable var
, where var
is either :CL
, :Vc
or :Ka
. A higher covariate effect indicates a relative importance score.
function sort_by_abs(var::Symbol)
= coef(fpm_map)
coef_map = DataFrame(
effect_summary :covariate => x_names,
:CL => coef_map.βcl,
:Vc => coef_map.βvc,
:Ka => coef_map.βka,
)sort!(effect_summary, [var], by = abs, rev = true)
return effect_summary[:, [:covariate, var]]
end
println(sort_by_abs(:CL), "\n")
println(sort_by_abs(:Vc), "\n")
println(sort_by_abs(:Ka), "\n");
10×2 DataFrame Row │ covariate CL │ String Float64 ─────┼───────────────────────── 1 │ Race__1 0.124728 2 │ glyco -0.122792 3 │ Race__2 0.112856 4 │ Creatinine -0.0531301 5 │ Heart -0.051275 6 │ Age 0.0459713 7 │ Weight 0.0456693 8 │ Smoke -0.00719001 9 │ Ethanol 0.00683482 10 │ Height 0.00189294 10×2 DataFrame Row │ covariate Vc │ String Float64 ─────┼───────────────────────── 1 │ Race__1 -0.295029 2 │ Ethanol -0.258868 3 │ Heart 0.18605 4 │ Race__2 0.11275 5 │ Creatinine 0.0709315 6 │ Height 0.0627181 7 │ Smoke -0.0287432 8 │ Age 0.0269712 9 │ Weight -0.0105103 10 │ glyco -0.00743039 10×2 DataFrame Row │ covariate Ka │ String Float64 ─────┼───────────────────────── 1 │ Race__2 0.399258 2 │ Height -0.164095 3 │ glyco 0.158789 4 │ Ethanol -0.138958 5 │ Age 0.11377 6 │ Weight -0.068371 7 │ Creatinine 0.0670258 8 │ Heart 0.0511355 9 │ Race__1 -5.69232e-7 10 │ Smoke 1.69087e-9
Covariates with an effect whose absolute value is less than a user-specified threshold, e.g. 1e-4, can be dropped from the model. A higher threshold can be used if a smaller model is required.
6 Cross-validation and hyper-parameter selection
6.1 Effect of \(\lambda\)
In the above fit, the value for \(\lambda\) was set to 1.0 upfront. However, that value is a hyper-parameter which needs to be objectively selected. A higher \(\lambda\) has the following effects:
- Covariate effects are more likely to be sparse, and
- The model is more likely to under-fit.
A lower \(\lambda\) has the opposite effects where:
- Covariate effects are less likely to be sparse, and
- The model is more likely to over-fit.
The optimal value of \(\lambda\) is one that maximizes the model’s ability to fit unseen data, not used to fit the model, thus avoiding both under-fitting and over-fitting. To quantify this ability, cross-validation is typically used.
Note that instead of using a single \(\lambda\), one may also use a different \(\lambda\) for each of the covariate effects on CL
, Vc
and Ka
. This will likely give a better model at the end. However, for simplicity, we use a single \(\lambda\) in this tutorial.
6.2 Cross-validation
One common strategy to split the data in the context of hyperparameter search is cross-validation (CV), which Figure 1 illustrates. The hyperparameter optimization procedure can be summarized as follows:
Split the data into \(k\) ‘folds’ (partitions) of roughly equal size.
For each value of \(\lambda\):
For each fold of the \(k\) folds:
- Assign the selected fold as the validation data, and use the remaining folds for fitting.
- Fit/train the model using the training folds and the selected value of \(\lambda\).
- Evaluate the fitted model’s performance on the validation fold, e.g. using the log likelihood of the fitted model given the validation data.
Take the average performance metric across all folds.
Use an optimization or search strategy to choose the best \(\lambda\) that maximizes the average model performance on unseen data.
Re-fit the model using the optimal \(\lambda\) and the entire dataset.
K-fold cross-validation can also be used without changing hyperparameters between iterations to get a measure of a model’s performance on unseen data for model comparison.
While there is generally no easy way to select the number of folds K (Kohavi 2001), values from 5 to 10 are common. Higher values increase the computational cost of the procedure, because more fits need to be done, but reduce the bias when estimating the model’s performance. Lower K values are faster to run but more biased.
6.3 Select \(\lambda\)
The cross-validation procedure presented will be repeatedly applied to determine the best value of \(\lambda\). And the average log likelihood using the validation data will be used as the metric to compare the models.
First some setup is required. In the following code, the function createModel
is defined for code organization. It returns a Pumas model using a provided \(\lambda\). Then, Pumas.cv_populations
is used to split the population. In the split
object returned, the fields train_data
and validation_data
are vectors of populations. The former contains 5 training populations for each cross-validation fold; and the latter, the respective 5 test populations. Lastly, the values of \(\lambda\) to be tried are defined in lambdas
, spanning 1e-5
to 1e3
in log10 scale.
function createModel(λ)
return @model begin
@param begin
∈ RealDomain(; lower = 0)
tvcl ∈ RealDomain(; lower = 0)
tvvc ∈ RealDomain(; lower = 0)
tvka ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0)
σ ~ mvlaplace(λ)
βcl ~ mvlaplace(λ)
βvc ~ mvlaplace(λ)
βka end
@covariates x
@random η ~ MvNormal(Ω)
@pre begin
= tvcl * exp(η[1] + dot(x, βcl))
CL = tvvc * exp(η[2] + dot(x, βvc))
Vc = tvka * exp(η[3] + dot(x, βka))
Ka end
@dynamics Depots1Central1
@derived conc ~ @. Normal(Central / Vc, σ)
end
end
# Split population
= Pumas.cv_populations(cov_pop, KFold(; K = 7), BySubject())
split = split
(; train_data, validation_data) = 10 .^ (range(-5, 3, length = 20)) lambdas
20-element Vector{Float64}:
1.0e-5
2.6366508987303607e-5
6.951927961775606e-5
0.00018329807108324357
0.0004832930238571752
0.0012742749857031334
0.003359818286283781
0.008858667904100823
0.023357214690901226
0.06158482110660264
0.16237767391887217
0.4281332398719394
1.128837891684689
2.976351441631318
7.847599703514613
20.6913808111479
54.5559478116852
143.8449888287663
379.26901907322497
1000.0
With these definitions, we are able to run the procedure. For optimization, we use a simple log-scale grid search technique, where we iterate over pre-specified \(\lambda\) values. For each \(\lambda\), k-fold cross-validation is done by looping over the train and test pairs of populations in train_data
and validation_data
. In each fold:
- The model is built with the current \(\lambda\).
- The initial parameters are defined.
- The model is fitted to training data.
- The validation log likelihood of the fitted model is stored.
For each \(\lambda\), the mean validation log likelihood is calculated, averaged across the \(K\) runs. Runs giving -Inf
or NaN
validation log likelihood values are assumed to be unlucky runs and ignored. If the number of unlucky runs is significant, the values of \(\lambda\) giving such non-finite validation log likelihood values should be penalized and not selected.
= map(lambdas) do λ
cv_result # cv with current λ
@info "Running cross-validation for λ = $λ."
map(zip(train_data, validation_data)) do (train, test)
# model with current lambda
= createModel(λ)
fold_model = merge(init_params(fold_model), coef(fpm))
fold_params # fit to current train fold
= fit(
fold_fpm
fold_model,
train,
fold_params,MAP(FOCE()),
= (; show_trace = false),
optim_options = false,
verbose
)# evaluate average loglikelihood in current validation fold
return loglikelihood(fold_model, test, coef(fold_fpm), FOCE())
end
end
= mean.(filter.(isfinite, cv_result))
medFin = DataFrame(λ = lambdas, mean_val_ll = medFin) cvDF
Row | λ | mean_val_ll |
---|---|---|
Float64 | Float64 | |
1 | 1.0e-5 | -93.7011 |
2 | 2.63665e-5 | -93.701 |
3 | 6.95193e-5 | -93.6999 |
4 | 0.000183298 | -93.7007 |
5 | 0.000483293 | -93.6998 |
6 | 0.00127427 | -93.699 |
7 | 0.00335982 | -93.6964 |
8 | 0.00885867 | -93.6889 |
9 | 0.0233572 | -93.6698 |
10 | 0.0615848 | -93.6256 |
11 | 0.162378 | -94.0104 |
12 | 0.428133 | -93.9451 |
13 | 1.12884 | -93.0878 |
14 | 2.97635 | -89.6189 |
15 | 7.8476 | -87.2926 |
16 | 20.6914 | -86.8912 |
17 | 54.5559 | -84.7182 |
18 | 143.845 | -84.7182 |
19 | 379.269 | -84.7182 |
20 | 1000.0 | -84.7182 |
As shown in the final DataFrame
, increasing \(\lambda\) monotonically improves the model’s performance. Recall that the \(\lambda\) is a penalty coefficient in the following objective function value being minimized when fitting the model:
\[ -\log p(y \mid x, \beta) + \lambda \cdot \left\lVert \beta \right\rVert_1 \]
A high value of \(\lambda = 1000\) penalizes all the covariates, leading to their respective coefficients to be all close to 0, i.e. none of the covariates is significant. This can happen in real analyses where none of the covariates have significant predictive power.
The dataset used above is a real dataset, so we have no control over the result. However, for didactic purposes, let’s simulate data from an presumed covariate model to showcase the next couple of steps that would be carried out in this type of analysis.
Below, simobs
uses model_map
to simulate observations for cov_pop
. The covariate effects (\(\beta\)) of smoke
, heart
and creatinine
are arbitrarily (no biological interpretation intended) and manually specified to impact clearance, volume and absorption, respectively. This way, LASSO will have relationships to identify. The remaining parameters are taken from fpm
. The data splits and \(\lambda\) values are then defined.
=
sims Subject.(
simobs(
model_map,
cov_pop,merge(
(;= [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
βcl = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
βvc = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
βka
),coef(fpm),
),
)
)= Pumas.cv_populations(sims, KFold(; K = 7), BySubject())
split = split
(; train_data, validation_data) = 10 .^ (range(-5, 3, length = 20)) lambdas
20-element Vector{Float64}:
1.0e-5
2.6366508987303607e-5
6.951927961775606e-5
0.00018329807108324357
0.0004832930238571752
0.0012742749857031334
0.003359818286283781
0.008858667904100823
0.023357214690901226
0.06158482110660264
0.16237767391887217
0.4281332398719394
1.128837891684689
2.976351441631318
7.847599703514613
20.6913808111479
54.5559478116852
143.8449888287663
379.26901907322497
1000.0
Now, the previous steps are repeated to optimize the hyperparameter \(\lambda\) using cross-validation, including taking the average of the validation log likelihood across folds and organizing the results in a DataFrame
.
= map(lambdas) do λ
cv_result # cv with current λ
@info "Running cross-validation for λ = $λ."
map(zip(train_data, validation_data)) do (train, test)
# model with current lambda
= createModel(λ)
fold_model = merge(init_params(fold_model), coef(fpm))
fold_params # fit to current train fold
= fit(
fold_fpm
fold_model,
train,
fold_params,MAP(FOCE()),
= (; show_trace = false),
optim_options = false,
verbose
)# evaluate average loglikelihood in current validation fold
return loglikelihood(fold_model, test, coef(fold_fpm), FOCE())
end
end
= mean.(filter.(isfinite, cv_result))
medFin = DataFrame(λ = lambdas, mean_val_ll = medFin) cvDF
Row | λ | mean_val_ll |
---|---|---|
Float64 | Float64 | |
1 | 1.0e-5 | -118.028 |
2 | 2.63665e-5 | -118.028 |
3 | 6.95193e-5 | -118.03 |
4 | 0.000183298 | -118.029 |
5 | 0.000483293 | -118.863 |
6 | 0.00127427 | -118.032 |
7 | 0.00335982 | -118.033 |
8 | 0.00885867 | -118.033 |
9 | 0.0233572 | -118.041 |
10 | 0.0615848 | -118.78 |
11 | 0.162378 | -118.463 |
12 | 0.428133 | -118.832 |
13 | 1.12884 | -116.239 |
14 | 2.97635 | -111.059 |
15 | 7.8476 | -109.868 |
16 | 20.6914 | -108.123 |
17 | 54.5559 | -118.349 |
18 | 143.845 | -130.503 |
19 | 379.269 | -148.935 |
20 | 1000.0 | -148.935 |
This time, we can see the performance doesn’t improve monotonically with larger \(\lambda\) values. It reaches an optimum with \(\lambda = 20.7\). To finish the analysis, the best \(\lambda\) can be used to fit the model to the entire (simulated) dataset. Note that the non-zero covariate effects learned indeed correspond to the true covariate effects assumed above.
= sort(cvDF, :mean_val_ll; rev = true)[1, :λ]
bestλ = createModel(bestλ)
model_bestλ = merge(init_params(model_bestλ), coef(fpm))
params_bestλ = fit(
fpm_bestλ
model_bestλ,
sims,
params_bestλ,MAP(FOCE()),
= (; show_trace = false),
optim_options = false,
verbose )
FittedPumasModel
Dynamical system type: Closed form
Number of subjects: 136
Observation records: Active Missing
conc: 443 0
Total: 443 0
Number of parameters: Constant Optimized
0 40
Likelihood approximation: MAP (FOCE)
Likelihood optimizer: BFGS
Termination Reason: NoObjectiveChange
Log-likelihood value: -745.66677
--------------------
Estimate
--------------------
tvcl 8.6963
tvvc 40.644
tvka 0.54269
Ω₁,₁ 0.042493
Ω₂,₁ -0.067561
Ω₃,₁ -0.096912
Ω₂,₂ 0.11312
Ω₃,₂ 0.17362
Ω₃,₃ 0.30136
σ 1.206
βcl₁ 0.90992
βcl₂ 0.0035772
βcl₃ -6.38e-6
βcl₄ -0.003265
βcl₅ -9.5192e-6
βcl₆ 0.0017061
βcl₇ -0.00047544
βcl₈ 0.00011199
βcl₉ -0.074301
βcl₁₀ 0.032713
βvc₁ 0.001681
βvc₂ 0.81006
βvc₃ -0.13354
βvc₄ 0.00020026
βvc₅ 0.011292
βvc₆ -0.0063178
βvc₇ -0.00023663
βvc₈ 0.0016102
βvc₉ -0.0017478
βvc₁₀ -0.0032734
βka₁ 0.00083222
βka₂ -0.0069325
βka₃ 0.80762
βka₄ 0.0032135
βka₅ 0.0014283
βka₆ -0.0016305
βka₇ 0.00096685
βka₈ -0.00024714
βka₉ -0.00069845
βka₁₀ 0.00019081
--------------------