LASSO for Covariate Modeling in NLME Models

Authors

Mohamed Tarek

Lucas Pereira

1 Introduction

LASSO, which stands for Least Absolute Shrinkage and Selection Operator (Ribbing et al. 2007), is a powerful covariate selection method in statistics. It enhances the base model by adding a penalty term to the objective function value, aka loss function, which is proportional to the absolute value of the coefficients. This penalty term enables the shrinking of some coefficients to be close to zero. As a result, LASSO helps in reducing model complexity and selecting a subset of relevant covariates from a potentially large set of covariates. Note that the LASSO solution may not be unique but it may still offer some insights into the data. The covariates with a coefficient close to 0 can be interpreted as not adding any more predictive power to the model, assuming the remaining covariates with non-zero coefficients are already present in the model.

Assuming the LASSO solution included a non-zero coefficient for covariate \(x_1\), LASSO can also be used to answer questions such as: “does including covariate \(x_2\) replace the need for including covariate \(x_1\) in the model?” One can answer this question by not penalizing the coefficient of \(x_2\), performing LASSO, and then observing the effect of that on the coefficient of \(x_1\) which was previously selected when all coefficients were equally penalized. Not penalizing the coefficient of \(x_2\) makes it more favorable to be selected by LASSO if it adds a comparable amount of predictive power to the model as covariate \(x_1\). If covariate \(x_1\) is still significantly more predictive of the response, LASSO will still assign a non-zero coefficient to \(x_1\) despite the coefficient of \(x_2\) not having a penalty.

The rest of this tutorial is organized as follows:

  1. Reading and pre-processing the data
  2. Presenting the LASSO PK model
  3. Cross-validation and hyper-parameter tuning
  4. Interpreting the results

In this tutorial, the model and dataset from “Pharmacokinetics of Quinidine in Male Patients” (Verme et al. 1992) will be used and adapted. In the original study, the authors analyzed data from 139 adult hospitalized men receiving oral therapy of quinidine, an anti-arrhythmic drug. In previous works, quinidine has shown considerable inter-subject variability in pharmacokinetic (PK) parameters, prompting the search by the authors after predictive covariates, specially for clearance. We will use the resulting data to train a LASSO covariate PK model.

2 Loading packages

First, some environment setup is necessary. The MLJ.jl package will be used throughout the tutorial, but the ML concepts and techniques discussed here are tool-agnostic. Depending on how you are running Julia, you may need to add MLJ.jl separately. Just copy, paste and execute the following code block.

import Pkg
Pkg.add("MLJ", preserve = Pkg.PRESERVE_ALL)

Next, the following code brings MLJ.jl and other packages into the environment. A seed for the random number generator is also set for reproducibility.

using CSV, DataFrames, DataFramesMeta, LinearAlgebra
using Random, MLJ, CategoricalArrays, Pumas
Random.seed!(378)

3 Data processing

3.1 Data reading

We begin by reading the data. The data can be downloaded from Kaggle, an online platform and community for sharing datasets and trained models. Alternatively, the data may be obtained from the RDatasets.jl package, which makes available to Julia many datasets already in the R programming language. Right from the start, we take the opportunity for a bit of processing by removing the first column, which just counts lines. Also, this dataset represents missing values with the string “NA”, so we indicate that in the call to CSV.read().

df = CSV.read("./Quinidine.csv", DataFrame; missingstring = ["NA"])[:, 2:end]
describe(df)[:, Not(7)]
14×6 DataFrame
Row variable mean min median max nmissing
Symbol Union… Any Union… Any Int64
1 Subject 73.9266 1 74.0 223 0
2 time 373.252 0.0 60.0 8095.5 0
3 conc 2.45429 0.4 2.3 9.4 1110
4 dose 225.302 83 201.0 603 443
5 interval 7.10843 4 6.0 12 1222
6 Age 66.7322 42 66.0 92 0
7 Height 69.1931 60 69.0 79 0
8 Weight 79.6608 41 78.0 119 0
9 Race Black Latin 0
10 Smoke no yes 0
11 Ethanol current none 0
12 Heart Moderate Severe 0
13 Creatinine < 50 >= 50 0
14 glyco 1.28168 0.39 1.23 3.16 0

3.2 Categorical variable recoding

As can be seen by the dataset summary printed previously, some preprocessing is still necessary. First, some columns include categories as strings, which will be encoded as integers. Also, two new columns are necessary for future steps, namely the compartment (cmt) to which the drug is administered and evid indicating dosing events.

@chain df begin
    # Encode string categories as integers
    @transform! :Smoke = recode(:Smoke, "no" => 0, "yes" => 1)
    @transform! :Heart = recode(:Heart, "No/Mild" => 1, "Moderate" => 2, "Severe" => 3)
    @transform! :Creatinine = recode(:Creatinine, ">= 50" => 0, "< 50" => 1)
    @transform! :Race = recode(:Race, "Caucasian" => 1, "Latin" => 2, "Black" => 3)
    @transform! :Ethanol = recode(:Ethanol, "none" => 1, "former" => 2, "current" => 3)
    # Fields for pumas_read()
    @transform! :cmt = ifelse(ismissing(:dose), missing, 1)
    @rtransform! :evid = ismissing(:dose) ? 0 : 1
end
describe(df)[:, Not(7)]
16×6 DataFrame
Row variable mean min median max nmissing
Symbol Float64 Real Float64 Real Int64
1 Subject 73.9266 1 74.0 223 0
2 time 373.252 0.0 60.0 8095.5 0
3 conc 2.45429 0.4 2.3 9.4 1110
4 dose 225.302 83 201.0 603 443
5 interval 7.10843 4 6.0 12 1222
6 Age 66.7322 42 66.0 92 0
7 Height 69.1931 60 69.0 79 0
8 Weight 79.6608 41 78.0 119 0
9 Race 1.42284 1 1.0 3 0
10 Smoke 0.303875 0 0.0 1 0
11 Ethanol 1.45615 1 1.0 3 0
12 Heart 1.93202 1 2.0 3 0
13 Creatinine 0.28416 0 0.0 1 0
14 glyco 1.28168 0.39 1.23 3.16 0
15 cmt 1.0 1 1.0 1 0
16 evid 0.698844 0 1.0 1 0

3.3 Data exploration

Since familiarizing with the data we are using can be useful for future troubleshooting, we’ll do a bit of exploration. In the code below, the dataset is first sorted by subject IDs. Then the number of unique values per column is determined and listed.

sort!(df, :Subject)
uniqs = Dict([Symbol(n) => unique(col) for (n, col) in zip(names(df), eachcol(df))])
for (name, uni) in uniqs
    println("$(rpad(name, 12))$(length(uni))")
end
Smoke       2
interval    5
Heart       3
time        728
Creatinine  2
evid        2
Age         38
Race        3
Height      20
glyco       145
cmt         1
Ethanol     3
Subject     136
dose        16
Weight      58
conc        57

And to get a better idea of the distributions of variables, Figure 1 includes the histograms of all columns.

Figure 1: Histograms of variables.
CairoMakie.activate!()
fig = Figure()
for (title, cartInd) in zip(names(df), CartesianIndices((4, 4)))
    ax = Axis(fig[cartInd.I...]; title)
    hist!(ax, df[!, title] |> skipmissing |> collect)
end
save("./histograms.pdf", fig)

We can also check how each patient in the dataset looks. In the following block, columns associated with pharmacokinetics are set aside, and shown for a subject.

pkDF = select(df, ["Subject", "time", "dose", "conc", "evid"])
pkDFsubGroup = groupby(pkDF, :Subject) # Group by subject
pkDFsubGroup[1] # Show rows from subject
16×5 SubDataFrame
Row Subject time dose conc evid
Int64 Float64 Int64? Float64? Int64
1 1 0.0 249 missing 1
2 1 3.0 249 missing 1
3 1 6.0 249 missing 1
4 1 16.0 201 missing 1
5 1 24.0 201 missing 1
6 1 32.0 201 missing 1
7 1 40.0 201 missing 1
8 1 49.25 missing 1.4 0
9 1 138.0 201 missing 1
10 1 145.5 201 missing 1
11 1 150.0 201 missing 1
12 1 160.0 402 missing 1
13 1 168.0 402 missing 1
14 1 176.0 402 missing 1
15 1 184.0 402 missing 1
16 1 193.33 missing 3.0 0

The same is done to the covariate columns in the code below to check how they look.

# Covariates
covas = [:Smoke, :Heart, :Creatinine, :Age, :Race, :Height, :glyco, :Ethanol, :Weight]
covDF = select(df, [:Subject, covas...]) # New DataFrame with covariates and subject column
covSubGroup = groupby(covDF, :Subject) # Group by subjects
covSubGroup[1] # Show rows from subject
16×10 SubDataFrame
Row Subject Smoke Heart Creatinine Age Race Height glyco Ethanol Weight
Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 Int64
1 1 0 2 0 60 1 69 0.41 3 106
2 1 0 2 0 60 1 69 0.41 3 106
3 1 0 2 0 60 1 69 0.41 3 106
4 1 0 2 0 60 1 69 0.41 3 106
5 1 0 2 0 60 1 69 0.41 3 106
6 1 0 2 0 60 1 69 0.41 3 106
7 1 0 2 0 60 1 69 0.41 3 106
8 1 0 2 0 60 1 69 0.41 3 106
9 1 0 2 0 60 1 69 1.11 3 106
10 1 0 2 0 60 1 69 1.11 3 106
11 1 0 2 0 60 1 69 1.11 3 106
12 1 0 2 0 60 1 69 1.11 3 106
13 1 0 2 0 60 1 69 1.11 3 106
14 1 0 2 0 60 1 69 1.11 3 106
15 1 0 2 0 60 1 69 1.11 3 106
16 1 0 2 0 60 1 69 1.11 3 106

3.4 Data encoding and standardization

Now, let’s actually prepare the data for the models. We begin by taking only the first row for each subject and dropping the :Subject column.

baselineInput = unique(covDF, :Subject)[:, Not(:Subject)]

All covariates must be coerced into numerically continuous values. Continuous and ordinal discrete variables need not be modified. Nominal categorical variables with \(n\) values are one-hot encoded using \(n - 1\) binary variables. Consider :Race as an example which has 3 possible values. The one-hot encoding of :Race involving representing the race of a subject using 2 binary covariates:

  • :Race__1: 1 if and only if the race has the value 1 (Caucasian)
  • :Race__2: 1 if and only if the race has the value 2 (Latin)

The third race (Black) is considered the reference value. The covariate effect of :Race__1 is therefore interpreted as the effect of the Caucasian race relative to the reference Black race. Similarly, the covariate effect of :Race__2 is interpreted as the effect of the Latin race relative to the reference Black race.

After encoding all covariates using continuous values, standardization is performed to ensure all covariates are on the same scale. As examples from the data summaries earlier, ‘glyco’ (\(\alpha_1\)-acid glycoprotein) is in the [0.39, 3.16] mg/100 dL range, while ‘weight’ is in [41, 119] kg. In LASSO, it is essential for all covariates to have the same scale to ensure their covariate effects are fairly penalized. After standardization, all the covariates will have a mean of 0 and standard deviation of 1. If a covariate has the same value across all subjects (standard deviation of 0), it should be filtered out.

# Coerce and one-hot encode the nominal categorical Race covariate
ordFacCoerce = inp -> coerce(inp, Symbol("Race") => OrderedFactor)
contCoerce = ContinuousEncoder(; one_hot_ordered_factors = true, drop_last = true)
# Also standardize, since covariates have different units/sizes
preparePipe = ordFacCoerce |> contCoerce |> Standardizer
# Apply transformations in pipeline
machPrepare = machine(preparePipe, baselineInput)
fit!(machPrepare, verbosity = 0)
stdInputs = MLJ.transform(machPrepare, baselineInput)
# Check results
describe(stdInputs, :mean, :std, :min, :max)
10×5 DataFrame
Row variable mean std min max
Symbol Float64 Float64 Float64 Float64
1 Smoke 6.04092e-17 1.0 -0.700621 1.41681
2 Heart 2.12249e-17 1.0 -1.05675 1.33855
3 Creatinine 7.9185e-17 1.0 -0.59779 1.66053
4 Age -4.89804e-18 1.0 -2.79031 2.81836
5 Race__1 9.63282e-17 1.0 -1.41681 0.700621
6 Race__2 4.73477e-17 1.0 -0.586504 1.69248
7 Height 1.96412e-15 1.0 -2.85786 2.77932
8 glyco -3.60822e-16 1.0 -1.72042 4.27123
9 Ethanol 1.26533e-16 1.0 -0.653312 2.21283
10 Weight 1.68166e-16 1.0 -2.46618 2.51976

Finally, let’s combine all the standardized covariates together in a single column of vectors \(x\):

x_names = names(stdInputs)
stdInputs.x = Vector{Float64}.(eachrow(stdInputs))
stdInputs.Subject = unique(df.Subject)
cov_df = leftjoin(df, stdInputs[:, [:Subject, :x]], on = :Subject)

4 Covariate-free model

4.1 Model definition

Based on (Verme et al. 1992), below is the definition for a Pumas PK model with one compartment, first-order absorption and elimination, 3 fixed effect parameters (clearance tvcl, central volume tvvc, and absorption constant tvka), associated random effects, and an additive residual error term (\(\sigma\)). Additionally, reference parameter values were extracted from (Pinheiro and Bates 2000) to be used as initial parameter values when fitting.

model = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ  RealDomain(; lower = 0)
    end
    @random η ~ MvNormal(Ω)
    @pre begin
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
        Ka = tvka * exp(η[3])
    end
    @dynamics Depots1Central1
    @derived conc ~ @. Normal(Central / Vc, σ)
end
# Parameters
params = (;
    tvcl = ^2.5, # 12.18
    tvvc = ^5.4, # 221
    tvka = ^-0.21, # 0.81
    Ω = [0.056 0.0 0.0; 0.0 0.03 0.0; 0.0 0.0 0.01],
    σ = 0.0123,
)

The above model does not include any covariates yet. It is generally a good idea to fit such a model first and obtain values for the population parameters before introducing covariates. The optimal population parameter values when fitting the covariate-free model can then be used as initial parameter values when fitting the model with covariates included.

4.2 Pumas population

To fit the above Pumas model to the data, the data needs to be converted to a Pumas Population using the read_pumas function:

pop = read_pumas(
    df;
    id = :Subject,
    time = :time,
    amt = :dose,
    cmt = :cmt,
    evid = :evid,
    observations = [:conc],
)
Population
  Subjects: 136
  Observations: conc

Note that, we created the columns cmt and evid because they are required by read_pumas.

4.3 Model fit

Now, we fit the model using the FOCE algorithm:

fpm = fit(model, pop, params, FOCE(), optim_options = (; show_trace = false))
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
FittedPumasModel

Successful minimization:                      true

Likelihood approximation:                     FOCE
Likelihood Optimizer:                         BFGS
Dynamical system type:                 Closed form

Log-likelihood value:                   -611.21547
Number of subjects:                            136
Number of parameters:         Fixed      Optimized
                                  0             10
Observation records:         Active        Missing
    conc:                       361             82
    Total:                      361             82

-------------------
         Estimate
-------------------
tvcl      9.5556
tvvc     35.468
tvka      0.42803
Ω₁,₁      0.096751
Ω₂,₁     -0.099672
Ω₃,₁     -0.17099
Ω₂,₂      0.18458
Ω₃,₂      0.14476
Ω₃,₃      0.32631
σ         1.1189
-------------------

5 LASSO model

5.1 Laplace prior

The LASSO objective function value to be minimized is:

\[ -\log p(y \mid x, \beta) + \lambda \cdot \left\lVert \beta \right\rVert_1 \]

where \(\beta\) is the set of covariate effects in the model and \(p(y \mid x, \beta)\) is the marginal likelihood of the model given the population data. An equivalent way to implement LASSO is using a Laplace prior distribution on \(\beta\) and using maximum a-posteriori (MAP) estimation to optimize for \(\beta\).

\[ \begin{aligned} \beta_i & \sim \text{Laplace}(0, b) \\ -\log p(\beta) & = \sum_{i=1}^{n_\beta} \Bigg( \log (2 \cdot b) + \frac{|\beta_i|}{b} \Bigg) \\ & = \underbrace{n_\beta \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_\beta} |\beta_i|}_{\left\lVert \beta \right\rVert_1} \\ & = \lambda \cdot \left\lVert \beta \right\rVert_1 + \text{constant} \end{aligned} \]

Therefore, by setting the parameter \(b\) of the Laplace prior, one can control the LASSO penalty parameter \(\lambda = \frac{1}{b}\).

5.2 Pumas model

To implement the above prior in a Pumas model, the following model can be used:

λ = 1.0 # LASSO penalty parameter
# multi-variate Laplace distribution
function mvlaplace(λ)
    return product_distribution(fill(Distributions.Laplace(0.0, 1 / λ), 10))
end

model_map = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ  RealDomain(; lower = 0)
        βcl ~ mvlaplace(λ)
        βvc ~ mvlaplace(λ)
        βka ~ mvlaplace(λ)
    end
    @covariates x
    @random η ~ MvNormal(Ω)
    @pre begin
        CL = tvcl * exp(η[1] + dot(x, βcl))
        Vc = tvvc * exp(η[2] + dot(x, βvc))
        Ka = tvka * exp(η[3] + dot(x, βka))
    end
    @dynamics Depots1Central1
    @derived conc ~ @. Normal(Central / Vc, σ)
end

In the above model, all the standardized covariates are assumed to be in a vector \(x\) for each subject. The vector \(x\) has length 10 for each subject.

5.3 Pumas population with covariates

cov_pop = read_pumas(
    cov_df;
    id = :Subject,
    time = :time,
    amt = :dose,
    cmt = :cmt,
    evid = :evid,
    covariates = [:x],
    observations = [:conc],
)
Population
  Subjects: 136
  Covariates: x
  Observations: conc

5.4 Fitting the model

To fit the above model using MAP estimation, call:

# initial parameter values
map_params = merge(init_params(model_map), coef(fpm))
fpm_map =
    fit(model_map, cov_pop, map_params, MAP(FOCE()), optim_options = (; show_trace = false))
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Optim/ZhuZN/src/multivariate/optimize/optimize.jl:98
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Optim/ZhuZN/src/multivariate/optimize/optimize.jl:98
FittedPumasModel

Successful minimization:                      true

Likelihood approximation:               MAP (FOCE)
Likelihood Optimizer:                         BFGS
Dynamical system type:                 Closed form

Log-likelihood value:                   -582.65735
Number of subjects:                            136
Number of parameters:         Fixed      Optimized
                                  0             40
Observation records:         Active        Missing
    conc:                       361             82
    Total:                      361             82

---------------------
          Estimate
---------------------
tvcl       9.9191
tvvc      24.675
tvka       0.29697
Ω₁,₁       0.075289
Ω₂,₁      -0.071577
Ω₃,₁      -0.078237
Ω₂,₂       0.095279
Ω₃,₂       0.15054
Ω₃,₃       0.29432
σ          1.0528
βcl₁      -0.0074482
βcl₂      -0.05125
βcl₃      -0.053056
βcl₄       0.046466
βcl₅       0.12498
βcl₆       0.11308
βcl₇       0.0018384
βcl₈      -0.12288
βcl₉       0.0063419
βcl₁₀      0.045607
βvc₁      -0.027769
βvc₂       0.18966
βvc₃       0.072418
βvc₄       0.025773
βvc₅      -0.29648
βvc₆       0.11586
βvc₇       0.06108
βvc₈      -0.0064235
βvc₉      -0.25783
βvc₁₀     -0.0011689
βka₁       9.8459e-5
βka₂       0.053088
βka₃       0.066929
βka₄       0.11245
βka₅       4.4001e-6
βka₆       0.40126
βka₇      -0.16549
βka₈       0.15886
βka₉      -0.13688
βka₁₀     -0.062274
---------------------

The following function summarizes the covariate effects and sorts them by their absolute value for any input variable var, where var is either :CL, :Vc or :Ka. A higher covariate effect indicates a relative importance score.

function sort_by_abs(var::Symbol)
    coef_map = coef(fpm_map)
    effect_summary = DataFrame(
        :covariate => x_names,
        :CL => coef_map.βcl,
        :Vc => coef_map.βvc,
        :Ka => coef_map.βka,
    )
    sort!(effect_summary, [var], by = abs, rev = true)
    return effect_summary[:, [:covariate, var]]
end
println(sort_by_abs(:CL), "\n")
println(sort_by_abs(:Vc), "\n")
println(sort_by_abs(:Ka), "\n");
10×2 DataFrame
 Row │ covariate   CL
     │ String      Float64
─────┼─────────────────────────
   1 │ Race__1      0.12498
   2 │ glyco       -0.122885
   3 │ Race__2      0.113076
   4 │ Creatinine  -0.0530562
   5 │ Heart       -0.0512502
   6 │ Age          0.0464655
   7 │ Weight       0.0456074
   8 │ Smoke       -0.0074482
   9 │ Ethanol      0.00634193
  10 │ Height       0.00183844

10×2 DataFrame
 Row │ covariate   Vc
     │ String      Float64
─────┼─────────────────────────
   1 │ Race__1     -0.296484
   2 │ Ethanol     -0.257833
   3 │ Heart        0.18966
   4 │ Race__2      0.115858
   5 │ Creatinine   0.0724178
   6 │ Height       0.0610796
   7 │ Smoke       -0.0277692
   8 │ Age          0.0257734
   9 │ glyco       -0.00642353
  10 │ Weight      -0.00116895

10×2 DataFrame
 Row │ covariate   Ka
     │ String      Float64
─────┼─────────────────────────
   1 │ Race__2      0.401256
   2 │ Height      -0.165494
   3 │ glyco        0.158863
   4 │ Ethanol     -0.136882
   5 │ Age          0.112455
   6 │ Creatinine   0.0669291
   7 │ Weight      -0.062274
   8 │ Heart        0.0530876
   9 │ Smoke        9.84587e-5
  10 │ Race__1      4.40006e-6

Covariates with an effect whose absolute value is less than a user-specified threshold, e.g. 1e-4, can be dropped from the model. A higher threshold can be used if a smaller model is required.

6 Cross-validation and hyper-parameter selection

6.1 Effect of \(\lambda\)

In the above fit, the value for \(\lambda\) was set to 1.0 upfront. However, that value is a hyper-parameter which needs to be objectively selected. A higher \(\lambda\) has the following effects:

  • Covariate effects are more likely to be sparse, and
  • The model is more likely to under-fit.

A lower \(\lambda\) has the opposite effects where:

  • Covariate effects are less likely to be sparse, and
  • The model is more likely to over-fit.

The optimal value of \(\lambda\) is one that maximizes the model’s ability to fit unseen data, not used to fit the model, thus avoiding both under-fitting and over-fitting. To quantify this ability, cross-validation is typically used.

Note that instead of using a single \(\lambda\), one may also use a different \(\lambda\) for each of the covariate effects on CL, Vc and Ka. This will likely give a better model at the end. However, for simplicity, we use a single \(\lambda\) in this tutorial.

6.2 Cross-validation

One common strategy to split the data in the context of hyperparameter search is cross-validation (CV), which Figure 2 illustrates. The hyperparameter optimization procedure can be summarized as follows:

  1. Split the data into \(k\) ‘folds’ (partitions) of roughly equal size.

  2. For each value of \(\lambda\):

    1. For each fold of the \(k\) folds:

      1. Assign the selected fold as the validation data, and use the remaining folds for fitting.
      2. Fit/train the model using the training folds and the selected value of \(\lambda\).
      3. Evaluate the fitted model’s performance on the validation fold, e.g. using the log likelihood of the fitted model given the validation data.
    2. Take the average performance metric across all folds.

  3. Use an optimization or search strategy to choose the best \(\lambda\) that maximizes the average model performance on unseen data.

  4. Re-fit the model using the optimal \(\lambda\) and the entire dataset.

Figure 2: Usage of data splits across iterations of cross-validation.

K-fold cross-validation can also be used without changing hyperparameters between iterations to get a measure of a model’s performance on unseen data for model comparison.

While there is generally no easy way to select the number of folds K (Kohavi 2001), values from 5 to 10 are common. Higher values increase the computational cost of the procedure, because more fits need to be done, but reduce the bias when estimating the model’s performance. Lower K values are faster to run but more biased.

6.3 Select \(\lambda\)

The cross-validation procedure presented will be repeatedly applied to determine the best value of \(\lambda\). And the average log likelihood using the validation data will be used as the metric to compare the models.

First some setup is required. In the following code, the function createModel is defined for code organization. It returns a Pumas model using a provided \(\lambda\). Then, Pumas.cv_populations is used to split the population. In the split object returned, the fields train_data and test_data are vectors of populations. The former contains 5 training populations for each cross-validation fold; and the latter, the respective 5 test populations. Lastly, the values of \(\lambda\) to be tried are defined in lambdas, spanning 1e-7 to 1e5 skipping some orders of magnitude.

function createModel(λ)
    return @model begin
        @param begin
            tvcl  RealDomain(; lower = 0)
            tvvc  RealDomain(; lower = 0)
            tvka  RealDomain(; lower = 0)
            Ω  PSDDomain(3)
            σ  RealDomain(; lower = 0)
            βcl ~ mvlaplace(λ)
            βvc ~ mvlaplace(λ)
            βka ~ mvlaplace(λ)
        end
        @covariates x
        @random η ~ MvNormal(Ω)
        @pre begin
            CL = tvcl * exp(η[1] + dot(x, βcl))
            Vc = tvvc * exp(η[2] + dot(x, βvc))
            Ka = tvka * exp(η[3] + dot(x, βka))
        end
        @dynamics Depots1Central1
        @derived conc ~ @. Normal(Central / Vc, σ)
    end
end
# Split population
split = Pumas.cv_populations(cov_pop, KFold(; K = 7), BySubject())
(; train_data, test_data) = split
lambdas = [10^i for i = -6.0:2.0:3.0]
5-element Vector{Float64}:
   1.0e-6
   0.0001
   0.010000000000000002
   1.0
 100.0

With these definitions, we are able to run the procedure. For optimization, we use a simple log-scale grid search technique, where we iterate over pre-specified \(\lambda\) values. For each \(\lambda\), k-fold cross-validation is done by looping over the train and test pairs of populations in train_data and test_data. In each fold:

  1. The model is built with the current \(\lambda\).
  2. The initial parameters are defined.
  3. The model is fitted to training data.
  4. The validation log likelihood of the fitted model is stored.

For each \(\lambda\), the mean validation log likelihood is calculated, averaged across the \(K\) runs. Runs giving -Inf or NaN validation log likelihood values are assumed to be unlucky runs and ignored. If the number of unlucky runs is significant, the values of \(\lambda\) giving such non-finite validation log likelihood values should be penalized and not selected.

cv_result = map(lambdas) do λ
    # cv with current λ
    println("Cross-validation of λ = $λ")
    map(zip(train_data, test_data)) do (train, test)
        # model with current lambda
        fold_model = createModel(λ)
        fold_params = merge(init_params(fold_model), coef(fpm))
        # fit to current train fold
        fold_fpm = fit(
            fold_model,
            train,
            fold_params,
            MAP(FOCE()),
            optim_options = (; show_trace = false),
            verbose = false,
        )
        # evaluate average loglikelihood in current validation fold
        return loglikelihood(fold_model, test, coef(fold_fpm), FOCE())
    end
end
medFin = mean.(filter.(isfinite, cv_result))
cvDF = DataFrame= lambdas, mean_val_ll = medFin)
Cross-validation of λ = 1.0e-6
Cross-validation of λ = 0.0001
Cross-validation of λ = 0.010000000000000002
Cross-validation of λ = 1.0
Cross-validation of λ = 100.0
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Optim/ZhuZN/src/multivariate/optimize/optimize.jl:98
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Optim/ZhuZN/src/multivariate/optimize/optimize.jl:98
┌ Warning: Terminated early due to NaN in gradient.
└ @ Optim ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Optim/ZhuZN/src/multivariate/optimize/optimize.jl:98
5×2 DataFrame
Row λ mean_val_ll
Float64 Float64
1 1.0e-6 -92.8469
2 0.0001 -92.8468
3 0.01 -92.8382
4 1.0 -92.5988
5 100.0 -84.7184

As shown in the final DataFrame, increasing \(\lambda\) monotonically improves the model’s performance. Recall that the \(\lambda\) is a penalty coefficient in the following objective function value being minimized when fitting the model:

\[ -\log p(y \mid x, \beta) + \lambda \cdot \left\lVert \beta \right\rVert_1 \]

A high value of \(\lambda = 100\) penalizes all the covariates, leading to their respective coefficients to be all close to 0, i.e. none of the covariates is significant. This can happen in real analyses where none of the covariates have significant predictive power.

The dataset used above is a real dataset, so we have no control over the result. However, for didactic purposes, let’s simulate data from an presumed covariate model to showcase the next couple of steps that would be carried out in this type of analysis.

Below, simobs uses model_map to simulate observations for cov_pop. The covariate effects (\(\beta\)) of smoke, heart and creatinine are arbitrarily (no biological interpretation intended) and manually specified to impact clearance, volume and absorption, respectively. This way, LASSO will have relationships to identify. The remaining parameters are taken from fpm. The data splits and \(\lambda\) values are then defined.

sims =
    Subject.(
        simobs(
            model_map,
            cov_pop,
            merge(
                (;
                    βcl = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                    βvc = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
                    βka = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
                ),
                coef(fpm),
            ),
        )
    )
split = Pumas.cv_populations(sims, KFold(; K = 7), BySubject())
(; train_data, test_data) = split
lambdas = [10^i for i = -2.0:2.0]
5-element Vector{Float64}:
   0.010000000000000002
   0.1
   1.0
  10.0
 100.0

Now, the previous steps are repeated to optimize the hyperparameter \(\lambda\) using cross-validation, including taking the average of the validation log likelihood across folds and organizing the results in a DataFrame.

cv_result = map(lambdas) do λ
    # cv with current λ
    println("Cross-validation of λ = $λ")
    map(zip(train_data, test_data)) do (train, test)
        # model with current lambda
        fold_model = createModel(λ)
        fold_params = merge(init_params(fold_model), coef(fpm))
        # fit to current train fold
        fold_fpm = fit(
            fold_model,
            train,
            fold_params,
            MAP(FOCE()),
            optim_options = (; show_trace = false),
            verbose = false,
        )
        # evaluate average loglikelihood in current validation fold
        return loglikelihood(fold_model, test, coef(fold_fpm), FOCE())
    end
end
medFin = mean.(filter.(isfinite, cv_result))
cvDF = DataFrame= lambdas, mean_val_ll = medFin)
Cross-validation of λ = 0.010000000000000002
Cross-validation of λ = 0.1
Cross-validation of λ = 1.0
Cross-validation of λ = 10.0
Cross-validation of λ = 100.0
5×2 DataFrame
Row λ mean_val_ll
Float64 Float64
1 0.01 -118.585
2 0.1 -118.564
3 1.0 -116.5
4 10.0 -107.947
5 100.0 -125.334

This time, we can see the performance doesn’t improve monotonically with larger \(\lambda\) values. It reaches an optimum with \(\lambda = 10\), but then worsens for \(\lambda = 100\). This indicates regularization is controlling overfitting, but there are still covariate effects that are relevant to the model’s performance, so ignoring them with a \(\lambda\) that’s too large is detrimental to the model’s performance on unseen data. And to finish the analysis, the best \(\lambda\) can be used to fit the model to the entire (simulated) dataset. Note that the non-zero covariate effects learned indeed correspond to the true covariate effects assumed above.

bestλ = sort(cvDF, :mean_val_ll; rev = true)[1, :λ]
model_bestλ = createModel(bestλ)
params_bestλ = merge(init_params(model_bestλ), coef(fpm))
fpm_bestλ = fit(
    model_bestλ,
    sims,
    params_bestλ,
    MAP(FOCE()),
    optim_options = (; show_trace = false),
    verbose = false,
)
FittedPumasModel

Successful minimization:                      true

Likelihood approximation:               MAP (FOCE)
Likelihood Optimizer:                         BFGS
Dynamical system type:                 Closed form

Log-likelihood value:                   -735.72061
Number of subjects:                            136
Number of parameters:         Fixed      Optimized
                                  0             40
Observation records:         Active        Missing
    conc:                       443              0
    Total:                      443              0

----------------------
          Estimate
----------------------
tvcl       9.0562
tvvc      35.241
tvka       0.40195
Ω₁,₁       0.030798
Ω₂,₁      -0.048774
Ω₃,₁      -0.068336
Ω₂,₂       0.077248
Ω₃,₂       0.10832
Ω₃,₃       0.15917
σ          1.19
βcl₁       0.9606
βcl₂       0.047617
βcl₃      -5.1505e-7
βcl₄      -0.026219
βcl₅      -0.0069079
βcl₆       9.8418e-7
βcl₇      -3.259e-7
βcl₈       8.0283e-6
βcl₉      -0.076939
βcl₁₀      0.049973
βvc₁       6.8613e-7
βvc₂       0.75209
βvc₃      -0.041448
βvc₄       0.036615
βvc₅       0.059971
βvc₆      -3.6377e-6
βvc₇       1.9082e-7
βvc₈       5.1006e-7
βvc₉       7.2328e-7
βvc₁₀     -4.5441e-7
βka₁       1.9601e-7
βka₂      -0.32809
βka₃       0.9917
βka₄       0.00068273
βka₅       1.1896e-6
βka₆      -1.4441e-8
βka₇      -3.3042e-8
βka₈      -0.0004485
βka₉      -9.478e-7
βka₁₀      0.020677
----------------------

References

Kohavi, Ron. 2001. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” 14 (March).
Pinheiro, José C., and Douglas M. Bates. 2000. Mixed-Effects Models in s and s-PLUS. Springer.
Ribbing, Jakob, Joakim Nyberg, Ola Caster, and E. Niclas Jonsson. 2007. “The Lasso—a Novel Method for Predictive Covariate Model Building in Nonlinear Mixed Effects Models.” Journal of Pharmacokinetics and Pharmacodynamics 34 (4): 485–517. https://doi.org/10.1007/s10928-007-9057-1.
Verme, Catherine N., Thomas M. Ludden, William A. Clementi, and Steven C. Harris. 1992. “Pharmacokinetics of Quinidine in Male Patients: A Population Analysis.” Clinical Pharmacokinetics 22 (6): 468–80. https://doi.org/10.2165/00003088-199222060-00005.

Reuse