```
using PumasPlots
using DeepPumas
using CairoMakie
using Latexify
using StableRNGs
= StableRNG(123)
rng set_theme!(deep_light())
```

# Getting started with DeepNLME

Welcome to the tutorial focusing on the basics of pharmacometric modeling using DeepPumas. DeepPumas is a state-of-the-art tool that unifies statistical, dynamical, and machine learning modeling. Its machine learning capabilities make it uniquely suited for dealing with the complex dynamics often found in pharmacometric data.

This tutorial highlights key features of DeepPumas. You will learn how to model unknown terms in dynamical systems using data-driven techniques and how to improve predictions of longitudinal outcomes by incorporating baseline covariates to individualize these terms. As we go deeper into the practical applications, you will have a broader understanding of how DeepPumas contributes to more accurate and predictive pharmacometric models.

We get started by loading the packages that will be used in the tutorial.

## 1 Generating synthetic data

First, we define a model that we will use to generate a synthetic data set. We will create an indirect response (IDR) model with a Hill-type drug-dependent stimulation. The model will contain between-subject variability from both random effects and covariates. Given only the baseline data, the variability from random effects will be impossible to predict, whereas the variability coming from the covariates should be predictable. We intentionally introduce nonlinear and unconventional covariate effects to ensure that the data set is challenging to model manually, demonstrating the need for DeepPumas’ machine learning capabilities.

```
= @model begin
datamodel @param begin
∈ RealDomain()
tvKa ∈ RealDomain()
tvVc ∈ RealDomain()
tvSmax ∈ RealDomain()
tvSC50 ∈ RealDomain()
tvKout ∈ RealDomain()
Kin ∈ RealDomain()
CL ∈ RealDomain()
n ∈ PDiagDomain(5)
Ω ∈ RealDomain()
σ_pk ∈ RealDomain()
σ_pd end
@random η ~ MvNormal(Ω)
@covariates c1 c2 c3 c4 c5 c6
@pre begin
= tvSmax * exp(η[1]) + 3 * c1 / (10.0 + c1)
Smax = tvSC50 * exp(η[2] + 0.3 * (c2 / 20)^0.75)
SC50 = tvKa * exp(η[3] + 0.3 * c3 * c4)
Ka = tvVc * exp(η[4] + 0.3 * c3)
Vc = tvKout * exp(η[5] + 0.3 * c5 / (c6 + c5))
Kout end
@init R = Kin / Kout
@vars begin
= abs(Central / Vc)
cp = Smax * cp^n / (SC50^n + cp^n)
EFF end
@dynamics begin
' = -Ka * Depot
Depot' = Ka * Depot - (CL / Vc) * Central
Central' = Kin * (1 + EFF) - Kout * R
Rend
@derived begin
~ @. Normal(Central ./ Vc, σ_pk)
yPK ~ @. Normal(R, σ_pd)
yPD end
end;
```

The `@model`

syntax should be fairly simple to understand for an experienced NLME modeler, but we can also automatically generate a set of more readable model equations:

`latexify(datamodel, :dynamics)`

\[\begin{align} \frac{\mathrm{d} \cdot Depot(t)}{\mathrm{d}t} &= - Ka \cdot Depot(t) \\ \frac{\mathrm{d} \cdot Central(t)}{\mathrm{d}t} &= \frac{ - CL \cdot Central(t)}{Vc} + Ka \cdot Depot(t) \\ \frac{\mathrm{d} \cdot R(t)}{\mathrm{d}t} &= Kin \cdot \left( 1 + \frac{\left( \left|\frac{Central(t)}{Vc}\right| \right)^{n} \cdot Smax}{SC50^{n} + \left( \left|\frac{Central(t)}{Vc}\right| \right)^{n}} \right) - Kout \cdot R(t) \end{align}\]

Having defined the `datamodel`

, we now generate a synthetic data set. We then split the dataset into a training set and a validation set. We will use the validation set to evaluate the predictive performance of our fitted model on a set of subjects that the model was not trained on. We also plot some example time courses from the validation set to get a better sense of the data.

```
= (;
data_params = 0.5,
tvKa = 1.0,
tvVc = 0.9,
tvSmax = 0.1,
tvSC50 = 1.2,
tvKout = 1.2,
Kin = 1.0,
CL = 1.5,
n = I(5) * 0.05,
Ω = 1e-1,
σ_pk = 1e-1,
σ_pd
)
= map(1:1520) do i
pop = Subject(
_subj = i,
id = (;
covariates = rand(rng, Gamma(10, 1)),
c1 = rand(rng, Gamma(21, 1)),
c2 = rand(rng, Normal()),
c3 = rand(rng, Normal()),
c4 = rand(rng, Gamma(11, 1)),
c5 = rand(rng, Gamma(11, 1)),
c6
),= DosageRegimen(2.0, ii = 4, addl = 1);
events
)= simobs(datamodel, _subj, data_params; obstimes = 0:15, rng = rng)
sim Subject(sim)
end
## Split the data into different training/test populations
= pop[1:40]
trainpop_small = pop[1:1500]
trainpop_large = pop[(length(trainpop_large)+1):end]
testpop
## Visualize the synthetic data and the predictions of the data-generating model.
= predict(datamodel, testpop, data_params; obstimes = 0:0.1:15);
pred_datamodel
= plotgrid(pred_datamodel[1:6]; observation = :yPK);
plt_pk = plotgrid(pred_datamodel[1:6]; observation = :yPD);
plt_pd
display(plt_pk)
display(plt_pd)
```

With our synthetic data prepared, we can now explore how DeepPumas can identify unknown model dynamics.

## 2 Data-driven identification of model dynamics

Having generated the data, let us now predend that we do not know anything about the pharmacodynamics it. Traditional approaches to such lack of knowledge would include a significant amount of educated guesswork based on domain expertise. Rather than guessing and developing models by trial and error, DeepPumas enables data-driven identification of the parts of the model that we do not know. We will start by writing down the parts of the model that we feel pretty confident about.

```
= @model begin
known_model_parts @param begin
∈ RealDomain(; lower = 0)
tvKa ∈ RealDomain(; lower = 0)
tvVc ∈ RealDomain(; lower = 0)
CL ∈ PDiagDomain(2)
Ω ∈ RealDomain(; lower = 0)
σ_pk end
@random begin
~ MvNormal(Ω)
η end
@pre begin
= tvKa * exp(η[1])
Ka = tvVc * exp(η[2])
Vc end
@dynamics begin
' = -Ka * Depot
Depot' = Ka * Depot - (CL / Vc) * Central
Central# ???
end
@derived begin
~ @. Normal(Central ./ Vc, σ_pk)
yPK # yPD ~ @. Normal(???, σ_pd)
end
end;
```

Recall that we want to capture the unknown parts of the model in a data-driven manner. More specifically, we will use a neural network for this task. We can define a neural network by using the `MLPDomain`

constructor (MLP stands for multi-layer perceptron). For example

`= MLPDomain(4, 4, (1, identity); reg = L2(1.0)) nn `

defines a stand-alone, callable MLP with 4 inputs, 4 hidden units, and a single output. The `reg`

argument specifies the regularization of neural network parameters. Here we use the L2 regularization with a regularization constant of 1. For a more detailed description of the `MLPDomain`

, please see the DeepPumas documentation.

In addition to using `MLPDomain`

s as stand-alone functions, we can define them as part of a `@model`

, allowing for automatic discovery of functional relationships within the model. We start by defining a covariate-free model with an `MLPDomain`

. We then embed the neural network in the derivative of the drug sensitive dynamical variable `R`

which drives the PD observations we are interested in. The rate of change of `R`

is assumed to depend on the drug concentration in the `Central`

compartment and `R`

itself, but not the `Depot`

compartment. Therefore, `Central / Vc`

and `R`

are inputs to the neural network `NN`

. Furthermore, we assume that the initial concentration `R₀`

(which is unique to each patient) may influence the rate of change of `R`

, so we include it as another input to the neural network `NN`

. Lastly, we include three random effects with normal priors as inputs to the neural network to produce individualization of `NN`

.

```
= @model begin
model @param begin
# PK parameters
∈ RealDomain(; lower = 0)
tvKa ∈ RealDomain(; lower = 0)
tvVc ∈ RealDomain(; lower = 0)
tvR₀ ∈ RealDomain(; lower = 0)
CL ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0, init = 10.0)
σ_pk
# PD parameters
## Define a multi-layer perceptron (a neural network) which maps from 6 inputs
## (2 state variables + 4 individual parameters) to a single output.
## Apply L2 regularization (equivalent to a Normal prior).
∈ MLPDomain(6, 6, 5, (1, identity); reg = L2(1.0))
NN ∈ PDiagDomain(; init = 0.1 .* I(3))
Ω_nn ∈ RealDomain(; lower = 0, init = 10.0)
σ_pd end
@random begin
~ MvNormal(Ω)
η ~ MvNormal(Ω_nn)
η_nn end
@pre begin
= tvKa * exp(η[1])
Ka = tvVc * exp(η[2])
Vc = tvR₀ * exp(η[3])
R₀ = fix(NN, R₀, η_nn)
iNN end
@init begin
= R₀
R end
@dynamics begin
' = -Ka * Depot
Depot' = Ka * Depot - (CL / Vc) * Central
Central' = iNN(Central, R)[1]
Rend
@derived begin
~ @. Normal(Central / Vc, σ_pk)
yPK ~ @. Normal(R, σ_pd)
yPD end
end
```

Neural networks can arbitrarily transform their inputs, including inputs from random effects. Therefore, it is not helpful to introduce complex, or even parametric, random effect distributions when these random effects are only used as inputs to a neural network. The neural network can transform our input distribution to whatever potentially complex distribution we need in order to best explain the between-subject variability in the data. So, the choice of a prior random effect distribution matters little for our ability to account for between-subject variability. Instead, we can base our choice of prior on other factors.

- Neural networks are easier to fit if their inputs are standardized. Machine learning practitioners frequently use z-score standardization to transform input to have zero mean and unit standard deviation. Standardization reduces the risk of the first neural network layer becoming saturated, ending up with values that are either too high or too low for a chosen activation function. Saturation results in small gradients which impede model training.
- Usage of a diagonal covariance matrix in the prior penalizes covariance between random effects in the posterior during model fitting. This incentivises the neural network to adopt a functional form wherein the random effect inputs have orthogonal effects on the output. If the random effects describe orthogonal dimensions of the between-subject variability then they are easier to work with later on. Fitting of empirical Bayes estimates (EBEs) becomes more robust, and it becomes easier to use the posterior distribution of the random effects for covariate modeling.

```
= fit(
fpm
model,
trainpop_small,sample_params(model; rng = rng),
MAP(FOCE());
= (; iterations = 125, show_trace = false),
optim_options );
```

`MAP`

?
We use maximum aposteriori (`MAP`

) fitting when we want to include priors on our fixed effects. The parameters (weights and biases) of our neural networks are all fixed effects of the NLME model and their `L1`

or `L2`

regularization are priors. Without `MAP`

we would not be regularizing our neural network.

`FOCE`

to maximize the *marginal*likelihood?

We fit the model by maximizing an approximation of the *marginal* likelihood and this is *very* important for the generalization performance of our fitted model. The marginal likelihood \(L_m\) defined as \[
L_m(\theta, z) = \int L_c(\theta, z, \eta) \cdot p_\eta(\eta \vert \theta) d\eta
\] integrates out the random effects and represents a score of how well the model fits on average, across a range of values of the random effects that are plausible for a given the prior distribution \(p_\eta(\eta \vert \theta)\).

Had we maximized the *conditional* likelihood \(L_c\) or the conditional maximum aposteriori (`JointMAP()`

in Pumas), then we would have allowed an almost complete independence of the values that \(\eta\) takes on for each subject making overfitting much more likely. The marginal likelihood penalizes overfitting since an overfitted \(L_c\) would be sensitive to the precise input of \(\eta\) and would have a very small area-under-the-curve (the integral) that we are using when computing the marginal likelihood. Marginal likelihood imparts a measure of robustness to the variation of random effects.

`FOCE`

is method for approximating the marginal likelihood without having to evaluate this whole integral. It makes assumptions that may be dubious but it still yields good results across a wide range of models (including most DeepPumas models we have tried). We DeepPumas scientists default to it as a matter of pragmatism and only depart from it in the rare cases where the model fit does not succeed with delivering anything that predicts and behaves well.

Model fitting has succeeded if the population and individual predictions match the data well, we confirm this by inspecting the prediction plots

```
= predict(fpm, testpop; obstimes = 0:0.1:10);
pred plotgrid(pred; observation = :yPD)
```

## 3 Sequential covariate modeling

So far our model is capable of fitting the dynamics of the data by learning, among other things, an unknown dynamical term. It also captures the between-subject variability but it does so without using the information provided by the baseline covariates. If there was a way to predict the values of a patient’s empirical bayes estimates from their baseline covariates that would improve our ability to make longitudinal predictions, especially for new patients. Fortunately, `DeepPumas`

implements a machine learning based covariate modeling workflow (for a more detailed description see the documentation on sequential covariate modeling). First of all, we have to `preprocess`

our data by extracting the relevant covariate \(\to\) random effect values mapping which can be done directly on the model we fitted:

`preprocess(fpm)`

However, using this target, which contains only \(40\) patients, would result in a rather small improvement. Different model parts require different amounts of data in order to robustly separate the signal from the noise. One might think that a neural network inserted into a dynamical system would be harder to fit than a neural network that maps a vector of covariates to a vector of targets but that is not always the case. When we fit the parameters of a dynamical system, every data point over time for every patient can provide information about the model parameters. However, when we fit a model that maps baseline covariate model to empirical Bayes estimates, we are trying to find time-independent features of a patient. The number of observations over time does not affect the total number of data points we can utilize to discover the covariate mapping. To find a good, individualizable, dynamical model, we did not need more than \(40\) patients, we might have done well even with less than \(10\). Conversely, we need data from more patients to find an accurate nonlinear relationship between the covariates and the empirical Bayes estimates. Thus, we create a target mapping from `trainpop_large`

, a population of \(1,500\) patients.

`= preprocess(model, trainpop_large, coef(fpm), FOCE()) target `

We then define a neural network `nn`

to model that relationship, fit It, and `augment`

the original fitted NLME model with the fitted covariate model.

```
= MLPDomain(numinputs(target), 10, 10, (numoutputs(target), identity); reg = L2(1e-3))
nn = fit(nn, target, optim_options = (; rng = rng))
fitted_nn = augment(fpm, fitted_nn); augmented_fpm
```

Since some of the between-subject variability in the data generating model comes from random effects, there is *no way* of perfectly predicting the patient outcomes from the covariates. The best we can do is if our augmented model makes the same population prediction as the data generating model when applied to validation data. This would indicate that we have identified an accurate mapping between the covariates and the longitudinal patient outcomes.

```
= predict(augmented_fpm, testpop; obstimes = 0:0.1:10)
pred_augment plotgrid(pred_augment; ipred = false, observation = :yPD)
plotgrid!(
pred_datamodel,= false,
ipred = (;
pred = "Best possible pred",
label = :black,
color = :dash,
linestyle = 2,
linewidth
),= :yPD,
observation )
```

To summare our sequential workflow, we first captured the between-subject variability in random effects. Secondly, we used each patients’ baseline covariates and empirical Bayes estimates to generate a target for further machine learning fitting. Note that the second step happened outside of the NLME model. We simply found a machine learning model that maps from a vector of inputs to a vector of outputs. In this example we used numbers but, in principle, any covariate type would work - images, omics, anything you have. The only thing you would need to adjust is the machine learning model such that it could appropriately map from the chosen covariate type to a vector of numerical outputs.

## 4 Refitting the prior random effect distribution

We have now augmented our fitted NLME model to capture some of the between-subject variability from patient covariates rather than only with random effects. It would therefore be a good idea to re-fit \(\Omega\) and \(\Omega_{nn}\) that control the prior distributions of the random effects. Typically, this prior would no longer be quite as broad as it was before the model augmentation.

```
= fit(
fpm_refit_Ω
augmented_fpm.model,
trainpop_small,coef(augmented_fpm),
MAP(FOCE());
= (; show_trace = false),
optim_options # Fit only the Ω parameters
= filter(i -> !(i in [:Ω, :Ω_nn]), keys(coef(augmented_fpm))),
constantcoef );
```

## 5 Fully joint model fine tuning

Above, you saw a highly successful sequential approach to model fitting. We first fit a model to capture the dynamics and individual variability across our data set. We then fit a covariate model to personalize baseline predictions based on patient characteristics. We finally merged the two sequentially fitted models. The sequential approach typically already performs very well, but we *can* go further and fine tune the entire joint model. However, we have quite a lot of parameters in the augmented NLME model and fitting would take a long time.

```
= fit(
fpm_deep
fpm_refit_Ω.model,
trainpop_small,coef(fpm_refit_Ω),
MAP(FOCE());
= (; time_limit = 5 * 60, show_trace = false),
optim_options );
```

In this case, the benefits of joint fitting would be rather negligible since we both terminated the fit early and we started from the sequentially fitted results which was already very good.

## 6 Classical NLME diagnostics and visual predictive check

Once you have trained your DeepNLME model using DeepPumas, you can run all the diagnostic tools available in Pumas including the visual predictive check (VPC) to further probe the quality of the fitted model. As shown below, the augmented model displays good performance in the diagnostics.

```
= inspect(fpm_refit_Ω)
ins_refit_Ω goodness_of_fit(ins_refit_Ω, figure = (; size = (1000, 1000)))
```

```
= vpc(fpm_refit_Ω)
_vpc vpc_plot(_vpc, figurelegend = (; position = :b, orientation = :horizontal, nbanks = 3))
```

## 7 Conclusion

As shown, DeepPumas enables flexible, nested composition of dynamical, statistical and machine learning modeling. It can be used to model missing dynamical elements in a data-driven way, resulting in prediction of hetereogeneous outcomes across different individuals. Furthermore, the DeepPumas can be used to map the relationship between baseline covariates and patient parameters, enhancing the longitudinal outcome predictions. While no model can perfectly predict future outcomes given imperfect data, the DeepPumas model achieves the best possible predictions, closely matching the predictions made by the data-generating model itself.