Getting started with DeepNLME

Author

Niklas Korsbo

Welcome to the tutorial focusing on the basics of pharmacometric modeling using DeepPumas. DeepPumas is a state-of-the-art tool that unifies statistical, dynamical, and machine learning modeling. Its machine learning capabilities make it uniquely suited for dealing with the complex dynamics often found in pharmacometric data.

This tutorial highlights key features of DeepPumas. You will learn how to model unknown terms in dynamical systems using data-driven techniques and how to improve predictions of longitudinal outcomes by incorporating baseline covariates to individualize these terms. As we go deeper into the practical applications, you will have a broader understanding of how DeepPumas contributes to more accurate and predictive pharmacometric models.

We get started by loading the packages that will be used in the tutorial.

using PumasPlots
using DeepPumas
using CairoMakie
using Latexify
using StableRNGs
rng = StableRNG(123)
set_theme!(deep_light())

1 Generating synthetic data

First, we define a model that we will use to generate a synthetic data set. We will create an indirect response (IDR) model with a Hill-type drug-dependent stimulation. The model will contain between-subject variability from both random effects and covariates. Given only the baseline data, the variability from random effects will be impossible to predict, whereas the variability coming from the covariates should be predictable. We intentionally introduce nonlinear and unconventional covariate effects to ensure that the data set is challenging to model manually, demonstrating the need for DeepPumas’ machine learning capabilities.

datamodel = @model begin
    @param begin
        tvKa  RealDomain()
        tvVc  RealDomain()
        tvSmax  RealDomain()
        tvSC50  RealDomain()
        tvKout  RealDomain()
        Kin  RealDomain()
        CL  RealDomain()
        n  RealDomain()
        Ω  PDiagDomain(5)
        σ_pk  RealDomain()
        σ_pd  RealDomain()
    end
    @random η ~ MvNormal(Ω)
    @covariates c1 c2 c3 c4 c5 c6
    @pre begin
        Smax = tvSmax * exp(η[1]) + 3 * c1 / (10.0 + c1)
        SC50 = tvSC50 * exp(η[2] + 0.3 * (c2 / 20)^0.75)
        Ka = tvKa * exp(η[3] + 0.3 * c3 * c4)
        Vc = tvVc * exp(η[4] + 0.3 * c3)
        Kout = tvKout * exp(η[5] + 0.3 * c5 / (c6 + c5))
    end
    @init R = Kin / Kout
    @vars begin
        cp = abs(Central / Vc)
        EFF = Smax * cp^n / (SC50^n + cp^n)
    end
    @dynamics begin
        Depot' = -Ka * Depot
        Central' = Ka * Depot - (CL / Vc) * Central
        R' = Kin * (1 + EFF) - Kout * R
    end
    @derived begin
        yPK ~ @. Normal(Central ./ Vc, σ_pk)
        yPD ~ @. Normal(R, σ_pd)
    end
end;

The @model syntax should be fairly simple to understand for an experienced NLME modeler, but we can also automatically generate a set of more readable model equations:

latexify(datamodel, :dynamics)

\[\begin{align} \frac{\mathrm{d} \cdot Depot(t)}{\mathrm{d}t} &= - Ka \cdot Depot(t) \\ \frac{\mathrm{d} \cdot Central(t)}{\mathrm{d}t} &= \frac{ - CL \cdot Central(t)}{Vc} + Ka \cdot Depot(t) \\ \frac{\mathrm{d} \cdot R(t)}{\mathrm{d}t} &= Kin \cdot \left( 1 + \frac{\left( \left|\frac{Central(t)}{Vc}\right| \right)^{n} \cdot Smax}{SC50^{n} + \left( \left|\frac{Central(t)}{Vc}\right| \right)^{n}} \right) - Kout \cdot R(t) \end{align}\]

Having defined the datamodel, we now generate a synthetic data set. We then split the dataset into a training set and a validation set. We will use the validation set to evaluate the predictive performance of our fitted model on a set of subjects that the model was not trained on. We also plot some example time courses from the validation set to get a better sense of the data.

data_params = (;
    tvKa = 0.5,
    tvVc = 1.0,
    tvSmax = 0.9,
    tvSC50 = 0.1,
    tvKout = 1.2,
    Kin = 1.2,
    CL = 1.0,
    n = 1.5,
    Ω = I(5) * 0.05,
    σ_pk = 1e-1,
    σ_pd = 1e-1,
)

pop = map(1:1520) do i
    _subj = Subject(
        id = i,
        covariates = (;
            c1 = rand(rng, Gamma(10, 1)),
            c2 = rand(rng, Gamma(21, 1)),
            c3 = rand(rng, Normal()),
            c4 = rand(rng, Normal()),
            c5 = rand(rng, Gamma(11, 1)),
            c6 = rand(rng, Gamma(11, 1)),
        ),
        events = DosageRegimen(2.0, ii = 4, addl = 1);
    )
    sim = simobs(datamodel, _subj, data_params; obstimes = 0:15, rng = rng)
    Subject(sim)
end

## Split the data into different training/test populations
trainpop_small = pop[1:40]
trainpop_large = pop[1:1500]
testpop = pop[(length(trainpop_large)+1):end]

## Visualize the synthetic data and the predictions of the data-generating model.
pred_datamodel = predict(datamodel, testpop, data_params; obstimes = 0:0.1:15);

plt_pk = plotgrid(pred_datamodel[1:6]; observation = :yPK);
plt_pd = plotgrid(pred_datamodel[1:6]; observation = :yPD);

display(plt_pk)
display(plt_pd)

Synthetic data of drug concentration over time, along with population predictions and individual predictions of the data-generating model.

The corresponding syntheticly generated pharmacodynamic outcome of interest.

With our synthetic data prepared, we can now explore how DeepPumas can identify unknown model dynamics.

2 Data-driven identification of model dynamics

Having generated the data, let us now predend that we do not know anything about the pharmacodynamics it. Traditional approaches to such lack of knowledge would include a significant amount of educated guesswork based on domain expertise. Rather than guessing and developing models by trial and error, DeepPumas enables data-driven identification of the parts of the model that we do not know. We will start by writing down the parts of the model that we feel pretty confident about.

known_model_parts = @model begin
    @param begin
        tvKa  RealDomain(; lower = 0)
        tvVc  RealDomain(; lower = 0)
        CL  RealDomain(; lower = 0)
        Ω  PDiagDomain(2)
        σ_pk  RealDomain(; lower = 0)
    end
    @random begin
        η ~ MvNormal(Ω)
    end
    @pre begin
        Ka = tvKa * exp(η[1])
        Vc = tvVc * exp(η[2])
    end
    @dynamics begin
        Depot' = -Ka * Depot
        Central' = Ka * Depot - (CL / Vc) * Central
        # ???
    end
    @derived begin
        yPK ~ @. Normal(Central ./ Vc, σ_pk)
        # yPD ~ @. Normal(???, σ_pd)
    end
end;

Recall that we want to capture the unknown parts of the model in a data-driven manner. More specifically, we will use a neural network for this task. We can define a neural network by using the MLPDomain constructor (MLP stands for multi-layer perceptron). For example

nn = MLPDomain(4, 4, (1, identity); reg = L2(1.0))

defines a stand-alone, callable MLP with 4 inputs, 4 hidden units, and a single output. The reg argument specifies the regularization of neural network parameters. Here we use the L2 regularization with a regularization constant of 1. For a more detailed description of the MLPDomain, please see the DeepPumas documentation.

In addition to using MLPDomains as stand-alone functions, we can define them as part of a @model, allowing for automatic discovery of functional relationships within the model. We start by defining a covariate-free model with an MLPDomain. We then embed the neural network in the derivative of the drug sensitive dynamical variable R which drives the PD observations we are interested in. The rate of change of R is assumed to depend on the drug concentration in the Central compartment and R itself, but not the Depot compartment. Therefore, Central / Vc and R are inputs to the neural network NN. Furthermore, we assume that the initial concentration R₀ (which is unique to each patient) may influence the rate of change of R, so we include it as another input to the neural network NN. Lastly, we include three random effects with normal priors as inputs to the neural network to produce individualization of NN.

model = @model begin
    @param begin
        # PK parameters
        tvKa  RealDomain(; lower = 0)
        tvVc  RealDomain(; lower = 0)
        tvR₀  RealDomain(; lower = 0)
        CL  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ_pk  RealDomain(; lower = 0, init = 10.0)

        # PD parameters
        ## Define a multi-layer perceptron (a neural network) which maps from 6 inputs 
        ## (2 state variables + 4 individual parameters) to a single output. 
        ## Apply L2 regularization (equivalent to a Normal prior).
        NN  MLPDomain(6, 6, 5, (1, identity); reg = L2(1.0))
        Ω_nn  PDiagDomain(; init = 0.1 .* I(3))
        σ_pd  RealDomain(; lower = 0, init = 10.0)
    end
    @random begin
        η ~ MvNormal(Ω)
        η_nn ~ MvNormal(Ω_nn)
    end
    @pre begin
        Ka = tvKa * exp(η[1])
        Vc = tvVc * exp(η[2])
        R₀ = tvR₀ * exp(η[3])
        iNN = fix(NN, R₀, η_nn)
    end
    @init begin
        R = R₀
    end
    @dynamics begin
        Depot' = -Ka * Depot
        Central' = Ka * Depot - (CL / Vc) * Central
        R' = iNN(Central, R)[1]
    end
    @derived begin
        yPK ~ @. Normal(Central / Vc, σ_pk)
        yPD ~ @. Normal(R, σ_pd)
    end
end

Neural networks can arbitrarily transform their inputs, including inputs from random effects. Therefore, it is not helpful to introduce complex, or even parametric, random effect distributions when these random effects are only used as inputs to a neural network. The neural network can transform our input distribution to whatever potentially complex distribution we need in order to best explain the between-subject variability in the data. So, the choice of a prior random effect distribution matters little for our ability to account for between-subject variability. Instead, we can base our choice of prior on other factors.

  1. Neural networks are easier to fit if their inputs are standardized. Machine learning practitioners frequently use z-score standardization to transform input to have zero mean and unit standard deviation. Standardization reduces the risk of the first neural network layer becoming saturated, ending up with values that are either too high or too low for a chosen activation function. Saturation results in small gradients which impede model training.
  2. Usage of a diagonal covariance matrix in the prior penalizes covariance between random effects in the posterior during model fitting. This incentivises the neural network to adopt a functional form wherein the random effect inputs have orthogonal effects on the output. If the random effects describe orthogonal dimensions of the between-subject variability then they are easier to work with later on. Fitting of empirical Bayes estimates (EBEs) becomes more robust, and it becomes easier to use the posterior distribution of the random effects for covariate modeling.
fpm = fit(
    model,
    trainpop_small,
    sample_params(model; rng = rng),
    MAP(FOCE());
    optim_options = (; iterations = 125, show_trace = false),
);

We use maximum aposteriori (MAP) fitting when we want to include priors on our fixed effects. The parameters (weights and biases) of our neural networks are all fixed effects of the NLME model and their L1 or L2 regularization are priors. Without MAP we would not be regularizing our neural network.

We fit the model by maximizing an approximation of the marginal likelihood and this is very important for the generalization performance of our fitted model. The marginal likelihood \(L_m\) defined as \[ L_m(\theta, z) = \int L_c(\theta, z, \eta) \cdot p_\eta(\eta \vert \theta) d\eta \] integrates out the random effects and represents a score of how well the model fits on average, across a range of values of the random effects that are plausible for a given the prior distribution \(p_\eta(\eta \vert \theta)\).

Had we maximized the conditional likelihood \(L_c\) or the conditional maximum aposteriori (JointMAP() in Pumas), then we would have allowed an almost complete independence of the values that \(\eta\) takes on for each subject making overfitting much more likely. The marginal likelihood penalizes overfitting since an overfitted \(L_c\) would be sensitive to the precise input of \(\eta\) and would have a very small area-under-the-curve (the integral) that we are using when computing the marginal likelihood. Marginal likelihood imparts a measure of robustness to the variation of random effects.

FOCE is method for approximating the marginal likelihood without having to evaluate this whole integral. It makes assumptions that may be dubious but it still yields good results across a wide range of models (including most DeepPumas models we have tried). We DeepPumas scientists default to it as a matter of pragmatism and only depart from it in the rare cases where the model fit does not succeed with delivering anything that predicts and behaves well.

Model fitting has succeeded if the population and individual predictions match the data well, we confirm this by inspecting the prediction plots

pred = predict(fpm, testpop; obstimes = 0:0.1:10);
plotgrid(pred; observation = :yPD)

The individual predictions “ipreds”, which use the empirical Bayes estimates (EBEs) for the subjects’ random effect, match the data just about perfectly. The “preds” (which use the mode of the random effect prior) are all identical because we have no information that would distinguish between our patients, e.g. covariates or different dosing. Therefore, “preds” accurately find the average population behaviour.

3 Sequential covariate modeling

So far our model is capable of fitting the dynamics of the data by learning, among other things, an unknown dynamical term. It also captures the between-subject variability but it does so without using the information provided by the baseline covariates. If there was a way to predict the values of a patient’s empirical bayes estimates from their baseline covariates that would improve our ability to make longitudinal predictions, especially for new patients. Fortunately, DeepPumas implements a machine learning based covariate modeling workflow (for a more detailed description see the documentation on sequential covariate modeling). First of all, we have to preprocess our data by extracting the relevant covariate \(\to\) random effect values mapping which can be done directly on the model we fitted:

preprocess(fpm)

However, using this target, which contains only \(40\) patients, would result in a rather small improvement. Different model parts require different amounts of data in order to robustly separate the signal from the noise. One might think that a neural network inserted into a dynamical system would be harder to fit than a neural network that maps a vector of covariates to a vector of targets but that is not always the case. When we fit the parameters of a dynamical system, every data point over time for every patient can provide information about the model parameters. However, when we fit a model that maps baseline covariate model to empirical Bayes estimates, we are trying to find time-independent features of a patient. The number of observations over time does not affect the total number of data points we can utilize to discover the covariate mapping. To find a good, individualizable, dynamical model, we did not need more than \(40\) patients, we might have done well even with less than \(10\). Conversely, we need data from more patients to find an accurate nonlinear relationship between the covariates and the empirical Bayes estimates. Thus, we create a target mapping from trainpop_large, a population of \(1,500\) patients.

target = preprocess(model, trainpop_large, coef(fpm), FOCE())

We then define a neural network nn to model that relationship, fit It, and augment the original fitted NLME model with the fitted covariate model.

nn = MLPDomain(numinputs(target), 10, 10, (numoutputs(target), identity); reg = L2(1e-3))
fitted_nn = fit(nn, target, optim_options = (; rng = rng))
augmented_fpm = augment(fpm, fitted_nn);

Since some of the between-subject variability in the data generating model comes from random effects, there is no way of perfectly predicting the patient outcomes from the covariates. The best we can do is if our augmented model makes the same population prediction as the data generating model when applied to validation data. This would indicate that we have identified an accurate mapping between the covariates and the longitudinal patient outcomes.

pred_augment = predict(augmented_fpm, testpop; obstimes = 0:0.1:10)
plotgrid(pred_augment; ipred = false, observation = :yPD)
plotgrid!(
    pred_datamodel,
    ipred = false,
    pred = (;
        label = "Best possible pred",
        color = :black,
        linestyle = :dash,
        linewidth = 2,
    ),
    observation = :yPD,
)

model performance on validation data indicates that both the learned dynamical system and the model mapping patient covariates to empirical Bayes estimates are predictively accurate.

To summare our sequential workflow, we first captured the between-subject variability in random effects. Secondly, we used each patients’ baseline covariates and empirical Bayes estimates to generate a target for further machine learning fitting. Note that the second step happened outside of the NLME model. We simply found a machine learning model that maps from a vector of inputs to a vector of outputs. In this example we used numbers but, in principle, any covariate type would work - images, omics, anything you have. The only thing you would need to adjust is the machine learning model such that it could appropriately map from the chosen covariate type to a vector of numerical outputs.

4 Refitting the prior random effect distribution

We have now augmented our fitted NLME model to capture some of the between-subject variability from patient covariates rather than only with random effects. It would therefore be a good idea to re-fit \(\Omega\) and \(\Omega_{nn}\) that control the prior distributions of the random effects. Typically, this prior would no longer be quite as broad as it was before the model augmentation.

fpm_refit_Ω = fit(
    augmented_fpm.model,
    trainpop_small,
    coef(augmented_fpm),
    MAP(FOCE());
    optim_options = (; show_trace = false),
    # Fit only the Ω parameters
    constantcoef = filter(i -> !(i in [:Ω, :Ω_nn]), keys(coef(augmented_fpm))),
);

5 Fully joint model fine tuning

Above, you saw a highly successful sequential approach to model fitting. We first fit a model to capture the dynamics and individual variability across our data set. We then fit a covariate model to personalize baseline predictions based on patient characteristics. We finally merged the two sequentially fitted models. The sequential approach typically already performs very well, but we can go further and fine tune the entire joint model. However, we have quite a lot of parameters in the augmented NLME model and fitting would take a long time.

fpm_deep = fit(
    fpm_refit_Ω.model,
    trainpop_small,
    coef(fpm_refit_Ω),
    MAP(FOCE());
    optim_options = (; time_limit = 5 * 60, show_trace = false),
);

In this case, the benefits of joint fitting would be rather negligible since we both terminated the fit early and we started from the sequentially fitted results which was already very good.

6 Classical NLME diagnostics and visual predictive check

Once you have trained your DeepNLME model using DeepPumas, you can run all the diagnostic tools available in Pumas including the visual predictive check (VPC) to further probe the quality of the fitted model. As shown below, the augmented model displays good performance in the diagnostics.

ins_refit_Ω = inspect(fpm_refit_Ω)
goodness_of_fit(ins_refit_Ω, figure = (; size = (1000, 1000)))

The results of the classical NLME diagnostic tools applied on the augmented DeepNLME model
_vpc = vpc(fpm_refit_Ω)
vpc_plot(_vpc, figurelegend = (; position = :b, orientation = :horizontal, nbanks = 3))

The visual predictive check plot of the augmented DeepNLME model

7 Conclusion

As shown, DeepPumas enables flexible, nested composition of dynamical, statistical and machine learning modeling. It can be used to model missing dynamical elements in a data-driven way, resulting in prediction of hetereogeneous outcomes across different individuals. Furthermore, the DeepPumas can be used to map the relationship between baseline covariates and patient parameters, enhancing the longitudinal outcome predictions. While no model can perfectly predict future outcomes given imperfect data, the DeepPumas model achieves the best possible predictions, closely matching the predictions made by the data-generating model itself.