using DeepPumas
using CairoMakie
using StableRNGs
set_mlp_backend(:staticflux)
SciML using DeepPumas
1 Learning objectives
- Explore the use of machine learning models, such as multi-layer perceptrons (MLPs), to model time-series data.
- Understand the concept of neural ordinary differential equations (NODEs) and their application in modeling dynamic systems with doses.
- Learn how to encode domain knowledge into models using universal differential equations (UDEs).
- Understand the limitations of models without random effects in capturing individual variability.
2 Setup
First, we need to load the necessary packages. We also set the backend for the MLPDomain
to use :staticflux
. DeepPumas supports multiple backends for the MLPDomain
. :staticflux
is a good choice for small neural networks which we will use in this tutorial.
3 Synthetic data and true model
The following Pumas model is designed to simulate pharmacokinetic (PK) and pharmacodynamic (PD) data. It assumes a linear one-compartment PK model, oral dosing, and a non-linear elimination PD model, serving as the basis for generating synthetic data.
= @model begin
data_model @param begin
∈ RealDomain(; lower = 0) # absorption rate constant
Ka ∈ RealDomain(; lower = 0) # clearance
CL ∈ RealDomain(; lower = 0) # volume of distribution
Vc ∈ RealDomain(; lower = 0) # maximum effect
Smax ∈ RealDomain(; lower = 0) # Hill coefficient
n ∈ RealDomain(; lower = 0) # concentration at half-maximal effect
SC50 ∈ RealDomain(; lower = 0) # elimination rate constant
Kout ∈ RealDomain(; lower = 0) # input rate constant
Kin ∈ RealDomain(; lower = 0) # proportional error standard deviation
σ end
@init begin
= Kin / Kout # initial response
Response end
@vars begin
:= max(Central / Vc, 0.0) # plasma concentration
cp := Smax * cp^n / (SC50^n + cp^n) # effect function
EFF end
@dynamics begin
' = -Ka * Depot
Depot' = Ka * Depot - (CL / Vc) * Central
Central' = Kin * (1 + EFF) - Kout * Response # response dynamics
Responseend
@derived begin
~ @. Normal(Response, abs(Response) * σ)
Outcome end
end
= (;
true_parameters = 0.5,
Ka = 1.0,
CL = 1.0,
Vc = 2.9,
Smax = 1.5,
n = 0.05,
SC50 = 2.2,
Kout = 0.8,
Kin = 0.02, ## <-- tune the observational noise of the data here
σ )
We will use this model to simulate data for two subjects, A and B, with different dosage regimens. Subject A will receive a dose of 1.0 with an additional dose after 5 hours, while Subject B will receive a lower dose of 0.1 with the same regimen.
We use the simobs
function to simulate observations for each subject based on the defined model and true parameters. The obstimes
argument specifies the time points at which observations are made, and we use a stable random number generator (StableRNG
) to ensure reproducibility.
= DosageRegimen(1.0, addl = 1, ii = 5)
dr_a = Subject(; events = dr_a, id = "Subject A")
subj_a = simobs(data_model, subj_a, true_parameters; obstimes = 0:0.5:15, rng = StableRNG(1))
sim_a = [Subject(sim_a)]
data_a
= DosageRegimen(0.1, addl = 1, ii = 5)
dr_b = Subject(; events = dr_b, id = "Subject B")
subj_b = simobs(data_model, subj_b, true_parameters; obstimes = 0:0.5:15, rng = StableRNG(2))
sim_b = [Subject(sim_b)] data_b
We can visualize the simulated data for both subjects using plotgrid
, which provides a convenient way to plot multiple subjects’ data in a grid layout.
plotgrid(data_a; data = (; label = "Data (subject A)"))
plotgrid!(data_b; data = (; label = "Data (subject B)"), color = :gray)
We overlay the true model predictions on the data for both subjects. This is the ground truth underlying function that we will try to recover using scientific machine learning. To get the predictions, we use the predict
function with the data_model
and the true_parameters
. The obstimes
argument specifies the time points at which we want to evaluate the predictions. Note that we are predicting at more time points than the original data to get a smoother curve.
= predict(data_model, data_a, true_parameters; obstimes = 0:0.01:15)
pred_datamodel_a = predict(data_model, data_b, true_parameters; obstimes = 0:0.01:15)
pred_datamodel_b plotgrid(pred_datamodel_a; ipred = false)
plotgrid!(pred_datamodel_b; data = true, ipred = false)
4 Identification of dynamics using machine learning
One way to identify the dynamics of a system is to use a neural network to model the relationship between time and the observed outcomes. In this section, we will explore how to use a simple neural network function of time to model the dynamics of our system.
The following code defines a model that uses a multi-layer perceptron (MLP) mlp
that takes time as input and outputs a predicted outcome, which we will then fit to the observed data. The MLP is defined with 1 input, 1 output, and 2 hidden layers of 6 nodes each. We also include a regularization term to lower the chances of overfitting but avoid regularizing the output layer, as we want the output to be on the same scale as the observed data. This model is not a SciML model, but rather a machine learning model that maps time to a noisy outcome.
= @model begin
time_model @param begin
∈ MLPDomain(1, 6, 6, (1, identity); reg = L2(1.0; output = false))
mlp ∈ RealDomain(; lower = 0.0)
σ end
@derived begin
:= first.(mlp.(t))
nn_output ~ @. Normal(nn_output, σ)
Outcome end
end
The @derived
block specifies how the outcome is derived from the MLP output, using the first element of the MLP output for each time point. In the @derived
block:
t
is a vector of all time points for which a subject had observations.mlp.(t)
applies themlp
function to each element of t. (the.
“broadcasts” the function over all elements instead of using the vector directly)first
gets the first element of the mlp output (the output is a 1-element vector)
To fit this model to the data, we first need to remove the dose information from the subjects, as this simple model does not account for dosing. We create new subjects without events (doses) for both Subject A and Subject B.
= Subject.(data_a; events = nothing)
data_a_no_dose = Subject.(data_b; events = nothing) data_b_no_dose
We then call the fit
function to fit the model to the data for Subject A, using the init_params
function to initialize the parameters of the model. We use the MAP(NaivePooled())
method to fit the model. MAP
stands for Maximum A Posteriori estimation, and NaivePooled
is used because the model does not have random effects. The reason MAP
is required is that under the hood, the regularization is implemented as a prior distribution on the parameters. Without MAP
, the regularization would not be applied.
= fit(time_model, data_a_no_dose, init_params(time_model), MAP(NaivePooled())) fpm_time
Following the fit, we can visualize the predictions of the model for Subject A. The predict
function is used to generate predictions based on the fitted model, and we specify the observation times at which we want to evaluate the predictions.
First, we plot the predictions for Subject A, overlaying them on the observed data.
= predict(fpm_time; obstimes = 0:0.1:15);
pred_a plotgrid(
pred_a;= (; label = "Pred (subject A)"),
pred = (; label = "Data (subject A)", color = :gray),
data = false,
ipred )
Next, we can also visualize the predictions for Subject B using the same fitted model.
= predict(fpm_time, data_b_no_dose; obstimes = 0:0.1:15);
pred_b plotgrid!(
pred_b,= (; label = "Pred (subject B)", color = :red),
pred = (; label = "Data (subject B)", color = :gray),
data = false,
ipred )
Note that they have the exact same prediction! This is expected since the model does not account for dosing and only uses time as input. The model has learned a function of time that fits the data, but it does not capture the underlying pharmacokinetic and pharmacodynamic processes.
5 Neural ordinary differential equations (NODEs)
For simple dosing regimens, we could have input the dose amount as an additional input to the neural network above. However, this approach does not scale well to more complex dosing regimens, or when we have multiple subjects with different dosing patterns.
To account for the dosing and the underlying dynamics of the system, we can use a neural ordinary differential equation (NODE). This approach allows us to model the dynamics of the system using dynamic compartments and their interactions, while still leveraging the power of neural networks to learn complex relationships. This approach is generally more suitable for PKPD modeling, because it allows us to assign one of the compartments to be the Depot compartment, which is where the drug is administered, and one to be the Central compartment, which is where the drug is distributed and metabolized, and one to be the Response compartment, which is where the effect of the drug is observed.
= @model begin
neural_ode_model @param begin
∈ MLPDomain(3, 6, 6, (3, identity); reg = L2(1.0)) # neural network with 3 inputs and 1 output
mlp ∈ RealDomain(; lower = 0) # initial value of response
Response₀ ∈ RealDomain(; lower = 0) # proportional error standard deviation
σ end
@init Response = Response₀
@dynamics begin
' = mlp(Depot, Central, Response)[1]
Depot' = mlp(Depot, Central, Response)[2]
Central' = mlp(Depot, Central, Response)[3]
Responseend
@derived begin
~ @. Normal(Response, abs(Response) * σ)
Outcome end
end
Note that we still utilize a neural network to model the rates of change of the compartments, but any dosing regimen can now be administered to the Depot compartment. Additionally, the same neural network is used to model the rates of change of all three compartments. The first output of the neural network is used for the Depot compartment, the second output for the Central compartment, and the third output for the Response compartment.
For simplicity, no PK data is used in this model which means that the second compartment’s dynamics is not identifiable and its meaning as the Central compartment will generally not be respected! This is an important point to keep in mind when using NODEs. The meaning you assign to a compartment is not enforced by the model unless you tie a compartment to a specific observed variable that is used for fitting. This is also true to an extent for the Depot compartment despite the fact that it is the only compartment that receives a dose directly. But because the rate of change is a general neural network, there is no guarantee that the Depot compartment will be a decreasing function of time between doses, or that the amount there is only that of the actual dose given.
Next, we fit the model to the data for Subject A. We use the fit
function with the init_params
function to initialize the parameters of the model. init_params
will initialize mlp
to use random parameter values that are sampled once when a model is defined. This means that subsequent calls to init_params
will return the same parameters. This is useful for reproducibility. sample_params
can be used instead to sample new random parameters each time it is called, which is useful for exploring the parameter space and finding good initial values for the optimization. We use MAP(NaivePooled())
again to fit the model.
= fit(neural_ode_model, data_a, init_params(neural_ode_model), MAP(NaivePooled())) fpm_node
After fitting the model, we can visualize the predictions for subjects A and B.
= predict(fpm_node; obstimes = 0:0.01:15)
pred_a plotgrid(
pred_a;= (; label = "Pred (subject A)"),
pred = false,
ipred = (; label = "Data (subject A)", color = :gray),
data
)
= predict(fpm_node, data_b; obstimes = 0:0.01:15)
pred_b plotgrid!(
pred_b,= (; label = "Pred (subject B)", color = :red),
pred = (; label = "Data (subject B)", color = :gray),
data = false,
ipred )
We can see that the predictions are different unlike the previous model, which only used time as input. However, the model is clearly not able to predict the data of Subject B very well, indicating that it was unable to learn a generalizable model that captures the underlying dynamics of the system and how these dynamics depend on the dose level.
Perhaps, this is not surprising since the model is only fitted on a single dose instance. Machine learning needs more data to learn generalizable patterns. So despite the model’s ability to learn the dynamics of Subject A, it does not generalize well to Subject B.
6 Universal differential equation (UDE)
The NODE model used previously did not encode any domain knowledge about the system, beside administering drug to a specific compartment. To improve the model’s ability to generalize across subjects, we can encode more domain knowledge into the model. This is where the concept of Universal Differential Equations (UDEs) comes into play.
Below is a model that uses a traditional one compartment PK model which is known to be the correct data-generating PK model. But beside the specific functional form, this PK model encodes the knowledge that the drug is conserved between the Depot and Central compartments. This means that the drug that enters the Central compartment is equal to the drug that leaves the Depot compartment, minus the drug that is cleared from the Central compartment. This mass conservation property ensures a monotonic decreasing behavior of Depot between doses, and ensures a behavior of Central that is consistent with our understanding and meaning assigned to this compartment. Note that the specific dynamics of the compartments may still not be identifiable from the PD data only. However, we have constrained the set of plausible models to ones that are more biologically reasonable.
= @model begin
ude_model @param begin
∈ MLPDomain(2, 6, 6, (1, identity); reg = L2(0.1)) # neural network with 2 inputs and 1 output
mlp ∈ RealDomain(; lower = 0)
Ka ∈ RealDomain(; lower = 0)
CL ∈ RealDomain(; lower = 0)
Vc ∈ RealDomain(; lower = 0)
Response₀ ∈ RealDomain(; lower = 0) # proportional error standard deviation
σ end
@init Response = Response₀
@dynamics begin
' = -Ka * Depot # known
Depot' = Ka * Depot - (CL / Vc) * Central # known
Central' = mlp(Central / Vc / 100, Response / 10)[1]
Responseend
@derived begin
~ @. Normal(Response, abs(Response) * σ)
Outcome end
end
In this model, the Response compartment is the only part that is still modelled using a neural network. The neural network takes 2 inputs:
- The Central compartment concentration divided by the volume of distribution (Vc) and divided further by 100, and
- The current value of the Response compartment (R) divided by 10.
The division by 100 and 10 ensures that the scales of the inputs are small at the beginning of the fit. At the beginning of a fit, the dynamics will not be well behaved and may lead to unreasonably high values for Central or Response. This can cause fitting issues by saturating the neural network’s activation function (tanh
), leading to non-informative low gradients and poor convergence. By scaling the inputs, we can avoid this issue and allow the model to learn more effectively. The default activation function of the MLP is tanh
, which saturates for large inputs. Alternatively, we could have used softplus
as the activation function, which does not saturate for large inputs. However, for consistency with the previous models and for the smoothness of the tanh
function, we will stick with tanh
for this example.
Next, we fit the model to the data for Subject A and make the same predictions for both subjects A and B as before.
= fit(ude_model, data_a, init_params(ude_model), MAP(NaivePooled())) fpm_ude
= predict(fpm_ude; obstimes = 0:0.1:15);
pred_a plotgrid(
pred_a;= (; label = "Pred (subject A)"),
pred = false,
ipred = (; label = "Data (subject A)", color = :gray),
data
)
= predict(ude_model, data_b, coef(fpm_ude); obstimes = 0:0.1:10);
pred_b plotgrid!(
pred_b,= (; label = "Pred (subject B)", color = :red),
pred = (; label = "Data (subject B)", color = :gray),
data = false,
ipred )
Note that the model is able to generalize better to Subject B compared to the previous NODE model, despite just seeing a single dose level in the training data. In general, encoding more domain knowledge into a model can help improve its ability to generalize and data efficiency. Data efficiency refers to the model’s ability to learn from a limited amount of data, which is particularly important in pharmacometrics where data can be scarce or expensive to obtain.
7 Encoding more domain knowledge
We can go further by encoding knowledge about the functional form of the rate of change of the Response compartment. For example, an additional constraint we can add is that when there is no drug, the Response compartment should follow the classical turnover model:
' = Kin - Kout * Response Response
We also initialize the Response compartment to be the ratio of the input and output rate constants which is the steady state value of the Response compartment when there is no drug. This is a common practice in pharmacometrics.
We achieve this using the following model.
= @model begin
ude_model_knowledge @param begin
∈ MLPDomain(1, 6, 6, (1, identity); reg = L2(0.1), bias = false) # neural network with 1 input and 1 output and no bias/intercept
mlp ∈ RealDomain(; lower = 0)
Ka ∈ RealDomain(; lower = 0)
CL ∈ RealDomain(; lower = 0)
Vc ∈ RealDomain(; lower = 0)
Kout ∈ RealDomain(; lower = 0)
Kin ∈ RealDomain(; lower = 0) # proportional error standard deviation
σ end
@init Response = Kin / Kout
@dynamics begin
' = -Ka * Depot
Depot' = Ka * Depot - (CL / Vc) * Central
Central' = Kin * (1 + mlp(Central / Vc)[1]) - Kout * Response
Responseend
@derived begin
~ @. Normal(Response, abs(Response) * σ)
Outcome end
end
Note that to ensure the output of the neural network is 0 when its input is 0, we avoid using bias terms in the neural network, hence the bias = false
in the MLPDomain
definition. Also because this model is much more constrained and well-behaved, we drop the scaling of the input to the neural network which was unnecessary.
Next, we fit and predict using the new model with the same data for Subject A and B.
=
fpm_knowledge fit(ude_model_knowledge, data_a, init_params(ude_model_knowledge), MAP(NaivePooled());)
= predict(fpm_knowledge; obstimes = 0:0.1:15);
pred_a plotgrid(
pred_a;= false,
ipred = (; label = "Data (subject A)", color = (:black, 0.5)),
data = (; label = "Pred (subject A)"),
pred = (; orientation = :horizontal, nbanks = 2),
legend
)
= predict(ude_model_knowledge, data_b, coef(fpm_knowledge); obstimes = 0:0.1:10);
pred_b plotgrid!(
pred_b;= false,
ipred = (; label = "Data (subject B)", color = :black),
data = (; label = "Pred (subject B)", color = :red),
pred )
Note how the prediction now takes the correct function shape but it is still not able to predict the scale of the response of Subject B very well. To further appreciate this, we overlay the true model predictions on the same plot.
plotgrid!(
pred_datamodel_a;= (; color = (:black, 0.4), label = "Datamodel"),
pred = false,
ipred
)plotgrid!(
pred_datamodel_b;= (; color = (:black, 0.4), label = "Datamodel"),
pred = false,
ipred )
Notice that the steady-state behavior of the model prediction and the true underlying function is the same. Compare the prediction here from that of the first black-box NODE model. While the UDE model prediction is not perfect for Subject B, it is still impressive given that only a single subject was used in the training data. So let’s see what happens when we have more subjects in the training data.
8 Two training subjects
To demonstrate the effect of adding more data, let’s add one more subject with different dose level to the training data. The new subject, Subject C, is given a dose amount of 0.5. Given that Subject B has a dose level of 0.1, by using subjects A and C to fit the model and predicting for Subject B, we are still extrapolating the model to a dose level that is lower than the training data.
First, let’s define the new subject and simulate its data using the same true_parameters
as before.
= DosageRegimen(0.5, addl = 1, ii = 5)
dr_c = Subject(; events = dr_c, id = "Subject C")
subj_c = simobs(data_model, subj_c, true_parameters; obstimes = 0:0.5:15, rng = StableRNG(3)) sim_c
Next, we fit the UDE model to the data for Subjects A and C, using the same fitting procedure as before, and we predict for Subject B.
= [Subject(sim_a), Subject(sim_c)]
data_ac =
fpm_knowledge fit(ude_model_knowledge, data_ac, init_params(ude_model_knowledge), MAP(NaivePooled());)
= predict(ude_model_knowledge, data_b, coef(fpm_knowledge); obstimes = 0:0.1:10);
pred_b plotgrid(
pred_b;= false,
ipred = (; label = "Data (subject B)", color = :black),
data = (; label = "Pred (subject B)", color = :red),
pred )
Much better!
9 Sparse data
Let’s repeat the same analysis above but with sparsely sampled data for the training data. To compensate for the data sparsity per subject, 4 subjects will be used for training the model. This is a more realistic scenario where we often have sparse data for each subject, which can make it difficult to fit the model.
We sample 4 subjects, each with a different dose level, and simulate their data using the same true_parameters
as before. The doses are chosen to be 1.0, 0.8, 0.6, and 0.4. The same Subject B from before will then be used for prediction.
= [1.0, 0.8, 0.6, 0.4]
doses = map(1:4) do i
sparse_training_data = DosageRegimen(doses[i], addl = 1, ii = 5)
dr = Subject(; events = dr, id = "Sparse subject $i")
subj = rand(StableRNG(2i), Uniform(4, 6)) # random step size between 4 and 6 hours
step_size = simobs(
sim
data_model,
subj,
true_parameters;= 0:step_size:15,
obstimes = StableRNG(i),
rng
)return Subject(sim)
end
With the above sampling procedure, each subject has 3 to 4 observations. Next, we fit the model to the sparse data and predict for Subject B as before. Note that Subject B has a dose level that is not in the training data.
= fit(
fpm_knowledge
ude_model_knowledge,
sparse_training_data,init_params(ude_model_knowledge),
MAP(NaivePooled());
)
= predict(ude_model_knowledge, data_b, coef(fpm_knowledge); obstimes = 0:0.1:10);
pred_b plotgrid(
pred_b;= false,
ipred = (; label = "Data (subject B)", color = :black),
data = (; label = "Pred (subject B)", color = :red),
pred )
The model is still able to predict the data for Subject B reasonably well, despite the sparse data and the fact that Subject B has a dose level that is not in the training data. This demonstrates the power of UDEs in modeling complex systems with limited data.
10 Multiple subjects pooled
The previous analysis demonstrated the ability of a UDE model to extrapolate to new dose levels with very limited training data. However, a key assumption was made in the data generating process so far, which is that all subjects were assumed to have the same true_parameters
. So the only sources of variability between subjects are the covariates and residual noise. This assumption is not realistic in practice. In reality, subjects have different individual parameters, e.g. Ka and CL. In the next analysis, we will try to understand how our UDE model behaves once we relax this assumption in the data generating process without changing the fitting model.
To simulate multiple subjects that have different individual PK and PD parameters, we use a nonlinear mixed effects model (NLME) with a @random
block to define random effects for the PK and PD parameters. The model is otherwise identical to the previous data generating model.
= @model begin
data_model_heterogeneous @param begin
∈ RealDomain(; lower = 0)
Ka ∈ RealDomain(; lower = 0)
CL ∈ RealDomain(; lower = 0)
Vc ∈ RealDomain(; lower = 0)
Smax ∈ RealDomain(; lower = 0)
n ∈ RealDomain(; lower = 0)
SC50 ∈ RealDomain(; lower = 0)
Kout ∈ RealDomain(; lower = 0)
Kin ∈ RealDomain(; lower = 0)
σ end
@random begin
~ MvNormal(0.04 * I(5))
η end
@pre begin
= Smax * exp(η[1])
Smaxᵢ = SC50 * exp(η[2])
SC50ᵢ = Ka * exp(η[3])
Kaᵢ = Vc * exp(η[4])
Vcᵢ = Kout * exp(η[5])
Koutᵢ end
@init begin
= Kin / Koutᵢ
R end
@vars begin
:= max(Central / Vcᵢ, 0.0)
cp := Smaxᵢ * cp^n / (SC50ᵢ^n + cp^n)
EFF end
@dynamics begin
' = -Kaᵢ * Depot
Depot' = Kaᵢ * Depot - (CL / Vcᵢ) * Central
Central' = Kin * (1 + EFF) - Koutᵢ * Response
Responseend
@derived begin
~ @. Normal(Response, abs(Response) * σ)
Outcome end
end
To investigate the effect of the existence of random effects in the data generating process, we fix the dose level to 1.0 for all subjects and assume a single dose at baseline, for simplicity.
Using the following code, we can simulate sparse data for 30 subjects. Each subject has 2 observations at random time points within the first 10 hours after the dose. The same true_parameters
are used to simulate the data. However, these parameters now refer to the typical values of the individual parameters due to the existence of random effects in the model.
= [
sims_sparse simobs(
data_model_heterogeneous,Subject(; events = DosageRegimen(1.0), id = i),
true_parameters;= 10 .* sort!(rand(StableRNG(i), 2)),
obstimes = 1:30
) for i
]= Subject.(sims_sparse)
population_sparse plotgrid(population_sparse)
We then fit the same UDE model to the sparse data using the same fitting procedure as before. Note that no random effects are assumed in the fitting model, which is fundamentally incorrect given the data-generating process.
= fit(
fpm_sparse
ude_model_knowledge,
population_sparse,init_params(ude_model_knowledge),
MAP(NaivePooled()),
)
We then plot the predictions of the fitted model for all subjects in the training population and all the observations. The predictions are made at a finer time resolution to visualize the dynamics better.
= predict(fpm_sparse; obstimes = 0:0.01:15);
pred plotgrid(pred)
# plot them all stacked ontop of oneanother
= Figure();
fig = Axis(fig[1, 1]; xlabel = "Time", ylabel = "Outcome", title = "Stacked predictions")
ax for i in eachindex(pred)
plotgrid!([ax], pred[i:i]; data = (; color = Cycled(i)))
end
fig
Since the dose level is the same across all subjects, the only source of variability in the fitting model is assumed to be the residual noise. Therefore, the model is not able to capture the individual differences in the PK and PD parameters. The prediction made by the model is therefore a single curve that represents the average behavior of the population, but it does not capture the systematic individual differences in the PK and PD responses. This is a fundamental limitation of the UDE model because it lacks random effects.
11 Conclusion
In this tutorial, we explored how to use DeepPumas to model pharmacokinetic and pharmacodynamic data using 3 approaches:
- A simple multi-layer perceptron (MLP) that maps time to a noisy outcome.
- A neural ordinary differential equation (NODE) that models the dynamics of the system using dynamic compartments.
- A universal differential equation (UDE) that encodes domain knowledge about the system and can generalize better across subjects.
We demonstrated how to fit these models to synthetic data and how to visualize the predictions. We also explored how to improve the generalization of the models by encoding more domain knowledge, demonstrating the generalization in both the dense and sparse data cases. Finally, we discussed the limitations of the UDE model when random effects are present in the data-generating process but not in the model.