Goal
To generate synthetic data that is indistinguishable in distribution from the real data.
Time-series / longitudinal / panel data
Data structure
- Data is often hierarchical in nature and heterogeneous
- Longitudinal data
- Every subject has one or more response variables (e.g. PK and PD responses)
- Every response variable has one or more observations.
- Response variables can have different numbers of observations across and/or within subjects.
- Image/scan data
- Every subject has one or more images/scans, e.g. CT scan and MRI scan.
- Every image/scan has many pixels/points.
- Images/scans can have different sizes and resolutions across and/or within subjects.
Key point
The model structure must match the structure of the data, e.g:
- Temporal models are used for time-series data, and
- Spatial models for image/scan/spatial data.
Model structure
- Generative models are probabilistic models that make use of random variables to simulate new data.
- Generative models generate synthetic data \(y\) from:
- A latent representation \(\eta\) of the data point, modelled as a random variable.
- A parametric function \(f(\eta)\) which returns either
- a probability distribution for \(y\) given its latent representation, or
- a single point \(y\) directly.
- Example
\[
\begin{aligned}
\eta & \sim \mathcal{N} ( 0, 1 ) \\
y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg)
\end{aligned}
\]
Model structure
Latent variables
The latent representation \(\eta\) of the data point is:
- Low-dimensional and fixed-length
- Modelled as a random variable sampled from a parametric prior distribution to allow generating data.
- Also known as a latent variable, embedding variable, latent feature or random effect.
- If \(\eta\) is discrete, it is commonly known as a cluster.
- The latent variable \(\eta\) is per point/subject/image.
- The prior distribution is shared between all points/subjects/images.
Model structure
Decoder
- The parametric function \(f(\eta)\) is also known as a probabilistic decoder, generator, data model or error model.
- The architecture/structure of the decoder \(f\) must match the structure of the data.
Model structure
Parameters
- The parameters of the decoder \(f\) and the prior distribution of \(\eta\) are shared between all the points/subjects/images.
- The parameters of \(f\) and the prior distribution of \(\eta\) are known as population-level parameters in the hierarchical statistics literature.
- A subset of the parameters of \(f\) are also sometimes called fixed effects in the mixed effects literature.
- In deep learning, all parameters just called weights or weights and biases.
Generative deep learning models
Decoder
The decoder \(f\) returns a distribution whose parameters are described by one or more neural networks. Example:
\[
y \sim f(\eta) = \mathcal{N} ( \text{NN}_{\mu}(\eta), \text{NN}_{\sigma^2}(\eta) )
\]
Latent variables
- \(\eta\) is a vector of random degrees of freedom that the neural networks can use in any way they see fit to generate plausible data.
- What each latent variable does to \(f(\eta)\) and \(y\) is emergent behavior and not something engineered into the model.
\[
\eta \sim \mathcal{N}(0, I)
\]
Generative pharmacometric models
Decoder
The decoder \(f\) returns a distribution (e.g. a normal distribution) whose parameters (e.g. mean and covariance) are described by either:
- A mechanistic model, often based on differential equations, aka a scientific model,
- A fully empirical model, aka a machine learning model, or
- A semi-mechanistic model, aka a scientific machine learning model.
Generative pharmacometric models
Latent variables
- \(\eta\) is interpreted as a subject’s deviation from the population’s typical value for a specific individual quantity.
- For example, the individual clearance is \(\text{CL} = \text{tvcl} \cdot \exp(\eta_{\text{CL}})\).
- The typical value \(\text{tvcl}\) is a population-level parameter (part of the decoder).
- The prior distribution’s parameters are also population-level parameters.
- The prior distribution of \(\eta\) models the between-subject variability not explained by the covariates.
Sampling from the generative model
Sampling from the generative model is a 3-step process:
- Sample a random value for the latent representation \(\eta\) from its prior distribution,
- Decode \(\eta\) using \(f(\eta)\) to produce a distribution for \(y\), conditional on \(\eta\), and
- Sample from the conditional distribution of \(y\).
\[
\begin{aligned}
\eta & \sim \mathcal{N} ( 0, 1 ) \\
y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg)
\end{aligned}
\]
Fitting the generative model
- Goal: find the best values for the parameters of \(f\) and the prior distribution of \(\eta\) by some definition of best.
- One common fitting objective is maximizing the log marginal likelihood (aka evidence):
\[
\sum_{i=1}^N \log \int p(y_i \mid \eta) \cdot p(\eta) \, d\eta
\]
- \(p(y_i \mid \eta)\) is the probability of \(y = y_i\) according to the distribution \(f(\eta)\).
- \(p(\eta)\) is the probability of \(\eta\) according to its prior distribution.
- \(N\) is the number of points/images/subjects in the training data.
Inverse problem
- Instead of generating a synthetic value \(y\) from a random value \(\eta\), we can use the generative model in reverse.
- Given an actual observed value \(y\), we can infer the latent variable \(\eta^*\) that best represents this data point \(y\).
- This is done by solving the following optimization problem for some dissimilarity function \(D\).
- \(D\) should quantify the dissimilarity between the distribution \(f(\eta)\) and the observation \(y\).
\[
\eta^* = g(y) = \text{argmin}_{\eta} D(f(\eta), y)
\]
Inverse problem
- In mixed effects modelling, a common choice for the dissimilarity function \(D(f(\eta), y)\) is:
\[
-\log p(y \mid \eta) - \log p(\eta)
\]
- \(-\log p(y \mid \eta)\) quantifies the dissimilarity between \(f(\eta)\) and \(y\).
- \(-\log p(\eta)\) acts as regularization which reduces over-fitting if there is a large noise component in \(y\), or it’s an outlier.
- Over-fitting in this context is often measured using \(\epsilon\)-shrinkage in the mixed effects literature.
- When using this particular definition of \(D\), \(\eta^*\) is called the empirical Bayes estimate.
Inverse problem
- More generally, in machine learning, \(\eta^*\) is called an embedding.
- In this optimization, the parameters of \(f\) and the prior distribution are kept fixed.
- Since the dimension of \(\eta\) is often less than that of \(y\), the inverse problem can be used for dimension reduction (aka factor analysis) or feature extraction.
- The inverse problem can also be used to generate a fixed-length encoding of variable-length data.
- If \(\eta\) is discrete, the inverse problem can be used for clustering.
Inverse problem
- Solving the inverse problem is also known as encoding the data point \(y\).
- One can train a surrogate neural network to approximate \(\eta^* = g(y)\) given any \(y\) without doing the optimization every time.
- This surrogate model is usually called an encoder.
- The optimization-based function itself \(g(y)\) can also be considered an exact encoder.
Inverse problem
- Surrogates can also learn an approximation of the distribution of \(\eta\) values that could have plausibly generated \(y\).
- This distribution is known as the posterior distribution \(p(\eta \mid y)\).
- A surrogate model that approximates \(p(\eta \mid y)\) using variational inference is called a probabilistic encoder or a variational encoder.
- Learning a surrogate model for the inverse problem to avoid solving it for every \(y\) is known as amortized learning.
- This is usually done simultaneously while training the decoder/generator.
Popular deep generative models
Variational autoencoder (VAE)
- Latent variables: standard Gaussian
- Decoder:
- Probabilistic
- Neural network based
- Encoder:
- Approximate
- Probabilistic
- Neural network based
Popular deep generative models
Normalizing flow (NF)
- Latent variables: standard Gaussian
- Decoder:
- Deterministic
- Any invertible function, often based on neural networks
- Encoder:
- Exact
- Deterministic
- Inverse of the decoder’s function
Conditional generative models
Covariates
(Baseline) covariates
\[
\begin{aligned}
\eta & \sim p(.) \\
y & \sim f(\eta, x_b)
\end{aligned}
\]
- \(x_b\) is the set of baseline covariates.
- \(x_b\) can be an image/scan or its embedding!
Covariates
Time as a covariate
\[
\begin{aligned}
\eta & \sim p(.) \\
y(t) & \sim f(\eta, x_b, t)
\end{aligned}
\]
- \(y\) is observed at a few finite time points but can be simulated for any time point or interval.
Covariates
Time-varying covariates
\[
\begin{aligned}
\eta & \sim p(.) \\
y(t) & \sim f(\eta, x_b, t, x_v(t))
\end{aligned}
\]
- \(x_v(t)\) interpolates/extrapolates the finite covariate measurements to any given model time \(t\).
- \(x_v(t)\) could be a time-varying image/scan or its embedding!
One-compartment PK model as a generative model
Latent variables
\[
\begin{aligned}
\eta_a & \sim \mathcal{N}(0, \omega_{\eta_a}^2) \\
\eta_e & \sim \mathcal{N}(0, \omega_{\eta_e}^2) \\
\eta_v & \sim \mathcal{N}(0, \omega_{\eta_v}^2)
\end{aligned}
\]
- Population-level parameters: \(\omega_{\eta_a}^2\), \(\omega_{\eta_e}^2\), \(\omega_{\eta_v}^2\).
One-compartment PK model as a generative model
Decoder
\[
\begin{aligned}
k_{a,\eta} & = k_a \cdot e^{\eta_a} \\
k_{e,\eta} & = k_e \cdot e^{\eta_e} \\
V_{d,\eta} & = V_d \cdot e^{\eta_v} \\
C(t) & = \frac{D \cdot k_{a,\eta}}{V_{d,\eta} \cdot (k_{a,\eta} - k_{e,\eta})} \left( e^{-k_{e,\eta} \cdot t} - e^{-k_{a,\eta} \cdot t} \right) \\
y(t) & \sim \mathcal{N}(C(t), \sigma^2)
\end{aligned}
\]
One-compartment PK model as a generative model
Decoder
- \(C(t)\): drug concentration at time \(t\).
- \(D\): dose administered.
- \(k_{a,\eta}\), \(k_{e,\eta}\), \(V_{d,\eta}\): absorption, elimination rates, and volume with random effects \(\eta\).
- \(y(t)\): observed response (concentration).
- \(\sigma^2\): variance of Gaussian residual error, same for all the time points.
- Population-level parameters: \(k_a\), \(k_e\), \(V_d\), \(\sigma^2\).
Visual predictive checks
- Two univariate distributions are identical if all their respective quantiles are equal to one another.
- The visual predictive check (VPC) plot commonly checks 3 quantiles levels, \([0.1, 0.5, 0.9]\).
Visual predictive checks
- For each quantile level, if the observed data’s quantile as a function of time (or any other covariate) is different from the generated data’s quantile (i.e. lies outside the simulated quantile interval), the 2 distributions are not similar.
- A bad VPC is more informative than a good one.
Deep learning fundamentals
Training
- Find network parameters \(W\) that minimize a suitable cost/loss function \[
C(W; x, y) = \sum_{i=1}^N L(y_i, \hat{y}(x_i; W))
\]
- One common choice of \(L(y_i, \hat{y}(x_i; W))\) is: \[
L(y_i, \hat{y}(x_i; W)) = (y_i - \hat{y}(x_i; W))^2
\]
- Iteratively refine the parameters by stochastic gradient descent \[
W^{\text{new}} = W^{\text{old}} - \alpha \cdot \nabla_W C(W; x, y)
\]
- Gradient computed by backpropagation (Rumelhart et al., 1986)
- \(\alpha\) is known as the learning rate
Negative log likelihood as a loss function
The negative log likelihood generalizes the sum of square error loss function.
Normal distribution
\[
\begin{aligned}
y_i & \sim \mathcal{N}(\hat{y}(x_i; W), \sigma^2) \\
-\log P(y_i) & = \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{(y_i - \hat{y}(x_i; W))^2}{2 \cdot \sigma^2}
\end{aligned}
\]
Setting \(\sigma^2 = \frac{1}{2}\): \[
\begin{aligned}
-\log P(y) & = -\sum_{i=1}^N \log P(y_i) \\
& = \underbrace{-\frac{N}{2} \cdot \log (2) + \frac{N}{2} \cdot \log (2 \pi)}_{\text{constant}} + \underbrace{\sum_{i=1}^{N} (y_i - \hat{y}(x_i; W))^2}_{C(W; x, y)}
\end{aligned}
\]
Feed-forward neural networks
- The basic interface is quite given by the task
- Number of input units
- Number of output units
- Activation function for the output layer
- The architecture must be selected carefully
- Number of hidden layers
- Number of units per hidden layer
- Activation function for the hidden layers
- Parameters of the optimization algorithm
Activation functions
- At the output layer we typically use
- the identity function for regression tasks,
- the sigmoid function for binary classification tasks,
- the softmax function for multi-class classification tasks, and
- the softplus function to ensure non-negativity.
- At the hidden layers,
- traditional activations are the sigmoid and tanh functions,
- but deeper networks will suffer from vanishing gradients.
- In this case, experiment with the ReLU and SELU functions.
Architecture selection
- Typically based on empirical investigation and assessment on withheld data
- Initial exploration to get a feel of reasonable network dimensions
- Hyperparameter search
- Design a grid of hyperparameters based on the previous exploration
- In case of a tight budget, pay attention to the learning rate
- Exhaustive, random, Bayesian search
Under-fitting and over-fitting
Strategies to prevent over-fitting
- Early stopping based on withheld data
- Most careful data splitting to avoid data leakage
- Networks as large as possible, but not larger
- Larger and more diverse training datasets
- Regularization
- Penalization of the cost function
- Dropout (Srivastava et al., 2014)
- Data augmentation
Penalization of the cost function
\[
\text{arg min}_W \, C(W; x, y)
\]
- Penalized cost function reduces expressiveness
\[
\text{arg min}_W \, C(W; x, y) + \lambda \cdot \left\lVert W \right\rVert_p^p
\]
- \(\left\lVert W \right\rVert_p\) is the \(L_p\) norm of the parameters \(W\).
- \(\lambda\) is the regularization strength hyperparameter.
Regularization types
- \(L_2\) regularization
- Does not promote sparsity of \(W\)
- Equivalent to the regularization used in ridge regression. \[
\left\lVert W \right\rVert_2^2 = \sum_{i=1}^{n_w} W_i^2
\]
- \(L_1\) regularization
- More likely to give sparse weights \(W\)
- Equivalent to the regularization used in LASSO regression. \[
\left\lVert W \right\rVert_1 = \sum_{i=1}^{n_w} |W_i|
\]
Regularization using prior distributions
- Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.
- Maximizing the log likelihood + log prior is called maximum a-posteriori (MAP) estimation.
Normal prior is \(L_2\) regularization
\[
\begin{aligned}
W & \sim \mathcal{N}(0, \sigma^2 \cdot I) \\
-\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{W_i^2}{2 \cdot \sigma^2} \Bigg) \\
& = \underbrace{n_w \cdot \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) \Bigg)}_{\text{constant}} + \underbrace{\frac{1}{2 \cdot \sigma^2}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} W_i^2}_{\left\lVert W \right\rVert_2^2} \\
& = \lambda \cdot \left\lVert W \right\rVert_2^2 + \text{constant}
\end{aligned}
\]
Regularization using prior distributions
Laplace prior is \(L_1\) regularization
\[
\begin{aligned}
W_i & \sim \text{Laplace}(0, b) \\
-\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (2 \cdot b) + \frac{|W_i|}{b} \Bigg) \\
& = \underbrace{n_w \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} |W_i|}_{\left\lVert W \right\rVert_1} \\
& = \lambda \cdot \left\lVert W \right\rVert_1 + \text{constant}
\end{aligned}
\]