Goal
To generate synthetic data that is indistinguishable in distribution from the real data.
Time-series / longitudinal / panel data
Key point
The model structure must match the structure of the data, e.g:
- Temporal models are used for time-series data, and
- Spatial models for image/scan/spatial data.
Model structure
\[
\begin{aligned}
\eta & \sim \mathcal{N} ( 0, 1 ) \\
y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg)
\end{aligned}
\]
Model structure
Latent variables
The latent representation \(\eta\) of the data point is:
- Low-dimensional and fixed-length
- Modelled as a random variable sampled from a parametric prior distribution to allow generating data.
- Also known as a latent variable, embedding variable, latent feature or random effect.
- If \(\eta\) is discrete, it is commonly known as a cluster.
- The latent variable \(\eta\) is per point/subject/image.
- The prior distribution is shared between all points/subjects/images.
Model structure
Decoder
- The parametric function \(f(\eta)\) is also known as a probabilistic decoder, generator, data model or error model.
- The architecture/structure of the decoder \(f\) must match the structure of the data.
Model structure
Parameters
- The parameters of the decoder \(f\) and the prior distribution of \(\eta\) are shared between all the points/subjects/images.
- The parameters of \(f\) and the prior distribution of \(\eta\) are known as population-level parameters in the hierarchical statistics literature.
- A subset of the parameters of \(f\) are also sometimes called fixed effects in the mixed effects literature.
- In deep learning, all parameters just called weights or weights and biases.
Generative deep learning models
Decoder
The decoder \(f\) returns a distribution whose parameters are described by one or more neural networks. Example:
\[
y \sim f(\eta) = \mathcal{N} ( \text{NN}_{\mu}(\eta), \text{NN}_{\sigma^2}(\eta) )
\]
Latent variables
- \(\eta\) is a vector of random degrees of freedom that the neural networks can use in any way they see fit to generate plausible data.
- What each latent variable does to \(f(\eta)\) and \(y\) is emergent behavior and not something engineered into the model.
\[
\eta \sim \mathcal{N}(0, I)
\]
Generative pharmacometric models
Decoder
The decoder \(f\) returns a distribution (e.g. a normal distribution) whose parameters (e.g. mean and covariance) are described by either:
- A mechanistic model, often based on differential equations, aka a scientific model,
- A fully empirical model, aka a machine learning model, or
- A semi-mechanistic model, aka a scientific machine learning model.
Generative pharmacometric models
Latent variables
- \(\eta\) is interpreted as a subject’s deviation from the population’s typical value for a specific individual quantity.
- For example, the individual clearance is \(\text{CL} = \text{tvcl} \cdot \exp(\eta_{\text{CL}})\).
- The typical value \(\text{tvcl}\) is a population-level parameter (part of the decoder).
- The prior distribution’s parameters are also population-level parameters.
- The prior distribution of \(\eta\) models the between-subject variability not explained by the covariates.
Sampling from the generative model
Sampling from the generative model is a 3-step process:
- Sample a random value for the latent representation \(\eta\) from its prior distribution,
- Decode \(\eta\) using \(f(\eta)\) to produce a distribution for \(y\), conditional on \(\eta\), and
- Sample from the conditional distribution of \(y\).
\[
\begin{aligned}
\eta & \sim \mathcal{N} ( 0, 1 ) \\
y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg)
\end{aligned}
\]
Fitting the generative model
- Goal: find the best values for the parameters of \(f\) and the prior distribution of \(\eta\) by some definition of best.
- One common fitting objective is maximizing the log marginal likelihood (aka evidence):
\[
\sum_{i=1}^N \log \int p(y_i \mid \eta) \cdot p(\eta) \, d\eta
\]
- \(p(y_i \mid \eta)\) is the probability of \(y = y_i\) according to the distribution \(f(\eta)\).
- \(p(\eta)\) is the probability of \(\eta\) according to its prior distribution.
- \(N\) is the number of points/images/subjects in the training data.
Inverse problem
- Instead of generating a synthetic value \(y\) from a random value \(\eta\), we can use the generative model in reverse.
- Given an actual observed value \(y\), we can infer the latent variable \(\eta^*\) that best represents this data point \(y\).
- This is done by solving the following optimization problem for some dissimilarity function \(D\).
- \(D\) should quantify the dissimilarity between the distribution \(f(\eta)\) and the observation \(y\).
\[
\eta^* = g(y) = \text{argmin}_{\eta} D(f(\eta), y)
\]
Inverse problem
- In mixed effects modelling, a common choice for the dissimilarity function \(D(f(\eta), y)\) is:
\[
-\log p(y \mid \eta) - \log p(\eta)
\]
- \(-\log p(y \mid \eta)\) quantifies the dissimilarity between \(f(\eta)\) and \(y\).
- \(-\log p(\eta)\) acts as regularization which reduces over-fitting if there is a large noise component in \(y\), or it’s an outlier.
- Over-fitting in this context is often measured using \(\epsilon\)-shrinkage in the mixed effects literature.
- When using this particular definition of \(D\), \(\eta^*\) is called the empirical Bayes estimate.
Inverse problem
- More generally, in machine learning, \(\eta^*\) is called an embedding.
- In this optimization, the parameters of \(f\) and the prior distribution are kept fixed.
- Since the dimension of \(\eta\) is often less than that of \(y\), the inverse problem can be used for dimension reduction (aka factor analysis) or feature extraction.
- The inverse problem can also be used to generate a fixed-length encoding of variable-length data.
- If \(\eta\) is discrete, the inverse problem can be used for clustering.
Inverse problem
Solving the inverse problem is also known as encoding the data point \(y\).
One can train a surrogate neural network to approximate \(\eta^* = g(y)\) given any \(y\) without doing the optimization every time.
This surrogate model is usually called an encoder.
The optimization-based function itself \(g(y)\) can also be considered an exact encoder.
Inverse problem
Surrogates can also learn an approximation of the distribution of \(\eta\) values that could have plausibly generated \(y\).
This distribution is known as the posterior distribution \(p(\eta \mid y)\).
A surrogate model that approximates \(p(\eta \mid y)\) using variational inference is called a probabilistic encoder or a variational encoder.
Learning a surrogate model for the inverse problem to avoid solving it for every \(y\) is known as amortized learning.
- This is usually done simultaneously while training the decoder/generator.
Popular deep generative models
Variational autoencoder (VAE)
Popular deep generative models
Normalizing flow (NF)
Conditional generative models
Covariates
(Baseline) covariates
\[
\begin{aligned}
\eta & \sim p(.) \\
y & \sim f(\eta, x_b)
\end{aligned}
\]
- \(x_b\) is the set of baseline covariates.
- \(x_b\) can be an image/scan or its embedding!
Covariates
Time as a covariate
\[
\begin{aligned}
\eta & \sim p(.) \\
y(t) & \sim f(\eta, x_b, t)
\end{aligned}
\]
- \(y\) is observed at a few finite time points but can be simulated for any time point or interval.
Covariates
Time-varying covariates
\[
\begin{aligned}
\eta & \sim p(.) \\
y(t) & \sim f(\eta, x_b, t, x_v(t))
\end{aligned}
\]
- \(x_v(t)\) interpolates/extrapolates the finite covariate measurements to any given model time \(t\).
- \(x_v(t)\) could be a time-varying image/scan or its embedding!
One-compartment PK model as a generative model
Latent variables
\[
\begin{aligned}
\eta_a & \sim \mathcal{N}(0, \omega_{\eta_a}^2) \\
\eta_e & \sim \mathcal{N}(0, \omega_{\eta_e}^2) \\
\eta_v & \sim \mathcal{N}(0, \omega_{\eta_v}^2)
\end{aligned}
\]
- Population-level parameters: \(\omega_{\eta_a}^2\), \(\omega_{\eta_e}^2\), \(\omega_{\eta_v}^2\).
One-compartment PK model as a generative model
Decoder
\[
\begin{aligned}
k_{a,\eta} & = k_a \cdot e^{\eta_a} \\
k_{e,\eta} & = k_e \cdot e^{\eta_e} \\
V_{d,\eta} & = V_d \cdot e^{\eta_v} \\
C(t) & = \frac{D \cdot k_{a,\eta}}{V_{d,\eta} \cdot (k_{a,\eta} - k_{e,\eta})} \left( e^{-k_{e,\eta} \cdot t} - e^{-k_{a,\eta} \cdot t} \right) \\
y(t) & \sim \mathcal{N}(C(t), \sigma^2)
\end{aligned}
\]
One-compartment PK model as a generative model
Decoder
- \(C(t)\): drug concentration at time \(t\).
- \(D\): dose administered.
- \(k_{a,\eta}\), \(k_{e,\eta}\), \(V_{d,\eta}\): absorption, elimination rates, and volume with random effects \(\eta\).
- \(y(t)\): observed response (concentration).
- \(\sigma^2\): variance of Gaussian residual error, same for all the time points.
- Population-level parameters: \(k_a\), \(k_e\), \(V_d\), \(\sigma^2\).
Visual predictive checks
- Two univariate distributions are identical if all their respective quantiles are equal to one another.
- The visual predictive check (VPC) plot commonly checks 3 quantiles levels, \([0.1, 0.5, 0.9]\).
Visual predictive checks
- For each quantile level, if the observed data’s quantile as a function of time (or any other covariate) is different from the generated data’s quantile (i.e. lies outside the simulated quantile interval), the 2 distributions are not similar.
- A bad VPC is more informative than a good one.
Deep learning fundamentals
Training
- Find network parameters \(W\) that minimize a suitable cost/loss function \[
C(W; x, y) = \sum_{i=1}^N L(y_i, \hat{y}(x_i; W))
\]
- One common choice of \(L(y_i, \hat{y}(x_i; W))\) is: \[
L(y_i, \hat{y}(x_i; W)) = (y_i - \hat{y}(x_i; W))^2
\]
- Iteratively refine the parameters by stochastic gradient descent \[
W^{\text{new}} = W^{\text{old}} - \alpha \cdot \nabla_W C(W; x, y)
\]
- Gradient computed by backpropagation (Rumelhart et al., 1986)
- \(\alpha\) is known as the learning rate
Negative log likelihood as a loss function
The negative log likelihood generalizes the sum of square error loss function.
Normal distribution
\[
\begin{aligned}
y_i & \sim \mathcal{N}(\hat{y}(x_i; W), \sigma^2) \\
-\log P(y_i) & = \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{(y_i - \hat{y}(x_i; W))^2}{2 \cdot \sigma^2}
\end{aligned}
\]
Setting \(\sigma^2 = \frac{1}{2}\): \[
\begin{aligned}
-\log P(y) & = -\sum_{i=1}^N \log P(y_i) \\
& = \underbrace{-\frac{N}{2} \cdot \log (2) + \frac{N}{2} \cdot \log (2 \pi)}_{\text{constant}} + \underbrace{\sum_{i=1}^{N} (y_i - \hat{y}(x_i; W))^2}_{C(W; x, y)}
\end{aligned}
\]
Feed-forward neural networks
Under-fitting and over-fitting
Strategies to prevent over-fitting
Early stopping based on withheld data
Most careful data splitting to avoid data leakage
Networks as large as possible, but not larger
Larger and more diverse training datasets
Regularization
- Penalization of the cost function
- Dropout (Srivastava et al., 2014)
- Data augmentation
Penalization of the cost function
\[
\text{arg min}_W \, C(W; x, y)
\]
- Penalized cost function reduces expressiveness
\[
\text{arg min}_W \, C(W; x, y) + \lambda \cdot \left\lVert W \right\rVert_p^p
\]
- \(\left\lVert W \right\rVert_p\) is the \(L_p\) norm of the parameters \(W\).
- \(\lambda\) is the regularization strength hyperparameter.
Regularization types
\(L_2\) regularization
- Does not promote sparsity of \(W\)
- Equivalent to the regularization used in ridge regression. \[
\left\lVert W \right\rVert_2^2 = \sum_{i=1}^{n_w} W_i^2
\]
\(L_1\) regularization
- More likely to give sparse weights \(W\)
- Equivalent to the regularization used in LASSO regression. \[
\left\lVert W \right\rVert_1 = \sum_{i=1}^{n_w} |W_i|
\]
Regularization using prior distributions
- Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.
- Maximizing the log likelihood + log prior is called maximum a-posteriori (MAP) estimation.
Normal prior is \(L_2\) regularization
\[
\begin{aligned}
W & \sim \mathcal{N}(0, \sigma^2 \cdot I) \\
-\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{W_i^2}{2 \cdot \sigma^2} \Bigg) \\
& = \underbrace{n_w \cdot \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) \Bigg)}_{\text{constant}} + \underbrace{\frac{1}{2 \cdot \sigma^2}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} W_i^2}_{\left\lVert W \right\rVert_2^2} \\
& = \lambda \cdot \left\lVert W \right\rVert_2^2 + \text{constant}
\end{aligned}
\]
Regularization using prior distributions
Laplace prior is \(L_1\) regularization
\[
\begin{aligned}
W_i & \sim \text{Laplace}(0, b) \\
-\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (2 \cdot b) + \frac{|W_i|}{b} \Bigg) \\
& = \underbrace{n_w \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} |W_i|}_{\left\lVert W \right\rVert_1} \\
& = \lambda \cdot \left\lVert W \right\rVert_1 + \text{constant}
\end{aligned}
\]