A generative model is, at its core, a probability distribution \(p(\mathbf{y})\) over the data space.
“Generative model” is largely a fancy name for a probability distribution that is:
Once you have a distribution that approximates \(p_{\text{data}}(\mathbf{y})\), almost everything a “generative model” does reduces to one of a handful of standard operations on that distribution.
Given a distribution \(p(\mathbf{y})\), we typically want to:
Important
A generative model is “useful” to the extent that these operations are tractable for it.
Important
Generative models can be used for more than just generating new data points!
Most “applications” of generative models are compositions of the basic operations above:
Given labelled data \((\mathbf{x}, \mathbf{y})\), learn the joint distribution \(p(\mathbf{x}, \mathbf{y})\) using a generative model.
Make a prediction for a given \(\mathbf{x}\) by:
If the generative model has a closed-form expression for the marginal likelihood \(p(\mathbf{y})\) which matches any of the known distributions, e.g. Gaussian, we can directly sample from it.
If the generative model has latent variables \(\boldsymbol{z}\), and closed form conditional distribution \(p(\mathbf{y} \mid \boldsymbol{z})\), we can sample from the model distribution using ancestral sampling:
If the model is an energy-based model, sample using the Stein score \(\nabla_{\mathbf{y}} \log p(\mathbf{y})\) and MCMC.
If the model is a diffusion model, sample by solving the reverse SDE/ODE from noise to data.
| Model class | Sample | Log \(p(\mathbf{y})\) | Marginalize | Condition |
|---|---|---|---|---|
| Gaussian / Probabilistic PCA | ✓ | ✓ | ✓ | ✓ |
| Naive Bayes / Probabilistic Circuits | ✓ | ✓ | ✓ | ✓ |
| Gaussian Discriminant Analysis | ✓ | ✓ | ✓ | ✓ |
| (Continuous) Normalizing Flows | ✓ | ✓ | ✗ | ✗ |
| Variational Autoencoder / Diffusion | ✓ | ≈ | ✗ | ✗ |
| Energy-Based Models | ✗ | ✗ | ✗ | ✓ |
| GANs | ✓ | ✗ | ✗ | ✗ |
A simple Gaussian generative model for data \(\mathbf{y} \in \mathbb{R}^D\) is given by: \[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \] One can factorize the covariance matrix as \(\boldsymbol{\Sigma} = \mathbf{W}\mathbf{W}^\top + \sigma^2 \mathbf{I}\), which corresponds to the Probabilistic PCA model. \[ p(\mathbf{y}) = \int p(\mathbf{y} \mid \boldsymbol{z}) \, p(\boldsymbol{z}) \, d\boldsymbol{z} \] where
\[ \begin{aligned} \boldsymbol{z} &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} &\sim \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu} + \mathbf{W}\boldsymbol{z}, \sigma^2 \mathbf{I}) \end{aligned} \]
\[ \begin{aligned} \boldsymbol{z} &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} &\sim \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu} + \mathbf{W}\boldsymbol{z}, \sigma^2 \mathbf{I}) \end{aligned} \]
\[ \begin{aligned} \log p(\mathbf{z}(0)) - \log p(\mathbf{z}(T)) & = \int_T^0 \frac{d}{dt} \log p(\mathbf{z}(t)) dt \\ \log p(\mathbf{z}(T)) & = \log p(\mathbf{z}(0)) - \int_T^0 \frac{d}{dt} \log p(\mathbf{z}(t)) dt \\ & = \log p(\mathbf{z}(0)) - \int_T^0 -\text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) dt \\ & = \log p(\mathbf{z}(0)) + \int_T^0 \text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) dt \end{aligned} \]
This describes the log density of the generated data \(\mathbf{y} = \mathbf{z}(T)\) in terms of the log density of the base distribution \(p(\mathbf{z}(0))\) and the integral of the trace of the Jacobian of \(f\) backwards in time.
An auto-regressive model is a generative model that factorizes the joint distribution of a set of random variables into a product of conditional distributions, where each variable is conditioned on the previous variables in a specified ordering.
It is common in sequential data modeling, where the ordering of the variables corresponds to the temporal order of the data, e.g. in language modeling or time series forecasting.
Any sequence to vector model can be turned into an auto-regressive model by:
Any generative model can be conditioned on time by allowing the parameters of the model to depend on time \(t\), which allows for modeling time-varying distributions and capturing temporal dynamics in the data. \[ \begin{aligned} \boldsymbol{z} &\sim p(. \mid t) \\ \mathbf{y}(t) &\sim p(. \mid \boldsymbol{z}, t) \end{aligned} \] A special case of this is when the latent variable \(\boldsymbol{z}\) is the initial condition of an ODE, and the generative process is defined by solving the ODE forward in time, which leads to the class of models known as Latent Neural ODEs. \[ \begin{aligned} \boldsymbol{z}_0 &\sim p(.) \\ \frac{d\boldsymbol{z}(t)}{dt} & = f(\boldsymbol{z}(t), t) \\ \mathbf{y}(t) & \sim p(. \mid g(\boldsymbol{z}(t))) \end{aligned} \]
A stochastic process is a collection of random variables \(\{Y(t) : t \in \mathcal{T}\}\) indexed by a set \(\mathcal{T}\), together with a specification of the joint distribution of \((Y(t_1), \ldots, Y(t_N))\) for any finite subset \(\{t_1, \ldots, t_N\} \subset \mathcal{T}\).
Equivalently, a stochastic process defines a probability distribution over functions \(Y : \mathcal{T} \to \mathbb{R}^D\).
The index set \(\mathcal{T}\) can be:
Given an observed PK profile for a subject \(\{\mathbf{y}(t_i)\}_{i=1}^N\) , we can compute the representation of that subject in the latent space, using: \[ \boldsymbol{z}^* = \arg\max_{\boldsymbol{z}} p(\boldsymbol{z} \mid \{\mathbf{y}(t_i)\}_{i=1}^N) = \arg\max_{\boldsymbol{z}} p(\{\mathbf{y}(t_i)\}_{i=1}^N \mid \boldsymbol{z}) p(\boldsymbol{z}) \]
This is the maximum a posteriori (MAP) estimate of the latent variable \(\boldsymbol{z}\) given the observed data.
The shared prior over random effects and other population-level parameters enable information-sharing between subjects.
This allows us to learn the distribution of PK profiles (generative model interpretation) from a heterogeneous dataset of subjects, where some subjects may have sparse data and others may have rich data.
Notice that the model for each subject is a Bayesian model!
The random effects are assigned a prior distribution
We assume these individual parameters are deterministic for each subject but we don’t know their values.
For any model with parameters \(\beta\), we can have 3 choices:
Say we decide to be Bayesian and put a prior distribution over the parameters \(\beta\) of our model.
The prior distribution itself has some parameters, called hyper-parameters \(\alpha\).
Hidden Markov Models (HMMs): generative models for sequential data with discrete latent states \(\boldsymbol{z}_t\), Markovian dynamics, and discrete steps in time.
State-Space Models (SSMs): generative models for sequential data with continuous latent states, Markovian dynamics, and discrete steps in time.
Latent Neural SDEs: stochastic differential equation (SDE) version of latent neural ODEs: \(d\boldsymbol{z}(t) = f(\boldsymbol{z}, t)\,dt + g(\boldsymbol{z}, t)\,d\mathbf{W}(t)\).
Copulas: generative models for multivariate data that model the dependence structure between variables separately from their marginal distributions.
Point Processes: generative models for event data that model the times of events as random variables.
Latent Dirichlet Allocation: a hierarchical generative model for discrete data (e.g. text) that models documents as mixtures of topics, where each topic is a distribution over words.
Bayesian Non-Parametric Models: a general class of non-parametric generative models, enabling the model to adapt its complexity to the data.
Restricted Boltzmann Machines (RBMs): historically important generative, energy-based models or probabilistic graphical models that consist of a layer of visible units and a layer of hidden units with undirected connections between them, used for unsupervised learning and dimensionality reduction.