Overview of Generative Models

Authors:

Mohamed Tarek

Overview of Generative Models

Generative Models are Just Distributions

A generative model is, at its core, a probability distribution $p(\mathbf{y})$ over the data space.
“Generative model” is largely a fancy name for a probability distribution that is:
- Possibly high-dimensional (e.g., images, molecules, time series of patient measurements),
- Possibly defined implicitly (e.g., through a neural network), and
- Designed so we can fit it to data and operate on it computationally.
Once you have a distribution that approximates $p_{\text{data}}(\mathbf{y})$, almost everything a “generative model” does reduces to one of a handful of standard operations on that distribution.

The Normal Distribution: A Tiny Generative Model

The simplest non-trivial generative model is the multivariate normal: \[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}\,|\boldsymbol{\Sigma}|^{1/2}}\,\exp\!\left(-\tfrac{1}{2}(\mathbf{y}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{y}-\boldsymbol{\mu})\right) \]
It is “generative” because it defines a distribution we can sample (aka generate) synthetic data from.
Gaussian distributions are nice to work with because we can evaluate the density, marginalize, condition, compute expectations, and find modes in closed form.

Operations We Typically Want

Given a distribution $p(\mathbf{y})$, we typically want to:

Sample — draw $\mathbf{y} \sim p(\mathbf{y})$.
Evaluate the log density — compute $\log p(\mathbf{y})$ at a given $\mathbf{y}$.
Marginalize — for $\mathbf{y} = (\mathbf{y}_A, \mathbf{y}_B)$, compute $p(\mathbf{y}_A) = \int p(\mathbf{y}_A, \mathbf{y}_B)\, d\mathbf{y}_B$.
Condition — compute $p(\mathbf{y}_A \mid \mathbf{y}_B)$.
Find the mode / MAP — compute $\arg\max_{\mathbf{y}} p(\mathbf{y})$ (or a conditional mode).
Compute expectations — $\mathbb{E}_{p(\mathbf{y})}[h(\mathbf{y})]$ for some function $h$.
Compute a latent representation — compute $p(\boldsymbol{z} \mid \mathbf{y})$ for some latent variable $\boldsymbol{z}$.

Important

A generative model is “useful” to the extent that these operations are tractable for it.

Generative Models are Multi-Purpose

Important

Generative models can be used for more than just generating new data points!

Most “applications” of generative models are compositions of the basic operations above:

Classification = conditioning on $\mathbf{x}$.
Regression = sampling/expectation under $p(\mathbf{y} \mid \mathbf{x})$.
Representation learning = posterior over latents (conditioning on $\mathbf{y}$).
Density estimation = evaluating $\log p(\mathbf{y})$.
Data generation = sampling $\mathbf{y} \sim p(\mathbf{y})$.
Anomaly detection = thresholding $\log p(\mathbf{y})$ or distance to typical set.
Imputation = sampling/conditioning on observed entries.

Generative Models for Classification

Given labelled data $(\mathbf{x}, \mathbf{y})$, learn the joint distribution $p(\mathbf{x}, \mathbf{y})$ using a generative model.
Make a prediction for a given $\mathbf{x}$ using: \[ \hat{y} = \arg\max_y p(\mathbf{y} \mid \mathbf{x}) = \arg\max_y \frac{p(\mathbf{x}, \mathbf{y})}{p(\mathbf{x})} = \arg\max_y p(\mathbf{x}, \mathbf{y}) \]
Example: Naive Bayes classifier, Gaussian discriminant analysis (GDA).

Generative Models for Regression

Given labelled data $(\mathbf{x}, \mathbf{y})$, learn the joint distribution $p(\mathbf{x}, \mathbf{y})$ using a generative model.
Make a prediction for a given $\mathbf{x}$ by:
1. Sampling $M$ samples $\{\mathbf{y}^{(i)}\}_{i=1}^M$ from the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ directly using the model if it has a closed form, or using Markov Chain Monte Carlo (MCMC).
2. Calculate the prediction $\hat{y}$ from the samples. \[ \hat{y} = \mathbb{E}_{p(\mathbf{y} \mid \mathbf{x})}[\mathbf{y}] = \int \mathbf{y} \, p(\mathbf{y} \mid \mathbf{x}) \, d\mathbf{y} \approx \frac{1}{M} \sum_{i=1}^M \mathbf{y}^{(i)} \]

Generative Models for Representation Learning

Given unlabelled data $\mathbf{y}$, learn a generative model with a latent variable $\boldsymbol{z}$: \[ \begin{aligned} \boldsymbol{z} &\sim p(\boldsymbol{z}) \\ \mathbf{y} &\sim p(\mathbf{y} \mid \boldsymbol{z}) \end{aligned} \]
The posterior distribution $p(\boldsymbol{z} \mid \mathbf{y})$ can be used as a lower-dimensional representation (aka embedding) of the observed data $\mathbf{y}$.
If the latent variable $\boldsymbol{z}$ is low-dimensional and continuous, the mean/mode of the posterior distribution can be used as a low-dimensional embedding of the observed data $\mathbf{y}$.
If the latent variable $\boldsymbol{z}$ is discrete, the mode of the posterior distribution can be used as a clustering of the observed data $\mathbf{y}$.

Generative Models for Representation Learning

When latent variables exist, the probability distribution of the observed data $\mathbf{y}$ is given by the marginal likelihood: \[ p(\mathbf{y}) = \int p(\mathbf{y} \mid \boldsymbol{z}) \, p(\boldsymbol{z}) \, d\boldsymbol{z} \]
The marginal likelihood integrates over all possible values of the latent variable $\boldsymbol{z}$, capturing the uncertainty in the latent representation.
Example: Variational Autoencoders (VAEs), Probabilistic Principal Component Analysis (PPCA), Gaussian Mixture Models (GMMs).

Generative Models for Density Estimation

Given unlabelled data $\mathbf{y}$, learn a generative model that provides a tractable expression for the probability density function $p(\mathbf{y})$ of the observed data $\mathbf{y}$.
Example: Normalizing Flows, Probabilistic Principal Component Analysis (PPCA), Probabilistic Circuits.
For these models, $\log p(\mathbf{y})$ is directly available and can be used for anomaly detection, model comparison, and likelihood-ratio tests.

Generative Models for Data Generation

If the generative model has a closed-form expression for the marginal likelihood $p(\mathbf{y})$ which matches any of the known distributions, e.g. Gaussian, we can directly sample from it.
If the generative model has latent variables $\boldsymbol{z}$, and closed form conditional distribution $p(\mathbf{y} \mid \boldsymbol{z})$, we can sample from the model distribution using ancestral sampling:
1. Sample $\boldsymbol{z}^{(i)} \sim p(\boldsymbol{z})$.
2. Sample $\tilde{\mathbf{y}}^{(i)} \sim p(\mathbf{y} \mid \boldsymbol{z}^{(i)})$.
If the model is an energy-based model, sample using the Stein score $\nabla_{\mathbf{y}} \log p(\mathbf{y})$ and MCMC.
If the model is a diffusion model, sample by solving the reverse SDE/ODE from noise to data.

Tractability Table

Model class	Sample	Log $p(\mathbf{y})$	Marginalize	Condition
Gaussian / Probabilistic PCA	✓	✓	✓	✓
Naive Bayes / Probabilistic Circuits	✓	✓	✓	✓
Gaussian Discriminant Analysis	✓	✓	✓	✓
(Continuous) Normalizing Flows	✓	✓	✗	✗
Variational Autoencoder / Diffusion	✓	≈	✗	✗
Energy-Based Models	✗	✗	✗	✓
GANs	✓	✗	✗	✗

✓: easy / tractable / closed form for at least 1 variant of the model.
≈: can get a lower bound.
✗: generally difficult / intractable, especially in high dimensions.

Summary

A generative model is, fundamentally, a (possibly high-dimensional, possibly implicit) probability distribution.
The Gaussian is the canonical “everything is tractable” baseline.
The interesting design space is: which operations remain tractable, and which ones are sacrificed for expressiveness?
Most “applications” of generative models are recombinations of: sampling, density evaluation, marginalization, conditioning.

Generative Models’ Zoo

Probabilistic PCA

A simple Gaussian generative model for data $\mathbf{y} \in \mathbb{R}^D$ is given by: \[ p(\mathbf{y}) = \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) \] One can factorize the covariance matrix as $\boldsymbol{\Sigma} = \mathbf{W}\mathbf{W}^\top + \sigma^2 \mathbf{I}$, which corresponds to the Probabilistic PCA model. \[ p(\mathbf{y}) = \int p(\mathbf{y} \mid \boldsymbol{z}) \, p(\boldsymbol{z}) \, d\boldsymbol{z} \] where

\[ \begin{aligned} \boldsymbol{z} &\sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} &\sim \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu} + \mathbf{W}\boldsymbol{z}, \sigma^2 \mathbf{I}) \end{aligned} \]

Probabilistic PCA

This is also a linear mixed effects model, where $\boldsymbol{z}$ are the random effects and $\boldsymbol{\mu}$ is the fixed effect.
The latent variable $\boldsymbol{z}$ can be interpreted as a low-dimensional representation of the data $\mathbf{y}$.
The parameters $\mathbf{W}$ and $\sigma^2$ control the structure of the covariance matrix of the observed data.
Given data $\mathbf{y}$, the posterior distribution over the latent variable $\boldsymbol{z}$ is also Gaussian and can be computed in closed form, making inference tractable. \[ p(\boldsymbol{z} \mid \mathbf{y}) = \mathcal{N}(\boldsymbol{z} \mid \mathbf{M}^{-1}\mathbf{W}^\top (\mathbf{y} - \boldsymbol{\mu}), \sigma^2 \mathbf{M}^{-1}) \] where $\mathbf{M} = \mathbf{W}^\top \mathbf{W} + \sigma^2 \mathbf{I}$.

Naive Bayes Classifier

Given labelled data $(\mathbf{x}, \mathbf{y})$, with $\mathbf{x} \in \mathbb{R}^D$ and $\mathbf{y} \in \{1, \ldots, K\}$, learn the joint distribution $p(\mathbf{x}, \mathbf{y})$ using a generative model. \[ p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}) \prod_{d=1}^D p(x_d \mid \mathbf{y}) \]
The naive assumption is that the features $x_d$ are conditionally independent given the class label $\mathbf{y}$, which allows for tractable inference and learning.
The model can be trained using maximum likelihood estimation, which reduces to estimating the class priors $p(\mathbf{y})$ and the class-conditional distributions $p(x_d \mid \mathbf{y})$ for each feature and class.

Naive Bayes Classifier

Once the model is trained, we can make predictions for new data points $\mathbf{x}$ by computing the posterior distribution over the class labels $\mathbf{y}$ using Bayes’ rule: \[ p(\mathbf{y} \mid \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{y})}{p(\mathbf{x})} = \frac{p(\mathbf{y}) \prod_{d=1}^D p(x_d \mid \mathbf{y})}{\sum_{k=1}^K p(k) \prod_{d=1}^D p(x_d \mid k)} \]
The predicted class label $\hat{y}$ can be obtained by taking the class with the highest posterior probability: \[ \hat{y} = \arg\max_y p(\mathbf{y} \mid \mathbf{x}) = \arg\max_y p(\mathbf{y}) \prod_{d=1}^D p(x_d \mid \mathbf{y}) \]

Gaussian Discriminant Analysis (GDA)

Gaussian Discriminant Analysis (GDA) is a generative model for classification that assumes the class-conditional distributions are multivariate Gaussians.
Given labelled data $(\mathbf{x}, \mathbf{y})$, with $\mathbf{x} \in \mathbb{R}^D$ and $\mathbf{y} \in \{1, \ldots, K\}$, learn the joint distribution $p(\mathbf{x}, \mathbf{y})$ using a generative model: \[ p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}) \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_y, \boldsymbol{\Sigma}_y) \]
The model can be trained using maximum likelihood estimation, which reduces to estimating the class priors $p(\mathbf{y})$ and the parameters of the class-conditional Gaussians $\boldsymbol{\mu}_y$ and $\boldsymbol{\Sigma}_y$ for each class.

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a special case of GDA where the class-conditional covariance matrices are assumed to be the same across all classes, i.e., $\boldsymbol{\Sigma}_y = \boldsymbol{\Sigma}$ for all $y$.
The joint distribution is given by: \[ p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}) \,\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_y, \boldsymbol{\Sigma}) \]
The decision boundary between classes is linear in the feature space, which is why it is called “Linear” Discriminant Analysis.

Linear Discriminant Analysis (LDA)

The decision boundary is defined by the set of points $\mathbf{x}$ where the posterior probabilities of two classes are equal, which leads to a linear equation in $\mathbf{x}$. \[ \log \frac{p(\mathbf{y}=k \mid \mathbf{x})}{p(\mathbf{y}=j \mid \mathbf{x})} = 0 \] \[ \log \frac{p(k)}{p(j)} - \frac{1}{2} (\boldsymbol{\mu}_k^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_k - \boldsymbol{\mu}_j^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_j) + (\boldsymbol{\mu}_k - \boldsymbol{\mu}_j)^\top \boldsymbol{\Sigma}^{-1} \mathbf{x} = 0 \]

Quadratic Discriminant Analysis (QDA)

Quadratic Discriminant Analysis (QDA) is a special case of GDA where the class-conditional covariance matrices are allowed to be different for each class, i.e., $\boldsymbol{\Sigma}_y$ can vary with $y$.
The joint distribution is given by: \[ p(\mathbf{x}, \mathbf{y}) = p(\mathbf{y}) \,\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_y, \boldsymbol{\Sigma}_y) \]
The decision boundary between classes is quadratic in the feature space, which is why it is called “Quadratic” Discriminant Analysis. \[ \log \frac{p(\mathbf{y}=k \mid \mathbf{x})}{p(\mathbf{y}=j \mid \mathbf{x})} = 0 \] \[ \log \frac{p(k)}{p(j)} - \frac{1}{2} \left( \log |\boldsymbol{\Sigma}_k| - \log |\boldsymbol{\Sigma}_j| + (\mathbf{x} - \boldsymbol{\mu}_k)^\top \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - (\mathbf{x} - \boldsymbol{\mu}_j)^\top \boldsymbol{\Sigma}_j^{-1} (\mathbf{x} - \boldsymbol{\mu}_j) \right) = 0 \]

Latent Variable Models

A latent variable model is a generative model that introduces unobserved (latent) variables $\boldsymbol{z}$ to capture the underlying structure of the data $\mathbf{y}$.
The generative process is defined as: \[ \begin{aligned} \boldsymbol{z} &\sim p(\boldsymbol{z}) \\ \mathbf{y} &\sim p(\mathbf{y} \mid \boldsymbol{z}) \end{aligned} \]
The marginal likelihood of the observed data $\mathbf{y}$ is obtained by integrating out the latent variable $\boldsymbol{z}$: \[ p(\mathbf{y}) = \int p(\mathbf{y} \mid \boldsymbol{z}) \, p(\boldsymbol{z}) \, d\boldsymbol{z} \]
Examples of latent variable models include Variational Autoencoders (VAEs), Gaussian Mixture Models (GMMs), and Probabilistic Principal Component Analysis (PPCA).

Generative Adversarial Networks (GANs)

A Generative Adversarial Network (GAN) is a generative model that consists of two neural networks: a generator $G$ and a discriminator $D$.
The generator $G$ takes random noise $\mathbf{z} \sim p(\mathbf{z})$ as input and produces synthetic data $\tilde{\mathbf{y}} = G(\mathbf{z})$.
This is a form of a latent variable model where the latent variable is the noise $\mathbf{z}$ and the conditional distribution $p(\mathbf{y} \mid \boldsymbol{z})$ is implicitly defined by the generator network. \[ \begin{aligned} \mathbf{z} &\sim p(\mathbf{z}) \\ \tilde{\mathbf{y}} & = G(\mathbf{z}) \end{aligned} \]
The discriminator $D$ is used to train the generator, while being trained itself simultaneously to distinguish between real data $\mathbf{y}$ and generated data $\tilde{\mathbf{y}}$.

Flow-Based Models

A flow-based model is a generative model that defines a complex distribution $p(\mathbf{y})$ by applying a sequence of invertible transformations to a simple base distribution $p(\mathbf{z})$ (e.g., a standard Gaussian).
The transformations are designed to be invertible and have a tractable Jacobian determinant, allowing for efficient sampling and density evaluation.
The generative process is defined as: \[ \begin{aligned} \mathbf{z} &\sim p(\mathbf{z}) \\ \mathbf{y} & = f(\mathbf{z}) \end{aligned} \] where $f$ is an invertible function parameterized by a neural network.
Flow-based models allow for both efficient sampling (by sampling $\mathbf{z}$ and applying $f$) and efficient density evaluation (by applying $f^{-1}$ to $\mathbf{y}$ and computing the Jacobian determinant).

Flow-Based Models

The log density of the generated data can be computed using the change of variables formula: \[ \log p(\mathbf{y}) = \log p(\mathbf{z}) - \log \left| \det \left( \frac{\partial f^{-1}(\mathbf{y})}{\partial \mathbf{y}} \right) \right| \]

Normalizing Flows

Normalizing Flows are a popular class of flow-based models that use a sequence of simple, invertible transformations to build complex distributions. \[ \begin{aligned} \mathbf{z}_0 &\sim p(\mathbf{z}) \\ \mathbf{z}_1 & = f_1(\mathbf{z}_0) \\ \mathbf{z}_2 & = f_2(\mathbf{z}_1) \\ & \vdots \\ \mathbf{y} & = f_K(\mathbf{z}_{K-1}) \end{aligned} \]

Flow-Based Models

Continuous Normalizing Flows

Continuous normalizing flows (CNFs) are a special class of flow-based models where the transformations are defined by continuous-time dynamics, typically modeled as ordinary differential equations (ODEs). \[ \begin{aligned} \mathbf{z}(0) &\sim p(\mathbf{z}) \\ \frac{d\mathbf{z}(t)}{dt} & = f(\mathbf{z}(t), t) \\ \mathbf{y} & = \mathbf{z}(T) \end{aligned} \]
The ODE is invertible in time under mild conditions on the function $f$, allowing for efficient sampling and density evaluation.

Flow-Based Models

Continuous Normalizing Flows

The instantaneous log probability density at time $t$ of $\mathbf{z}(t)$ can be computed using the instantaneous change of variables formula: \[ \frac{d}{dt} \log p(\mathbf{z}(t)) = -\text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) \]
For a given function $f(t)$ the following is true: \[ f(0) - f(T) = \int_T^0 \frac{df(t)}{dt} dt \]
Let $f(t) = \log p(\mathbf{z}(t))$

Flow-Based Models

Continuous Normalizing Flows

\[ \begin{aligned} \log p(\mathbf{z}(0)) - \log p(\mathbf{z}(T)) & = \int_T^0 \frac{d}{dt} \log p(\mathbf{z}(t)) dt \\ \log p(\mathbf{z}(T)) & = \log p(\mathbf{z}(0)) - \int_T^0 \frac{d}{dt} \log p(\mathbf{z}(t)) dt \\ & = \log p(\mathbf{z}(0)) - \int_T^0 -\text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) dt \\ & = \log p(\mathbf{z}(0)) + \int_T^0 \text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) dt \end{aligned} \]

This describes the log density of the generated data $\mathbf{y} = \mathbf{z}(T)$ in terms of the log density of the base distribution $p(\mathbf{z}(0))$ and the integral of the trace of the Jacobian of $f$ backwards in time.

Flow-Based Models

Continuous Normalizing Flows

Therefore, given a data point $\mathbf{z}(T) = \mathbf{y}$, one can compute the log density $\log p(\mathbf{y})$ by solving the following final value problem (FVP) backwards in time from $t=T$ to $t=0$: \[ \begin{aligned} \frac{d\mathbf{z}(t)}{dt} & = f(\mathbf{z}(t), t) \\ \frac{d l(t)}{dt} & = \text{Tr} \left( \frac{\partial f(\mathbf{z}(t), t)}{\partial \mathbf{z}(t)} \right) \\ \mathbf{z}(T) & = \mathbf{y} \\ l(T) & = 0 \end{aligned} \]
The FVP can be solved using any ODE solver, and the log density can be computed as: \[ \log p(\mathbf{y}) = \log p(\mathbf{z}(0)) + l(0). \]

Energy-Based Models

An energy-based model defines a probability distribution over data $\mathbf{y}$ using an energy function $E(\mathbf{y})$: \[ p(\mathbf{y}) = \frac{1}{Z} \exp(-E(\mathbf{y})) \] where $Z = \int \exp(-E(\mathbf{y})) \, d\mathbf{y}$ is the partition function that normalizes the distribution.
The energy function $E(\mathbf{y})$ can be parameterized by a neural network, allowing for flexible and expressive models.
Inference in energy-based models typically involves sampling from the distribution using Markov Chain Monte Carlo (MCMC) methods, which can be computationally expensive, especially in high-dimensional spaces.

Diffusion Models

Diffusion models a class of generative models defined by a noise adding process that transforms data $\mathbf{y}$ into noise $\mathbf{z}$ through a forward (diffusion) process. \[ \begin{aligned} \mathbf{z}_0 &\sim p(\mathbf{y}) \\ \mathbf{z}_t &\sim q(\mathbf{z}_t \mid \mathbf{z}_{t-1}) = \mathcal{N}(\mathbf{z}_t \mid \sqrt{1-\beta_t}\,\mathbf{z}_{t-1}, \beta_t \mathbf{I}) \quad \text{for } t=1,\ldots,T-1\\ \mathbf{z}_T &\sim \mathcal{N}(\mathbf{z}_T \mid \mathbf{0}, \mathbf{I}) \quad \text{(approximately standard Gaussian for large $T$)} \end{aligned} \]
The noise schedule $\{\beta_t\}$ controls how quickly information is destroyed: small $\beta_t$ preserves structure longer, while large $\beta_t$ adds noise more aggressively.
The reverse process is then learned to generate data from noise by reversing the diffusion process, effectively learning a model that generates data from a latent noise $\mathbf{z}$.
The number of steps $T$ can be taken to infinity and $\beta_t$ can be made infinitesimally small, in which case the forward process corresponds to a continuous-time stochastic differential equation (SDE), and the reverse process corresponds to solving a reverse-time SDE.

Probabilistic Graphical Models

Bayesian Networks

A Bayesian Network is a generative model that represents the joint distribution of a set of random variables using a directed acyclic graph (DAG).
Each node in the graph represents a random variable, and the edges represent conditional dependencies between the variables.
The joint distribution is factorized according to the structure of the graph: \[ p(\mathbf{y}) = \prod_{i=1}^D p(y_i \mid \text{Pa}(y_i)) \] where $\text{Pa}(y_i)$ denotes the set of parent nodes of $y_i$ in the graph.
Some variables may be observed (data) and some may be unobserved (latent), and the model can be used for inference, learning, and reasoning about the relationships between variables.

Probabilistic Graphical Models

Markov Random Fields (aka Markov Networks)

A Markov Random Field (MRF) is a generative energy-based model that represents the joint distribution of a set of random variables using an undirected graph.
Each node in the graph represents a random variable, and the edges represent undirected dependencies between the variables.
The joint distribution is factorized according to the cliques of the graph: \[ p(\mathbf{y}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{y}_c) \] where $\mathcal{C}$ is the set of cliques in the graph, $\psi_c$ is a potential function defined on the variables in clique $c$, and $Z$ is the partition function that normalizes the distribution.

Auto-Regressive Models

An auto-regressive model is a generative model that factorizes the joint distribution of a set of random variables into a product of conditional distributions, where each variable is conditioned on the previous variables in a specified ordering.
It is common in sequential data modeling, where the ordering of the variables corresponds to the temporal order of the data, e.g. in language modeling or time series forecasting.
Any sequence to vector model can be turned into an auto-regressive model by:
1. Generating the first variable $y_1$ unconditionally.
2. Generating the $i$-th variable $y_i$ conditioned on $y_1, \ldots, y_{i-1}$. \[ y_i = f(y_1, \ldots, y_{i-1}) + \epsilon_i \]
3. Add the generated variable $y_i$ to the conditioning set and repeat step 2 for step $i + 1$.

Probabilistic Circuits

A probabilistic circuit (PC) is a generative model that represents a probability distribution using a computational graph composed of specific types of nodes, such as sum nodes, product nodes and leaf nodes that represent tractable distributions.
Probabilistic circuits made of sum, product and leaf nodes only are also called Sum-Product Networks (SPNs).
The idea is to define (multiple) simple distributions over each variable in $\mathbf{y}$ at the leaf nodes. These distributions are not correctly describing the data, but they are tractable.
Then, by combining these simple distributions using sum and product nodes, we can build a more complex distribution that can better fit the data while still maintaining tractability for inference and learning.
SPNs and PCs in general can represent complex distributions while maintaining tractable inference and learning by exploiting the structure of the circuit.

SPN Example

Assume we have 2 variables $\mathbf{y} = (y_1, y_2)$.
We can define 2 leaf nodes for each variable, representing 2 simple distributions over each of $y_1$ and $y_2$ independently, e.g. 2 Gaussians each. \[ \begin{aligned} \text{Leaf node 1: } & p_1(y_1) = \mathcal{N}(y_1 \mid \mu_1, \sigma_1^2) \\ \text{Leaf node 2: } & p_2(y_1) = \mathcal{N}(y_1 \mid \mu_2, \sigma_2^2) \\ \text{Leaf node 3: } & p_3(y_2) = \mathcal{N}(y_2 \mid \mu_3, \sigma_3^2) \\ \text{Leaf node 4: } & p_4(y_2) = \mathcal{N}(y_2 \mid \mu_4, \sigma_4^2) \end{aligned} \]
The sum nodes represent a mixture of distributions that have the same scope (i.e. defined over the same variables). \[ \begin{aligned} \text{Sum node 1: } & p_5(y_1) = w_1 p_1(y_1) + w_2 p_2(y_1) \\ \text{Sum node 2: } & p_6(y_2) = w_3 p_3(y_2) + w_4 p_4(y_2) \end{aligned} \]

SPN Example

While the product nodes represent the joint distribution over 2 disjoint scopes (i.e. defined over different variables) by multiplying the distributions of their children. \[ \begin{aligned} \text{Product node 1: } & p_7(y_1, y_2) = p_5(y_1) \cdot p_6(y_2) \end{aligned} \]
We can replicate the above structure (using separate leaf nodes) to define another product node $p_8(y_1, y_2)$ over $y_1$ and $y_2$ jointly.
We can then combine the two product nodes $p_7$ and $p_8$ using a sum node to define a more complex distribution over $\mathbf{y}$ that captures more complex dependencies between $y_1$ and $y_2$. \[ \begin{aligned} \text{Sum node 3: } & p_9(y_1, y_2) = w_5 p_7(y_1, y_2) + w_6 p_8(y_1, y_2) \end{aligned} \]
The parameters of the model are the parameters of the leaf distributions (e.g. $\mu_i$ and $\sigma_i^2$ for Gaussian leaves) and the weights of the sum nodes (e.g. $w_i$).

Conditional Generative Models

A conditional generative model defines a distribution over data $\mathbf{y}$ conditioned on some input $\mathbf{x}$, allowing for more flexible modeling of complex relationships between variables.
All generative models can be extended to the conditional setting by allowing the parameters of the model to depend on the conditioning variable $\mathbf{x}$.
This gives rise to models such as Conditional Variational Autoencoders (CVAEs), Conditional Normalizing Flows, Conditional Energy-Based Models, and Conditional Diffusion Models, which can be used for tasks like conditional generation, conditional density estimation, and conditional representation learning.
Alternatively, one can define a joint generative model over both $\mathbf{x}$ and $\mathbf{y}$, and then use the conditional distribution $p(\mathbf{y} \mid \mathbf{x})$ for inference and generation.

Conditioning Generative Models on Time

Any generative model can be conditioned on time by allowing the parameters of the model to depend on time $t$, which allows for modeling time-varying distributions and capturing temporal dynamics in the data. \[ \begin{aligned} \boldsymbol{z} &\sim p(. \mid t) \\ \mathbf{y}(t) &\sim p(. \mid \boldsymbol{z}, t) \end{aligned} \] A special case of this is when the latent variable $\boldsymbol{z}$ is the initial condition of an ODE, and the generative process is defined by solving the ODE forward in time, which leads to the class of models known as Latent Neural ODEs. \[ \begin{aligned} \boldsymbol{z}_0 &\sim p(.) \\ \frac{d\boldsymbol{z}(t)}{dt} & = f(\boldsymbol{z}(t), t) \\ \mathbf{y}(t) & \sim p(. \mid g(\boldsymbol{z}(t))) \end{aligned} \]

Stochastic Process

A stochastic process is a collection of random variables $\{Y(t) : t \in \mathcal{T}\}$ indexed by a set $\mathcal{T}$, together with a specification of the joint distribution of $(Y(t_1), \ldots, Y(t_N))$ for any finite subset $\{t_1, \ldots, t_N\} \subset \mathcal{T}$.
Equivalently, a stochastic process defines a probability distribution over functions $Y : \mathcal{T} \to \mathbb{R}^D$.
The index set $\mathcal{T}$ can be:
- Discrete, e.g. $\mathcal{T} = \{1, 2, \ldots, T\}$ or $\mathcal{T} = \mathbb{N}$ — gives a sequence $(Y_1, Y_2, \ldots)$.
- Continuous, e.g. $\mathcal{T} = [0, T]$ or $\mathcal{T} = \mathbb{R}^d$ — gives a random function over time or a random field over space.

Gaussian Process

A Gaussian process (GP) is a continuous-index stochastic process where any finite collection of function values follows a multivariate Gaussian distribution. \[ \begin{aligned} f &\sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)) \\ \mathbf{y}(t) & = f(t) + \epsilon(t) \end{aligned} \]
$m(\cdot)$ is the mean function, $k(\cdot, \cdot)$ is the covariance kernel, and $\epsilon(t)$ is observation noise.
The index $t$ can live in $\mathbb{R}$ (time), $\mathbb{R}^d$ (space, or arbitrary input features), or any other set on which a valid kernel can be defined.

Gaussian Process

Given a set of pre-specified input points $\{t_i\}_{i=1}^N$, the function values at these points $\mathbf{y} = (y_1, \ldots, y_N)$ are jointly Gaussian distributed: \[ \mathbf{y} \sim \mathcal{N}(\mathbf{y} \mid \mathbf{m}, \mathbf{K} + \sigma^2 \mathbf{I}) \] where $\mathbf{m} = (m(t_1), \ldots, m(t_N))$ and $\mathbf{K}$ is the covariance matrix with entries $K_{ij} = k(t_i, t_j)$.
The kernel function $k(t, t')$ encodes the covariance structure of the function values at different input points $t$ and $t'$, allowing for “interaction” between different input points and enabling the model to capture complex patterns in the data.
Example kernel function: radial basis function (RBF) kernel, also known as the squared exponential kernel: \[ k(t, t') = \sigma^2 \exp\left(-\frac{\|t - t'\|^2}{2\ell^2}\right) \]

Gaussian Process for Regression

The GP can be instantiated at any set of input points $\{t_i\}_{i=1}^N$ to obtain a finite-dimensional Gaussian distribution over the function values at those points.
Assume we observe the (random) function values $\mathbf{y}(t)$ at input points $\{t_i\}_{i=1}^N$, and we want to figure out what the distribution of $\mathbf{y}(t)$ is at new input points $\{t_j^*\}_{j=1}^M$.
The joint distribution over the observed and new function values is Gaussian \[ \begin{bmatrix} \mathbf{y} \\ \mathbf{y}^* \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \mathbf{m} \\ \mathbf{m}^* \end{bmatrix}, \begin{bmatrix} \mathbf{K} + \sigma^2 \mathbf{I} & \mathbf{K}^* \\ \mathbf{K}^{*\top} & \mathbf{K}^{**} + \sigma^2 \mathbf{I} \end{bmatrix}\right) \] where $\mathbf{m}^* = (m(t_1^*), \ldots, m(t_M^*))$, $\mathbf{K}^*$ is the covariance matrix between the observed and new input points, and $\mathbf{K}^{**}$ is the covariance matrix between the new input points.

Gaussian Process for Regression

We can then compute the conditional distribution over $\mathbf{y}^*$ given $\mathbf{y}$, which is also Gaussian and can be computed in closed form. \[ p(\mathbf{y}^* \mid \mathbf{y}) = \mathcal{N}(\mathbf{y}^* \mid \mathbf{m}^* + \mathbf{K}^{*\top} (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} (\mathbf{y} - \mathbf{m}), \mathbf{K}^{**} + \sigma^2 \mathbf{I} - \mathbf{K}^{*\top} (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{K}^*) \]
The mean of that conditional distribution can be interpreted as the best guess for the function values at the new input points, having observed the function values at the original input points.
The covariance of the conditional distribution captures the uncertainty in that prediction.
The conditional distribution is often called the “posterior” distribution over the function values at the new input points, given the observed data.
Both the input $t$ and the output $\mathbf{y}(t)$ can be multi-dimensional, going beyond just functions of time to functions of any input space.

Gaussian Process for Regression

This has applications in regression and hyper-parameter tuning of machine learning models, where the random function is the model performance and its inputs are the hyper-parameters.
A Gaussian process is an example of a non-parametric generative model whose flexibility adapts to the data. We typically need the dataset (or a subset) to be able to use the model.
The parameters of a GP include the mean function $m(\cdot)$ and the kernel function $k(\cdot, \cdot)$, which can be learned from data.

Nonlinear Mixed Effects (NLME) Models

NLME models are conditional latent variable generative models of the form: \[ \begin{aligned} \boldsymbol{z} &\sim p(. \mid \mathbf{x}(0)) \\ \mathbf{y} &\sim p(. \mid \boldsymbol{z}, t, \mathbf{x}(t), \mathbf{d}(t)) \end{aligned} \] where $\mathbf{x}(t)$ are time-varying covariates and $\mathbf{d}(t)$ is the dosing regimen.
In pharmacokinetics (PK), we condition on time, covariates and dosing regimen, e.g: \[ \begin{aligned} \boldsymbol{z} &\sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Omega}) \\ \begin{bmatrix} \text{CL} \\ \text{V} \end{bmatrix} & = \begin{bmatrix} \text{CL}_\text{pop} \\ \text{V}_\text{pop} \end{bmatrix} \odot \exp(\boldsymbol{z}) \quad \text{where } \odot \text{ is element-wise multiplication} \\ \frac{d \text{Central}(t)}{dt} & = -\text{CL} \cdot \frac{\text{Central}(t)}{\text{V}} + \text{Dose}(t) \\ \mathbf{y}(t) & \sim \mathcal{N}(. \mid \text{Central}(t) / \text{V}, \sigma^2) \end{aligned} \]

Nonlinear Mixed Effects (NLME) Models

The NLME model captures the distribution of PK profiles (drug concentration over time) across individuals.
This is a distribution over functions of time.
The model parameters are the fixed effects ($\text{CL}_\text{pop}$ and $\text{V}_\text{pop}$), the covariance of the random effects ($\boldsymbol{\Omega}$) and the residual error ($\sigma^2$).
The latent variable $\boldsymbol{z}$ captures the inter-individual variability in PK parameters between subjects that have the same covariates and dosing regimen.
Two subjects with the same covariates and dosing regimen can have different PK profiles due to unobserved factors, which is captured by the latent variable $\boldsymbol{z}$.
This is exactly the random effect in NLME models.

NLME Model for Representation Learning

Given an observed PK profile for a subject $\{\mathbf{y}(t_i)\}_{i=1}^N$ , we can compute the representation of that subject in the latent space, using: \[ \boldsymbol{z}^* = \arg\max_{\boldsymbol{z}} p(\boldsymbol{z} \mid \{\mathbf{y}(t_i)\}_{i=1}^N) = \arg\max_{\boldsymbol{z}} p(\{\mathbf{y}(t_i)\}_{i=1}^N \mid \boldsymbol{z}) p(\boldsymbol{z}) \]
This is the maximum a posteriori (MAP) estimate of the latent variable $\boldsymbol{z}$ given the observed data.
- It is commonly known as the “empirical Bayes” estimate of the random effect in NLME models.

NLME Model as Empirical Bayes Hierarchical Models

Say we have $N$ subjects, then the hierarchical model can be written as: \[ \begin{aligned} \boldsymbol{z}_i &\sim p(. \mid \mathbf{x}_i(0)) \quad \text{for } i=1,\ldots,N \\ \mathbf{y}_i &\sim p(. \mid \boldsymbol{z}_i, t, \mathbf{x}_i(t), \mathbf{d}_i(t)) \quad \text{for } i=1,\ldots,N \end{aligned} \]
We assume that each subject $i$ has a copy of the model.
The subjects’ models all share the same population parameters ($\text{CL}_\text{pop}$, $\text{V}_\text{pop}$, $\boldsymbol{\Omega}$ and $\sigma^2$).
However, each subject has their own random effect $\boldsymbol{z}_i$ that represents the individual deviation from the typical value of the PK parameters for that subject.
However, by virtue of sharing the $\boldsymbol{\Omega}$ and other parameters, the subjects’ models are not completely independent, they are “connected”.
The best NLME model is one that has the best prior distribution over the random effects (i.e. $\boldsymbol{\Omega}$) and the best structural model that best describes the data of all the subjects.

NLME Model as Empirical Bayes Hierarchical Models

The shared prior over random effects and other population-level parameters enable information-sharing between subjects.
This allows us to learn the distribution of PK profiles (generative model interpretation) from a heterogeneous dataset of subjects, where some subjects may have sparse data and others may have rich data.
Notice that the model for each subject is a Bayesian model!
The random effects are assigned a prior distribution
- With shared parameters across subjects but some baseline covariate dependence.
We assume these individual parameters are deterministic for each subject but we don’t know their values.

NLME Model as Empirical Bayes Hierarchical Models

So the Bayesian approach is to model them as random variables with a prior distribution.
Given observed data for a specific subject $i$, we can then compute the posterior distribution over the random effect $\boldsymbol{z}_i$ for that subject, which is a form of “personalization” of the model to that subject.
That is, we are being “Bayesian” in our approach to inference for the random effects.
However, the population-level parameters are treated as deterministic parameters.
These are estimated using maximum (marginal) likelihood estimation.
That is we are being “frequentist” or “empirical” in our approach to inference for the population parameters, because there is no prior distribution over these parameters.
This is why NLME models are often referred to as “empirical Bayes” hierarchical models, because they combine a Bayesian approach to inference for the random effects with a frequentist or empirical approach to inference for the population parameters.

Everyone Eventually Stops Being Bayesian!

For any model with parameters $\beta$, we can have 3 choices:
1. Fix the parameters to known values,
2. Treat the parameters as deterministic unknowns and estimate them using an empirical loss, such as maximum likelihood estimation (frequentist approach), or
3. Treat the parameters as random variables and put a prior distribution over them, and then compute the posterior distribution over the parameters given the observed data (Bayesian approach).
Say we decide to be Bayesian and put a prior distribution over the parameters $\beta$ of our model.
The prior distribution itself has some parameters, called hyper-parameters $\alpha$.

Everyone Eventually Stops Being Bayesian!

We have the same 3 choices for $\alpha$ that we had for $\beta$.
We can keep being Bayesian and putting priors over the hyper-parameters, which will have their own hyper-hyper-parameters, and so on.
Eventually, we either have to stop and fix the parameters of the prior distribution to known (or assumed) values, or we have to treat them as deterministic unknowns and estimate them using an empirical loss, such as maximum likelihood estimation (empirical Bayes approach).
So everyone eventually stops being Bayesian!

Other Generative Models

Hidden Markov Models (HMMs): generative models for sequential data with discrete latent states $\boldsymbol{z}_t$, Markovian dynamics, and discrete steps in time.
State-Space Models (SSMs): generative models for sequential data with continuous latent states, Markovian dynamics, and discrete steps in time.
- Linear dynamics lead to Kalman filters, while nonlinear dynamics lead to extended Kalman filters or particle filters.
Latent Neural SDEs: stochastic differential equation (SDE) version of latent neural ODEs: $d\boldsymbol{z}(t) = f(\boldsymbol{z}, t)\,dt + g(\boldsymbol{z}, t)\,d\mathbf{W}(t)$.
- Continuous-time version of state-space models.
Copulas: generative models for multivariate data that model the dependence structure between variables separately from their marginal distributions.
- Common families: Gaussian, $t$, Clayton, Gumbel, Frank.

Other Generative Models

Point Processes: generative models for event data that model the times of events as random variables.
- Examples include Poisson processes and Hawkes processes. Used for adverse events, dosing, survival analysis.
Latent Dirichlet Allocation: a hierarchical generative model for discrete data (e.g. text) that models documents as mixtures of topics, where each topic is a distribution over words.
- Distinct from the LDA classifier covered earlier.
Bayesian Non-Parametric Models: a general class of non-parametric generative models, enabling the model to adapt its complexity to the data.
- Examples include the Gaussian Process, Dirichlet Process, and the Indian Buffet Process (IBP).

Other Generative Models

Restricted Boltzmann Machines (RBMs): historically important generative, energy-based models or probabilistic graphical models that consist of a layer of visible units and a layer of hidden units with undirected connections between them, used for unsupervised learning and dimensionality reduction.
- The visible units represent the observed data, while the hidden units capture latent features.
- RBMs can be stacked to form Deep Belief Networks (DBNs) for more complex representations.