Machine Learning and Neural Networks

Authors:

Mohamed Tarek, Andreu Vall

Context and evolution

Use cases

Toxicity prediction with deep learning

Image-based diagnostic with deep learning

What is machine learning?

Main categories of machine learning (ML)

Supervised ML

Prediction
Classification

Unsupervised ML

Generative models
Feature extraction
Clustering

Semi-supervised ML

Has components of both supervised and unsupervised ML
Examples:
- Conditional generative models
- Unsupervised then supervised ML

Reinforcement learning

ML for sequential decision-making in a stochastic environment
Not in scope for this training

ML algorithms

ML using neural networks is known as deep learning (DL).
Not all ML algorithms use neural networks.
Many ML algorithms use probabilistic models as their work-horse.
Probabilistic machine learning algorithms borrow many concepts from computational (Bayesian) statistics.
Not all machine learning algorithms use probabilistic models, e.g.
- K-nearest neighbor algorithm
- Decision trees
- Hierarchical clustering
- DBSCAN

ML is not all new!

Machine learning builds on and often re-invents many fields including:
- Statistics
- Optimization
- Image and signal processing
- Software engineering
Related terms:
- Data mining
- Data science
- Artificial intelligence

Probabilistic supervised ML

Scope

Data type: labelled data \((\mathbf{x}_i, \mathbf{y}_i)\) for \(i \in 1, \dots, N\)
Tasks:
- Prediction, or
- Classification
Model type:
- Conditional model of \(\mathbf{y} \mid \mathbf{x}\)
- Also known as a discriminative model

Prediction model examples

Unconstrained response (positive or negative)
- Prediction is \(\mathbf{f}(\mathbf{x})\).
- The covariance matrix \(\mathbf{\Sigma}\) is often diagonal and can be fixed (not estimated).

\[ \mathbf{y} \sim \text{Normal}(\mathbf{f}(\mathbf{x}), \mathbf{\Sigma}) \]

Positive response
- Prediction is the mean/median/mode of the distribution.
- The median is the element-wise exponential of \(\mathbf{f}(\mathbf{x})\).

\[ \mathbf{y} \sim \text{LogNormal}(\mathbf{f}(\mathbf{x}), \mathbf{\Sigma}) \]

Classification model examples

Binary classification

\[ \begin{aligned} y & \sim \text{Bernoulli}(\text{logistic}(f(\mathbf{x}))) \\ \text{logistic}(z) & = \frac{1}{1 + e^{-z}} \end{aligned} \]

Multinomial classification, number of classes \(K > 2\)

\[ \begin{aligned} \mathbf{y} & \sim \text{Categorical}(\mathbf{\text{softmax}}(\mathbf{f}(\mathbf{x}))) \\ \mathbf{\text{softmax}}(\mathbf{z}) & = \Bigg[ \frac{e^{z_l}}{\sum_{j = 1}^K e^{z_j}} \text{ for } l \in 1, \dots, K \Bigg] \\ \end{aligned} \]

Prediction is the mode, i.e. the class with the highest probability

Probabilistic unsupervised ML

Scope

Data type: un-labelled data \(\mathbf{y}_i\) for \(i \in 1, \dots, N\).
Tasks:
- Learning data distributions, aka generative modelling
- Dimension reduction / feature extraction / embedding
- Clustering
Model type:
- Generative (simulation) model for \(\mathbf{y}\).
- Has latent random variables \(\mathbf{z}\), e.g. \(\mathbf{z} \sim \text{Normal}(\mathbf{0}, \mathbf{I})\).
- Models the distribution of \(\mathbf{y} \mid \mathbf{z}\) explicitly.
- Can (approximately) infer the inverse distribution \(\mathbf{z} \mid \mathbf{y}\).

Generative model examples

Probabilistic principal component analysis
- \(\mathbf{\Sigma}\) is usually diagonal
- Equivalent to a linear mixed effects model \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & \sim \text{Normal}(\mathbf{A} \cdot \mathbf{z} + \mathbf{b}, \mathbf{\Sigma}) \end{aligned} \]
Decoder of variational autoencoders (VAE)
- \(\mathbf{\text{NN}_{\mu}}(\mathbf{z})\) and \(\mathbf{\text{NN}_{\Sigma}}(\mathbf{z})\) are neural networks.
- The output of \(\mathbf{\text{NN}_{\mu}}(\mathbf{z})\) is unconstrained (positive or negative).
- The output of \(\mathbf{\text{NN}_{\Sigma}}(\mathbf{z})\) is usually a diagonal covariance matrix. \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & \sim \text{Normal} \Big( \mathbf{\text{NN}_{\mu}}(\mathbf{z}), \mathbf{\text{NN}_{\Sigma}}(\mathbf{z}) \Big) \end{aligned} \]

Generative model examples

Generator of a generative adversarial network (GAN) \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & = \mathbf{\text{NN}}(\mathbf{z}) \end{aligned} \]
Gaussian mixture model
- \(\mathbf{\Phi}\) is a \(K\) dimensional vector of parameters that sums up to 1 for \(K\) clusters.
- \(z \in [1, K]\) is a discrete latent variable, aka cluster.
- For each \(z\) value, there are parameters:
  - Mean \(\mathbf{\mu}_z\), and
  - Covariance \(\mathbf{\Sigma}_z\). \[ \begin{aligned} z & \sim \text{Categorical}(\mathbf{\Phi}) \\ \mathbf{y} & \sim \text{Normal} \Big( \mathbf{\mu}_z, \mathbf{\Sigma}_z \Big) \end{aligned} \]

Generative models

Goal

To generate synthetic data that is indistinguishable in distribution from the real data.

Images

https://www.thispersondoesnotexist.com

Time-series / longitudinal / panel data

Data structure

Data is often hierarchical in nature and heterogeneous
Longitudinal data
- Every subject has one or more response variables (e.g. PK and PD responses)
- Every response variable has one or more observations.
- Response variables can have different numbers of observations across and/or within subjects.
Image/scan data
- Every subject has one or more images/scans, e.g. CT scan and MRI scan.
- Every image/scan has many pixels/points.
- Images/scans can have different sizes and resolutions across and/or within subjects.

Key point

The model structure must match the structure of the data, e.g:

Temporal models are used for time-series data, and
Spatial models for image/scan/spatial data.

Model structure

Generative models are probabilistic models that make use of random variables to simulate new data.
Generative models generate synthetic data \(y\) from:
- A latent representation \(\eta\) of the data point, modelled as a random variable.
- A parametric function \(f(\eta)\) which returns either
  - a probability distribution for \(y\) given its latent representation, or
  - a single point \(y\) directly.
Example

\[ \begin{aligned} \eta & \sim \mathcal{N} ( 0, 1 ) \\ y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg) \end{aligned} \]

Model structure

Latent variables

The latent representation \(\eta\) of the data point is:

Low-dimensional and fixed-length
Modelled as a random variable sampled from a parametric prior distribution to allow generating data.
Also known as a latent variable, embedding variable, latent feature or random effect.
If \(\eta\) is discrete, it is commonly known as a cluster.
The latent variable \(\eta\) is per point/subject/image.
The prior distribution is shared between all points/subjects/images.

Model structure

Decoder

The parametric function \(f(\eta)\) is also known as a probabilistic decoder, generator, data model or error model.
The architecture/structure of the decoder \(f\) must match the structure of the data.

Model structure

Parameters

The parameters of the decoder \(f\) and the prior distribution of \(\eta\) are shared between all the points/subjects/images.
The parameters of \(f\) and the prior distribution of \(\eta\) are known as population-level parameters in the hierarchical statistics literature.
A subset of the parameters of \(f\) are also sometimes called fixed effects in the mixed effects literature.
In deep learning, all parameters just called weights or weights and biases.

Generative deep learning models

Decoder

The decoder \(f\) returns a distribution whose parameters are described by one or more neural networks. Example:

\[ y \sim f(\eta) = \mathcal{N} ( \text{NN}_{\mu}(\eta), \text{NN}_{\sigma^2}(\eta) ) \]

Latent variables

\(\eta\) is a vector of random degrees of freedom that the neural networks can use in any way they see fit to generate plausible data.
What each latent variable does to \(f(\eta)\) and \(y\) is emergent behavior and not something engineered into the model.

\[ \eta \sim \mathcal{N}(0, I) \]

Generative pharmacometric models

Decoder

The decoder \(f\) returns a distribution (e.g. a normal distribution) whose parameters (e.g. mean and covariance) are described by either:

A mechanistic model, often based on differential equations, aka a scientific model,
A fully empirical model, aka a machine learning model, or
A semi-mechanistic model, aka a scientific machine learning model.

Generative pharmacometric models

Latent variables

\(\eta\) is interpreted as a subject’s deviation from the population’s typical value for a specific individual quantity.
For example, the individual clearance is \(\text{CL} = \text{tvcl} \cdot \exp(\eta_{\text{CL}})\).
The typical value \(\text{tvcl}\) is a population-level parameter (part of the decoder).
The prior distribution’s parameters are also population-level parameters.
The prior distribution of \(\eta\) models the between-subject variability not explained by the covariates.

Sampling from the generative model

Sampling from the generative model is a 3-step process:

Sample a random value for the latent representation \(\eta\) from its prior distribution,
Decode \(\eta\) using \(f(\eta)\) to produce a distribution for \(y\), conditional on \(\eta\), and
Sample from the conditional distribution of \(y\).

Fitting the generative model

Goal: find the best values for the parameters of \(f\) and the prior distribution of \(\eta\) by some definition of best.
One common fitting objective is maximizing the log marginal likelihood (aka evidence):

\[ \sum_{i=1}^N \log \int p(y_i \mid \eta) \cdot p(\eta) \, d\eta \]

\(p(y_i \mid \eta)\) is the probability of \(y = y_i\) according to the distribution \(f(\eta)\).
\(p(\eta)\) is the probability of \(\eta\) according to its prior distribution.
\(N\) is the number of points/images/subjects in the training data.

Inverse problem

Instead of generating a synthetic value \(y\) from a random value \(\eta\), we can use the generative model in reverse.
Given an actual observed value \(y\), we can infer the latent variable \(\eta^*\) that best represents this data point \(y\).
This is done by solving the following optimization problem for some dissimilarity function \(D\).
\(D\) should quantify the dissimilarity between the distribution \(f(\eta)\) and the observation \(y\).

\[ \eta^* = g(y) = \text{argmin}_{\eta} D(f(\eta), y) \]

Inverse problem

In mixed effects modelling, a common choice for the dissimilarity function \(D(f(\eta), y)\) is:

\[ -\log p(y \mid \eta) - \log p(\eta) \]

\(-\log p(y \mid \eta)\) quantifies the dissimilarity between \(f(\eta)\) and \(y\).
\(-\log p(\eta)\) acts as regularization which reduces over-fitting if there is a large noise component in \(y\), or it’s an outlier.
Over-fitting in this context is often measured using \(\epsilon\)-shrinkage in the mixed effects literature.
When using this particular definition of \(D\), \(\eta^*\) is called the empirical Bayes estimate.

Inverse problem

More generally, in machine learning, \(\eta^*\) is called an embedding.
In this optimization, the parameters of \(f\) and the prior distribution are kept fixed.
Since the dimension of \(\eta\) is often less than that of \(y\), the inverse problem can be used for dimension reduction (aka factor analysis) or feature extraction.
The inverse problem can also be used to generate a fixed-length encoding of variable-length data.
If \(\eta\) is discrete, the inverse problem can be used for clustering.

Inverse problem

Solving the inverse problem is also known as encoding the data point \(y\).
One can train a surrogate neural network to approximate \(\eta^* = g(y)\) given any \(y\) without doing the optimization every time.
- Useful for big data.
This surrogate model is usually called an encoder.
The optimization-based function itself \(g(y)\) can also be considered an exact encoder.

Inverse problem

Surrogates can also learn an approximation of the distribution of \(\eta\) values that could have plausibly generated \(y\).
This distribution is known as the posterior distribution \(p(\eta \mid y)\).
A surrogate model that approximates \(p(\eta \mid y)\) using variational inference is called a probabilistic encoder or a variational encoder.
Learning a surrogate model for the inverse problem to avoid solving it for every \(y\) is known as amortized learning.
- This is usually done simultaneously while training the decoder/generator.

Popular deep generative models

Variational autoencoder (VAE)

Latent variables: standard Gaussian
Decoder:
- Probabilistic
- Neural network based
Encoder:
- Approximate
- Probabilistic
- Neural network based

Popular deep generative models

Normalizing flow (NF)

Latent variables: standard Gaussian
Decoder:
- Deterministic
- Any invertible function, often based on neural networks
Encoder:
- Exact
- Deterministic
- Inverse of the decoder’s function

Conditional generative models

Covariates

(Baseline) covariates

\[ \begin{aligned} \eta & \sim p(.) \\ y & \sim f(\eta, x_b) \end{aligned} \]

\(x_b\) is the set of baseline covariates.
\(x_b\) can be an image/scan or its embedding!

Covariates

Time as a covariate

\[ \begin{aligned} \eta & \sim p(.) \\ y(t) & \sim f(\eta, x_b, t) \end{aligned} \]

\(y\) is observed at a few finite time points but can be simulated for any time point or interval.

Covariates

Time-varying covariates

\[ \begin{aligned} \eta & \sim p(.) \\ y(t) & \sim f(\eta, x_b, t, x_v(t)) \end{aligned} \]

\(x_v(t)\) interpolates/extrapolates the finite covariate measurements to any given model time \(t\).
\(x_v(t)\) could be a time-varying image/scan or its embedding!

One-compartment PK model as a generative model

Latent variables

\[ \begin{aligned} \eta_a & \sim \mathcal{N}(0, \omega_{\eta_a}^2) \\ \eta_e & \sim \mathcal{N}(0, \omega_{\eta_e}^2) \\ \eta_v & \sim \mathcal{N}(0, \omega_{\eta_v}^2) \end{aligned} \]

Population-level parameters: \(\omega_{\eta_a}^2\), \(\omega_{\eta_e}^2\), \(\omega_{\eta_v}^2\).

One-compartment PK model as a generative model

Decoder

\[ \begin{aligned} k_{a,\eta} & = k_a \cdot e^{\eta_a} \\ k_{e,\eta} & = k_e \cdot e^{\eta_e} \\ V_{d,\eta} & = V_d \cdot e^{\eta_v} \\ C(t) & = \frac{D \cdot k_{a,\eta}}{V_{d,\eta} \cdot (k_{a,\eta} - k_{e,\eta})} \left( e^{-k_{e,\eta} \cdot t} - e^{-k_{a,\eta} \cdot t} \right) \\ y(t) & \sim \mathcal{N}(C(t), \sigma^2) \end{aligned} \]

One-compartment PK model as a generative model

Decoder

\(C(t)\): drug concentration at time \(t\).
\(D\): dose administered.
\(k_{a,\eta}\), \(k_{e,\eta}\), \(V_{d,\eta}\): absorption, elimination rates, and volume with random effects \(\eta\).
\(y(t)\): observed response (concentration).
\(\sigma^2\): variance of Gaussian residual error, same for all the time points.
Population-level parameters: \(k_a\), \(k_e\), \(V_d\), \(\sigma^2\).

Visual predictive checks

Two univariate distributions are identical if all their respective quantiles are equal to one another.
The visual predictive check (VPC) plot commonly checks 3 quantiles levels, \([0.1, 0.5, 0.9]\).

Visual predictive checks

For each quantile level, if the observed data’s quantile as a function of time (or any other covariate) is different from the generated data’s quantile (i.e. lies outside the simulated quantile interval), the 2 distributions are not similar.
A bad VPC is more informative than a good one.

Deep learning fundamentals

Neural networks

Training

Find network parameters \(W\) that minimize a suitable cost/loss function \[ C(W; x, y) = \sum_{i=1}^N L(y_i, \hat{y}(x_i; W)) \]
One common choice of \(L(y_i, \hat{y}(x_i; W))\) is: \[ L(y_i, \hat{y}(x_i; W)) = (y_i - \hat{y}(x_i; W))^2 \]
Iteratively refine the parameters by stochastic gradient descent \[ W^{\text{new}} = W^{\text{old}} - \alpha \cdot \nabla_W C(W; x, y) \]
Gradient computed by backpropagation (Rumelhart et al., 1986)
\(\alpha\) is known as the learning rate

Negative log likelihood as a loss function

The negative log likelihood generalizes the sum of square error loss function.

Normal distribution

\[ \begin{aligned} y_i & \sim \mathcal{N}(\hat{y}(x_i; W), \sigma^2) \\ -\log P(y_i) & = \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{(y_i - \hat{y}(x_i; W))^2}{2 \cdot \sigma^2} \end{aligned} \]

Setting \(\sigma^2 = \frac{1}{2}\): \[ \begin{aligned} -\log P(y) & = -\sum_{i=1}^N \log P(y_i) \\ & = \underbrace{-\frac{N}{2} \cdot \log (2) + \frac{N}{2} \cdot \log (2 \pi)}_{\text{constant}} + \underbrace{\sum_{i=1}^{N} (y_i - \hat{y}(x_i; W))^2}_{C(W; x, y)} \end{aligned} \]

Control on withheld data

Feed-forward neural networks

The basic interface is quite given by the task
- Number of input units
- Number of output units
- Activation function for the output layer
The architecture must be selected carefully
- Number of hidden layers
- Number of units per hidden layer
- Activation function for the hidden layers
- Parameters of the optimization algorithm

Activation functions

At the output layer we typically use
- the identity function for regression tasks,
- the sigmoid function for binary classification tasks,
- the softmax function for multi-class classification tasks, and
- the softplus function to ensure non-negativity.
At the hidden layers,
- traditional activations are the sigmoid and tanh functions,
- but deeper networks will suffer from vanishing gradients.
- In this case, experiment with the ReLU and SELU functions.

Architecture selection

Typically based on empirical investigation and assessment on withheld data
Initial exploration to get a feel of reasonable network dimensions
Hyperparameter search
- Design a grid of hyperparameters based on the previous exploration
- In case of a tight budget, pay attention to the learning rate
- Exhaustive, random, Bayesian search

Bergstra and Bengio “Random search for hyper-parameter optimization.” (JMLR, 2012)

Under-fitting and over-fitting

Strategies to prevent over-fitting

Early stopping based on withheld data
Most careful data splitting to avoid data leakage
Networks as large as possible, but not larger
Larger and more diverse training datasets
Regularization
- Penalization of the cost function
- Dropout (Srivastava et al., 2014)
- Data augmentation

Penalization of the cost function

Cost function

\[ \text{arg min}_W \, C(W; x, y) \]

Penalized cost function reduces expressiveness

\[ \text{arg min}_W \, C(W; x, y) + \lambda \cdot \left\lVert W \right\rVert_p^p \]

\(\left\lVert W \right\rVert_p\) is the \(L_p\) norm of the parameters \(W\).
\(\lambda\) is the regularization strength hyperparameter.

Regularization types

\(L_2\) regularization
- Does not promote sparsity of \(W\)
- Equivalent to the regularization used in ridge regression. \[ \left\lVert W \right\rVert_2^2 = \sum_{i=1}^{n_w} W_i^2 \]
\(L_1\) regularization
- More likely to give sparse weights \(W\)
- Equivalent to the regularization used in LASSO regression. \[ \left\lVert W \right\rVert_1 = \sum_{i=1}^{n_w} |W_i| \]

Regularization using prior distributions

Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.
Maximizing the log likelihood + log prior is called maximum a-posteriori (MAP) estimation.

Normal prior is \(L_2\) regularization

\[ \begin{aligned} W & \sim \mathcal{N}(0, \sigma^2 \cdot I) \\ -\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{W_i^2}{2 \cdot \sigma^2} \Bigg) \\ & = \underbrace{n_w \cdot \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) \Bigg)}_{\text{constant}} + \underbrace{\frac{1}{2 \cdot \sigma^2}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} W_i^2}_{\left\lVert W \right\rVert_2^2} \\ & = \lambda \cdot \left\lVert W \right\rVert_2^2 + \text{constant} \end{aligned} \]

Regularization using prior distributions

Laplace prior is \(L_1\) regularization

\[ \begin{aligned} W_i & \sim \text{Laplace}(0, b) \\ -\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (2 \cdot b) + \frac{|W_i|}{b} \Bigg) \\ & = \underbrace{n_w \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} |W_i|}_{\left\lVert W \right\rVert_1} \\ & = \lambda \cdot \left\lVert W \right\rVert_1 + \text{constant} \end{aligned} \]