Machine Learning and Neural Networks

Authors:
Mohamed Tarek, Andreu Vall

Context and evolution

Use cases

Toxicity prediction with deep learning

Image-based diagnostic with deep learning

What is machine learning?

Main categories of machine learning (ML)

Supervised ML

  • Prediction
  • Classification

Unsupervised ML

  • Generative models
  • Feature extraction
  • Clustering

Semi-supervised ML

  • Has components of both supervised and unsupervised ML
  • Examples:
    • Conditional generative models
    • Unsupervised then supervised ML

Reinforcement learning

  • ML for sequential decision-making in a stochastic environment
  • Not in scope for this training

ML algorithms

  • ML using neural networks is known as deep learning (DL).
  • Not all ML algorithms use neural networks.
  • Many ML algorithms use probabilistic models as their work-horse.
  • Probabilistic machine learning algorithms borrow many concepts from computational (Bayesian) statistics.
  • Not all machine learning algorithms use probabilistic models, e.g.
    • K-nearest neighbor algorithm
    • Decision trees
    • Hierarchical clustering
    • DBSCAN

ML is not all new!

  • Machine learning builds on and often re-invents many fields including:
    • Statistics
    • Optimization
    • Image and signal processing
    • Software engineering
  • Related terms:
    • Data mining
    • Data science
    • Artificial intelligence

Probabilistic supervised ML

Scope

  • Data type: labelled data \((\mathbf{x}_i, \mathbf{y}_i)\) for \(i \in 1, \dots, N\)
  • Tasks:
    • Prediction, or
    • Classification
  • Model type:
    • Conditional model of \(\mathbf{y} \mid \mathbf{x}\)
    • Also known as a discriminative model

Prediction model examples

  • Unconstrained response (positive or negative)
    • Prediction is \(\mathbf{f}(\mathbf{x})\).
    • The covariance matrix \(\mathbf{\Sigma}\) is often diagonal and can be fixed (not estimated).

\[ \mathbf{y} \sim \text{Normal}(\mathbf{f}(\mathbf{x}), \mathbf{\Sigma}) \]

  • Positive response
    • Prediction is the mean/median/mode of the distribution.
    • The median is the element-wise exponential of \(\mathbf{f}(\mathbf{x})\).

\[ \mathbf{y} \sim \text{LogNormal}(\mathbf{f}(\mathbf{x}), \mathbf{\Sigma}) \]

Classification model examples

  • Binary classification

\[ \begin{aligned} y & \sim \text{Bernoulli}(\text{logistic}(f(\mathbf{x}))) \\ \text{logistic}(z) & = \frac{1}{1 + e^{-z}} \end{aligned} \]

  • Multinomial classification, number of classes \(K > 2\)

\[ \begin{aligned} \mathbf{y} & \sim \text{Categorical}(\mathbf{\text{softmax}}(\mathbf{f}(\mathbf{x}))) \\ \mathbf{\text{softmax}}(\mathbf{z}) & = \Bigg[ \frac{e^{z_l}}{\sum_{j = 1}^K e^{z_j}} \text{ for } l \in 1, \dots, K \Bigg] \\ \end{aligned} \]

  • Prediction is the mode, i.e. the class with the highest probability

Probabilistic unsupervised ML

Scope

  • Data type: un-labelled data \(\mathbf{y}_i\) for \(i \in 1, \dots, N\).
  • Tasks:
    • Learning data distributions, aka generative modelling
    • Dimension reduction / feature extraction / embedding
    • Clustering
  • Model type:
    • Generative (simulation) model for \(\mathbf{y}\).
    • Has latent random variables \(\mathbf{z}\), e.g. \(\mathbf{z} \sim \text{Normal}(\mathbf{0}, \mathbf{I})\).
    • Models the distribution of \(\mathbf{y} \mid \mathbf{z}\) explicitly.
    • Can (approximately) infer the inverse distribution \(\mathbf{z} \mid \mathbf{y}\).

Generative model examples

  • Probabilistic principal component analysis
    • \(\mathbf{\Sigma}\) is usually diagonal
    • Equivalent to a linear mixed effects model \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & \sim \text{Normal}(\mathbf{A} \cdot \mathbf{z} + \mathbf{b}, \mathbf{\Sigma}) \end{aligned} \]
  • Decoder of variational autoencoders (VAE)
    • \(\mathbf{\text{NN}_{\mu}}(\mathbf{z})\) and \(\mathbf{\text{NN}_{\Sigma}}(\mathbf{z})\) are neural networks.
    • The output of \(\mathbf{\text{NN}_{\mu}}(\mathbf{z})\) is unconstrained (positive or negative).
    • The output of \(\mathbf{\text{NN}_{\Sigma}}(\mathbf{z})\) is usually a diagonal covariance matrix. \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & \sim \text{Normal} \Big( \mathbf{\text{NN}_{\mu}}(\mathbf{z}), \mathbf{\text{NN}_{\Sigma}}(\mathbf{z}) \Big) \end{aligned} \]

Generative model examples

  • Generator of a generative adversarial network (GAN) \[ \begin{aligned} \mathbf{z} & \sim \text{Normal}(\mathbf{0}, \mathbf{I}) \\ \mathbf{y} & = \mathbf{\text{NN}}(\mathbf{z}) \end{aligned} \]

  • Gaussian mixture model

    • \(\mathbf{\Phi}\) is a \(K\) dimensional vector of parameters that sums up to 1 for \(K\) clusters.
    • \(z \in [1, K]\) is a discrete latent variable, aka cluster.
    • For each \(z\) value, there are parameters:
      • Mean \(\mathbf{\mu}_z\), and
      • Covariance \(\mathbf{\Sigma}_z\). \[ \begin{aligned} z & \sim \text{Categorical}(\mathbf{\Phi}) \\ \mathbf{y} & \sim \text{Normal} \Big( \mathbf{\mu}_z, \mathbf{\Sigma}_z \Big) \end{aligned} \]

Generative models

Goal

To generate synthetic data that is indistinguishable in distribution from the real data.

Images

https://www.thispersondoesnotexist.com

Time-series / longitudinal / panel data

Data structure

  • Data is often hierarchical in nature and heterogeneous
  • Longitudinal data
    • Every subject has one or more response variables (e.g. PK and PD responses)
    • Every response variable has one or more observations.
    • Response variables can have different numbers of observations across and/or within subjects.
  • Image/scan data
    • Every subject has one or more images/scans, e.g. CT scan and MRI scan.
    • Every image/scan has many pixels/points.
    • Images/scans can have different sizes and resolutions across and/or within subjects.

Key point

The model structure must match the structure of the data, e.g:

  • Temporal models are used for time-series data, and
  • Spatial models for image/scan/spatial data.

Model structure

  • Generative models are probabilistic models that make use of random variables to simulate new data.
  • Generative models generate synthetic data \(y\) from:
    • A latent representation \(\eta\) of the data point, modelled as a random variable.
    • A parametric function \(f(\eta)\) which returns either
      • a probability distribution for \(y\) given its latent representation, or
      • a single point \(y\) directly.
  • Example

\[ \begin{aligned} \eta & \sim \mathcal{N} ( 0, 1 ) \\ y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg) \end{aligned} \]

Model structure

Latent variables

The latent representation \(\eta\) of the data point is:

  • Low-dimensional and fixed-length
  • Modelled as a random variable sampled from a parametric prior distribution to allow generating data.
  • Also known as a latent variable, embedding variable, latent feature or random effect.
  • If \(\eta\) is discrete, it is commonly known as a cluster.
  • The latent variable \(\eta\) is per point/subject/image.
  • The prior distribution is shared between all points/subjects/images.

Model structure

Decoder

  • The parametric function \(f(\eta)\) is also known as a probabilistic decoder, generator, data model or error model.
  • The architecture/structure of the decoder \(f\) must match the structure of the data.

Model structure

Parameters

  • The parameters of the decoder \(f\) and the prior distribution of \(\eta\) are shared between all the points/subjects/images.
  • The parameters of \(f\) and the prior distribution of \(\eta\) are known as population-level parameters in the hierarchical statistics literature.
  • A subset of the parameters of \(f\) are also sometimes called fixed effects in the mixed effects literature.
  • In deep learning, all parameters just called weights or weights and biases.

Generative deep learning models

Decoder

The decoder \(f\) returns a distribution whose parameters are described by one or more neural networks. Example:

\[ y \sim f(\eta) = \mathcal{N} ( \text{NN}_{\mu}(\eta), \text{NN}_{\sigma^2}(\eta) ) \]

Latent variables

  • \(\eta\) is a vector of random degrees of freedom that the neural networks can use in any way they see fit to generate plausible data.
  • What each latent variable does to \(f(\eta)\) and \(y\) is emergent behavior and not something engineered into the model.

\[ \eta \sim \mathcal{N}(0, I) \]

Generative pharmacometric models

Decoder

The decoder \(f\) returns a distribution (e.g. a normal distribution) whose parameters (e.g. mean and covariance) are described by either:

  • A mechanistic model, often based on differential equations, aka a scientific model,
  • A fully empirical model, aka a machine learning model, or
  • A semi-mechanistic model, aka a scientific machine learning model.

Generative pharmacometric models

Latent variables

  • \(\eta\) is interpreted as a subject’s deviation from the population’s typical value for a specific individual quantity.
  • For example, the individual clearance is \(\text{CL} = \text{tvcl} \cdot \exp(\eta_{\text{CL}})\).
  • The typical value \(\text{tvcl}\) is a population-level parameter (part of the decoder).
  • The prior distribution’s parameters are also population-level parameters.
  • The prior distribution of \(\eta\) models the between-subject variability not explained by the covariates.

Sampling from the generative model

Sampling from the generative model is a 3-step process:

  1. Sample a random value for the latent representation \(\eta\) from its prior distribution,
  2. Decode \(\eta\) using \(f(\eta)\) to produce a distribution for \(y\), conditional on \(\eta\), and
  3. Sample from the conditional distribution of \(y\).

\[ \begin{aligned} \eta & \sim \mathcal{N} ( 0, 1 ) \\ y & \sim f(\eta) = \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix} \cdot \eta + \begin{bmatrix} 2 \\ 2 \end{bmatrix}, \begin{bmatrix} 0.2 & 0 \\ 0 & 0.2 \end{bmatrix} \Bigg) \end{aligned} \]

Fitting the generative model

  • Goal: find the best values for the parameters of \(f\) and the prior distribution of \(\eta\) by some definition of best.
  • One common fitting objective is maximizing the log marginal likelihood (aka evidence):

\[ \sum_{i=1}^N \log \int p(y_i \mid \eta) \cdot p(\eta) \, d\eta \]

  • \(p(y_i \mid \eta)\) is the probability of \(y = y_i\) according to the distribution \(f(\eta)\).
  • \(p(\eta)\) is the probability of \(\eta\) according to its prior distribution.
  • \(N\) is the number of points/images/subjects in the training data.

Inverse problem

  • Instead of generating a synthetic value \(y\) from a random value \(\eta\), we can use the generative model in reverse.
  • Given an actual observed value \(y\), we can infer the latent variable \(\eta^*\) that best represents this data point \(y\).
  • This is done by solving the following optimization problem for some dissimilarity function \(D\).
  • \(D\) should quantify the dissimilarity between the distribution \(f(\eta)\) and the observation \(y\).

\[ \eta^* = g(y) = \text{argmin}_{\eta} D(f(\eta), y) \]

Inverse problem

  • In mixed effects modelling, a common choice for the dissimilarity function \(D(f(\eta), y)\) is:

\[ -\log p(y \mid \eta) - \log p(\eta) \]

  • \(-\log p(y \mid \eta)\) quantifies the dissimilarity between \(f(\eta)\) and \(y\).
  • \(-\log p(\eta)\) acts as regularization which reduces over-fitting if there is a large noise component in \(y\), or it’s an outlier.
  • Over-fitting in this context is often measured using \(\epsilon\)-shrinkage in the mixed effects literature.
  • When using this particular definition of \(D\), \(\eta^*\) is called the empirical Bayes estimate.

Inverse problem

  • More generally, in machine learning, \(\eta^*\) is called an embedding.
  • In this optimization, the parameters of \(f\) and the prior distribution are kept fixed.
  • Since the dimension of \(\eta\) is often less than that of \(y\), the inverse problem can be used for dimension reduction (aka factor analysis) or feature extraction.
  • The inverse problem can also be used to generate a fixed-length encoding of variable-length data.
  • If \(\eta\) is discrete, the inverse problem can be used for clustering.

Inverse problem

  • Solving the inverse problem is also known as encoding the data point \(y\).
  • One can train a surrogate neural network to approximate \(\eta^* = g(y)\) given any \(y\) without doing the optimization every time.
    • Useful for big data.
  • This surrogate model is usually called an encoder.
  • The optimization-based function itself \(g(y)\) can also be considered an exact encoder.

Inverse problem

  • Surrogates can also learn an approximation of the distribution of \(\eta\) values that could have plausibly generated \(y\).
  • This distribution is known as the posterior distribution \(p(\eta \mid y)\).
  • A surrogate model that approximates \(p(\eta \mid y)\) using variational inference is called a probabilistic encoder or a variational encoder.
  • Learning a surrogate model for the inverse problem to avoid solving it for every \(y\) is known as amortized learning.
    • This is usually done simultaneously while training the decoder/generator.

Conditional generative models

Covariates

(Baseline) covariates

\[ \begin{aligned} \eta & \sim p(.) \\ y & \sim f(\eta, x_b) \end{aligned} \]

  • \(x_b\) is the set of baseline covariates.
  • \(x_b\) can be an image/scan or its embedding!

Covariates

Time as a covariate

\[ \begin{aligned} \eta & \sim p(.) \\ y(t) & \sim f(\eta, x_b, t) \end{aligned} \]

  • \(y\) is observed at a few finite time points but can be simulated for any time point or interval.

Covariates

Time-varying covariates

\[ \begin{aligned} \eta & \sim p(.) \\ y(t) & \sim f(\eta, x_b, t, x_v(t)) \end{aligned} \]

  • \(x_v(t)\) interpolates/extrapolates the finite covariate measurements to any given model time \(t\).
  • \(x_v(t)\) could be a time-varying image/scan or its embedding!

One-compartment PK model as a generative model

Latent variables

\[ \begin{aligned} \eta_a & \sim \mathcal{N}(0, \omega_{\eta_a}^2) \\ \eta_e & \sim \mathcal{N}(0, \omega_{\eta_e}^2) \\ \eta_v & \sim \mathcal{N}(0, \omega_{\eta_v}^2) \end{aligned} \]

  • Population-level parameters: \(\omega_{\eta_a}^2\), \(\omega_{\eta_e}^2\), \(\omega_{\eta_v}^2\).

One-compartment PK model as a generative model

Decoder

\[ \begin{aligned} k_{a,\eta} & = k_a \cdot e^{\eta_a} \\ k_{e,\eta} & = k_e \cdot e^{\eta_e} \\ V_{d,\eta} & = V_d \cdot e^{\eta_v} \\ C(t) & = \frac{D \cdot k_{a,\eta}}{V_{d,\eta} \cdot (k_{a,\eta} - k_{e,\eta})} \left( e^{-k_{e,\eta} \cdot t} - e^{-k_{a,\eta} \cdot t} \right) \\ y(t) & \sim \mathcal{N}(C(t), \sigma^2) \end{aligned} \]

One-compartment PK model as a generative model

Decoder

  • \(C(t)\): drug concentration at time \(t\).
  • \(D\): dose administered.
  • \(k_{a,\eta}\), \(k_{e,\eta}\), \(V_{d,\eta}\): absorption, elimination rates, and volume with random effects \(\eta\).
  • \(y(t)\): observed response (concentration).
  • \(\sigma^2\): variance of Gaussian residual error, same for all the time points.
  • Population-level parameters: \(k_a\), \(k_e\), \(V_d\), \(\sigma^2\).

Visual predictive checks

  • Two univariate distributions are identical if all their respective quantiles are equal to one another.
  • The visual predictive check (VPC) plot commonly checks 3 quantiles levels, \([0.1, 0.5, 0.9]\).

Visual predictive checks

  • For each quantile level, if the observed data’s quantile as a function of time (or any other covariate) is different from the generated data’s quantile (i.e. lies outside the simulated quantile interval), the 2 distributions are not similar.
  • A bad VPC is more informative than a good one.

Deep learning fundamentals

Neural networks

Training

  • Find network parameters \(W\) that minimize a suitable cost/loss function \[ C(W; x, y) = \sum_{i=1}^N L(y_i, \hat{y}(x_i; W)) \]
  • One common choice of \(L(y_i, \hat{y}(x_i; W))\) is: \[ L(y_i, \hat{y}(x_i; W)) = (y_i - \hat{y}(x_i; W))^2 \]
  • Iteratively refine the parameters by stochastic gradient descent \[ W^{\text{new}} = W^{\text{old}} - \alpha \cdot \nabla_W C(W; x, y) \]
  • Gradient computed by backpropagation (Rumelhart et al., 1986)
  • \(\alpha\) is known as the learning rate

Negative log likelihood as a loss function

The negative log likelihood generalizes the sum of square error loss function.

Normal distribution

\[ \begin{aligned} y_i & \sim \mathcal{N}(\hat{y}(x_i; W), \sigma^2) \\ -\log P(y_i) & = \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{(y_i - \hat{y}(x_i; W))^2}{2 \cdot \sigma^2} \end{aligned} \]

Setting \(\sigma^2 = \frac{1}{2}\): \[ \begin{aligned} -\log P(y) & = -\sum_{i=1}^N \log P(y_i) \\ & = \underbrace{-\frac{N}{2} \cdot \log (2) + \frac{N}{2} \cdot \log (2 \pi)}_{\text{constant}} + \underbrace{\sum_{i=1}^{N} (y_i - \hat{y}(x_i; W))^2}_{C(W; x, y)} \end{aligned} \]

Control on withheld data

Figure by Hastie et al., 2008

Feed-forward neural networks

  • The basic interface is quite given by the task
    • Number of input units
    • Number of output units
    • Activation function for the output layer
  • The architecture must be selected carefully
    • Number of hidden layers
    • Number of units per hidden layer
    • Activation function for the hidden layers
    • Parameters of the optimization algorithm

Activation functions

  • At the output layer we typically use
    • the identity function for regression tasks,
    • the sigmoid function for binary classification tasks,
    • the softmax function for multi-class classification tasks, and
    • the softplus function to ensure non-negativity.
  • At the hidden layers,
    • traditional activations are the sigmoid and tanh functions,
    • but deeper networks will suffer from vanishing gradients.
    • In this case, experiment with the ReLU and SELU functions.

Architecture selection

  • Typically based on empirical investigation and assessment on withheld data
  • Initial exploration to get a feel of reasonable network dimensions
  • Hyperparameter search
    • Design a grid of hyperparameters based on the previous exploration
    • In case of a tight budget, pay attention to the learning rate
    • Exhaustive, random, Bayesian search

Bergstra and Bengio “Random search for hyper-parameter optimization.” (JMLR, 2012)

Under-fitting and over-fitting

Figure by Hastie et al., 2008

Strategies to prevent over-fitting

  • Early stopping based on withheld data
  • Most careful data splitting to avoid data leakage
  • Networks as large as possible, but not larger
  • Larger and more diverse training datasets
  • Regularization
    • Penalization of the cost function
    • Dropout (Srivastava et al., 2014)
    • Data augmentation

Penalization of the cost function

  • Cost function

\[ \text{arg min}_W \, C(W; x, y) \]

  • Penalized cost function reduces expressiveness

\[ \text{arg min}_W \, C(W; x, y) + \lambda \cdot \left\lVert W \right\rVert_p^p \]

  • \(\left\lVert W \right\rVert_p\) is the \(L_p\) norm of the parameters \(W\).
  • \(\lambda\) is the regularization strength hyperparameter.

Regularization types

  • \(L_2\) regularization
    • Does not promote sparsity of \(W\)
    • Equivalent to the regularization used in ridge regression. \[ \left\lVert W \right\rVert_2^2 = \sum_{i=1}^{n_w} W_i^2 \]
  • \(L_1\) regularization
    • More likely to give sparse weights \(W\)
    • Equivalent to the regularization used in LASSO regression. \[ \left\lVert W \right\rVert_1 = \sum_{i=1}^{n_w} |W_i| \]

Regularization using prior distributions

  • Prior distributions generalize the traditional \(L_1\) and \(L_2\) regularization.
  • Maximizing the log likelihood + log prior is called maximum a-posteriori (MAP) estimation.

Normal prior is \(L_2\) regularization

\[ \begin{aligned} W & \sim \mathcal{N}(0, \sigma^2 \cdot I) \\ -\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) + \frac{W_i^2}{2 \cdot \sigma^2} \Bigg) \\ & = \underbrace{n_w \cdot \Bigg( \log (\sigma) + \frac{1}{2} \cdot \log (2 \pi) \Bigg)}_{\text{constant}} + \underbrace{\frac{1}{2 \cdot \sigma^2}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} W_i^2}_{\left\lVert W \right\rVert_2^2} \\ & = \lambda \cdot \left\lVert W \right\rVert_2^2 + \text{constant} \end{aligned} \]

Regularization using prior distributions

Laplace prior is \(L_1\) regularization

\[ \begin{aligned} W_i & \sim \text{Laplace}(0, b) \\ -\log P(W) & = \sum_{i=1}^{n_w} \Bigg( \log (2 \cdot b) + \frac{|W_i|}{b} \Bigg) \\ & = \underbrace{n_w \cdot \log (2 \cdot b)}_{\text{constant}} + \underbrace{\frac{1}{b}}_{\lambda} \cdot \underbrace{\sum_{i=1}^{n_w} |W_i|}_{\left\lVert W \right\rVert_1} \\ & = \lambda \cdot \left\lVert W \right\rVert_1 + \text{constant} \end{aligned} \]