Mixed effects models and the marginal likelihood

Authors:
Mohamed Tarek

What are mixed effects models?

  • Population parameters \(\theta\)
    • Model parameters modelled as deterministic quantities.
    • A subset of these are commonly referred to as fixed effects.
  • Random effects \(\eta\)
    • Subject-specific parameters.
    • Modelled as random variables.
  • Covariates \(x\)
  • Responses \(y\)
Hierarchical NLME model with 3 subjects

What are mixed effects models?

Likelihood

Conditional likelihood

Probability of the response \(y\) according to the model given specific values of \(\theta\), \(\eta\) and \(x\).

\[ p_c(y \mid \theta, \eta, x) \]

Fit model by simply finding the values of \(\theta\) and \(\eta\) that maximizes the conditional probability?

This can horribly overfit the data, too many parameters!

Marginal likelihood

Integrates out the effect of the random effects

\[ p_m (y \mid \theta, x) = \int \overbrace{p_c (y \mid \theta, \eta, x) \cdot p_{\text{prior}} (\eta \mid \theta)}^{\text{joint probability}} \, d\eta \]

Average conditional probability weighted by a prior.

Comparing models

Which model has the largest area under the curve?

Comparing models

A good model should have:

  1. A high peak: high joint probability with the data, and
  2. A wide base: low sensitivity to the value of the random effect.

Comparing models

Intuitively, even if a model is able to fit the data very well, it is not a good model if it is very sensitive to the value of the random effect.

Over-fitted models will generally have a very high peak, but a very narrow base.

Decomposing the marginal likelihood

Consider a single subject \(i\). The marginal likelihood is:

\[ p(y_i \mid \theta) = \int p(y_i \mid \eta_i, \theta) \cdot p(\eta_i \mid \theta) \, d\eta_i \]

However, we can write the marginal likelihood in another way:

\[ p(y_i \mid \theta) = \prod_{j=1}^{m_i} p(y_{i,j} \mid y_{i,1:j-1}, \theta) \]

where

  • \(y_{i,1:j}\) are the observations of subject \(i\) until time point \(t_j\), and \(j\) is an integer that goes from 1 to \(m_i\), and
  • \(m_i\) is the number of longitudinal observations for subject \(i\).

Decomposing the marginal likelihood

No past observations, \(j = 1\)

\[ p(y_{i,1} \mid y_{i,1:0}, \theta) = p(y_{i,1} \mid \theta) = \int p(y_{i,1} \mid \eta_i, \theta) \cdot p(\eta_i \mid \theta) \, d\eta_i \]

With past observations, \(j > 1\)

\[ p(y_{i,j} \mid y_{i,1:j-1}, \theta) = \int p(y_{i,j} \mid \eta_i, \theta) \cdot p(\eta_i \mid y_{i,1:j-1}) \, d\eta_i \]

where \(p(\eta_i \mid y_{i,1:j-1}, \theta)\) is the posterior distribution of \(\eta_i\) given partial data \(y_{i,1:j-1}\) and \(\theta\).

Fitting algorithms in Pumas/DeepPumas

Marginal likelihood Conditional likelihood
Prior / regularization on \(\theta\)
  • MAP(FO())
  • MAP(FOCE())
  • MAP(LaplaceI())
JointMAP()
No prior / regularization on \(\theta\)
  • FO()
  • FOCE()
  • LaplaceI()
JointMAP() without priors in the @param block

Fitting algorithms in Pumas/DeepPumas

FO(), FOCE(), LaplaceI() and SAEM()

\[ \begin{aligned} \theta^* & = \text{arg max}_{\theta} \prod_{i=1}^N p(y_i \mid \theta, x) \\ \eta_i^* & = \text{arg max}_{\eta_i} p(y_i \mid \theta = \theta^*, \eta_i, x_i) \cdot p(\eta_i \mid \theta = \theta^*) \end{aligned} \]

MAP(FO()), MAP(FOCE()) and MAP(LaplaceI())

\[ \begin{aligned} \theta^* & = \text{arg max}_{\theta} p(\theta) \cdot \prod_{i=1}^N p(y_i \mid \theta, x) \\ \eta_i^* & = \text{arg max}_{\eta_i} p(y_i \mid \theta = \theta^*, \eta_i, x_i) \cdot p(\eta_i \mid \theta = \theta^*) \end{aligned} \]

Fitting algorithms in Pumas/DeepPumas

JointMAP()

\[ \begin{aligned} \theta^*, \eta^* & = \text{arg max}_{\theta, \eta} p(\theta, \eta \mid x, y) \\ & = \text{arg max}_{\theta, \eta} p(\theta) \cdot \prod_{i=1}^N p(\eta_i \mid \theta) \cdot p(y_i \mid \theta, \eta_i, x_i) \end{aligned} \]

BayesMCMC()

  • Samples from the joint posterior \(p(\theta, \eta \mid x, y)\).
  • Uses a Markov Chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution.