Random Effects in Bayesian Models

Authors

Jose Storopoli

Mohamed Tarek

using Pumas

This tutorial will present alternate parameterizations in Pumas Bayesian models and the motivation behind them.

We’ll start with reviewing the main limitations of state-of-the-art Markov chain Monte Carlo (MCMC) samplers, when these issues arise, and how to circumvent them.

1 Brief overview of No-U-Turn-Sampler (NUTS)

Markov chain Monte Carlo (MCMC) samplers have the ability of approximating any target posterior density using approximations with discrete sampling in an iterative manner. The target posterior density is explored by proposing new parameters values using an acception/rejection rule.

Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), use the posterior gradient to guide the proposals to areas of high probability, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on two hyperparameters:

  • \(L\): the number of steps to use in guiding the proposal towards areas of high probability
  • \(\epsilon\): the step-size of each step taken into guiding the proposal towards areas of high probability

No-U-Turn-Sampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can automatically tune \(\epsilon\). This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once it is done, \(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration. The sampling phase is the phase that originates the samples used for the MCMC target posterior density approximation.

For most models NUTS perform as expected and converges fast to the target posterior density. Nevertheless, there is a common posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate.

2 The Devil’s Funnel

The Devil’s funnel, also known as Neal (2003)’s funnel, arises often in hierarchical models (also known as mixed models), especially when there are few data points.

The funnel occurs when the variance of a variable depends on another variable in an exponential scale. A canonical example is:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \eta &\sim \text{Normal}\left( 0, e^{\text{log}\omega} \right) \\ y &\sim \text{Normal}\left( \eta, 0.1 \right) \end{aligned}\]

Let’s visualize the Devil’s funnel by looking at the joint probability of the prior over \(\text{log}\omega\) and \(\eta\):

using CairoMakie
funnel(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, exp(y / 2)), x))
heatmap(-5:0.1:5, -7:0.01:2, funnel)

Here we can see that any HMC sampler will fail to explore the posterior density with a fixed \(\epsilon\). This happens because in the top of the funnel we need high values for \(\epsilon\); whereas in the bottom of the funnel the opposite happens: we need low values for \(\epsilon\). The NUTS sampler can do better than most samplers since it adapts \(L\) in every iteration. Given a fixed small \(\epsilon\), the NUTS sampler uses a very large \(L\) in the top part of the funnel and a small \(L\) at the bottom. Large values of \(L\) cause a severe slowdown in the sampling.

3 Reparameterization

The parameterization for the funnel is called Centered Parameterization (CP). The naming is a little ambiguous, but the idea behind CP is a parameterization where one or more parameters depend on another parameter. This happens with \(\eta\) since its prior distribution’s scale depends on the exponential of \(\text{log}\omega\). Standard deviation and variance parameters are always sampled in log-scale so it doesn’t matter if you use \(\text{log}\omega\) or \(\omega\) in the model. When there are not enough data points per subject, the posterior density is mostly determined by the prior distributions.

One way to circumvent this is to use an alternate parameterization called Non-Centered Parameterization (NCP). NCP is a parameterization where no parameter’s prior depends on another parameter. Here we sample parameters independently and transform them after sampling:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \tilde{\eta} &\sim \text{Normal}(0, 1) \\ \eta &= \tilde{\eta} \cdot e^{\text{log}\omega} \\ y &\sim \text{Normal}(\eta, 0.1) \end{aligned}\]

The priors of \(\text{log}\omega\) and \(\tilde{\eta}\) are independent; and we reconstruct \(\eta\) from \(\tilde{\eta}\) by multiplying it by \(e^{\text{log}\omega}\). Since \(\eta\) and \(\tilde{\eta}\) are Gaussian random variables, we can scale and shift them to derive one from another without changing the data generating process.

While both models are equivalent when simulating and generating synthetic data, the second model’s prior is more regular:

ncp(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, 1), x))
heatmap(-5:0.1:5, -7:0.01:2, ncp)

When there are few data points, the posterior will therefore be much easier to sample from using the NCP formulation than the CP formulation, even when using a fixed a fixed \(L\) and \(\epsilon\).

4 CP versus NCP in Bayesian Pumas Models

pk_cp = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvvc ~ LogNormal(log(70), 0.25)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(2, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
            lower = zeros(2),
            upper = fill(Inf, 2),
            init = ones(2),
        )
    end

    @random begin
        η ~ MvNormal.* C .* ω')
    end

    @pre begin
        # PK parameters
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
    end

    @dynamics begin
        Central' = -CL / Vc * Central
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), cp * σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, σ, C, ω
  Random effects: η
  Covariates: 
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv
pk_ncp = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvvc ~ LogNormal(log(70), 0.25)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(2, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
            lower = zeros(2),
            upper = fill(Inf, 2),
            init = ones(2),
        )
    end

    @random begin
        ηstd ~ MvNormal(I(2))
    end

    @pre begin
        # compute the η from the ηstd
        # using lower Cholesky triangular matrix
        η = ω .* (getchol(C).L * ηstd)

        # PK parameters
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
    end

    @dynamics begin
        Central' = -CL / Vc * Central
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), cp * σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, σ, C, ω
  Random effects: ηstd
  Covariates: 
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv

4.1 CP

Let’s go over what is going on in the pk_cp model.

First, in the @param block we have new parameters C and ω. C is a correlation matrix, which we give LKJ priors already decomposed into Cholesky factors: LKJCholesky. This is useful for performance. Additionally, ω is a vector of standard deviations of our random effects, which we give a truncated MvNormal prior with zero mean, a diagonal covariance matrix, and a lower bound of 0 on the variables. The standard deviations of the MvNormal prior distribution before truncation is 0.4 (we square those because the elements of the diagonal covariance matrix are variances).

Since in the Bayesian approach we need to be careful about how to specify priors, we generally find it more intuitive to specify correlations and standard deviations separately. Specifying full covariance matrices is far more convoluted, error-prone, and non-intuitive. However, you can still work with the covariance matrix Ω directly and give it an InverseWishart or Wishart prior. Call ?InverseWishart or ?Wishart in the Julia REPL for more information on these prior distributions.

Now let’s move to the @random block. Here we are specifying our ηs as having an MvNormal prior with mean zero. The covariance matrix is computed by multiplying every element of the correlation matrix with the product of the corresponding standard deviations: The covariance matrix \(\boldsymbol{\Omega}\) can be decomposed into

\[\boldsymbol{\Omega}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{C}[i,j] \cdot \boldsymbol{\omega}[j]\]

where

  • \(\mathbf{C}\) is the correlation matrix
  • \(\boldsymbol{\omega}\) is the vector of standard deviations

As we can see this is a CP model because the ηs depends on both C and ω and ω is sampled in log-scale because it is positive.

4.2 NCP

What is different in the pk_ncp model is that, instead of having the distribution of ηs being dependent on both C and ω, we have the ηstd in the @random block being independent from any other parameters. ηstd is given an MvNormal prior with mean zero and covariance matrix as the identity matrix.

We reconstruct our \(\eta\)s back in the @pre block by using the lower triangular matrix \(\mathbf{L}\) from the Cholesky factorization of \(\mathbf{C} = \mathbf{L} \mathbf{L}'\), the \(\boldsymbol{\omega}\) and the \(\eta \text{std}\) vectors.

This produces a data generating process that is identical to the CP formulation since in both formulations the conditional distribution of \(\eta\) given \(\mathbf{C}\) and \(\boldsymbol{\omega}\) is a multi-variate normal distribution with the same mean vector and covariance matrix. It is well known that if

\[\eta\text{std} \sim \text{Normal}(0, \mathbf{I})\] and

\[\eta = \mathbf{A} \cdot \eta\text{std},\]

then the distribution of \(\eta\) is given by

\[\eta \sim \text{Normal}(0, \mathbf{A} \cdot \mathbf{A}')\].

Let \(\mathbf{A} = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L}\), i.e., \(\mathbf{A}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{L}[i,j]\). Then

\[\mathbf{A} \cdot \mathbf{A}' = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L} \cdot \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega}) = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{C} \cdot \text{Diagonal}(\boldsymbol{omega}) = \boldsymbol{\Omega}\]

since \(\mathbf{A}' = \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega})\) (transposing swaps the 2 matrices) and because \(\mathbf{C} = \mathbf{L} \cdot \mathbf{L}'\) by definition of the Cholesky factor \(\mathbf{L}\).

Now let’s test both parameterizations.

using PharmaDatasets
pkdata = dataset("iv_sd_3")
pop = read_pumas(pkdata)
iparams = (; tvcl = 1.0, tvvc = 10.0, C = float.(Matrix(I(2))), ω = [0.09, 0.09], σ = 0.3)
cp_fit = fit(
    pk_cp,
    pop[1:10], # just the first 10 subjects
    iparams,
    BayesMCMC(; nsamples = 1_000, nadapts = 500),
);
tcp_fit = discard(cp_fit; burnin = 500)
Chains MCMC chain (500×6×4 Array{Float64, 3}):

Iterations        = 1:1:500
Number of chains  = 4
Samples per chain = 500
Wall duration     = 363.87 seconds
Compute duration  = 1211.26 seconds
parameters        = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂

Summary Statistics
  parameters      mean       std      mcse   ess_bulk   ess_tail               ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64               ⋯

        tvcl    0.9687    0.0262    0.0130     4.0650        NaN   18133591743 ⋯
        tvvc   10.3198    0.2912    0.1444     4.0650        NaN   21173875386 ⋯
           σ    0.2859    0.0313    0.0155     4.0650        NaN   18133591743 ⋯
        C₂,₁   -0.0095    0.0115    0.0057     4.3567        NaN               ⋯
          ω₁    0.0883    0.0010    0.0005     4.0650        NaN   18133591743 ⋯
          ω₂    0.0918    0.0014    0.0007     4.0650        NaN   21173875386 ⋯
                                                               2 columns omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

        tvcl    0.9280    0.9550    0.9755    0.9892    0.9956
        tvvc   10.0489   10.1270   10.2126   10.4053   10.8051
           σ    0.2317    0.2846    0.3034    0.3047    0.3052
        C₂,₁   -0.0287   -0.0130   -0.0050   -0.0015    0.0006
          ω₁    0.0867    0.0877    0.0886    0.0891    0.0893
          ω₂    0.0898    0.0908    0.0920    0.0929    0.0932

Our effective sample size (ESS) is very low for all parameters. In fact, the sampler hasn’t actually explored the posterior density. It was stuck in the same place.

ncp_fit = fit(
    pk_ncp,
    pop[1:10], # just the first 10 subjects
    iparams,
    BayesMCMC(; nsamples = 1_000, nadapts = 500),
);
tncp_fit = discard(ncp_fit; burnin = 500)
Chains MCMC chain (500×6×4 Array{Float64, 3}):

Iterations        = 1:1:500
Number of chains  = 4
Samples per chain = 500
Wall duration     = 142.42 seconds
Compute duration  = 427.24 seconds
parameters        = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂

Summary Statistics
  parameters       mean       std      mcse    ess_bulk    ess_tail      rhat  ⋯
      Symbol    Float64   Float64   Float64     Float64     Float64   Float64  ⋯

        tvcl     1.0065    0.2322    0.0068   1214.5208   1167.0571    1.0071  ⋯
        tvvc   161.0225   21.8840    0.7188    917.7697   1213.3405    0.9999  ⋯
           σ     0.0283    0.0037    0.0001   1073.8000   1355.4603    1.0004  ⋯
        C₂,₁    -0.0830    0.4601    0.0107   1735.2800   1012.4155    1.0016  ⋯
          ω₁     1.2276    0.2061    0.0065    991.1758   1156.2900    1.0064  ⋯
          ω₂     0.1137    0.0925    0.0037    684.5426    794.7249    1.0008  ⋯
                                                                1 column omitted

Quantiles
  parameters       2.5%      25.0%      50.0%      75.0%      97.5%
      Symbol    Float64    Float64    Float64    Float64    Float64

        tvcl     0.6265     0.8361     0.9867     1.1400     1.5211
        tvvc   121.6520   146.1216   159.1555   174.8054   207.0101
           σ     0.0219     0.0256     0.0280     0.0306     0.0364
        C₂,₁    -0.9142    -0.4245    -0.1178     0.2335     0.8786
          ω₁     0.8564     1.0835     1.2164     1.3587     1.6696
          ω₂     0.0057     0.0446     0.0930     0.1605     0.3404

As you can see the ESS for all parameters is much better than for the CP fit. This means that our sampler has actually explored the posterior and sampled good samples from the posterior.

Tip

We are truncating our sampler chain with discard using the burnin keyword argument to specify the number of samples to remove in the beginning of the chain. We generally use burnin = nadapts to remove the first nadapts samples from our chain, so that we only keep the samples from the sampling phase while discarding the samples from the adaptation phase.

Note

The effective sample size (ESS) and other convegence metrics are explained in the Markov Chain Monte Carlo Convergence tutorial. Don’t forget to check it out!

4.3 CP vs NCP

Let’s quantify how much more efficient NCP is compared with CP for this model and data:

mean(ess(tncp_fit)) / mean(ess(tcp_fit))
271.03856495012207

5 Conclusions

Whenever possible try to make your prior and posterior topologies friendly to your MCMC sampler by parameterizing the model with independent parameters. You can do this by having the priors from all parameters being independent from each other and reconstructing dependencies after the prior definitions. This will help you to perform more robust inference.

6 References

Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.

Neal, R. M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741.