Random Effects in Bayesian Models

Authors

Jose Storopoli

Mohamed Tarek

using Pumas, Random

This tutorial will present alternate parameterizations in Pumas Bayesian models and the motivation behind them.

We’ll start with reviewing the main limitations of state-of-the-art Markov chain Monte Carlo (MCMC) samplers, when these issues arise, and how to circumvent them.

1 Brief overview of No-U-Turn-Sampler (NUTS)

Markov chain Monte Carlo (MCMC) samplers have the ability of approximating any target posterior density using approximations with discrete sampling in an iterative manner. The target posterior density is explored by proposing new parameters values using an acception/rejection rule.

Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), use the posterior gradient to guide the proposals to areas of high probability, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on two hyperparameters:

  • \(L\): the number of steps to use in guiding the proposal towards areas of high probability
  • \(\epsilon\): the step-size of each step taken into guiding the proposal towards areas of high probability

No-U-Turn-Sampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can automatically tune \(\epsilon\). This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once it is done, \(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration. The sampling phase is the phase that originates the samples used for the MCMC target posterior density approximation.

For most models NUTS perform as expected and converges fast to the target posterior density. Nevertheless, there is a common posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate.

2 The Devil’s Funnel

The Devil’s funnel, also known as Neal (2003)’s funnel, arises often in hierarchical models (also known as mixed models), especially when there are few data points.

The funnel occurs when the variance of a variable depends on another variable in an exponential scale. A canonical example is:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \eta &\sim \text{Normal}\left( 0, e^{\text{log}\omega} \right) \\ y &\sim \text{Normal}\left( \eta, 0.1 \right) \end{aligned}\]

Let’s visualize the Devil’s funnel by looking at the joint probability of the prior over \(\text{log}\omega\) and \(\eta\):

using CairoMakie
funnel(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, exp(y / 2)), x))
heatmap(-5:0.1:5, -7:0.01:2, funnel)

Here we can see that any HMC sampler will fail to explore the posterior density with a fixed \(\epsilon\). This happens because in the top of the funnel we need high values for \(\epsilon\); whereas in the bottom of the funnel the opposite happens: we need low values for \(\epsilon\). The NUTS sampler can do better than most samplers since it adapts \(L\) in every iteration. Given a fixed small \(\epsilon\), the NUTS sampler uses a very large \(L\) in the top part of the funnel and a small \(L\) at the bottom. Large values of \(L\) cause a severe slowdown in the sampling.

3 Reparameterization

The parameterization for the funnel is called Centered Parameterization (CP). The naming is a little ambiguous, but the idea behind CP is a parameterization where one or more parameters depend on another parameter. This happens with \(\eta\) since its prior distribution’s scale depends on the exponential of \(\text{log}\omega\). Standard deviation and variance parameters are always sampled in log-scale so it doesn’t matter if you use \(\text{log}\omega\) or \(\omega\) in the model. When there are not enough data points per subject, the posterior density is mostly determined by the prior distributions.

One way to circumvent this is to use an alternate parameterization called Non-Centered Parameterization (NCP). NCP is a parameterization where no parameter’s prior depends on another parameter. Here we sample parameters independently and transform them after sampling:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \tilde{\eta} &\sim \text{Normal}(0, 1) \\ \eta &= \tilde{\eta} \cdot e^{\text{log}\omega} \\ y &\sim \text{Normal}(\eta, 0.1) \end{aligned}\]

The priors of \(\text{log}\omega\) and \(\tilde{\eta}\) are independent; and we reconstruct \(\eta\) from \(\tilde{\eta}\) by multiplying it by \(e^{\text{log}\omega}\). Since \(\eta\) and \(\tilde{\eta}\) are Gaussian random variables, we can scale and shift them to derive one from another without changing the data generating process.

While both models are equivalent when simulating and generating synthetic data, the second model’s prior is more regular:

ncp(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, 1), x))
heatmap(-5:0.1:5, -7:0.01:2, ncp)

When there are few data points, the posterior will therefore be much easier to sample from using the NCP formulation than the CP formulation, even when using a fixed \(L\) and \(\epsilon\). Pumas performs NCP by default. Therefore, the following 2 models are equivalent:

pk1 = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvvc ~ LogNormal(log(70), 0.25)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(2, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
            lower = zeros(2),
            init = ones(2),
        )
    end

    @random begin
        η ~ MvNormal(Pumas.cor2cov(C, ω))
    end

    @pre begin
        # PK parameters
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
    end

    @dynamics begin
        Central' = -CL / Vc * Central
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), cp * σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, σ, C, ω
  Random effects: η
  Covariates:
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv
pk2 = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvvc ~ LogNormal(log(70), 0.25)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(2, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
            lower = zeros(2),
            init = ones(2),
        )
    end

    @random begin
        ηstd ~ MvNormal(I(2))
    end

    @pre begin
        # compute the η from the ηstd
        # using lower Cholesky triangular matrix
        η = ω .* getchol(C).L * ηstd

        # PK parameters
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
    end

    @dynamics begin
        Central' = -CL / Vc * Central
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), cp * σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, σ, C, ω
  Random effects: ηstd
  Covariates:
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv

Let’s go over what is going on in the model pk1.

First, in the @param block we have new parameters C and ω. C is a correlation matrix, which we give LKJ priors already decomposed into Cholesky factors: LKJCholesky. This is useful for performance. Additionally, ω is a vector of standard deviations of our random effects, which we give a truncated MvNormal prior with zero mean, a diagonal covariance matrix, and a lower bound of 0 on the variables. The standard deviations of the MvNormal prior distribution before truncation is 0.4 (we square those because the elements of the diagonal covariance matrix are variances).

Since in the Bayesian approach we need to be careful about how to specify priors, we generally find it more intuitive to specify correlations and standard deviations separately. Specifying full covariance matrices is far more convoluted, error-prone, and non-intuitive. However, you can still work with the covariance matrix Ω directly and give it an InverseWishart or Wishart prior. Call ?InverseWishart or ?Wishart in the Julia REPL for more information on these prior distributions.

Now let’s move to the @random block. Here we are specifying our ηs as having an MvNormal prior with mean zero. The covariance matrix is computed from the correlation matrix and standard deviations using the Pumas.cor2cov function. The covariance matrix \(\boldsymbol{\Omega}\) can be decomposed into:

\[\boldsymbol{\Omega}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{C}[i,j] \cdot \boldsymbol{\omega}[j]\]

where

  • \(\mathbf{C}\) is the correlation matrix
  • \(\boldsymbol{\omega}\) is the vector of standard deviations

While pk1 looks like it is using a CP formulation, Pumas will perform NCP re-parameterization under the hood. However, it is essential to use the Pumas.cor2cov function instead of manually multiplying ω and C using Ω = ω .* C .* ω'. At this stage in the sampling, C is parameterized using its Cholesky factorization. Pumas.cor2cov computes the Cholesky factorization of Ω without constructing the matrix Ω and re-factorizing it when constructing MvNormal. As a result, Pumas.cor2cov is both more numerically stable and memory efficient than manual multiplication.

In pk2, instead of having the distribution of ηs being dependent on both C and ω, we have the ηstd in the @random block being independent from any other parameters. ηstd is given an MvNormal prior with mean zero and covariance matrix as the identity matrix.

We reconstruct our \(\eta\)s back in the @pre block by using the lower triangular matrix \(\mathbf{L}\) from the Cholesky factorization of \(\mathbf{C} = \mathbf{L} \mathbf{L}'\), the \(\boldsymbol{\omega}\) and the \(\eta \text{std}\) vectors.

This produces a data generating process that is identical to the CP formulation since in both formulations the conditional distribution of \(\eta\) given \(\mathbf{C}\) and \(\boldsymbol{\omega}\) is a multi-variate normal distribution with the same mean vector and covariance matrix. It is well known that if

\[\eta\text{std} \sim \text{Normal}(0, \mathbf{I})\] and

\[\eta = \mathbf{A} \cdot \eta\text{std},\]

then the distribution of \(\eta\) is given by

\[\eta \sim \text{Normal}(0, \mathbf{A} \cdot \mathbf{A}')\].

Let \(\mathbf{A} = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L}\), i.e., \(\mathbf{A}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{L}[i,j]\). Then

\[\mathbf{A} \cdot \mathbf{A}' = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L} \cdot \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega}) = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{C} \cdot \text{Diagonal}(\boldsymbol{omega}) = \boldsymbol{\Omega}\]

since \(\mathbf{A}' = \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega})\) (transposing swaps the 2 matrices) and because \(\mathbf{C} = \mathbf{L} \cdot \mathbf{L}'\) by definition of the Cholesky factor \(\mathbf{L}\).

Now let’s test both models.

using PharmaDatasets
pkdata = dataset("iv_sd_3")
pop = read_pumas(pkdata)
iparams = (; tvcl = 1.0, tvvc = 10.0, C = float.(Matrix(I(2))), ω = [0.09, 0.09], σ = 0.3)
Random.seed!(123)
res1 = fit(
    pk1,
    pop[1:10], # just the first 10 subjects
    iparams,
    BayesMCMC(; nsamples = 2_000, nadapts = 1_000),
);
dres1 = discard(res1; burnin = 1_000)
dres1
Chains MCMC chain (1000×6×4 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 4
Samples per chain = 1000
Wall duration     = 276.54 seconds
Compute duration  = 895.96 seconds
parameters        = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂

Summary Statistics
  parameters       mean       std      mcse    ess_bulk    ess_tail      rhat  ⋯
      Symbol    Float64   Float64   Float64     Float64     Float64   Float64  ⋯

        tvcl     1.0135    0.2356    0.0053   1928.3705   2399.1246    1.0007  ⋯
        tvvc   161.2139   21.4898    0.5388   1577.1592   2134.6876    1.0006  ⋯
           σ     0.0284    0.0037    0.0001   1701.4757   2549.2565    1.0009  ⋯
        C₂,₁    -0.0735    0.4678    0.0096   2229.8330   1895.3499    1.0004  ⋯
          ω₁     1.2271    0.2070    0.0059   1222.7990   1866.9555    1.0022  ⋯
          ω₂     0.1177    0.0958    0.0029   1112.0181   1843.6374    1.0022  ⋯
                                                                1 column omitted

Quantiles
  parameters       2.5%      25.0%      50.0%      75.0%      97.5%
      Symbol    Float64    Float64    Float64    Float64    Float64

        tvcl     0.6346     0.8498     0.9834     1.1503     1.5587
        tvvc   121.5098   146.8016   159.7654   174.3843   206.4704
           σ     0.0222     0.0258     0.0281     0.0308     0.0364
        C₂,₁    -0.9110    -0.4255    -0.0821     0.2505     0.8661
          ω₁     0.8519     1.0746     1.2147     1.3648     1.6665
          ω₂     0.0044     0.0449     0.0964     0.1662     0.3524

Let’s consider the second model pk2.

Random.seed!(123)
res2 = fit(
    pk2,
    pop[1:10], # just the first 10 subjects
    iparams,
    BayesMCMC(; nsamples = 2_000, nadapts = 1_000),
);
dres2 = discard(res2; burnin = 1_000)
dres2
Chains MCMC chain (1000×6×4 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 4
Samples per chain = 1000
Wall duration     = 274.02 seconds
Compute duration  = 820.48 seconds
parameters        = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂

Summary Statistics
  parameters       mean       std      mcse    ess_bulk    ess_tail      rhat  ⋯
      Symbol    Float64   Float64   Float64     Float64     Float64   Float64  ⋯

        tvcl     1.0135    0.2356    0.0053   1928.3705   2399.1246    1.0007  ⋯
        tvvc   161.2139   21.4898    0.5388   1577.1592   2134.6876    1.0006  ⋯
           σ     0.0284    0.0037    0.0001   1701.4757   2549.2565    1.0009  ⋯
        C₂,₁    -0.0735    0.4678    0.0096   2229.8330   1895.3499    1.0004  ⋯
          ω₁     1.2271    0.2070    0.0059   1222.7990   1866.9555    1.0022  ⋯
          ω₂     0.1177    0.0958    0.0029   1112.0181   1843.6374    1.0022  ⋯
                                                                1 column omitted

Quantiles
  parameters       2.5%      25.0%      50.0%      75.0%      97.5%
      Symbol    Float64    Float64    Float64    Float64    Float64

        tvcl     0.6346     0.8498     0.9834     1.1503     1.5587
        tvvc   121.5098   146.8016   159.7654   174.3843   206.4704
           σ     0.0222     0.0258     0.0281     0.0308     0.0364
        C₂,₁    -0.9110    -0.4255    -0.0821     0.2505     0.8661
          ω₁     0.8519     1.0746     1.2147     1.3648     1.6665
          ω₂     0.0044     0.0449     0.0964     0.1662     0.3524
ess(dres2) ./ ess(dres1)
6-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

As you can see, the ESS for all parameters is identical in both models. The other diagnostics also indicate that the sampler has explored the posterior and sampled good samples from the posterior.

Tip

We are truncating our sampler chain with discard using the burnin keyword argument to specify the number of samples to remove in the beginning of the chain. We generally use burnin = nadapts to remove the first nadapts samples from our chain, so that we only keep the samples from the sampling phase while discarding the samples from the adaptation phase.

Note

The effective sample size (ESS) and other convergence metrics are explained in the Markov Chain Monte Carlo Convergence tutorial. Don’t forget to check it out!

4 Conclusions

You can use the CP parameterization in Pumas because it is simpler to write and Pumas will automatically perform the NCP re-parameterization making your model more friendly to your MCMC sampler. However, always use the Pumas.cor2cov function to get the covariance matrix from the correlation matrix and standard deviations.

5 References

Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.

Neal, R. M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741.