`using Pumas`

# Random Effects in Bayesian Models

This tutorial will present **alternate parameterizations** in Pumas Bayesian models and the motivation behind them.

We’ll start with reviewing the main limitations of state-of-the-art Markov chain Monte Carlo (MCMC) samplers, when these issues arise, and how to circumvent them.

## 1 Brief overview of No-U-Turn-Sampler (NUTS)

Markov chain Monte Carlo (MCMC) samplers have the ability of **approximating any target posterior density using approximations with discrete sampling in an iterative manner**. The target posterior density is **explored by proposing new parameters values using an acception/rejection rule**.

Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), use the **posterior gradient to guide the proposals to areas of high probability**, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on **two hyperparameters**:

- \(L\): the
**number of steps**to use in guiding the proposal towards areas of high probability - \(\epsilon\): the
**step-size**of each step taken into guiding the proposal towards areas of high probability

**N**o-**U**-**T**urn-**S**ampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can **automatically tune \(\epsilon\)**. This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once it is done, **\(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration**. The sampling phase is the phase that originates the samples used for the MCMC target posterior density approximation.

For most models NUTS perform as expected and converges fast to the target posterior density. Nevertheless, there is a common **posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate**.

## 2 The Devil’s Funnel

The **Devil’s funnel, also known as Neal (2003)’s funnel, arises often in hierarchical models** (also known as *mixed models*), especially when there are few data points.

The funnel occurs when the variance of a variable depends on another variable in an exponential scale. A canonical example is:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \eta &\sim \text{Normal}\left( 0, e^{\text{log}\omega} \right) \\ y &\sim \text{Normal}\left( \eta, 0.1 \right) \end{aligned}\]

Let’s visualize the Devil’s funnel by looking at the joint probability of the prior over \(\text{log}\omega\) and \(\eta\):

```
using CairoMakie
funnel(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, exp(y / 2)), x))
heatmap(-5:0.1:5, -7:0.01:2, funnel)
```

Here we can see that any **HMC sampler will fail to explore the posterior density with a fixed \(\epsilon\)**. This happens because in the *top* of the funnel we need *high* values for \(\epsilon\); whereas in the *bottom* of the funnel the opposite happens: we need *low* values for \(\epsilon\). The NUTS sampler can do better than most samplers since it adapts \(L\) in every iteration. Given a fixed small \(\epsilon\), the NUTS sampler uses a very large \(L\) in the top part of the funnel and a small \(L\) at the bottom. Large values of \(L\) cause a severe slowdown in the sampling.

## 3 Reparameterization

The parameterization for the funnel is called **Centered Parameterization** (CP). The naming is a little ambiguous, but the idea behind CP is a **parameterization where one or more parameters depend on another parameter**. This happens with \(\eta\) since its prior distribution’s scale depends on the exponential of \(\text{log}\omega\). Standard deviation and variance parameters are always sampled in log-scale so it doesn’t matter if you use \(\text{log}\omega\) or \(\omega\) in the model. When there are not enough data points per subject, the posterior density is mostly determined by the prior distributions.

One way to circumvent this is to use an alternate parameterization called **Non-Centered Parameterization** (NCP). NCP is a **parameterization where no parameter’s prior depends on another parameter**. Here we sample parameters independently and transform them after sampling:

\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \tilde{\eta} &\sim \text{Normal}(0, 1) \\ \eta &= \tilde{\eta} \cdot e^{\text{log}\omega} \\ y &\sim \text{Normal}(\eta, 0.1) \end{aligned}\]

The priors of \(\text{log}\omega\) and \(\tilde{\eta}\) are independent; and we reconstruct \(\eta\) from \(\tilde{\eta}\) by multiplying it by \(e^{\text{log}\omega}\). Since \(\eta\) and \(\tilde{\eta}\) are Gaussian random variables, we can scale and shift them to derive one from another without changing the data generating process.

While both models are equivalent when simulating and generating synthetic data, the second model’s prior is more regular:

```
ncp(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, 1), x))
heatmap(-5:0.1:5, -7:0.01:2, ncp)
```

When there are few data points, the posterior will therefore be much easier to sample from using the NCP formulation than the CP formulation, even when using a fixed a fixed \(L\) and \(\epsilon\).

## 4 CP versus NCP in Bayesian Pumas Models

```
= @model begin
pk_cp @param begin
~ LogNormal(log(1.1), 0.25)
tvcl ~ LogNormal(log(70), 0.25)
tvvc ~ truncated(Cauchy(0, 5); lower = 0)
σ ~ LKJCholesky(2, 1.0)
C ~ Constrained(
ω MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
= zeros(2),
lower = fill(Inf, 2),
upper = ones(2),
init
)end
@random begin
~ MvNormal(ω .* C .* ω')
η end
@pre begin
# PK parameters
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc end
@dynamics begin
' = -CL / Vc * Central
Centralend
@derived begin
:= @. 1_000 * Central / Vc
cp ~ @. LogNormal(log(cp), cp * σ)
dv end
end
```

```
PumasModel
Parameters: tvcl, tvvc, σ, C, ω
Random effects: η
Covariates:
Dynamical system variables: Central
Dynamical system type: Matrix exponential
Derived: dv
Observed: dv
```

```
= @model begin
pk_ncp @param begin
~ LogNormal(log(1.1), 0.25)
tvcl ~ LogNormal(log(70), 0.25)
tvvc ~ truncated(Cauchy(0, 5); lower = 0)
σ ~ LKJCholesky(2, 1.0)
C ~ Constrained(
ω MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
= zeros(2),
lower = fill(Inf, 2),
upper = ones(2),
init
)end
@random begin
~ MvNormal(I(2))
ηstd end
@pre begin
# compute the η from the ηstd
# using lower Cholesky triangular matrix
= ω .* (getchol(C).L * ηstd)
η
# PK parameters
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc end
@dynamics begin
' = -CL / Vc * Central
Centralend
@derived begin
:= @. 1_000 * Central / Vc
cp ~ @. LogNormal(log(cp), cp * σ)
dv end
end
```

```
PumasModel
Parameters: tvcl, tvvc, σ, C, ω
Random effects: ηstd
Covariates:
Dynamical system variables: Central
Dynamical system type: Matrix exponential
Derived: dv
Observed: dv
```

### 4.1 CP

Let’s go over what is going on in the `pk_cp`

model.

First, in the `@param`

block we have new parameters `C`

and `ω`

. ** C is a correlation matrix, which we give LKJ priors already decomposed into Cholesky factors: LKJCholesky**. This is useful for performance. Additionally,

**, which we give a truncated**

`ω`

is a vector of standard deviations of our random effects**. The standard deviations of the**

`MvNormal`

prior with zero mean, a diagonal covariance matrix, and a lower bound of 0 on the variables`MvNormal`

prior distribution before truncation is 0.4 (we square those because the elements of the diagonal covariance matrix are variances).Since in the Bayesian approach we need to be careful about how to specify priors, we generally find it more intuitive to specify correlations and standard deviations separately. Specifying full covariance matrices is far more convoluted, error-prone, and non-intuitive. However, you can still work with the covariance matrix `Ω`

directly and give it an `InverseWishart`

or `Wishart`

prior. Call `?InverseWishart`

or `?Wishart`

in the Julia REPL for more information on these prior distributions.

Now let’s move to the `@random`

block. Here we are specifying our `η`

s as having an `MvNormal`

prior with mean zero. The covariance matrix is computed **by multiplying every element of the correlation matrix with the product of the corresponding standard deviations**: The covariance matrix \(\boldsymbol{\Omega}\) can be decomposed into

\[\boldsymbol{\Omega}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{C}[i,j] \cdot \boldsymbol{\omega}[j]\]

where

- \(\mathbf{C}\) is the correlation matrix
- \(\boldsymbol{\omega}\) is the vector of standard deviations

As we can see this is a CP model because the `η`

s depends on both `C`

and `ω`

and `ω`

is sampled in log-scale because it is positive.

### 4.2 NCP

What is different in the `pk_ncp`

model is that, instead of having the distribution of `η`

s being dependent on both `C`

and `ω`

, we have the ** ηstd in the @random block being independent from any other parameters**.

`ηstd`

is given an `MvNormal`

prior with mean zero and covariance matrix as the identity matrix.We **reconstruct our \(\eta\)s back in the @pre block** by using the lower triangular matrix \(\mathbf{L}\) from the Cholesky factorization of \(\mathbf{C} = \mathbf{L} \mathbf{L}'\), the \(\boldsymbol{\omega}\) and the \(\eta \text{std}\) vectors.

This produces a data generating process that is identical to the CP formulation since in both formulations the conditional distribution of \(\eta\) given \(\mathbf{C}\) and \(\boldsymbol{\omega}\) is a multi-variate normal distribution with the same mean vector and covariance matrix. It is well known that if

\[\eta\text{std} \sim \text{Normal}(0, \mathbf{I})\] and

\[\eta = \mathbf{A} \cdot \eta\text{std},\]

then the distribution of \(\eta\) is given by

\[\eta \sim \text{Normal}(0, \mathbf{A} \cdot \mathbf{A}')\].

Let \(\mathbf{A} = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L}\), i.e., \(\mathbf{A}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{L}[i,j]\). Then

\[\mathbf{A} \cdot \mathbf{A}' = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L} \cdot \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega}) = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{C} \cdot \text{Diagonal}(\boldsymbol{omega}) = \boldsymbol{\Omega}\]

since \(\mathbf{A}' = \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega})\) (transposing swaps the 2 matrices) and because \(\mathbf{C} = \mathbf{L} \cdot \mathbf{L}'\) by definition of the Cholesky factor \(\mathbf{L}\).

Now let’s test both parameterizations.

```
using PharmaDatasets
= dataset("iv_sd_3")
pkdata = read_pumas(pkdata)
pop = (; tvcl = 1.0, tvvc = 10.0, C = float.(Matrix(I(2))), ω = [0.09, 0.09], σ = 0.3) iparams
```

```
= fit(
cp_fit
pk_cp,1:10], # just the first 10 subjects
pop[
iparams,BayesMCMC(; nsamples = 1_000, nadapts = 500),
Pumas.
);= Pumas.truncate(cp_fit; burnin = 500) tcp_fit
```

```
Chains MCMC chain (500×6×4 Array{Float64, 3}):
Iterations = 1:1:500
Number of chains = 4
Samples per chain = 500
Wall duration = 46.58 seconds
Compute duration = 46.45 seconds
parameters = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂
Summary Statistics
parameters mean std mcse ess_bulk ess_tail ⋯
Symbol Float64 Float64 Float64 Float64 Float64 ⋯
tvcl 0.9948 0.0042 0.0021 4.3589 NaN ⋯
tvvc 10.0762 0.0575 0.0285 4.0650 NaN 18133591743 ⋯
σ 0.2994 0.0046 0.0023 4.4057 11.3623 ⋯
C₂,₁ -0.0022 0.0059 0.0029 4.4344 5.2735 ⋯
ω₁ 0.0903 0.0008 0.0004 4.0650 NaN 18133591743 ⋯
ω₂ 0.0900 0.0016 0.0008 4.0931 NaN ⋯
2 columns omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
tvcl 0.9898 0.9911 0.9947 0.9984 1.0000
tvvc 9.9996 10.0319 10.0821 10.1263 10.1408
σ 0.2917 0.2979 0.3008 0.3023 0.3040
C₂,₁ -0.0122 -0.0035 0.0005 0.0018 0.0023
ω₁ 0.0893 0.0898 0.0902 0.0907 0.0914
ω₂ 0.0874 0.0894 0.0904 0.0910 0.0916
```

Our **e**ffective **s**ample **s**ize (ESS) is very low for all parameters. In fact, the sampler hasn’t actually explored the posterior density. It was stuck in the same place.

```
= fit(
ncp_fit
pk_ncp,1:10], # just the first 10 subjects
pop[
iparams,BayesMCMC(; nsamples = 1_000, nadapts = 500),
Pumas.
);= Pumas.truncate(ncp_fit; burnin = 500) tncp_fit
```

```
Chains MCMC chain (500×6×4 Array{Float64, 3}):
Iterations = 1:1:500
Number of chains = 4
Samples per chain = 500
Wall duration = 146.17 seconds
Compute duration = 146.04 seconds
parameters = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
tvcl 1.0129 0.2304 0.0057 1755.0212 1633.1211 1.0035 ⋯
tvvc 161.9066 21.6390 0.7962 781.4020 1149.4327 1.0156 ⋯
σ 0.0284 0.0038 0.0001 779.2104 967.7757 1.0133 ⋯
C₂,₁ -0.0583 0.4791 0.0135 1183.6184 872.0186 1.0051 ⋯
ω₁ 1.2227 0.2060 0.0067 920.3988 1147.6894 1.0025 ⋯
ω₂ 0.1064 0.0908 0.0033 665.4610 767.7212 1.0064 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
tvcl 0.6433 0.8445 0.9865 1.1483 1.5234
tvvc 122.5657 146.8441 160.8620 175.6208 207.6989
σ 0.0222 0.0257 0.0281 0.0307 0.0365
C₂,₁ -0.8987 -0.4212 -0.0825 0.2940 0.8797
ω₁ 0.8614 1.0738 1.2053 1.3537 1.6682
ω₂ 0.0034 0.0378 0.0844 0.1482 0.3361
```

As you can see the ESS for all parameters is much better than for the CP fit. This means that our sampler has actually explored the posterior and sampled good samples from the posterior.

We are truncating our sampler chain with `Pumas.truncate`

using the `burnin`

keyword argument to specify the number of samples to remove in the beginning of the chain. We generally use `burnin = nadapts`

to remove the first `nadapts`

samples from our chain, so that we **only keep the samples from the sampling phase while discarding the samples from the adaptation phase**.

The **e**ffective **s**ample **s**ize (ESS) and other convegence metrics are explained in the Markov Chain Monte Carlo Convergence tutorial. Don’t forget to check it out!

### 4.3 CP vs NCP

Let’s quantify how much more efficient NCP is compared with CP for this model and data:

`mean(ess(tncp_fit)) / mean(ess(tcp_fit))`

`248.53827019514432`

## 5 Conclusions

Whenever possible try to make your prior and posterior topologies friendly to your MCMC sampler by parameterizing the model with independent parameters. You can do this by having the priors from all parameters being independent from each other and reconstructing dependencies after the prior definitions. This will help you to perform **more robust inference**.

## 6 References

Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.

Neal, R. M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741.