using Pumas, Random

Random Effects in Bayesian Models
This tutorial will present alternate parameterizations in Pumas Bayesian models and the motivation behind them.
We’ll start with reviewing the main limitations of state-of-the-art Markov chain Monte Carlo (MCMC) samplers, when these issues arise, and how to circumvent them.
1 Brief overview of No-U-Turn-Sampler (NUTS)
Markov chain Monte Carlo (MCMC) samplers have the ability of approximating any target posterior density using approximations with discrete sampling in an iterative manner. The target posterior density is explored by proposing new parameters values using an acception/rejection rule.
Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), use the posterior gradient to guide the proposals to areas of high probability, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on two hyperparameters:
- \(L\): the number of steps to use in guiding the proposal towards areas of high probability
- \(\epsilon\): the step-size of each step taken into guiding the proposal towards areas of high probability
No-U-Turn-Sampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can automatically tune \(\epsilon\). This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once it is done, \(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration. The sampling phase is the phase that originates the samples used for the MCMC target posterior density approximation.
For most models NUTS perform as expected and converges fast to the target posterior density. Nevertheless, there is a common posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate.
2 The Devil’s Funnel
The Devil’s funnel, also known as Neal (2003)’s funnel, arises often in hierarchical models (also known as mixed models), especially when there are few data points.
The funnel occurs when the variance of a variable depends on another variable in an exponential scale. A canonical example is:
\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \eta &\sim \text{Normal}\left( 0, e^{\text{log}\omega} \right) \\ y &\sim \text{Normal}\left( \eta, 0.1 \right) \end{aligned}\]
Let’s visualize the Devil’s funnel by looking at the joint probability of the prior over \(\text{log}\omega\) and \(\eta\):
using CairoMakie
funnel(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, exp(y / 2)), x))
heatmap(-5:0.1:5, -7:0.01:2, funnel)
Here we can see that any HMC sampler will fail to explore the posterior density with a fixed \(\epsilon\). This happens because in the top of the funnel we need high values for \(\epsilon\); whereas in the bottom of the funnel the opposite happens: we need low values for \(\epsilon\). The NUTS sampler can do better than most samplers since it adapts \(L\) in every iteration. Given a fixed small \(\epsilon\), the NUTS sampler uses a very large \(L\) in the top part of the funnel and a small \(L\) at the bottom. Large values of \(L\) cause a severe slowdown in the sampling.
3 Reparameterization
The parameterization for the funnel is called Centered Parameterization (CP). The naming is a little ambiguous, but the idea behind CP is a parameterization where one or more parameters depend on another parameter. This happens with \(\eta\) since its prior distribution’s scale depends on the exponential of \(\text{log}\omega\). Standard deviation and variance parameters are always sampled in log-scale so it doesn’t matter if you use \(\text{log}\omega\) or \(\omega\) in the model. When there are not enough data points per subject, the posterior density is mostly determined by the prior distributions.
One way to circumvent this is to use an alternate parameterization called Non-Centered Parameterization (NCP). NCP is a parameterization where no parameter’s prior depends on another parameter. Here we sample parameters independently and transform them after sampling:
\[\begin{aligned} \text{log}\omega &\sim \text{Normal}(0, 3) \\ \tilde{\eta} &\sim \text{Normal}(0, 1) \\ \eta &= \tilde{\eta} \cdot e^{\text{log}\omega} \\ y &\sim \text{Normal}(\eta, 0.1) \end{aligned}\]
The priors of \(\text{log}\omega\) and \(\tilde{\eta}\) are independent; and we reconstruct \(\eta\) from \(\tilde{\eta}\) by multiplying it by \(e^{\text{log}\omega}\). Since \(\eta\) and \(\tilde{\eta}\) are Gaussian random variables, we can scale and shift them to derive one from another without changing the data generating process.
While both models are equivalent when simulating and generating synthetic data, the second model’s prior is more regular:
ncp(x, y) = exp(logpdf(Normal(0, 3), y) + logpdf(Normal(0, 1), x))
heatmap(-5:0.1:5, -7:0.01:2, ncp)
When there are few data points, the posterior will therefore be much easier to sample from using the NCP formulation than the CP formulation, even when using a fixed \(L\) and \(\epsilon\). Pumas performs NCP by default. Therefore, the following 2 models are equivalent:
= @model begin
pk1 @param begin
~ LogNormal(log(1.1), 0.25)
tvcl ~ LogNormal(log(70), 0.25)
tvvc ~ truncated(Cauchy(0, 5); lower = 0)
σ ~ LKJCholesky(2, 1.0)
C ~ Constrained(
ω MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
= zeros(2),
lower = ones(2),
init
)end
@random begin
~ MvNormal(Pumas.cor2cov(C, ω))
η end
@pre begin
# PK parameters
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc end
@dynamics begin
' = -CL / Vc * Central
Centralend
@derived begin
:= @. 1_000 * Central / Vc
cp ~ @. LogNormal(log(cp), cp * σ)
dv end
end
PumasModel
Parameters: tvcl, tvvc, σ, C, ω
Random effects: η
Covariates:
Dynamical system variables: Central
Dynamical system type: Matrix exponential
Derived: dv
Observed: dv
= @model begin
pk2 @param begin
~ LogNormal(log(1.1), 0.25)
tvcl ~ LogNormal(log(70), 0.25)
tvvc ~ truncated(Cauchy(0, 5); lower = 0)
σ ~ LKJCholesky(2, 1.0)
C ~ Constrained(
ω MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
= zeros(2),
lower = ones(2),
init
)end
@random begin
~ MvNormal(I(2))
ηstd end
@pre begin
# compute the η from the ηstd
# using lower Cholesky triangular matrix
= ω .* getchol(C).L * ηstd
η
# PK parameters
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc end
@dynamics begin
' = -CL / Vc * Central
Centralend
@derived begin
:= @. 1_000 * Central / Vc
cp ~ @. LogNormal(log(cp), cp * σ)
dv end
end
PumasModel
Parameters: tvcl, tvvc, σ, C, ω
Random effects: ηstd
Covariates:
Dynamical system variables: Central
Dynamical system type: Matrix exponential
Derived: dv
Observed: dv
Let’s go over what is going on in the model pk1
.
First, in the @param
block we have new parameters C
and ω
. C
is a correlation matrix, which we give LKJ priors already decomposed into Cholesky factors: LKJCholesky
. This is useful for performance. Additionally, ω
is a vector of standard deviations of our random effects, which we give a truncated MvNormal
prior with zero mean, a diagonal covariance matrix, and a lower bound of 0 on the variables. The standard deviations of the MvNormal
prior distribution before truncation is 0.4 (we square those because the elements of the diagonal covariance matrix are variances).
Since in the Bayesian approach we need to be careful about how to specify priors, we generally find it more intuitive to specify correlations and standard deviations separately. Specifying full covariance matrices is far more convoluted, error-prone, and non-intuitive. However, you can still work with the covariance matrix Ω
directly and give it an InverseWishart
or Wishart
prior. Call ?InverseWishart
or ?Wishart
in the Julia REPL for more information on these prior distributions.
Now let’s move to the @random
block. Here we are specifying our η
s as having an MvNormal
prior with mean zero. The covariance matrix is computed from the correlation matrix and standard deviations using the Pumas.cor2cov
function. The covariance matrix \(\boldsymbol{\Omega}\) can be decomposed into:
\[\boldsymbol{\Omega}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{C}[i,j] \cdot \boldsymbol{\omega}[j]\]
where
- \(\mathbf{C}\) is the correlation matrix
- \(\boldsymbol{\omega}\) is the vector of standard deviations
While pk1
looks like it is using a CP formulation, Pumas will perform NCP re-parameterization under the hood. However, it is essential to use the Pumas.cor2cov
function instead of manually multiplying ω
and C
using Ω = ω .* C .* ω'
. At this stage in the sampling, C
is parameterized using its Cholesky factorization. Pumas.cor2cov
computes the Cholesky factorization of Ω
without constructing the matrix Ω
and re-factorizing it when constructing MvNormal
. As a result, Pumas.cor2cov
is both more numerically stable and memory efficient than manual multiplication.
In pk2
, instead of having the distribution of η
s being dependent on both C
and ω
, we have the ηstd
in the @random
block being independent from any other parameters. ηstd
is given an MvNormal
prior with mean zero and covariance matrix as the identity matrix.
We reconstruct our \(\eta\)s back in the @pre
block by using the lower triangular matrix \(\mathbf{L}\) from the Cholesky factorization of \(\mathbf{C} = \mathbf{L} \mathbf{L}'\), the \(\boldsymbol{\omega}\) and the \(\eta \text{std}\) vectors.
This produces a data generating process that is identical to the CP formulation since in both formulations the conditional distribution of \(\eta\) given \(\mathbf{C}\) and \(\boldsymbol{\omega}\) is a multi-variate normal distribution with the same mean vector and covariance matrix. It is well known that if
\[\eta\text{std} \sim \text{Normal}(0, \mathbf{I})\] and
\[\eta = \mathbf{A} \cdot \eta\text{std},\]
then the distribution of \(\eta\) is given by
\[\eta \sim \text{Normal}(0, \mathbf{A} \cdot \mathbf{A}')\].
Let \(\mathbf{A} = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L}\), i.e., \(\mathbf{A}[i,j] = \boldsymbol{\omega}[i] \cdot \mathbf{L}[i,j]\). Then
\[\mathbf{A} \cdot \mathbf{A}' = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{L} \cdot \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega}) = \text{Diagonal}(\boldsymbol{\omega}) \cdot \mathbf{C} \cdot \text{Diagonal}(\boldsymbol{omega}) = \boldsymbol{\Omega}\]
since \(\mathbf{A}' = \mathbf{L}' \cdot \text{Diagonal}(\boldsymbol{\omega})\) (transposing swaps the 2 matrices) and because \(\mathbf{C} = \mathbf{L} \cdot \mathbf{L}'\) by definition of the Cholesky factor \(\mathbf{L}\).
Now let’s test both models.
using PharmaDatasets
= dataset("iv_sd_3")
pkdata = read_pumas(pkdata)
pop = (; tvcl = 1.0, tvvc = 10.0, C = float.(Matrix(I(2))), ω = [0.09, 0.09], σ = 0.3) iparams
Random.seed!(123)
= fit(
res1
pk1,1:10], # just the first 10 subjects
pop[
iparams,BayesMCMC(; nsamples = 2_000, nadapts = 1_000),
);= discard(res1; burnin = 1_000) dres1
dres1
Chains MCMC chain (1000×6×4 Array{Float64, 3}): Iterations = 1:1:1000 Number of chains = 4 Samples per chain = 1000 Wall duration = 276.54 seconds Compute duration = 895.96 seconds parameters = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂ Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ tvcl 1.0135 0.2356 0.0053 1928.3705 2399.1246 1.0007 ⋯ tvvc 161.2139 21.4898 0.5388 1577.1592 2134.6876 1.0006 ⋯ σ 0.0284 0.0037 0.0001 1701.4757 2549.2565 1.0009 ⋯ C₂,₁ -0.0735 0.4678 0.0096 2229.8330 1895.3499 1.0004 ⋯ ω₁ 1.2271 0.2070 0.0059 1222.7990 1866.9555 1.0022 ⋯ ω₂ 0.1177 0.0958 0.0029 1112.0181 1843.6374 1.0022 ⋯ 1 column omitted Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 tvcl 0.6346 0.8498 0.9834 1.1503 1.5587 tvvc 121.5098 146.8016 159.7654 174.3843 206.4704 σ 0.0222 0.0258 0.0281 0.0308 0.0364 C₂,₁ -0.9110 -0.4255 -0.0821 0.2505 0.8661 ω₁ 0.8519 1.0746 1.2147 1.3648 1.6665 ω₂ 0.0044 0.0449 0.0964 0.1662 0.3524
Let’s consider the second model pk2
.
Random.seed!(123)
= fit(
res2
pk2,1:10], # just the first 10 subjects
pop[
iparams,BayesMCMC(; nsamples = 2_000, nadapts = 1_000),
);= discard(res2; burnin = 1_000) dres2
dres2
Chains MCMC chain (1000×6×4 Array{Float64, 3}): Iterations = 1:1:1000 Number of chains = 4 Samples per chain = 1000 Wall duration = 274.02 seconds Compute duration = 820.48 seconds parameters = tvcl, tvvc, σ, C₂,₁, ω₁, ω₂ Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ tvcl 1.0135 0.2356 0.0053 1928.3705 2399.1246 1.0007 ⋯ tvvc 161.2139 21.4898 0.5388 1577.1592 2134.6876 1.0006 ⋯ σ 0.0284 0.0037 0.0001 1701.4757 2549.2565 1.0009 ⋯ C₂,₁ -0.0735 0.4678 0.0096 2229.8330 1895.3499 1.0004 ⋯ ω₁ 1.2271 0.2070 0.0059 1222.7990 1866.9555 1.0022 ⋯ ω₂ 0.1177 0.0958 0.0029 1112.0181 1843.6374 1.0022 ⋯ 1 column omitted Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 tvcl 0.6346 0.8498 0.9834 1.1503 1.5587 tvvc 121.5098 146.8016 159.7654 174.3843 206.4704 σ 0.0222 0.0258 0.0281 0.0308 0.0364 C₂,₁ -0.9110 -0.4255 -0.0821 0.2505 0.8661 ω₁ 0.8519 1.0746 1.2147 1.3648 1.6665 ω₂ 0.0044 0.0449 0.0964 0.1662 0.3524
ess(dres2) ./ ess(dres1)
6-element Vector{Float64}:
1.0
1.0
1.0
1.0
1.0
1.0
As you can see, the ESS for all parameters is identical in both models. The other diagnostics also indicate that the sampler has explored the posterior and sampled good samples from the posterior.
We are truncating our sampler chain with discard
using the burnin
keyword argument to specify the number of samples to remove in the beginning of the chain. We generally use burnin = nadapts
to remove the first nadapts
samples from our chain, so that we only keep the samples from the sampling phase while discarding the samples from the adaptation phase.
The effective sample size (ESS) and other convergence metrics are explained in the Markov Chain Monte Carlo Convergence tutorial. Don’t forget to check it out!
4 Conclusions
You can use the CP parameterization in Pumas
because it is simpler to write and Pumas
will automatically perform the NCP re-parameterization making your model more friendly to your MCMC sampler. However, always use the Pumas.cor2cov
function to get the covariance matrix from the correlation matrix and standard deviations.
5 References
Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Neal, R. M. (2003). Slice Sampling. The Annals of Statistics, 31(3), 705–741.