using Pumas
Markov Chain Monte Carlo Convergence
MCMC has an interesting property that it will asymptotically converge to the target distribution.
MCMC asymptotically convergence stems from the Central Limit Theorem (CLT) for Makov Chains. The assumption is that the parameters have finite variance (inherited from the CLT) and that the chain is reversible (also known as detailed balance)
If you are curious about MCMC convergence theory please check Brooks et al. (2011).
That means, if we have all the time in the world, it is guaranteed, irrelevant of the target distribution posterior geometry, MCMC will give you the right answer.
However, we don’t have all the time in the world. Different MCMC algorithms, like HMC and NUTS, can reduce the sampling (and warmup) time necessary for convergence to the target distribution.
Markov chain Monte Carlo (MCMC) samplers have the ability of approximating any target posterior density using approximations with discrete sampling in an iterative manner. The target posterior density is explored by proposing new parameters values using an acception/rejection rule.
Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), uses the log posterior gradient to guide the proposals to areas of high probability, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on two hyperparameters:
- \(L\): the number of steps to use in guiding the proposal towards areas of high probability
- \(\epsilon\): the step-size of each step taken into guiding the proposal towards areas of high probability
No-U-Turn-Sampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can automatically tune \(\epsilon\). This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once is done, \(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration. The sampling phase is the phase that generates the samples used for the MCMC target posterior density approximation.
For most models, NUTS performs as expected and converges fast to the target posterior density. Nevertheless, there is a common posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate.
1 Can We Prove Convergence?
In the ideal scenario, the NUTS sampler converges to the true posterior and doesn’t miss on any mode.
Unfortunately, this is not easy to prove in general. All the convergence diagnostics are only tests for symptoms of lack of convergence. In other words if all the diagnostics look normal, then we can’t prove that the sampler didn’t converge.
2 Signs of Lack of Convergence
Some signs of lack of convergence are:
- Any of the moments (e.g. the mean or standard deviation) is changing with time. This is diagnosed using stationarity tests by comparing different parts of a single chain to each other.
- Any of the moments is sensitive to the initial parameter values. This is diagnosed using multiple chains by comparing their summary statistics to each other.
While high auto-correlation is not strictly a sign of lack of convergence, samplers with high auto-correlation will require many more samples to get to the same efficiency as another sampler with low auto-correlation. So a low auto-correlation is usually more desirable.
To showcase and give applied explanations of MCMC convergence, let’s recall the model from the Introduction to Bayesian Models in Pumas tutorial:
= @model begin
pk_1cmp @param begin
~ LogNormal(log(3.2), 1)
tvcl ~ LogNormal(log(16.4), 1)
tvv ~ LogNormal(log(3.8), 1)
tvka ~ LogNormal(log(0.04), 0.25)
ω²cl ~ LogNormal(log(0.04), 0.25)
ω²v ~ LogNormal(log(0.04), 0.25)
ω²ka ∈ LogNormal(log(0.2), 0.25)
σ_p end
@random begin
~ Normal(0, sqrt(ω²cl))
ηcl ~ Normal(0, sqrt(ω²v))
ηv ~ Normal(0, sqrt(ω²ka))
ηka end
@covariates begin
Doseend
@pre begin
= tvcl * exp(ηcl)
CL = tvv * exp(ηv)
Vc = tvka * exp(ηka)
Ka end
@dynamics Depots1Central1
@derived begin
:= @. Central / Vc
cp ~ @. Normal(cp, abs(cp) * σ_p)
Conc end
end
PumasModel
Parameters: tvcl, tvv, tvka, ω²cl, ω²v, ω²ka, σ_p
Random effects: ηcl, ηv, ηka
Covariates: Dose
Dynamical system variables: Depot, Central
Dynamical system type: Closed form
Derived: Conc
Observed: Conc
using PharmaDatasets
= dataset("pk_painrelief");
pkpain_df using DataFramesMeta
@chain pkpain_df begin
@rsubset! :Dose != "Placebo"
@rtransform! begin
:amt = :Time == 0 ? parse(Int, chop(:Dose; tail = 3)) : missing
:evid = :Time == 0 ? 1 : 0
:cmt = :Time == 0 ? 1 : 2
end
@rtransform! :Conc = :evid == 1 ? missing : :Conc
end;
first(pkpain_df, 5)
Row | Subject | Time | Conc | PainRelief | PainScore | RemedStatus | Dose | amt | evid | cmt |
---|---|---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64 | Int64 | Int64 | String7 | Int64? | Int64 | Int64 | |
1 | 1 | 0.0 | missing | 0 | 3 | 1 | 20 mg | 20 | 1 | 1 |
2 | 1 | 0.5 | 1.15578 | 1 | 1 | 0 | 20 mg | missing | 0 | 2 |
3 | 1 | 1.0 | 1.37211 | 1 | 0 | 0 | 20 mg | missing | 0 | 2 |
4 | 1 | 1.5 | 1.30058 | 1 | 0 | 0 | 20 mg | missing | 0 | 2 |
5 | 1 | 2.0 | 1.19195 | 1 | 1 | 0 | 20 mg | missing | 0 | 2 |
=
pop = read_pumas(
pkpain_noplb
pkpain_df,= :Subject,
id = :Time,
time = :amt,
amt = [:Conc],
observations = [:Dose],
covariates = :evid,
evid = :cmt,
cmt )
Population
Subjects: 120
Covariates: Dose
Observations: Conc
= fit(
pk_1cmp_fit
pk_1cmp,
pop,init_params(pk_1cmp),
BayesMCMC(; nsamples = 2_000, nadapts = 1_000, constantcoef = (; tvka = 2)),
);= Pumas.truncate(pk_1cmp_fit; burnin = 1_000) pk_1cmp_tfit
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
Chains MCMC chain (1000×6×4 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 4
Samples per chain = 1000
Wall duration = 1020.2 seconds
Compute duration = 1018.88 seconds
parameters = tvcl, tvv, ω²cl, ω²v, ω²ka, σ_p
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
tvcl 3.1895 0.0785 0.0048 265.7133 383.2043 1.0103 ⋯
tvv 13.2522 0.2629 0.0135 371.5382 925.1666 1.0230 ⋯
ω²cl 0.0732 0.0079 0.0003 613.7068 1270.1257 1.0026 ⋯
ω²v 0.0459 0.0057 0.0002 1210.5400 2039.7858 1.0025 ⋯
ω²ka 1.1006 0.1437 0.0023 4089.1839 3378.4020 1.0009 ⋯
σ_p 0.1041 0.0024 0.0000 5276.1592 3366.9250 1.0006 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
tvcl 3.0344 3.1392 3.1887 3.2410 3.3484
tvv 12.7367 13.0741 13.2522 13.4270 13.7743
ω²cl 0.0590 0.0677 0.0727 0.0780 0.0897
ω²v 0.0359 0.0419 0.0454 0.0494 0.0584
ω²ka 0.8573 0.9985 1.0912 1.1897 1.4247
σ_p 0.0995 0.1024 0.1041 0.1057 0.1089
3 Diagnostics Plots
There are several diagnostics plots that can be used to assess convergence in a Markov chain:
- Trace Plot
- Cumulative Mean Plot
- Auto-correlation Plot
We’ll dive into each one those plots.
using PumasUtilities
The trace plot of a parameter shows the value of the parameter in each iteration of the MCMC algorithm.
A good trace plot is one that:
- is noisy, not an increase or decreasing line for example
- has a fixed mean
- has a fixed variance
- Shows all chains overlapping with each other, i.e. chain mixing
Let’s see an example with tvcl
and tvv
:
trace_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv], linkyaxes = :none)
You can see that all of 4 chains have mixed well and the trace plot does not show any bias along with the iterations. This means that the chains were stationary along the sampling phase of the MCMC sampler.
The cumulative mean plot of a parameter shows the mean of the parameter value in each MCMC chain up to a certain iteration.
An MCMC chain converging to a stationary posterior distribution should have the cumulative mean of each parameter converge to a fixed value.
Let’s see an example with tvcl
and tvv
:
cummean_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv], linkyaxes = :none)
MCMC chains are prone to auto-correlation between the samples because each sample in the chain is a function of the previous sample.
The auto-correlation plot shows the correlation between every sample with index \(s\) and the corresponding sample with index \(s + \text{lag}\) for all \(s \in [1, N − \text{lag}]\) where \(N\) is the total number of samples.
For each value of lag, we can compute a correlation measure between the samples and their lag-steps-ahead counterparts. The correlation is usually a value between \(0\) and \(1\) but can sometimes be between \(-1\) and \(0\) as well.
The auto-correlation plot shows the lag on the \(x\)-axis and the correlation value on the \(y\)-axis.
For well behaving MCMC chains, when lag increases, the corresponding correlation gets closer to \(0\). This means that there is less and less correlation between any 2 samples further away from each other.
Let’s see an example with tvcl
and tvv
:
autocor_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv])
4 Convergence Metrics
There are a few metrics and diagnostics usually used to assess and diagnose the Markov chains:
- Effective Sample Size (ESS): an approximation of the “number of independent samples” generated by a Markov chain.
- \(\widehat{R}\) (Rhat): potential scale reduction factor, a metric to measure if the Markov chain have mixed, and, potentially, converged.
To find more about convergence metrics, please check Gelman et al. (2013) and McElreath (2020).
The formula for the effective sample size is:
\[\widehat{\eta}_{\text{eff}} = \frac{mn}{1 + \sum^T_{t=1}\widehat{\rho}_t}\]
where:
- \(m\): number of Markov chains
- \(n\): total samples per Markov chain (discarding warmup or burnin)
- \(\widehat{\rho}_t\): an autocorrelation estimate
This formula is an approximation of the “number of independent samples” generated by a Markov chain, since we don’t have a way to recover the true autocorrelation \(\rho\). Instead we rely on an estimate \(\widehat{\rho}\).
To get any ESS from a Pumas Bayesian result from a fit
/Pumas.truncate
. You can use summarystats
and check the :ess
column:
summarystats(pk_1cmp_tfit)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
tvcl 3.1895 0.0785 0.0048 265.7133 383.2043 1.0103 ⋯
tvv 13.2522 0.2629 0.0135 371.5382 925.1666 1.0230 ⋯
ω²cl 0.0732 0.0079 0.0003 613.7068 1270.1257 1.0026 ⋯
ω²v 0.0459 0.0057 0.0002 1210.5400 2039.7858 1.0025 ⋯
ω²ka 1.1006 0.1437 0.0023 4089.1839 3378.4020 1.0009 ⋯
σ_p 0.1041 0.0024 0.0000 5276.1592 3366.9250 1.0006 ⋯
1 column omitted
Or you can also use the ess
function to return a vector of the parameters’ ESS according to the ordering of appearance in summarystats
:
ess(pk_1cmp_tfit)
6-element Vector{Float64}:
264.0006175212188
377.6056199175059
623.2177723207453
1206.7436793411644
3986.1291226015273
5287.071525052197
which makes convenient to apply summaryzing functions:
mean(ess(pk_1cmp_tfit))
1957.4613894590595
The formula for Rhat is:
\[\widehat{R} = \sqrt{\frac{\widehat{\operatorname{var}}^+ \left( \psi \mid y \right)}{W}}\]
where \(\widehat{\operatorname{var}}^+ \left( \psi \mid y \right)\) is the Markov chains’ sample variance for a certain parameter \(\psi\). We calculate it by using a weighted sum of the within-chain \(W\) and between-chain \(B\) variances:
\[\widehat{\operatorname{var}}^+ \left( \psi \mid y \right) = \frac{n - 1}{n} W + \frac{1}{n} B\]
Intuitively, the value is \(\widehat{R} = 1.0\) if all chains are totally convergent. As a heuristic, if \(\widehat{R} > 1.1\), you need to worry because probably the chains have not converged adequately.
To get any Rhat from a Pumas Bayesian result from a fit
/Pumas.truncate
. You can use summarystats
and check the :rhat
column:
summarystats(pk_1cmp_tfit)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
tvcl 3.1895 0.0785 0.0048 265.7133 383.2043 1.0103 ⋯
tvv 13.2522 0.2629 0.0135 371.5382 925.1666 1.0230 ⋯
ω²cl 0.0732 0.0079 0.0003 613.7068 1270.1257 1.0026 ⋯
ω²v 0.0459 0.0057 0.0002 1210.5400 2039.7858 1.0025 ⋯
ω²ka 1.1006 0.1437 0.0023 4089.1839 3378.4020 1.0009 ⋯
σ_p 0.1041 0.0024 0.0000 5276.1592 3366.9250 1.0006 ⋯
1 column omitted
Or you can also use the rhat
function to return a vector of the parameters’ Rhat according to the ordering of appearance in summarystats
:
rhat(pk_1cmp_tfit)
6-element Vector{Float64}:
1.0104571769460664
1.023010370813974
1.0029034515703046
1.0022894768389998
0.9998600814822199
0.9993479088907875
which makes convenient to apply summaryzing functions:
mean(rhat(pk_1cmp_tfit))
1.0063114110903921
5 What To Do If the Markov Chains Do Not Converge?
Before diving into MCMC tweaks and tuning, let’s learn about Gelman’s (2008) Folk Theorem:
“When you have computational problems, often there’s a problem with your model.”
This means that most of the MCMC samplers and algorithms have a great deal of scientific and academic scrutiny, and also engineering stress tests.
Thus, in bad MCMC convergence, since we are Bayesians, we give a higher prior mass on your model having issues than in the MCMC sampler having issues.
There are some things that you can do to make your Markov chains converge:
lower the target acceptance ratio:
fit( ..., BayesMCMC(; target_accept=0.6 # default is 0.8 ) )
re-parameterize your model to have less parameter dependence (check Random Effects in Bayesian Models tutorial)
fix some parameter values to known good values, e.g. values obtained by maximum-a-posteriori (MAP) optimization
initialize the sampling from good parameter values, i.e. good initial values in the
fit
functionuse a stronger prior around suspected good parameter values
simplify your model, e.g. using simpler dynamics
try the marginal MCMC algorithm
MarginalMCMC
instead of the full joint MCMC algorithmBayesMCMC
(check Marginal MCMC Algorithm)
Please refer to the Pumas Bayesian workflow documentation for an in-depth explanation behind these suggestions.
6 References
Gelman, A. (2008). The folk theorem of statistical computing. https://statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.
Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.