`using Pumas`

# Markov Chain Monte Carlo Convergence

MCMC has an interesting property that it will **asymptotically converge to the target distribution**.

MCMC asymptotically convergence stems from the **C**entral **L**imit **T**heorem (CLT) for Makov Chains. The assumption is that the parameters have finite variance (inherited from the CLT) and that the chain is reversible (also known as detailed balance)

If you are curious about MCMC convergence theory please check Brooks et al. (2011).

That means, if we have all the time in the world, it is guaranteed, irrelevant of the target distribution posterior geometry, **MCMC will give you the right answer**.

However, we don’t have all the time in the world. Different MCMC algorithms, like HMC and NUTS, can reduce the sampling (and warmup) time necessary for convergence to the target distribution.

Markov chain Monte Carlo (MCMC) samplers have the ability of **approximating any target posterior density using approximations with discrete sampling in an iterative manner**. The target posterior density is **explored by proposing new parameters values using an acception/rejection rule**.

Modern MCMC samplers, such as Hamiltoninan Monte Carlo (HMC), uses the **log posterior gradient to guide the proposals to areas of high probability**, making the posterior exploration and approximation more efficient than random-walk proposals. However, this guidance scheme depends on **two hyperparameters**:

- \(L\): the
**number of steps**to use in guiding the proposal towards areas of high probability - \(\epsilon\): the
**step-size**of each step taken into guiding the proposal towards areas of high probability

**N**o-**U**-**T**urn-**S**ampler (NUTS) (Hoffman & Gelman, 2011) is a MCMC Sampler that uses HMC and can **automatically tune \(\epsilon\)**. This tuning happens during the adaptation phase of the sampler (also called “warm-up”), and, once is done, **\(\epsilon\) is fixed for the sampling phase while \(L\) is dynamically chosen in each iteration**. The sampling phase is the phase that generates the samples used for the MCMC target posterior density approximation.

For most models, NUTS performs as expected and converges fast to the target posterior density. Nevertheless, there is a common **posterior density topology that NUTS (and almost all of MCMC samplers) struggle to explore, sample, and approximate**.

## 1 Can We Prove Convergence?

In the ideal scenario, the NUTS sampler converges to the true posterior and doesn’t miss on any mode.

Unfortunately, this is not easy to prove in general. **All the convergence diagnostics are only tests for symptoms of lack of convergence**. In other words if all the diagnostics look normal, then we can’t prove that the sampler didn’t converge.

## 2 Signs of Lack of Convergence

Some signs of lack of convergence are:

- Any of the
**moments (e.g. the mean or standard deviation) is changing with time**. This is diagnosed using stationarity tests by**comparing different parts of a single chain to each other**. - Any of the
**moments is sensitive to the initial parameter values**. This is diagnosed using multiple chains by**comparing their summary statistics to each other**.

While high auto-correlation is not strictly a sign of lack of convergence, samplers with **high auto-correlation will require many more samples to get to the same efficiency** as another sampler with low auto-correlation. So a **low auto-correlation is usually more desirable**.

To showcase and give applied explanations of MCMC convergence, let’s recall the model from the Introduction to Bayesian Models in Pumas tutorial:

```
= @model begin
pk_1cmp @param begin
~ LogNormal(log(3.2), 1)
tvcl ~ LogNormal(log(16.4), 1)
tvv ~ LogNormal(log(3.8), 1)
tvka ~ LogNormal(log(0.04), 0.25)
ω²cl ~ LogNormal(log(0.04), 0.25)
ω²v ~ LogNormal(log(0.04), 0.25)
ω²ka ∈ LogNormal(log(0.2), 0.25)
σ_p end
@random begin
~ Normal(0, sqrt(ω²cl))
ηcl ~ Normal(0, sqrt(ω²v))
ηv ~ Normal(0, sqrt(ω²ka))
ηka end
@covariates begin
Doseend
@pre begin
= tvcl * exp(ηcl)
CL = tvv * exp(ηv)
Vc = tvka * exp(ηka)
Ka end
@dynamics Depots1Central1
@derived begin
:= @. Central / Vc
cp ~ @. Normal(cp, abs(cp) * σ_p)
Conc end
end
```

```
PumasModel
Parameters: tvcl, tvv, tvka, ω²cl, ω²v, ω²ka, σ_p
Random effects: ηcl, ηv, ηka
Covariates: Dose
Dynamical variables: Depot, Central
Derived: Conc
Observed: Conc
```

```
using PharmaDatasets
= dataset("pk_painrelief");
pkpain_df using DataFramesMeta
@chain pkpain_df begin
@rsubset! :Dose != "Placebo"
@rtransform! begin
:amt = :Time == 0 ? parse(Int, chop(:Dose; tail = 3)) : missing
:evid = :Time == 0 ? 1 : 0
:cmt = :Time == 0 ? 1 : 2
end
@rtransform! :Conc = :evid == 1 ? missing : :Conc
end;
first(pkpain_df, 5)
```

Row | Subject | Time | Conc | PainRelief | PainScore | RemedStatus | Dose | amt | evid | cmt |
---|---|---|---|---|---|---|---|---|---|---|

Int64 | Float64 | Float64? | Int64 | Int64 | Int64 | String7 | Int64? | Int64 | Int64 | |

1 | 1 | 0.0 | missing | 0 | 3 | 1 | 20 mg | 20 | 1 | 1 |

2 | 1 | 0.5 | 1.15578 | 1 | 1 | 0 | 20 mg | missing | 0 | 2 |

3 | 1 | 1.0 | 1.37211 | 1 | 0 | 0 | 20 mg | missing | 0 | 2 |

4 | 1 | 1.5 | 1.30058 | 1 | 0 | 0 | 20 mg | missing | 0 | 2 |

5 | 1 | 2.0 | 1.19195 | 1 | 1 | 0 | 20 mg | missing | 0 | 2 |

```
=
pop = read_pumas(
pkpain_noplb
pkpain_df,= :Subject,
id = :Time,
time = :amt,
amt = [:Conc],
observations = [:Dose],
covariates = :evid,
evid = :cmt,
cmt )
```

```
Population
Subjects: 120
Covariates: Dose
Observations: Conc
```

```
= fit(
pk_1cmp_fit
pk_1cmp,
pop,init_params(pk_1cmp),
BayesMCMC(; nsamples = 2_000, nadapts = 1_000);
= (; tvka = 2),
constantcoef
);= Pumas.truncate(pk_1cmp_fit; burnin = 1_000) pk_1cmp_tfit
```

```
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
```

Chains MCMC chain (1000×6×4 Array{Float64, 3}): Iterations = 1:1:1000 Number of chains = 4 Samples per chain = 1000 Wall duration = 491.39 seconds Compute duration = 491.29 seconds parameters = tvcl, tvv, ω²cl, ω²v, ω²ka, σ_p Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ tvcl 3.1844 0.0803 0.0072 123.6916 218.6267 1.0162 ⋯ tvv 13.2535 0.2700 0.0158 293.3284 587.4938 1.0150 ⋯ ω²cl 0.0731 0.0081 0.0004 442.7552 795.4423 1.0125 ⋯ ω²v 0.0461 0.0055 0.0002 710.8515 1395.6361 1.0059 ⋯ ω²ka 1.0982 0.1437 0.0028 2625.2624 2559.9029 1.0005 ⋯ σ_p 0.1040 0.0024 0.0000 3864.2401 3177.1123 1.0006 ⋯ 1 column omitted Quantiles parameters 2.5% 25.0% 50.0% 75.0% 97.5% Symbol Float64 Float64 Float64 Float64 Float64 tvcl 3.0325 3.1285 3.1846 3.2376 3.3476 tvv 12.7308 13.0714 13.2560 13.4276 13.8053 ω²cl 0.0591 0.0675 0.0725 0.0782 0.0907 ω²v 0.0366 0.0422 0.0456 0.0495 0.0586 ω²ka 0.8442 0.9967 1.0872 1.1887 1.4036 σ_p 0.0993 0.1024 0.1040 0.1056 0.1088

## 3 Diagnostics Plots

There are several diagnostics plots that can be used to assess convergence in a Markov chain:

**Trace Plot****Cumulative Mean Plot****Auto-correlation Plot**

We’ll dive into each one those plots.

`using PumasUtilities`

The trace plot of a parameter shows the **value of the parameter in each iteration** of the MCMC algorithm.

A good trace plot is one that:

**is noisy**, not an increase or decreasing line for example- has a
**fixed mean** - has a
**fixed variance** - Shows all
**chains overlapping with each other**, i.e.**chain mixing**

Let’s see an example with `tvcl`

and `tvv`

:

`trace_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv], linkyaxes = :none)`

You can see that all of 4 chains have mixed well and the trace plot does not show any bias along with the iterations. This means that the chains were stationary along the sampling phase of the MCMC sampler.

The cumulative mean plot of a parameter shows the **mean of the parameter value in each MCMC chain up to a certain iteration**.

An MCMC chain converging to a stationary posterior distribution should have the cumulative mean of each parameter **converge to a fixed value**.

Let’s see an example with `tvcl`

and `tvv`

:

`cummean_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv], linkyaxes = :none)`

MCMC chains are prone to auto-correlation between the samples because **each sample in the chain is a function of the previous sample**.

The auto-correlation plot shows the correlation between every sample with index \(s\) and the corresponding sample with index \(s + \text{lag}\) for all \(s \in [1, N − \text{lag}]\) where \(N\) is the total number of samples.

For each value of lag, we can compute a correlation measure between the samples and their lag-steps-ahead counterparts. The correlation is usually a value between \(0\) and \(1\) but can sometimes be between \(-1\) and \(0\) as well.

The auto-correlation plot shows the lag on the \(x\)-axis and the correlation value on the \(y\)-axis.

**For well behaving MCMC chains, when lag increases, the corresponding correlation gets closer to \(0\)**. This means that there is less and less correlation between any 2 samples further away from each other.

Let’s see an example with `tvcl`

and `tvv`

:

`autocor_plot(pk_1cmp_tfit; parameters = [:tvcl, :tvv])`

## 4 Convergence Metrics

There are a few metrics and diagnostics usually used to assess and diagnose the Markov chains:

**E**ffective**S**ample**S**ize (ESS): an approximation of the “number of independent samples” generated by a Markov chain.- \(\widehat{R}\) (Rhat): potential scale reduction factor, a metric to measure if the Markov chain have mixed, and, potentially, converged.

To find more about convergence metrics, please check Gelman et al. (2013) and McElreath (2020).

The formula for the effective sample size is:

\[\widehat{\eta}_{\text{eff}} = \frac{mn}{1 + \sum^T_{t=1}\widehat{\rho}_t}\]

where:

- \(m\): number of Markov chains
- \(n\): total samples per Markov chain (discarding warmup or burnin)
- \(\widehat{\rho}_t\): an autocorrelation estimate

This formula is an approximation of the “number of independent samples” generated by a Markov chain, since we don’t have a way to recover the true autocorrelation \(\rho\). Instead we rely on an estimate \(\widehat{\rho}\).

To get any ESS from a Pumas Bayesian result from a `fit`

/`Pumas.truncate`

. You can use `summarystats`

and check the `:ess`

column:

`summarystats(pk_1cmp_tfit)`

Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ tvcl 3.1844 0.0803 0.0072 123.6916 218.6267 1.0162 ⋯ tvv 13.2535 0.2700 0.0158 293.3284 587.4938 1.0150 ⋯ ω²cl 0.0731 0.0081 0.0004 442.7552 795.4423 1.0125 ⋯ ω²v 0.0461 0.0055 0.0002 710.8515 1395.6361 1.0059 ⋯ ω²ka 1.0982 0.1437 0.0028 2625.2624 2559.9029 1.0005 ⋯ σ_p 0.1040 0.0024 0.0000 3864.2401 3177.1123 1.0006 ⋯ 1 column omitted

Or you can also use the `ess`

function to return a vector of the parameters’ ESS according to the ordering of appearance in `summarystats`

:

`ess(pk_1cmp_tfit)`

```
6-element Vector{Float64}:
124.16285162630935
293.51349327358827
448.20674330276586
725.292557769113
2584.931550281939
3863.7019531098317
```

which makes convenient to apply summaryzing functions:

`mean(ess(pk_1cmp_tfit))`

`1339.9681915605913`

The formula for Rhat is:

\[\widehat{R} = \sqrt{\frac{\widehat{\operatorname{var}}^+ \left( \psi \mid y \right)}{W}}\]

where \(\widehat{\operatorname{var}}^+ \left( \psi \mid y \right)\) is the Markov chains’ sample variance for a certain parameter \(\psi\). We calculate it by using a weighted sum of the within-chain \(W\) and between-chain \(B\) variances:

\[\widehat{\operatorname{var}}^+ \left( \psi \mid y \right) = \frac{n - 1}{n} W + \frac{1}{n} B\]

Intuitively, the value is \(\widehat{R} = 1.0\) if all chains are totally convergent. As a heuristic, if \(\widehat{R} > 1.1\), you need to worry because probably the chains have not converged adequately.

To get any Rhat from a Pumas Bayesian result from a `fit`

/`Pumas.truncate`

. You can use `summarystats`

and check the `:rhat`

column:

`summarystats(pk_1cmp_tfit)`

Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ tvcl 3.1844 0.0803 0.0072 123.6916 218.6267 1.0162 ⋯ tvv 13.2535 0.2700 0.0158 293.3284 587.4938 1.0150 ⋯ ω²cl 0.0731 0.0081 0.0004 442.7552 795.4423 1.0125 ⋯ ω²v 0.0461 0.0055 0.0002 710.8515 1395.6361 1.0059 ⋯ ω²ka 1.0982 0.1437 0.0028 2625.2624 2559.9029 1.0005 ⋯ σ_p 0.1040 0.0024 0.0000 3864.2401 3177.1123 1.0006 ⋯ 1 column omitted

Or you can also use the `rhat`

function to return a vector of the parameters’ Rhat according to the ordering of appearance in `summarystats`

:

`rhat(pk_1cmp_tfit)`

```
6-element Vector{Float64}:
1.0166196513521852
1.0150204195950057
1.0120016144164303
1.0052149359720752
1.0001644159768457
0.9999308355271794
```

which makes convenient to apply summaryzing functions:

`mean(rhat(pk_1cmp_tfit))`

`1.008158645473287`

## 5 What To Do If the Markov Chains Do Not Converge?

Before diving into MCMC tweaks and tuning, let’s learn about **Gelman’s (2008) Folk Theorem**:

“When you have computational problems, often there’s a problem with your model.”

This means that most of the MCMC samplers and algorithms have a great deal of scientific and academic scrutiny, and also engineering stress tests.

Thus, in bad MCMC convergence, since we are Bayesians, we give a higher prior mass on your model having issues than in the MCMC sampler having issues.

There are some things that you can do to make your Markov chains converge:

**lower the target acceptance ratio**:`fit( ..., BayesMCMC(; target_accept=0.6 # default is 0.8 ) )`

**re-parameterize your model to have less parameter dependence**(check Random Effects in Bayesian Models tutorial)**fix some parameter values to known good values**, e.g. values obtained by maximum-a-posteriori (MAP) optimization**initialize the sampling from good parameter values**, i.e. good initial values in the`fit`

function**use a stronger prior around suspected good parameter values****simplify your model**, e.g. using simpler dynamics**try the marginal MCMC algorithm**instead of the full joint MCMC algorithm`MarginalMCMC`

`BayesMCMC`

(check Marginal MCMC Algorithm)

Please refer to the Pumas Bayesian workflow documentation for an in-depth explanation behind these suggestions.

## 6 References

Gelman, A. (2008). The folk theorem of statistical computing. https://statmodeling.stat.columbia.edu/2008/05/13/the_folk_theore/

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.

McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.

Brooks, S., Gelman, A., Jones, G., & Meng, X.-L. (2011). Handbook of Markov Chain Monte Carlo. CRC Press.