Model Comparison with Crossvalidation

Authors

Jose Storopoli

Mohamed Tarek

using Pumas

In the Bayesian workflow, especially in pharmacometrics modeling, sometimes we generate a lot of models.

Then it becomes imperial to know which model to select.

Note

But What About VPCs?

VPCs are useful to identify misfits but comparing models using a VPC only is a subjective approach.

There is a more objective approach to compare Bayesian models which uses a robust metric that helps us select the best model in a set of candidate models.

Having an objective way of comparing and choosing the best model is very important.

1 Predictive Accuracy

One popular model evaluation criteria is the out-of-sample prediction accuracy. This is often quantified using the so-called expected log predictive density (ELPD).

Let \(\mathcal{M}\) be a model with parameters \(\theta\) that describes the data generating process of the observed data \(y\). The ELPD is defined as:

\[\begin{align} \text{elpd} = \int \log P(\tilde{y} \mid y, \mathcal{M}) P_t(\tilde{y}) d\tilde{y} \end{align}\]

where \(\tilde{y}\) is unobserved data, \(P_t(\tilde{y})\) is the true data generating distribution of \(\tilde{y}\) (unknown in practice) and \(P(\tilde{y} \mid y, \mathcal{M})\) is the posterior predictive density defined as:

\[\begin{align} P(\tilde{y} \mid y, \mathcal{M}) = \int P(\tilde{y} \mid \theta, \mathcal{M}) P(\theta \mid y, \mathcal{M}) d\theta \end{align}\]

where \(P(\theta \mid y, \mathcal{M})\) describes the posterior distribution of \(\theta\) given the observed data \(y\) and the model \(\mathcal{M}\). Since the true data generating distribution is unknown, it is common to approximate the ELPD by an empirical distribution over the observed data. One such estimator is the log pointwise predictive density (lppd). Let \(y_i\) be the \(i^{th}\) observation and \(\theta^{[j]}\) be the \(j^{th}\) sample draw from the posterior \(P(\theta \mid y, \mathcal{M})\). The lppd can be calculated using:

\[\begin{align} \text{lppd} & = \frac{1}{N} \sum_{i=1}^N \log P(y_i \mid y, \mathcal{M}) \\ & = \frac{1}{N} \sum_{i=1}^N \log \int P(y_i \mid \theta, \mathcal{M}) P(\theta \mid y, \mathcal{M}) d\theta \\ & \approx \frac{1}{N} \sum_{i=1}^N \log \Big( \frac{1}{M} \sum_{j=1}^M P(y_i | \theta^{[j]}, \mathcal{M}) \Big) \end{align}\]

A shortcoming of the lppd is that it is not representative of predictive accuracy on unseen data, since \(y_i\) is used both for inference on the posterior and to evaluate the model out-of-sample.

Crossvalidation overcomes this problem by ensuring that \(y_i\) is not used for inference on the posterior when evaluating the out-of-sample performance for \(y_i\).

2 Crossvalidation

Crossvalidation (CV) is a resampling method that uses different portions of the data to test and fit a model on different iterations.

There are several ways we can do CV, here the idea is to leave either one or more (k) observations out of the fit procedure in order to test the model on those left-out observations:

  • leave-K-out, a generalization of leave-one-out (LOO)
  • leave-future-K-out
  • K-fold
Note

Additional Remarks for Hierarchical Models

Depending on the prediction task of interest, one may choose to treat each time observation as a data point or each entire patient/subject as a data point. If the goal is to evaluate the model’s ability to predict responses for new patients, leave-one-subject-out or leave-one-future-subject-out crossvalidation should be used. Alternatively, if the goal is to evaluate the model’s ability to predict future drug concentrations or any other observable time-dependent quantity the model predicts, then leaving future observations out for each subject makes more sense.

In each one of these CV techniques we compute the \(\operatorname{elpd}\) using the sum of the log probability of all observations that were included conditioned on the observations that were left out.

For example, this is how we would compute the \(\operatorname{elpd}\) using LOO:

\[\operatorname{elpd}_\text{loo} = \sum^N_{i=1} \log P (y_i \mid y_{-i})\]

where

\[P(y_i \mid y_{-i}) = \int P(y_i \mid \theta) P(\theta \mid y_{-i}) d \theta\]

which is the predictive density conditioned on the data without a single observation \(y_i\).

2.1 Pareto Smoothed Importance Sampling LOO (PSIS-LOO)

Evaluating the \(\operatorname{elpd}\) using any CV technique is expensive, since you need to fit different models for each partition of the data.

One approach to overcome this difficulty, is the Pareto-smoothed importance sampling method for leave-one-out, crossvalidation (PSIS-LOO-CV). In PSIS-LOO-CV, MCMC is run only once on the full data. The same samples are then re-used in each outer iteration of CV but using different weights. The weights are determined using importance sampling (IS) by comparing the likelihood with one data point left out to the likelihood of the full dataset. The raw importance weights are then smoothed by fitting them to a generalized Pareto distribution. The smoothed weights can then be used to estimate the \(\operatorname{elpd}\) contribution of each data point.

Beside the ability to approximate the \(\operatorname{elpd}\), PSIS-LOO-CV also provides a useful diagnostic which is the shape parameter of the Pareto distribution fitted to the raw weights when leaving out each data point. Data points that when removed lead to a large shape parameter are more influential than data points which have a low shape parameter. For highly influential points where the Pareto shape parameter is higher than 0.7, the \(\operatorname{elpd}\) contribution for this point can be considered unreliable. In those cases, resampling from the posterior after removing the influential point is recommended instead of relying on PSIS-LOO-CV.

Hence, PSIS-LOO-CV is a good alternative to not having to perform exact crossvalidation, since it is several orders of magnitude computationally cheap while also being reliable since it provides an objective metric to diagnostic problems in the calculation of the \(\operatorname{elpd}\) approximation.

3 How to do Crossvalidation in Bayesian Pumas Models

You can do almost all types of crossvalidation for hierarchical models using Pumas. The main function that performs crossvalidation in Pumas is the crossvalidate function. There are 2 positional arguments to crossvalidate:

  1. The MCMC result from fit or Pumas.truncate
  2. The crossvalidation algorithm

Besides that, you can do two types of crossvalidation in Pumas:

  • ExactCrossvalidation: Resampling-based CV which performs the MCMC again for each data point or subset left out
  • PSISCrossvalidation: PSIS-based CV which uses the importance sampling and weight smoothing approach to avoid the need for resampling

Finally, for these types of crossvalidation, you can choose:

  • split_method: how to perform the split of the data into fit and test. You have three options:

    • LeaveK: leave-K-out crossvalidation
    • KFold: K-fold crossvalidation
    • LeaveFutureK: leave-future-K-out
  • split_by: guidance on how to split the data into fit and test. You have two options:

    • BySubject: for leaving out subjects
    • ByObservation: for leaving out individual observations per subject

For example, if you want to do leave two observations out per subject you would do:

split_method = LeaveK(; K = 2)
LeaveK{Nothing}(2, nothing)
split_by = ByObservation()
ByObservation{true}()
cv_method = PSISCrossvalidation(; split_method = split_method, split_by = split_by)
PSISCrossvalidation{LeaveK{Nothing}, ByObservation{true}}(LeaveK{Nothing}(2, nothing), ByObservation{true}(), true, 0.7)
Tip

Don’t forget to check the Pumas Bayesian workflow documentation for an in-depth explanation of the data splitting methods.

Now let’s remember the model we used in the Random Effects in Bayesian Models tutorial. For the sake of comparison, let’s do a two-compartment model:

pk_1cpt = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvvc ~ LogNormal(log(70), 0.25)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(2, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(2), Diagonal(0.4^2 * ones(2)));
            lower = zeros(2),
            upper = fill(Inf, 2),
            init = ones(2),
        )
    end

    @random begin
        ηstd ~ MvNormal(I(2))
    end

    @pre begin
        # compute the η from the ηstd
        # using lower Cholesky triangular matrix
        η = Diagonal(ω) * (getchol(C).L * ηstd)

        # PK parameters
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
    end

    @dynamics begin
        Central' = -CL / Vc * Central
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), cp * σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, σ, C, ω
  Random effects: ηstd
  Covariates: 
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv
pk_2cpt = @model begin
    @param begin
        tvcl ~ LogNormal(log(1.1), 0.25)
        tvq ~ LogNormal(log(3), 0.5)
        tvvc ~ LogNormal(log(35), 0.25)
        tvvp ~ LogNormal(log(35), 0.5)
        σ ~ truncated(Cauchy(0, 5); lower = 0)
        C ~ LKJCholesky(4, 1.0)
        ω ~ Constrained(
            MvNormal(zeros(4), Diagonal(0.4^2 * ones(4))),
            lower = zeros(4),
            upper = fill(Inf, 4),
            init = ones(4),
        )
    end

    @random begin
        ηstd ~ MvNormal(I(4))
    end

    @pre begin
        # compute the η from the ηstd
        # using lower Cholesky triangular matrix
        η = ω .* (getchol(C).L * ηstd)

        # PK parameters
        CL = tvcl * exp(η[1])
        Q = tvq * exp(η[2])
        Vc = tvvc * exp(η[3])
        Vp = tvvp * exp(η[4])
    end

    @dynamics begin
        Central' = -(CL + Q) / Vc * Central + Q / Vp * Peripheral
        Peripheral' = Q / Vc * Central - Q / Vp * Peripheral
    end

    @derived begin
        cp := @. 1_000 * Central / Vc
        dv ~ @. LogNormal(log(cp), σ)
    end
end
PumasModel
  Parameters: tvcl, tvq, tvvc, tvvp, σ, C, ω
  Random effects: ηstd
  Covariates: 
  Dynamical system variables: Central, Peripheral
  Dynamical system type: Matrix exponential
  Derived: dv
  Observed: dv

We proceed by fitting both models. And also truncating the adaptation samples:

using PharmaDatasets
pkdata = dataset("iv_sd_3");
pop = read_pumas(pkdata)
fit_1cpt = fit(
    pk_1cpt,
    pop[1:10], # just the first 10 subjects
    init_params(pk_1cpt),
    Pumas.BayesMCMC(; nsamples = 1_000, nadapts = 500),
);
fit_2cpt = fit(
    pk_2cpt,
    pop[1:10], # just the first 10 subjects
    init_params(pk_2cpt),
    Pumas.BayesMCMC(; nsamples = 1_000, nadapts = 500),
);
tfit_1cpt = Pumas.truncate(fit_1cpt; burnin = 500)
tfit_2cpt = Pumas.truncate(fit_2cpt; burnin = 500)

The next step is to perform the crossvalidation in both models:

cv_1cpt = crossvalidate(tfit_1cpt, cv_method);
cv_2cpt = crossvalidate(tfit_2cpt, cv_method);

To conclude we can estimate the \(\operatorname{elpd}\) from any result of a crossvalidate using the elpd function:

elpd(cv_1cpt)
-115.69144690840879
elpd(cv_2cpt)
-83.74872464805097

Hence, by judging the values of the \(\operatorname{elpd}\), we should choose the two compartment model.

4 References

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the royal statistical society: Series b (statistical methodology), 64(4), 583–639.

Vehtari, A., Gelman, A., & Gabry, J. (2015, July 16). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. arXiv: 1507.04544. https://doi.org/10.1007/s11222-016-9696-4

Watanabe, S., & Opper, M. (2010). Asymptotic equivalence of bayes cross validation and widely applicable information criterion in singular learning theory. Journal of machine learning research, 11(12).