Item Response Theory

Author

Mohamed Tarek

Item response theory (IRT) describes how a set of discrete questionnaire or clinical-test responses (“items”) depends on an unobserved latent trait (such as disease severity). Each item has its own characteristic curve, driven by a difficulty (cut-off) and a discrimination, and the latent trait can vary over time and across subjects via a nonlinear mixed-effects (NLME) progression model. This tutorial walks through the different ways of defining and fitting an IRT model in Pumas.

1 Learning objectives

By the end of this tutorial you will be able to:

Describe the bounded ordinal IRT model with item-specific discriminations.
Simulate ordinal longitudinal responses from known parameters.
Fit the IRT model by marginal likelihood maximization.
Obtain asymptotic confidence intervals of the population parameters.
Add weakly informative priors on the population parameters and fit the Bayesian model with maximum a-posteriori (MAP) estimation or Markov chain Monte Carlo (MCMC).

First, we load the necessary packages.

using Pumas
using DataFrames
using DataFramesMeta
using Random
using LinearAlgebra
using CairoMakie
using AlgebraOfGraphics
using SummaryTables
using PumasUtilities
using Pumas: OrderedLogit
Random.seed!(123)

2 What is IRT?

IRT models describe the progression of multiple ordinal outcomes (e.g., questionnaire items, clinical tests) as a function of a latent trait (e.g., disease severity). The different outcomes are called items and the latent trait is often called ability. Each item has its own response curve, often called an item characteristic curve (ICC), which maps the latent trait to a probability of a certain response.

In a clinical setting, we might have a few tests that measure different levels of disease severity at different time points. Each test has its own characteristics (difficulty, discrimination) that determine how likely a patient with a certain disease severity is to respond positively to that test. IRT allows us to model these relationships and estimate both the latent trait (disease severity) and the item parameters (difficulty, discrimination) simultaneously from observed responses. For instance, for test A, a patient with a disease severity of 0.5 might have a 70% chance of responding positively (positive is bad), while the same patient might only have a 30% chance of responding positively to test B. As the disease severity increases, the probability of responding positively to both tests will increase, but at different rates depending on the item/test parameters.

Note

The disease severity (or ability) is a latent trait that we cannot directly observe, but we can infer it from the observed (discrete) responses to the items.

Tip

A mixture of discrete and continuous outcomes can also be modeled in a joint IRT framework, driven by a shared latent disease severity variable. For example, we might have a continuous biomarker that is measured alongside the discrete item responses, and we can model both types of outcomes together to get a more comprehensive understanding of the disease progression.

Warning

Simple IRT models assume that the latent trait (ability) is unidimensional and that all the items measure the same underlying trait. A violation of this assumption can lead to biased parameter estimates and incorrect inferences. If increasing the underlying disease severity leads to a higher score on some items but a lower score on others, then the unidimensionality assumption is likely violated and a more complex model (e.g., multidimensional IRT) may be needed. This can be checked during exploratory data analysis by looking at the correlation structure of the items’ responses. If the items are all positively correlated, this is a good sign that they are measuring the same underlying trait. On the other hand, if some items are negatively correlated with others, this may indicate that they are measuring different traits.

3 Bounded ordinal score models

With the conceptual picture in place, we now fix the notation that will carry us through the rest of the tutorial. Consider a study with \(N\) subjects responding to \(J\) items, where each item has a bounded ordinal response. The model below is the graded response model (GRM) of Samejima (1969), the polytomous extension of standard IRT, with item-specific discriminations. We use the following notation:

\(i \in \{1, \dots, N\}\): subject index
\(j \in \{1, \dots, J\}\): item index
\(K_j\): number of response categories for item \(j\), so the response \(y_{i,j}\) takes values in \(\{1, 2, \dots, K_j\}\)
\(k \in \{1, 2, \dots, K_j\}\): category index
\(y_{i,j}\): observed ordinal response of subject \(i\) to item \(j\)
\(f_i \in \mathbb{R}\): latent ability of subject \(i\)
\(a_j > 0\): discrimination parameter of item \(j\) (controls the slope of the response curve)
\(b_{j,k}\): cut-off (difficulty) for item \(j\) at category threshold \(k\), with \(b_{j,1} < b_{j,2} < \dots < b_{j,K_j - 1}\) ordered (\(K_j - 1\) cut-offs in total)

Important

Pumas’ OrderedLogit distribution assumes the response takes values in \(\{1, 2, \dots, K_j\}\), i.e., consecutive integers starting at 1. If your observed scores are ordinal but encoded differently (e.g., \(\{0, 1, 2, 3\}\), \(\{-2, -1, 0, 1, 2\}\), non-consecutive integers like \(\{0, 1, 2, 4\}\), fractional values such as \(\{0.0, 0.125, 0.25, \dots, 3.0\}\), or textual labels like "low" / "medium" / "high"), map them to \(1, 2, \dots, K_j\) before passing them to read_pumas or simobs. The mapping must preserve the ordering, and the same mapping should be used consistently for all subjects and time points so that a given integer always denotes the same category.

3.1 Category probabilities

The probability of each ordinal category is then defined as the difference between two cumulative probabilities. The cumulative probabilities are defined by the logistic (inverse logit) function as follows:

\[ \begin{aligned} P(y_{i,j} = k) & = P(y_{i,j} \leq k) - P(y_{i,j} \leq k - 1) \quad \forall k \in 1 \dots K_j \\ P(y_{i,j} \leq k) & = \begin{cases} 0 & k = 0 \\ 1 & k = K_j \\ \mathrm{logit}^{-1}\!\left(a_j \cdot (b_{j,k} - f_i)\right) & 1 \leq k \leq K_j - 1 \end{cases} \end{aligned} \]

Each cut-off \(b_{j,k}\) governs the cumulative probability \(P(y_{i,j} \leq k)\) of being in category \(k\) or below, with \(k = 1, \dots, K_j - 1\) indexing the \(K_j - 1\) cut-offs. The discrimination \(a_j > 0\) controls how sharply the response probability changes with ability: items with larger \(a_j\) have sharper transitions between categories (a small change in ability moves a lot of probability mass), while items with smaller \(a_j\) have flatter ICCs.

Note

The OrderedLogit distribution in Pumas. OrderedLogit(η) takes a vector \(\eta \in \mathbb{R}^{K-1}\) of (strictly increasing) values and defines a discrete distribution over \(\{1, 2, \dots, K\}\) via

\[ P(y \leq k) = \mathrm{logit}^{-1}(\eta_k), \qquad k = 1, \dots, K - 1, \]

with the boundary cases \(P(y \leq 0) = 0\) and \(P(y \leq K) = 1\). Each \(\eta_k\) is therefore the log-odds of being in category \(k\) or lower. Per-category probabilities are obtained by differencing:

\[ P(y = k) = P(y \leq k) - P(y \leq k - 1). \]

In the IRT model above, we set \(\eta_k = a_j (b_{j,k} - f_i)\), so larger ability \(f_i\) pushes \(\eta\) downward and shifts probability mass toward higher categories. The ordering \(b_{j,1} < \dots < b_{j,K_j - 1}\) implies \(\eta_1 < \dots < \eta_{K_j - 1}\), which is exactly the monotonicity that OrderedLogit requires.

For a more detailed treatment of ordinal regression with OrderedLogit in Pumas (including exploratory data analysis, covariate effects, and category-probability diagnostics), see the dedicated Ordinal Regression tutorial, which the IRT model here builds upon.

3.2 Example: visualizing ICCs

To build intuition, the figure below shows item characteristic curves (ICCs) for a single ordinal item with \(K = 4\) categories (\(y \in \{1, 2, 3, 4\}\) and \(K - 1 = 3\) cut-offs \(b_{j,1}, b_{j,2}, b_{j,3}\)). The left panel shows the cumulative probabilities \(P(y_{i,j} \leq k)\) as a function of ability \(f_i\), and the right panel shows the per-category probabilities \(P(y_{i,j} = k)\).

logistic(x) = 1 / (1 + exp(-x))

fi = -4:0.05:4

# Single item: 4 categories, 3 cut-offs
a_ex = 1.5
b_ex = [-1.0, 0.0, 1.2]

# Cumulative probabilities P(y ≤ k) for k = 1, 2, 3
cum_probs = [logistic.(a_ex .* (b .- fi)) for b in b_ex]

# Category probabilities P(y = k) for k = 1, 2, 3, 4
p_y1 = cum_probs[1]
p_y2 = cum_probs[2] .- cum_probs[1]
p_y3 = cum_probs[3] .- cum_probs[2]
p_y4 = 1 .- cum_probs[3]

icc_cum_df = DataFrame(
    ability = repeat(collect(fi), 3),
    probability = vcat(cum_probs...),
    threshold = repeat(["P(y ≤ 1)", "P(y ≤ 2)", "P(y ≤ 3)"]; inner = length(fi)),
)

icc_cat_df = DataFrame(
    ability = repeat(collect(fi), 4),
    probability = vcat(p_y1, p_y2, p_y3, p_y4),
    category = repeat(["y = 1", "y = 2", "y = 3", "y = 4"]; inner = length(fi)),
)

fig = Figure(; size = (700, 300))
ag1 = draw!(
    fig[1, 1],
    data(icc_cum_df) *
    mapping(:ability => "ability (f)", :probability, color = :threshold => nonnumeric) *
    visual(Lines; linewidth = 3);
    axis = (; title = "Cumulative ICCs"),
)
legend!(fig[1, 2], ag1)

ag2 = draw!(
    fig[1, 3],
    data(icc_cat_df) *
    mapping(:ability => "ability (f)", :probability, color = :category => nonnumeric) *
    visual(Lines; linewidth = 3);
    axis = (; title = "Per-category ICCs"),
)
legend!(fig[1, 4], ag2)

fig

The cumulative curves are ordered logistic functions of ability, each centred at one cut-off \(b_{j,k}\) and sharing the slope \(a_j\). The per-category probabilities sum to 1 at every value of \(f_i\); the middle categories appear as bumps between adjacent cut-offs, while the extreme categories (\(y = 1\) and \(y = K_j\)) are monotone in ability: \(P(y = 1)\) decreasing and \(P(y = K_j)\) increasing as \(f_i\) grows.

4 A Pumas ordinal IRT model with a linear latent ability

Having grounded the cross-sectional picture in a plot, we now turn it into a longitudinal Pumas model. Concretely, we build a graded response model with \(J = 2\) items (score1, score2), each with \(K = 4\) response categories, and each subject measured at multiple time points. In this first example, the latent ability is taken to be linear in time:

\[ \begin{aligned} f_i(t) & = f_{0,i} + s_i \cdot t \\ f_{0,i} & \sim \mathrm{Normal}\!\left(\mu_f,\; \omega_f^2\right) \\ s_i & \sim \mathrm{Normal}\!\left(\mu_s,\; \omega_s^2\right) \end{aligned} \]

Both the intercept \(f_{0,i}\) and the slope \(s_i\) of this linear trajectory carry fixed and random effects: the population means \(\mu_f, \mu_s\) describe the typical disease trajectory, while the per-subject draws \(f_{0,i}, s_i\) around those means capture inter-individual variability. This makes the model a nonlinear mixed effects (NLME) model. Concretely, the population-level parameters are:

\(\mu_f, \mu_s\): fixed-effect (population) mean intercept and progression slope of ability
\(\omega_f^2, \omega_s^2\): variances of the random effects on the intercept and slope (so \(\omega_f, \omega_s\) are SDs)
\(f_{0,i}, s_i\): subject-specific random effects (the actual baseline ability and slope for subject \(i\))

The item-level parameters per item \(j \in \{1, 2\}\) are:

\(b_j \in \mathbb{R}^{K_j - 1}\): ordered per-category cut-offs of item \(j\), with \(b_{j,1} < b_{j,2} < \dots < b_{j,K_j - 1}\). The length of \(b_j\) is the number of cut-offs, one fewer than the number of categories. Each item has its own cut-off parameter, accommodating items with different numbers of categories.
\(a_j > 0\): discrimination of item \(j\) (positive scalar).

The OrderedLogit threshold input for item \(j\) at time \(t\) is \(\eta_{j,k} = a_j \cdot (b_{j,k} - f_i(t))\) for each cut-off \(k = 1, \dots, K_j - 1\), matching the bounded ordinal model defined earlier: \(\mathrm{logit}^{-1}(\eta_{j,k}) = P(y_{i,j} \leq k)\).

We stick with the linear form throughout the rest of this tutorial because it keeps the focus on the IRT-specific parts of the model. More generally, any disease progression model can play the role of \(f_i(t)\), such as exponential, EMAX, indirect-response, or mechanistic ODE-based forms. Switching between progressions is a localized change to @pre (and optionally @dynamics) and its associated parameters; the rest of the model (item cut-offs, discriminations, OrderedLogit observation) is unaffected. We come back to this in the Extensions section.

4.1 Enforcing ordering via an unconstrained parameterization

The ordered cut-off vector \(b_j\) lives in a constrained region of \(\mathbb{R}^{K_j - 1}\), namely the strictly increasing sequences. To let the optimizer work in a simple, unconstrained space, we introduce an unconstrained surrogate \(\theta^b_j \in \mathbb{R}^{K_j - 1}\) (called θb1 and θb2 in the Pumas @param block) and obtain the ordered cut-offs \(b_j\) from \(\theta^b_j\) via a cumulative log-exp transform:

\[ \begin{aligned} b_{j,1} & = \theta^b_{j,1} & & \text{(first cut-off is unconstrained on the real line)} \\ \theta^b_{j,k} & = \log\!\left(b_{j,k} - b_{j,k-1}\right), \quad k > 1 & & \text{(log of the gap between consecutive cut-offs)} \\ b_{j,k} & = b_{j,k-1} + e^{\theta^b_{j,k}}, \quad k > 1 & & \text{(reconstruct ordered cut-offs)} \end{aligned} \]

So when item \(j\)’s first cut-off is estimated:

\(\theta^b_{j,1}\) is the location of the lowest cut-off on the ability scale, free to take any real value.
\(\theta^b_{j,k}\) for \(k > 1\) is the logarithm of the gap between consecutive cut-offs. Because \(e^{\theta^b_{j,k}} > 0\), the reconstructed \(b_{j,k}\) sequence is strictly increasing for any value of \(\theta^b_j\), so the ordering constraint is satisfied automatically. The \(\theta\) prefix distinguishes them from the ordered cut-offs \(b_j\).

When item \(j\)’s first cut-off is instead fixed (which we do for item 1 below, to anchor the latent scale), the lowest cut-off drops out of \(\theta^b_j\) entirely: \(\theta^b_j\) then consists only of log-gaps, has length \(K_j - 2\), and the reconstruction prepends the fixed value before applying the same cumulative log-exp transform. So the meaning of \(\theta^b_{j,1}\) differs between the anchor item (a log-gap) and the rest (a location).

The reparameterization is implemented as a pair of plain Julia functions used both inside the model and when post-processing the fit:

# Map an unconstrained vector θ ∈ R^m to an ordered vector b ∈ R^m:
#   b[1] = θ[1],  b[k] = b[k-1] + exp(θ[k])  for k > 1.
function to_ordered(θ)
    b = similar(θ)
    b[1] = θ[1]
    for k = 2:length(θ)
        b[k] = b[k-1] + exp(θ[k])
    end
    return b
end

# Inverse: map an ordered vector b ∈ R^m to its unconstrained θ ∈ R^m:
#   θ[1] = b[1],  θ[k] = log(b[k] - b[k-1])  for k > 1.
function to_raw(b)
    θ = similar(b)
    θ[1] = b[1]
    for k = 2:length(b)
        θ[k] = log(b[k] - b[k-1])
    end
    return θ
end

4.2 Pinning a reference cut-off for location identifiability

The model is invariant under a joint shift of the latent ability \(f\) and all cut-offs \(b_{j,k}\) by the same constant, so without an external constraint, the location of \(f\) (and hence of every \(b_{j,k}\)) is unidentified. Only the relative positions of the cut-offs across items carry information; the common origin is a convention. To make the parameters recoverable, we therefore pick any one cut-off and pin it to a fixed value (a “reference cut-off”), and let everything else vary freely. The choice of which cut-off is arbitrary (it does not have to be the lowest, the highest, or come from any particular item). For concreteness, we use

\[ b_{1,1} \;=\; 0 \]

throughout this tutorial. In the Pumas model, this is hard-coded directly inside @pre rather than declared in @param, so \(b_{1,1}\) is not a fitted parameter at all. Concretely, \(\theta^b_1\) now holds only the \(K_1 - 2\) log-gaps of item 1, and the ordered cut-offs are reconstructed by prepending \(0\) before applying the cumulative transform.

4.3 Pinning a reference discrimination for scale identifiability

Location is not the only invariance. The model is also invariant under a joint rescaling: for any \(c > 0\), mapping \(f_i \to c f_i\), \(b_{j,k} \to c b_{j,k}\), and \(a_j \to a_j / c\) leaves \(\eta_{j,k} = a_j(b_{j,k} - f_i)\) unchanged. Fixing \(b_{1,1} = 0\) does not resolve this; it only kills the shift. We therefore additionally pin the discrimination of item 1,

\[ a_1 \;=\; 1, \]

also hard-coded in @pre. Only the remaining discriminations \(a_2, \dots, a_J\) are estimated, and they are interpreted relative to item 1: \(a_j = 1.5\) means item \(j\)’s response curve is 1.5× as steep as item 1’s, not 1.5 on some absolute scale.

4.4 The Pumas model

The Pumas model for the IRT model with a linear latent ability trajectory is defined as follows.

n_items = 2
# Per-item number of categories; can differ across items.
# Each item j has n_cats[j] categories and n_cats[j] - 1 cut-offs.
n_cats = [4, 4]
m = n_cats .- 1

irt_ordinal_model = @model begin
    @param begin
        # Ability fixed effects (population intercept and slope)
        μf ∈ RealDomain()
        μs ∈ RealDomain()
        # Ability random-effect SDs (subject-level variability in intercept and slope)
        ωf ∈ RealDomain(; lower = 0.05)
        ωs ∈ RealDomain(; lower = 1e-3)

        # Per-item log-gap vectors. Item 1's first cut-off is hard-coded to 0
        # in @pre as the location reference (not estimated), so θb1 contributes
        # only m[1] - 1 free entries (the log-gaps). Item 2 contributes m[2]
        # entries (first cut-off + log-gaps).
        θb1 ∈ VectorDomain(m[1] - 1)
        θb2 ∈ VectorDomain(m[2])
        # Discriminations for items 2..n_items, on the positive scale. Item 1's
        # discrimination is hard-coded to 1 in @pre as the scale reference, so
        # a_free has length n_items - 1.
        a_free ∈ VectorDomain(n_items - 1; lower = 0.0)
    end

    @random begin
        # Subject-specific intercept and slope drawn directly from their
        # population distributions, parameterized by mean and SD
        f0 ~ Normal(μf, ωf)
        slope ~ Normal(μs, ωs)
    end

    @pre begin
        # Latent ability trajectory for the subject
        f = f0 + slope * t

        # Reconstruct ordered cut-offs b_j. For item 1, prepend the reference
        # cut-off (hard-coded to 0) before the log-gaps θb1.
        b1_ordered = to_ordered(vcat(0.0, θb1))
        b2_ordered = to_ordered(θb2)

        # Full discrimination vector with a[1] = 1 fixed as the scale reference.
        a = vcat(1.0, a_free)

        # OrderedLogit thresholds: η_{j,k} = a_j (b_{j,k} - f)
        lle1 = a[1] .* (b1_ordered .- f)
        lle2 = a[2] .* (b2_ordered .- f)
    end

    @derived begin
        score1 ~ @. OrderedLogit(lle1)
        score2 ~ @. OrderedLogit(lle2)
    end
end

PumasModel
  Parameters: μf, μs, ωf, ωs, θb1, θb2, a_free
  Random effects: f0, slope
  Covariates:
  Dynamical system variables:
  Dynamical system type: No dynamical model
  Derived: score1, score2
  Observed: score1, score2

5 Simulate data

With the model defined, we generate a synthetic dataset that we can later fit it on, so that we can compare the recovered estimates against parameter values we actually know. We simulate ordinal responses for 300 subjects measured at 5 time points (e.g., weeks 0, 4, 8, 12, 24) from fixed ordinal parameter values using Pumas’ simobs. First we build a template population (one Subject per id), then call simobs at the true parameter values.

n_subjects = 300
obs_times = [0.0, 4.0, 8.0, 12.0, 24.0]

# Target ordered cut-offs per item (length m[j] = n_cats[j] - 1).
# Item 1 anchors the latent scale: its first cut-off is fixed at 0.
b1_ordered_true = [0.0, 1.5, 2.5]
b2_ordered_true = [-1.0, 0.3, 1.2]

# Drop the (fixed-at-0) first cut-off of item 1; keep only the log-gaps,
# matching the @param decomposition where θb1 holds the log-gaps alone.
# a_free holds only the free discriminations (items 2..n_items); item 1's
# discrimination is fixed to 1 inside @pre as the scale reference.
true_params = (
    μf = -0.3,
    ωf = 0.8,
    μs = 0.04,
    ωs = 0.02,
    θb1 = to_raw(b1_ordered_true)[2:end],
    θb2 = to_raw(b2_ordered_true),
    a_free = [0.75],
)

template_pop = map(1:n_subjects) do i
    Subject(; id = i, time = obs_times)
end

sim_obs = simobs(irt_ordinal_model, template_pop, true_params)
irt_df = DataFrame(sim_obs)
first(irt_df, 10)

10×13 DataFrame

Row	id	time	score1	score2	evid	f0	slope	f	b1_ordered	b2_ordered	a	lle1	lle2
	String	Float64	Int64?	Int64?	Int64?	Float64?	Float64?	Float64?	Array…?	Array…?	Array…?	Array…?	Array…?
1	1	0.0	2	4	0	0.482736	0.00734241	0.482736	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-0.482736, 1.01726, 2.01726]	[-1.11205, -0.137052, 0.537948]
2	1	4.0	2	1	0	0.482736	0.00734241	0.512106	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-0.512106, 0.987894, 1.98789]	[-1.13408, -0.159079, 0.515921]
3	1	8.0	2	2	0	0.482736	0.00734241	0.541475	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-0.541475, 0.958525, 1.95852]	[-1.15611, -0.181106, 0.493894]
4	1	12.0	1	2	0	0.482736	0.00734241	0.570845	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-0.570845, 0.929155, 1.92916]	[-1.17813, -0.203134, 0.471866]
5	1	24.0	2	4	0	0.482736	0.00734241	0.658954	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-0.658954, 0.841046, 1.84105]	[-1.24422, -0.269215, 0.405785]
6	2	0.0	1	4	0	1.56184	0.0543695	1.56184	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-1.56184, -0.0618438, 0.938156]	[-1.92138, -0.946383, -0.271383]
7	2	4.0	3	4	0	1.56184	0.0543695	1.77932	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-1.77932, -0.279322, 0.720678]	[-2.08449, -1.10949, -0.434491]
8	2	8.0	1	4	0	1.56184	0.0543695	1.9968	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-1.9968, -0.4968, 0.5032]	[-2.2476, -1.2726, -0.5976]
9	2	12.0	2	4	0	1.56184	0.0543695	2.21428	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-2.21428, -0.714278, 0.285722]	[-2.41071, -1.43571, -0.760708]
10	2	24.0	4	3	0	1.56184	0.0543695	2.86671	[0.0, 1.5, 2.5]	[-1.0, 0.3, 1.2]	[1.0, 0.75]	[-2.86671, -1.36671, -0.366712]	[-2.90003, -1.92503, -1.25003]

5.1 Checking category coverage

Before fitting, it is worth checking that every response category for every score is observed at every time point; otherwise some cut-offs will be poorly identified by the likelihood:

@chain irt_df begin
    groupby([:time, :score1])
    combine(nrow => :n)
    simple_table
end


Time	score1	n

0	1	197
0	2	62
0	3	26
0	4	15
4	1	172
4	2	87
4	3	26
4	4	15
8	1	159
8	2	88
8	3	31
8	4	22
12	1	140
12	2	100
12	3	39
12	4	21
24	1	142
24	2	69
24	3	43
24	4	46

@chain irt_df begin
    groupby([:time, :score2])
    combine(nrow => :n)
    simple_table
end


Time	score2	n

0	1	111
0	2	75
0	3	43
0	4	71
4	1	113
4	2	68
4	3	30
4	4	89
8	1	95
8	2	65
8	3	48
8	4	92
12	1	101
12	2	61
12	3	47
12	4	91
24	1	71
24	2	54
24	3	49
24	4	126

All four categories appear with a reasonable count at every time point for both items, so we proceed.

5.2 Reading into a Pumas population

We convert the simulated DataFrame into a Pumas population, ready for fitting:

irt_pop = read_pumas(
    irt_df;
    id = :id,
    time = :time,
    observations = [:score1, :score2],
    event_data = false,
)

Population
  Subjects: 300
  Observations: score1, score2

6 Fit the IRT model with LaplaceI

With the simulated population defined, we now estimate the population parameters using the LaplaceI() algorithm. We deliberately initialize the optimizer away from the true values so that recovery is not trivial:

# Initial unconstrained θb vectors, split out for clarity:
# - θb1_init holds only the log-gaps of item 1 (length m[1] - 1); the first
#   cut-off is hard-coded to 0 inside @pre and is not estimated.
# - θb2_init holds the first cut-off and log-gaps of item 2 (length m[2]).
θb1_init = fill(-0.5, m[1] - 1)
θb2_init = vcat(0.0, fill(-0.5, m[2] - 1))

init_irt = (
    μf = 0.5,
    ωf = 0.4,
    μs = 0.2,
    ωs = 0.1,
    θb1 = θb1_init,
    θb2 = θb2_init,
    a_free = fill(0.5, n_items - 1),
)

(μf = 0.5, ωf = 0.4, μs = 0.2, ωs = 0.1, θb1 = [-0.5, -0.5], θb2 = [0.0, -0.5, -0.5], a_free = [0.5])

fpm_irt_laplace = fit(irt_ordinal_model, irt_pop, init_irt, LaplaceI())

[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter     Function value   Gradient norm 
     0     4.800428e+03     4.779925e+03
 * time: 0.03578996658325195
     1     4.245452e+03     1.013823e+03
 * time: 3.453160047531128
     2     4.158189e+03     3.264277e+02
 * time: 4.141160011291504
     3     3.977290e+03     1.242868e+03
 * time: 4.800279855728149
     4     3.822159e+03     1.826704e+03
 * time: 5.429880857467651
     5     3.733687e+03     5.928409e+02
 * time: 6.080786943435669
     6     3.704887e+03     3.621482e+02
 * time: 6.718828916549683
     7     3.687178e+03     4.735016e+02
 * time: 7.373612880706787
     8     3.659058e+03     1.472809e+02
 * time: 8.002394914627075
     9     3.654377e+03     9.295584e+01
 * time: 8.651124000549316
    10     3.641771e+03     3.179643e+02
 * time: 9.279438972473145
    11     3.627748e+03     4.153773e+02
 * time: 9.906957864761353
    12     3.616755e+03     1.477176e+02
 * time: 10.556922912597656
    13     3.614091e+03     3.115247e+01
 * time: 11.18484902381897
    14     3.613411e+03     2.419889e+01
 * time: 11.830452919006348
    15     3.612684e+03     4.667978e+01
 * time: 12.461710929870605
    16     3.611347e+03     6.251912e+01
 * time: 13.136571884155273
    17     3.610067e+03     4.765711e+01
 * time: 13.76744794845581
    18     3.609369e+03     1.652206e+01
 * time: 14.395721912384033
    19     3.609052e+03     1.704878e+01
 * time: 15.042192935943604
    20     3.608696e+03     2.589193e+01
 * time: 15.669189929962158
    21     3.608037e+03     4.448229e+01
 * time: 16.316513061523438
    22     3.606979e+03     5.243231e+01
 * time: 16.947618007659912
    23     3.605960e+03     3.656687e+01
 * time: 17.60514497756958
    24     3.605339e+03     1.755167e+01
 * time: 18.244412899017334
    25     3.604951e+03     1.569934e+01
 * time: 18.87156391143799
    26     3.604685e+03     2.354730e+01
 * time: 19.51592493057251
    27     3.604013e+03     3.901258e+01
 * time: 20.140594959259033
    28     3.603063e+03     4.595394e+01
 * time: 20.79408288002014
    29     3.602303e+03     3.541563e+01
 * time: 21.419699907302856
    30     3.601964e+03     1.522746e+01
 * time: 22.045002937316895
    31     3.601848e+03     1.157073e+01
 * time: 22.69025492668152
    32     3.601795e+03     1.111696e+01
 * time: 23.321531057357788
    33     3.601675e+03     1.172637e+01
 * time: 23.9655818939209
    34     3.601483e+03     1.638165e+01
 * time: 24.589326858520508
    35     3.601269e+03     1.431190e+01
 * time: 25.235925912857056
    36     3.601149e+03     8.426453e+00
 * time: 25.861865043640137
    37     3.601094e+03     6.148582e+00
 * time: 26.488147974014282
    38     3.601059e+03     6.060319e+00
 * time: 27.13067889213562
    39     3.600998e+03     1.146124e+01
 * time: 27.762945890426636
    40     3.600882e+03     1.640790e+01
 * time: 28.40818500518799
    41     3.600728e+03     1.643570e+01
 * time: 29.03513503074646
    42     3.600601e+03     9.916291e+00
 * time: 29.663702964782715
    43     3.600552e+03     2.555235e+00
 * time: 30.30987000465393
    44     3.600546e+03     2.600715e+00
 * time: 30.93436098098755
    45     3.600543e+03     2.581232e+00
 * time: 31.58029294013977
    46     3.600533e+03     4.282126e+00
 * time: 32.20481991767883
    47     3.600510e+03     7.019430e+00
 * time: 32.832524061203
    48     3.600455e+03     1.095331e+01
 * time: 33.48176288604736
    49     3.600361e+03     1.401551e+01
 * time: 34.10554099082947
    50     3.600264e+03     1.213499e+01
 * time: 34.76631498336792
    51     3.600196e+03     4.054761e+00
 * time: 35.39568901062012
    52     3.600186e+03     2.280079e-01
 * time: 36.02151894569397
    53     3.600185e+03     3.282533e-02
 * time: 36.665420055389404
    54     3.600185e+03     2.051408e-02
 * time: 37.29280185699463
    55     3.600185e+03     1.828249e-03
 * time: 37.94362998008728
    56     3.600185e+03     1.543443e-03
 * time: 38.56424808502197
    57     3.600185e+03     1.489942e-03
 * time: 39.21896505355835
    58     3.600185e+03     1.352684e-03
 * time: 39.848686933517456
    59     3.600185e+03     1.351321e-03
 * time: 40.52302098274231
    60     3.600185e+03     1.349973e-03
 * time: 41.19292497634888
    61     3.600185e+03     1.348623e-03
 * time: 41.8829550743103
    62     3.600185e+03     1.347275e-03
 * time: 42.55357503890991
    63     3.600185e+03     1.347140e-03
 * time: 43.23446798324585
    64     3.600185e+03     1.347005e-03
 * time: 43.93339395523071
    65     3.600185e+03     1.346870e-03
 * time: 44.61492705345154
    66     3.600185e+03     1.346736e-03
 * time: 45.29696297645569
    67     3.600185e+03     1.346601e-03
 * time: 45.99984002113342
    68     3.600185e+03     1.346466e-03
 * time: 46.68307590484619
    69     3.600185e+03     1.346453e-03
 * time: 47.40495204925537
    70     3.600185e+03     1.346439e-03
 * time: 48.11387801170349
    71     3.600185e+03     1.346426e-03
 * time: 48.8020498752594
    72     3.600185e+03     1.346413e-03
 * time: 49.5105619430542
    73     3.600185e+03     1.346399e-03
 * time: 50.22691607475281
    74     3.600185e+03     1.346386e-03
 * time: 50.96284890174866
    75     3.600185e+03     1.346384e-03
 * time: 51.692229986190796
    76     3.600185e+03     1.346383e-03
 * time: 52.41261291503906
    77     3.600185e+03     1.346382e-03
 * time: 53.13038086891174
    78     3.600185e+03     1.346380e-03
 * time: 53.85036587715149
    79     3.600185e+03     1.346379e-03
 * time: 54.57348895072937
    80     3.600185e+03     1.346379e-03
 * time: 55.303210973739624
    81     3.600185e+03     1.346379e-03
 * time: 56.06073999404907
    82     3.600185e+03     1.346379e-03
 * time: 56.84914207458496
    83     3.600185e+03     1.346379e-03
 * time: 57.636290073394775

FittedPumasModel

Dynamical system type:          No dynamical model

Number of subjects:                            300

Observation records:         Active        Missing
    score1:                    1500              0
    score2:                    1500              0
    Total:                     3000              0

Number of parameters:      Constant      Optimized
                                  0             10

Likelihood approximation:                 LaplaceI
Likelihood optimizer:                         BFGS

Termination Reason:              NoObjectiveChange
Log-likelihood value:                   -3600.1853

--------------------
          Estimate
--------------------
μf        -0.60264
μs         0.044821
ωf         0.8417
ωs         0.038788
θb1₁       0.44853
θb1₂       0.14895
θb2₁      -1.4797
θb2₂       0.47622
θb2₃       0.11072
a_free₁    0.60231
--------------------

Recall the true parameters used to simulate the data, to compare against the estimates above:

true_params

(μf = -0.3, ωf = 0.8, μs = 0.04, ωs = 0.02, θb1 = [0.4054651081081644, 0.0], θb2 = [-1.0, 0.26236426446749106, -0.1053605156578264], a_free = [0.75])

Despite the deliberately off starting point, the optimizer converges to estimates closer (but not identical) to the parameters used to simulate the data. Next, we check whether the confidence intervals of the estimates contain the true values.

7 Confidence intervals

Next, we want to get confidence intervals (CIs) for the population parameters and check that they contain the true values we used to simulate the data. Asymptotic confidence intervals for the population parameters can be obtained using infer on the LaplaceI fitted Pumas model.

infer_irt = infer(fpm_irt_laplace)
ci_df = coeftable(infer_irt)
simple_table(ci_df)

[ Info: Calculating: variance-covariance matrix.
[ Info: Done.


parameter	constant	estimate	se	relative_se	ci_lower	ci_upper

μf	false	-0.603	0.0957	0.159	-0.79	-0.415
μs	false	0.0448	0.00601	0.134	0.033	0.0566
ωf	false	0.842	0.0983	0.117	0.649	1.03
ωs	false	0.0388	0.00864	0.223	0.0218	0.0557
θb1₁	false	0.449	0.0477	0.106	0.355	0.542
θb1₂	false	0.149	0.0818	0.549	-0.0114	0.309
θb2₁	false	-1.48	0.223	0.15	-1.92	-1.04
θb2₂	false	0.476	0.144	0.302	0.194	0.758
θb2₃	false	0.111	0.15	1.35	-0.183	0.404
a_free₁	false	0.602	0.0844	0.14	0.437	0.768

We then append true_value and init_value columns by flattening true_params and init_irt in the same row order that coeftable uses. coeftable lists the population scalars as μf, μs, ωf, ωs (all means first, then all SDs), followed by the vector parameters θb1, θb2, a_free:

ci_df.true_value = vcat(
    [true_params.μf, true_params.μs, true_params.ωf, true_params.ωs],
    true_params.θb1,
    true_params.θb2,
    true_params.a_free,
)
ci_df.init_value = vcat(
    [init_irt.μf, init_irt.μs, init_irt.ωf, init_irt.ωs],
    init_irt.θb1,
    init_irt.θb2,
    init_irt.a_free,
)
simple_table(ci_df)


parameter	constant	estimate	se	relative_se	ci_lower	ci_upper	true_value	init_value

μf	false	-0.603	0.0957	0.159	-0.79	-0.415	-0.3	0.5
μs	false	0.0448	0.00601	0.134	0.033	0.0566	0.04	0.2
ωf	false	0.842	0.0983	0.117	0.649	1.03	0.8	0.4
ωs	false	0.0388	0.00864	0.223	0.0218	0.0557	0.02	0.1
θb1₁	false	0.449	0.0477	0.106	0.355	0.542	0.405	-0.5
θb1₂	false	0.149	0.0818	0.549	-0.0114	0.309	0	-0.5
θb2₁	false	-1.48	0.223	0.15	-1.92	-1.04	-1.0	0
θb2₂	false	0.476	0.144	0.302	0.194	0.758	0.262	-0.5
θb2₃	false	0.111	0.15	1.35	-0.183	0.404	-0.105	-0.5
a_free₁	false	0.602	0.0844	0.14	0.437	0.768	0.75	0.5

Each row now shows the LaplaceI point estimate, its confidence interval, the true value used to simulate the data, and the initial value handed to the optimizer, making it easy to check both that every true parameter falls inside its CI and that the fit moved away from its starting point.

8 Predicted item characteristic curves

We opened the tutorial by plotting ICCs from hypothetical parameters; now we plot the estimated ICCs and overlay the true curves. We reconstruct each item’s cut-offs and discrimination from the fit (via to_ordered, prepending the pinned 0 for item 1) and reuse the per-category probability construction from the introductory example.

# Reconstruct estimated item parameters from the LaplaceI fit
est = coef(fpm_irt_laplace)
b1_hat = to_ordered(vcat(0.0, est.θb1))
b2_hat = to_ordered(est.θb2)
a_hat = vcat(1.0, est.a_free)

# Per-category probabilities P(y = k) over an ability grid for one item,
# given its discrimination `a` and ordered cut-offs `b`.
function category_probs(a, b, grid)
    cum = [logistic.(a .* (bk .- grid)) for bk in b]   # P(y ≤ k), k = 1 … K-1
    K = length(b) + 1
    return [k == 1 ? cum[1] : k == K ? 1 .- cum[end] : cum[k] .- cum[k-1] for k = 1:K]
end

# Tidy long-format ICC table for one item and one parameter source.
function icc_long(a, b, grid, item, source)
    probs = category_probs(a, b, grid)
    K = length(probs)
    DataFrame(
        ability = repeat(collect(grid), K),
        probability = reduce(vcat, probs),
        category = repeat(["y = $k" for k = 1:K]; inner = length(grid)),
        item = "item $item",
        source = source,
    )
end

icc_compare = reduce(
    vcat,
    [
        icc_long(a_hat[1], b1_hat, fi, 1, "estimated"),
        icc_long(a_hat[2], b2_hat, fi, 2, "estimated"),
        icc_long(1.0, b1_ordered_true, fi, 1, "true"),
        icc_long(true_params.a_free[1], b2_ordered_true, fi, 2, "true"),
    ],
)

plt =
    data(icc_compare) *
    mapping(
        :ability => "ability (f)",
        :probability => "P(y = k)",
        color = :category => nonnumeric,
        linestyle = :source => nonnumeric => "source",
        col = :item => nonnumeric,
    ) *
    visual(Lines; linewidth = 2)
draw(plt; figure = (; size = (700, 300)))

The estimated curves (solid) sit close to the true curves (dashed) for both items, confirming visually what the confidence-interval table reported. Item 2’s flatter transitions reflect its smaller estimated discrimination.

9 Model diagnostics and qualification

With real data there is no ground truth to compare against, so we rely on diagnostics that ask whether the fitted model is consistent with the observed responses.

9.1 Information criteria

aic and bic are most useful for comparing competing models (e.g. a linear vs. an EMAX latent trajectory) fitted to the same data:

(; AIC = aic(fpm_irt_laplace), BIC = bic(fpm_irt_laplace))

(AIC = 7220.37065592953, BIC = 7280.434331606032)

9.2 Visual predictive check

The key qualification tool is the visual predictive check (VPC), which overlays the observed category proportions over time on the distribution simulated from the fitted model. Pumas builds a categorical VPC with one panel per response category, so we plot one item at a time. The panels default to first-appearance order, so we pass levels to force them to read \(1 \to K\). Pumas names the per-category columns <observation>_<value>, which we reconstruct from the sorted category values:

# Build sorted level names matching Pumas' `<observation>_<value>` columns.
pop_df = DataFrame(irt_pop)
ordered_levels(score) =
    [Symbol("$(score)_$(v)") for v in sort(unique(skipmissing(pop_df[!, score])))]

ordered_levels (generic function with 1 method)

irt_vpc_1 = vpc(fpm_irt_laplace; observations = [:score1])
vpc_plot(irt_vpc_1; levels = ordered_levels(:score1), figure = (; size = (600, 1200)))

[ Info: Discrete VPC
[ Info: Automatically choosing bandwidth for smooth discrete VPC.
[ Info: Automatic bandwidth choice complete with bandwidth: 24.0.

irt_vpc_2 = vpc(fpm_irt_laplace; observations = [:score2])
vpc_plot(irt_vpc_2; levels = ordered_levels(:score2), figure = (; size = (600, 1200)))

[ Info: Discrete VPC
[ Info: Automatically choosing bandwidth for smooth discrete VPC.
[ Info: Automatic bandwidth choice complete with bandwidth: 24.0.

Each panel shows one response category’s observed proportion over time (for one item) against the simulated prediction band; bands that envelop the observed proportions across categories and items indicate an adequate fit.

10 Priors as regularization when data is sparse

With 300 subjects, every category was observed enough times that pure maximum likelihood worked fine. But what happens when we don’t have that much data?

The fit we just ran uses marginal maximum likelihood (LaplaceI()): each parameter is declared in @param with ∈ RealDomain(...) or ∈ VectorDomain(...), which specifies only its support, with no probabilistic information. With the \(N = 300\) subjects and 5 observations per item per subject we just used, every category of every item is observed enough times that the likelihood alone identifies all parameters well, and MLE works fine.

The story changes when some categories are barely observed: for example, a clinical scale where the worst category is reached by only a handful of subjects, or an item where one threshold is rarely crossed. In that regime, the corresponding cut-off has very little likelihood information, leading to numerical instability and non-estimability of some parameters.

10.1 A prior-aware Pumas model

The standard remedy is to regularize via priors, i.e., to fit by maximum a posteriori (MAP) instead of MLE. In Pumas, this means redeclaring each parameter with the ~ syntax instead of ∈, which both fixes its support and attaches a prior distribution:

irt_ordinal_model_map = @model begin
    @param begin
        # Ability fixed effects: weakly informative Gaussians centred at 0.
        μf ~ Normal(0, 2)
        μs ~ Normal(0, 2)
        # Random-effect SDs: half-Cauchy on the positive line. Heavy tails
        # allow large values; the small lower bound keeps the optimizer
        # from collapsing the SD to exactly zero.
        ωf ~ truncated(Cauchy(0, 1); lower = 0.05)
        ωs ~ truncated(Cauchy(0, 1); lower = 1e-3)

        # Item 1: θb1 contains only log-gaps (first cut-off is fixed to 0).
        # N(0, 1) on each log-gap implies a prior 95% interval of roughly
        # (e^{-2}, e^{2}) ≈ (0.14, 7.4) for the gap between consecutive
        # cut-offs. This is broad relative to typical category spacings but not
        # absurd.
        θb1 ~ MvNormal(zeros(m[1] - 1), I(m[1] - 1))
        # Item 2: θb2[1] is the lowest cut-off (a location on the latent
        # scale) and gets a wider N(0, 2) prior; θb2[2:end] are log-gaps
        # and get the same N(0, 1) prior as for item 1.
        θb2 ~ MvNormal(zeros(m[2]), Diagonal(vcat(4.0, ones(m[2] - 1))))
        # Free item discriminations (items 2..n_items): N(0, 2) on log(a_j).
        # Item 1's discrimination is fixed to 1 as the scale reference.
        a_free ~ MvLogNormal(zeros(n_items - 1), 4 * I(n_items - 1))
    end

    @random begin
        f0 ~ Normal(μf, ωf)
        slope ~ Normal(μs, ωs)
    end

    @pre begin
        f = f0 + slope * t
        b1_ordered = to_ordered(vcat(0.0, θb1))
        b2_ordered = to_ordered(θb2)
        a = vcat(1.0, a_free)
        lle1 = a[1] .* (b1_ordered .- f)
        lle2 = a[2] .* (b2_ordered .- f)
    end

    @derived begin
        score1 ~ @. OrderedLogit(lle1)
        score2 ~ @. OrderedLogit(lle2)
    end
end

PumasModel
  Parameters: μf, μs, ωf, ωs, θb1, θb2, a_free
  Random effects: f0, slope
  Covariates:
  Dynamical system variables:
  Dynamical system type: No dynamical model
  Derived: score1, score2
  Observed: score1, score2

Only the @param block has changed; everything downstream is identical. The prior choices above are deliberately weakly informative, and they treat locations on the latent scale and log-gaps between cut-offs differently. A wide prior on the log scale corresponds to an absurdly wide prior on the gap itself, which is rarely what we want:

\(\mathrm{Normal}(0, 2)\) on the ability fixed effects (\(\mu_f, \mu_s\)) and the lowest cut-off of item 2 (\(\theta^b_{2,1}\), which lives on the ability scale). Wide enough not to dominate informative data, but tight enough to keep the latent scale anchored.
\(\mathrm{Normal}(0, 1)\) on every log-gap component (\(\theta^b_{1,:}\) and \(\theta^b_{2,2:}\)). This places the prior 95% interval for each consecutive-cut-off gap at \((e^{-2}, e^{2}) \approx (0.14, 7.4)\). Bumping this up to \(\mathrm{Normal}(0, 2)\) (the prior we use on locations) would imply gaps anywhere in \((e^{-4}, e^{4}) \approx (0.02, 55)\), which is far too diffuse and gives the optimizer / sampler very little to hold on to in sparse-data regimes.
\(\mathrm{LogNormal}(0, 2)\) on the free item discriminations \(a_{2:J}\) (item 1’s discrimination is fixed to 1 as the scale reference, so it has no prior). Equivalent to \(\mathrm{Normal}(0, 2)\) on \(\log a_j\), putting the bulk of the mass on \(a_j \in (e^{-4}, e^{4})\) on the positive scale.
Half-Cauchy \(\mathrm{Cauchy}(0, 1)\) truncated to a small positive lower bound on the random-effect SDs: heavy-tailed, permits very small values (near-zero between-subject variability) while still allowing large ones.

10.2 Fitting MAP and comparing to MLE

We then use MAP(LaplaceI()) to fit the model, which adds the log-priors to the Laplace-approximated log-likelihood as a regularizer. To make the regularization actually matter, we downsample the simulated population to its first 50 subjects (irt_pop[1:50]). At that size, some categories of each item are only observed a handful of times, so the likelihood alone leaves several cut-offs and discriminations weakly identified:

fpm_irt_map = fit(irt_ordinal_model_map, irt_pop[1:50], init_irt, MAP(LaplaceI()))

[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter     Function value   Gradient norm 
     0     8.188248e+02     1.610446e+03
 * time: 1.4066696166992188e-5
     1     7.504969e+02     8.749098e+02
 * time: 1.4160079956054688
     2     7.089774e+02     1.006082e+02
 * time: 1.5388638973236084
     3     6.925979e+02     1.783234e+02
 * time: 1.6552560329437256
     4     6.659047e+02     5.086880e+02
 * time: 1.7707409858703613
     5     6.546412e+02     1.926765e+02
 * time: 1.9991250038146973
     6     6.420159e+02     2.389700e+02
 * time: 2.098465919494629
     7     6.344980e+02     3.002435e+02
 * time: 2.1976139545440674
     8     6.253718e+02     9.710727e+01
 * time: 2.2968571186065674
     9     6.234522e+02     2.739849e+01
 * time: 2.399775981903076
    10     6.212891e+02     6.323539e+01
 * time: 2.502761125564575
    11     6.195780e+02     6.192034e+01
 * time: 2.684354066848755
    12     6.182941e+02     2.404885e+01
 * time: 2.7825260162353516
    13     6.174628e+02     2.057114e+01
 * time: 2.883608102798462
    14     6.165491e+02     3.855207e+01
 * time: 2.983949899673462
    15     6.151824e+02     5.451810e+01
 * time: 3.097270965576172
    16     6.140967e+02     3.387723e+01
 * time: 3.2113049030303955
    17     6.135866e+02     8.740670e+00
 * time: 3.3626959323883057
    18     6.134429e+02     6.117447e+00
 * time: 3.4617109298706055
    19     6.133251e+02     9.539533e+00
 * time: 3.560352087020874
    20     6.130095e+02     1.932386e+01
 * time: 3.662714958190918
    21     6.125185e+02     2.390563e+01
 * time: 3.776412010192871
    22     6.119794e+02     1.831283e+01
 * time: 3.9129509925842285
    23     6.116847e+02     9.450470e+00
 * time: 4.0128021240234375
    24     6.115896e+02     4.018618e+00
 * time: 4.112075090408325
    25     6.115476e+02     4.064309e+00
 * time: 4.22104811668396
    26     6.114762e+02     6.402699e+00
 * time: 4.3521459102630615
    27     6.113168e+02     1.011667e+01
 * time: 4.451571941375732
    28     6.109975e+02     1.578319e+01
 * time: 4.550947904586792
    29     6.106509e+02     1.637345e+01
 * time: 4.674942970275879
    30     6.104040e+02     2.052358e+01
 * time: 4.773761034011841
    31     6.101593e+02     3.343998e+00
 * time: 4.878731966018677
    32     6.101439e+02     2.237561e+00
 * time: 5.001184940338135
    33     6.101287e+02     2.178054e+00
 * time: 5.099446058273315
    34     6.101142e+02     2.266599e+00
 * time: 5.1979899406433105
    35     6.100966e+02     2.625904e+00
 * time: 5.31984806060791
    36     6.100744e+02     3.410928e+00
 * time: 5.420573949813843
    37     6.100449e+02     2.547888e+00
 * time: 5.52009391784668
    38     6.100163e+02     1.569060e+00
 * time: 5.642934083938599
    39     6.100003e+02     1.424563e+00
 * time: 5.741283893585205
    40     6.099944e+02     1.286109e+00
 * time: 5.840868949890137
    41     6.099904e+02     1.152158e+00
 * time: 5.963870048522949
    42     6.099830e+02     9.712418e-01
 * time: 6.062372922897339
    43     6.099662e+02     1.097342e+00
 * time: 6.161432981491089
    44     6.099290e+02     1.516326e+00
 * time: 6.284941911697388
    45     6.098629e+02     2.189174e+00
 * time: 6.38494610786438
    46     6.097865e+02     1.924357e+00
 * time: 6.484205007553101
    47     6.097423e+02     8.576499e-01
 * time: 6.606229066848755
    48     6.097333e+02     6.700974e-01
 * time: 6.704832077026367
    49     6.097319e+02     6.565466e-01
 * time: 6.803848028182983
    50     6.097303e+02     6.562399e-01
 * time: 6.905033111572266
    51     6.097263e+02     6.279605e-01
 * time: 7.027538061141968
    52     6.097180e+02     7.845362e-01
 * time: 7.126235008239746
    53     6.097045e+02     9.536278e-01
 * time: 7.224438905715942
    54     6.096927e+02     7.077729e-01
 * time: 7.346159934997559
    55     6.096881e+02     4.254716e-01
 * time: 7.4472010135650635
    56     6.096875e+02     6.182014e-02
 * time: 7.546864032745361
    57     6.096875e+02     1.666577e-02
 * time: 7.669086933135986
    58     6.096875e+02     3.632520e-03
 * time: 7.7673680782318115
    59     6.096875e+02     1.289419e-03
 * time: 7.867412090301514
    60     6.096875e+02     3.058822e-04
 * time: 7.988351106643677

FittedPumasModel

Dynamical system type:          No dynamical model

Number of subjects:                             50

Observation records:         Active        Missing
    score1:                     250              0
    score2:                     250              0
    Total:                      500              0

Number of parameters:      Constant      Optimized
                                  0             10

Estimation method:                  MAP (LaplaceI)

--------------------
          Estimate
--------------------
μf        -0.65986
μs         0.051183
ωf         0.89282
ωs         0.045609
θb1₁       0.52516
θb1₂       0.035215
θb2₁      -1.5777
θb2₂       0.6691
θb2₃       0.30676
a_free₁    0.50861
--------------------

We can compare the MAP estimates (50 subjects, priors active) to the earlier LaplaceI estimates (full 300 subjects, MLE) by appending them to ci_df:

map_est = coef(fpm_irt_map)
ci_df.map_50 = vcat(
    [map_est.μf, map_est.μs, map_est.ωf, map_est.ωs],
    map_est.θb1,
    map_est.θb2,
    map_est.a_free,
)
simple_table(ci_df)


parameter	constant	estimate	se	relative_se	ci_lower	ci_upper	true_value	init_value	map_50

μf	false	-0.603	0.0957	0.159	-0.79	-0.415	-0.3	0.5	-0.66
μs	false	0.0448	0.00601	0.134	0.033	0.0566	0.04	0.2	0.0512
ωf	false	0.842	0.0983	0.117	0.649	1.03	0.8	0.4	0.893
ωs	false	0.0388	0.00864	0.223	0.0218	0.0557	0.02	0.1	0.0456
θb1₁	false	0.449	0.0477	0.106	0.355	0.542	0.405	-0.5	0.525
θb1₂	false	0.149	0.0818	0.549	-0.0114	0.309	0	-0.5	0.0352
θb2₁	false	-1.48	0.223	0.15	-1.92	-1.04	-1.0	0	-1.58
θb2₂	false	0.476	0.144	0.302	0.194	0.758	0.262	-0.5	0.669
θb2₃	false	0.111	0.15	1.35	-0.183	0.404	-0.105	-0.5	0.307
a_free₁	false	0.602	0.0844	0.14	0.437	0.768	0.75	0.5	0.509

The MAP estimates on 50 subjects should sit close to the true values for the well-identified parameters and shrink toward the prior mode for the weakly identified ones.

10.3 Marginal log-likelihood diagnostic

A useful diagnostic is to evaluate the marginal log-likelihood on the same 50-subject dataset at both the MAP estimate and the true parameters:

ll_map = loglikelihood(irt_ordinal_model_map, irt_pop[1:50], map_est, LaplaceI())
ll_true = loglikelihood(irt_ordinal_model_map, irt_pop[1:50], true_params, LaplaceI())
(; ll_map, ll_true, Δll = ll_map - ll_true)

(ll_map = -597.9485443464408, ll_true = -601.2558270522165, Δll = 3.3072827057757195)

In this case, the log-likelihood of the MAP estimate is similar to that of the true parameters on the 50-subject dataset, which implies a good fit to the observed data. Notice that the log-likelihood of the MAP estimate is actually higher than that of the true parameters on this dataset. This is because the true parameters, although they generated the data, are not guaranteed to be the best fit to any particular subset of it (especially a small one) due to sampling variability.

11 Markov chain Monte Carlo (MCMC)

To fully quantify the uncertainty in the parameters of the model in the presence of sparse data, a full Bayesian fit is recommended. Pumas has a No-U-Turn Sampler (NUTS)-based MCMC sampler. We can simply use the BayesMCMC algorithm with the same model definition as for MAP. Here, we start the sampler from the MAP estimate:

fpm_irt_bayes = fit(
    irt_ordinal_model_map,
    irt_pop[1:50],
    coef(fpm_irt_map),  # use the MAP estimate as the MCMC initial value
    BayesMCMC(; nsamples = 2_000, nadapts = 1_000, nchains = 4),
)
fpm_irt_bayes_samples = discard(fpm_irt_bayes; burnin = 1_000)

[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.
[ Info: Checking the initial parameter values.
[ Info: The initial log probability and its gradient are finite. Check passed.

Chains MCMC chain (1000×10×4 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 4
Samples per chain = 1000
Wall duration     = 127.89 seconds
Compute duration  = 127.72 seconds
parameters        = μf, μs, ωf, ωs, θb1₁, θb1₂, θb2₁, θb2₂, θb2₃, a_free₁

Use `describe(chains)` for summary statistics and quantiles.

11.1 Convergence diagnostics

The returned object is a collection of MCMC chains rather than a single point estimate. The standard summaries are obtained with describe (mean, std, ESS, R̂, quantiles, etc.):

describe(fpm_irt_bayes_samples)

Chains MCMC chain (1000×10×4 Array{Float64, 3}):

Iterations        = 1:1:1000
Number of chains  = 4
Samples per chain = 1000
Wall duration     = 127.89 seconds
Compute duration  = 127.72 seconds
parameters        = μf, μs, ωf, ωs, θb1₁, θb1₂, θb2₁, θb2₂, θb2₃, a_free₁

Summary Statistics
  parameters      mean       std      mcse    ess_bulk    ess_tail      rhat   ⋯
      Symbol   Float64   Float64   Float64     Float64     Float64   Float64   ⋯

          μf   -0.6869    0.2499    0.0041   3774.5341   2751.9469    1.0008   ⋯
          μs    0.0532    0.0181    0.0003   3292.5996   3055.7186    1.0007   ⋯
          ωf    0.9962    0.2456    0.0067   1315.4304   2099.2138    1.0063   ⋯
          ωs    0.0549    0.0231    0.0008    790.8375   1287.2277    1.0052   ⋯
        θb1₁    0.5437    0.1184    0.0022   2800.5125   2597.8038    1.0018   ⋯
        θb1₂    0.0379    0.1977    0.0028   5004.2390   2882.9351    1.0018   ⋯
        θb2₁   -1.8402    0.5852    0.0141   1925.1006   2226.2480    1.0009   ⋯
        θb2₂    0.7982    0.3087    0.0078   1576.1964   1983.9329    1.0014   ⋯
        θb2₃    0.4328    0.3226    0.0079   1684.9300   2458.5371    1.0021   ⋯
     a_free₁    0.4640    0.1480    0.0036   1521.4819   2258.7931    1.0018   ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5%
      Symbol   Float64   Float64   Float64   Float64   Float64

          μf   -1.2082   -0.8452   -0.6820   -0.5190   -0.2158
          μs    0.0182    0.0408    0.0531    0.0652    0.0888
          ωf    0.5492    0.8267    0.9865    1.1477    1.5019
          ωs    0.0091    0.0391    0.0546    0.0710    0.1004
        θb1₁    0.3064    0.4662    0.5478    0.6229    0.7685
        θb1₂   -0.3606   -0.0924    0.0471    0.1798    0.3992
        θb2₁   -3.2524   -2.1600   -1.7562   -1.4232   -0.9185
        θb2₂    0.2037    0.5882    0.7895    1.0003    1.4320
        θb2₃   -0.1868    0.2170    0.4281    0.6484    1.0741
     a_free₁    0.2325    0.3594    0.4458    0.5487    0.8150

Before reading anything off the posterior, glance at the rhat and ess_* columns. As a rule of thumb, rhat should be very close to 1.0 (certainly below 1.01) for every parameter, and the effective sample sizes (both bulk and tail) should be at least a few hundred per parameter. If rhat is much above 1.01 or ESS is much smaller than the nominal sample count, the chains have not mixed and the credible intervals below should not be trusted; rerun with more nsamples / nadapts, or revisit the model parameterization and priors.

11.2 Credible intervals

We can get 95% credible intervals for the parameters from the quantiles of the marginal posteriors:

quantile_df = DataFrame(quantile(fpm_irt_bayes_samples))
quantile_df.true_value = vcat(
    [true_params.μf, true_params.μs, true_params.ωf, true_params.ωs],
    true_params.θb1,
    true_params.θb2,
    true_params.a_free,
)
simple_table(quantile_df)


parameters	2.5%	25.0%	50.0%	75.0%	97.5%	true_value

:μf	-1.21	-0.845	-0.682	-0.519	-0.216	-0.3
:μs	0.0182	0.0408	0.0531	0.0652	0.0888	0.04
:ωf	0.549	0.827	0.987	1.15	1.5	0.8
:ωs	0.00906	0.0391	0.0546	0.071	0.1	0.02
:θb1₁	0.306	0.466	0.548	0.623	0.769	0.405
:θb1₂	-0.361	-0.0924	0.0471	0.18	0.399	0
:θb2₁	-3.25	-2.16	-1.76	-1.42	-0.918	-1.0
:θb2₂	0.204	0.588	0.789	1	1.43	0.262
:θb2₃	-0.187	0.217	0.428	0.648	1.07	-0.105
:a_free₁	0.233	0.359	0.446	0.549	0.815	0.75

If the true parameters are not mostly contained in their 95% credible intervals, this could imply that the priors are not weak enough to let the data speak for itself. For more on Bayesian estimation in Pumas, see the Bayesian tutorials in Pumas.

12 Extensions

The model we have built and fitted is intentionally minimal: a linear scalar latent ability \(f_i(t)\) with OrderedLogit observations on two items, no covariates, and no treatment effect. There are at least four ways to extend the IRT model presented here. Each extension requires only a localized change to the model, with the rest of the workflow (simobs, read_pumas, fit(..., LaplaceI()) / MAP(LaplaceI()) / BayesMCMC) unchanged.

12.1 Different disease progression function forms

The linear model \(f_i(t) = f_{0,i} + s_i \, t\) assumes the latent ability changes at a constant rate. In many clinical settings, the disease severity saturates (it changes quickly at first and then approaches a plateau), and a natural parametric alternative is the EMAX form

\[ f_i(t) = f_{0,i} + E_{\max,i} \cdot \frac{t}{EC_{50} + t}, \]

where \(f_{0,i}\) is the subject-specific baseline ability, \(E_{\max,i}\) is the maximum change from baseline (so the asymptote as \(t \to \infty\) is \(f_{0,i} + E_{\max,i}\)), and \(EC_{50} > 0\) is the time at which half of the maximum change is reached. The progression rate at \(t = 0\) is \(E_{\max,i} / EC_{50}\), and the curve approaches its plateau monotonically. Other parametric forms (exponential, Weibull, indirect-response, mechanistic ODE-based, etc.) slot in the same way. Each is a localized change to the @pre block, optionally the @dynamics block, and their associated parameters in @param / @random; the rest of the model (item cut-offs, discriminations, OrderedLogit observation, fitting with LaplaceI) is unaffected.

Sometimes no fixed parametric form is plausible; for example, the trajectory may be non-monotonic and its structure is difficult to specify a priori. In such cases, the structural part of \(f_i(t)\) can be replaced by a neural network:

\[ f_i(t) = \mathrm{NN}_{\boldsymbol{\phi}}\!\left(t,\; \boldsymbol{\eta}_i,\; \boldsymbol{x}_i\right), \]

where \(\boldsymbol{\eta}_i\) are subject-specific random effects, \(\boldsymbol{x}_i\) are covariates, and \(\boldsymbol{\phi}\) are the (population-level) network weights, turning the model to a DeepNLME model. This lets a neural component be embedded directly in @pre (using DeepPumas) alongside the usual @param / @random / @derived machinery. From the rest of the IRT model’s point of view, nothing changes (f is still just a scalar function of time and individual effects), so the same OrderedLogit observation block, simobs, read_pumas, and fit(..., LaplaceI()) workflow apply.

12.2 Covariate and treatment effects

The latent ability in our model depends only on time and subject-specific random effects. In practice, \(f_i(t)\) is often also driven by baseline covariates (such as age, sex, disease subtype, or a baseline biomarker) and by treatment (active vs. placebo, dose, regimen). These can be defined either directly in @pre or using an ODE in the @dynamics block, e.g., using a pharmacokinetic model to drive the disease progression via an exposure-response relationship. The key point is that the IRT-specific parts of the model (item cut-offs, discriminations, OrderedLogit observation) are unaffected by the addition of covariates and treatment effects, so the same workflow applies. If complex covariate relationships are expected, a neural network can be used to flexibly model the covariate effects on the latent trajectory.

12.3 Bounded integer model

The extensions above all leave the OrderedLogit likelihood untouched. The next extension is to replace that likelihood. An alternative is the bounded integer (BI) model, in which the observed score \(y_{i,j} \in \{1, \dots, K_j\}\) arises by discretizing a latent continuous variable on the score scale through a Gaussian CDF, rather than a logistic function as in the OrderedLogit.

12.4 Multi-dimensional IRT

The fourth extension touches the latent state itself. The model above assumes a single scalar latent ability \(f_i(t)\). When the items are driven by multiple, partially independent latent dimensions (for instance, motor function and cognition, or fatigue and pain), the unidimensional assumption (cf. the warning at the top of this tutorial) is violated and a multi-dimensional IRT model is appropriate. The latent state becomes a vector \(\boldsymbol{f}_i(t) \in \mathbb{R}^D\), and each item’s OrderedLogit threshold reads off a linear combination of the latent dimensions via item-specific loadings \(\boldsymbol{w}_j \in \mathbb{R}^D\):

\[ \eta_{j,k} \;=\; a_j \left(b_{j,k} - \boldsymbol{w}_j^\top \boldsymbol{f}_i(t)\right). \]

The multi-dimensional latent trajectory \(\boldsymbol{f}_i(t)\) can be built in two complementary ways:

Traditional disease progression models. Each component of \(\boldsymbol{f}_i(t)\) is given its own structural form (linear, EMAX, exponential, ODE-based, …), with its own fixed and random effects. This is the natural choice when each latent dimension corresponds to a clinically interpretable construct with a well-understood time course.
DeepNLME (using DeepPumas). \(\boldsymbol{f}_i(t) = \mathrm{NN}_{\boldsymbol{\phi}}(t, \boldsymbol{\eta}_i, \boldsymbol{x}_i)\) is the output of a neural network with \(D\) outputs. This is well suited to high-dimensional or multi-modal trajectories where no single parametric form fits well across patients, or to settings with many covariates whose interactions with time are non-trivial.

In both cases, the rest of the IRT model (item cut-offs, discriminations, OrderedLogit observation) is unchanged from the unidimensional case.

Reuse

CC BY-SA 4.0