using Pumas
using DataFrames
using DataFramesMeta
using Random
using LinearAlgebra
using CairoMakie
using AlgebraOfGraphics
using SummaryTables
using PumasUtilities
using Pumas: OrderedLogit
Random.seed!(123)
Item Response Theory
Item response theory (IRT) describes how a set of discrete questionnaire or clinical-test responses (“items”) depends on an unobserved latent trait (such as disease severity). Each item has its own characteristic curve, driven by a difficulty (cut-off) and a discrimination, and the latent trait can vary over time and across subjects via a nonlinear mixed-effects (NLME) progression model. This tutorial walks through the different ways of defining and fitting an IRT model in Pumas.
1 Learning objectives
By the end of this tutorial you will be able to:
- Describe the bounded ordinal IRT model with item-specific discriminations.
- Simulate ordinal longitudinal responses from known parameters.
- Fit the IRT model by marginal likelihood maximization.
- Obtain asymptotic confidence intervals of the population parameters.
- Add weakly informative priors on the population parameters and fit the Bayesian model with maximum a-posteriori (MAP) estimation or Markov chain Monte Carlo (MCMC).
First, we load the necessary packages.
2 What is IRT?
IRT models describe the progression of multiple ordinal outcomes (e.g., questionnaire items, clinical tests) as a function of a latent trait (e.g., disease severity). The different outcomes are called items and the latent trait is often called ability. Each item has its own response curve, often called an item characteristic curve (ICC), which maps the latent trait to a probability of a certain response.
In a clinical setting, we might have a few tests that measure different levels of disease severity at different time points. Each test has its own characteristics (difficulty, discrimination) that determine how likely a patient with a certain disease severity is to respond positively to that test. IRT allows us to model these relationships and estimate both the latent trait (disease severity) and the item parameters (difficulty, discrimination) simultaneously from observed responses. For instance, for test A, a patient with a disease severity of 0.5 might have a 70% chance of responding positively (positive is bad), while the same patient might only have a 30% chance of responding positively to test B. As the disease severity increases, the probability of responding positively to both tests will increase, but at different rates depending on the item/test parameters.
The disease severity (or ability) is a latent trait that we cannot directly observe, but we can infer it from the observed (discrete) responses to the items.
A mixture of discrete and continuous outcomes can also be modeled in a joint IRT framework, driven by a shared latent disease severity variable. For example, we might have a continuous biomarker that is measured alongside the discrete item responses, and we can model both types of outcomes together to get a more comprehensive understanding of the disease progression.
Simple IRT models assume that the latent trait (ability) is unidimensional and that all the items measure the same underlying trait. A violation of this assumption can lead to biased parameter estimates and incorrect inferences. If increasing the underlying disease severity leads to a higher score on some items but a lower score on others, then the unidimensionality assumption is likely violated and a more complex model (e.g., multidimensional IRT) may be needed. This can be checked during exploratory data analysis by looking at the correlation structure of the items’ responses. If the items are all positively correlated, this is a good sign that they are measuring the same underlying trait. On the other hand, if some items are negatively correlated with others, this may indicate that they are measuring different traits.
3 Bounded ordinal score models
With the conceptual picture in place, we now fix the notation that will carry us through the rest of the tutorial. Consider a study with \(N\) subjects responding to \(J\) items, where each item has a bounded ordinal response. The model below is the graded response model (GRM) of Samejima (1969), the polytomous extension of standard IRT, with item-specific discriminations. We use the following notation:
- \(i \in \{1, \dots, N\}\): subject index
- \(j \in \{1, \dots, J\}\): item index
- \(K_j\): number of response categories for item \(j\), so the response \(y_{i,j}\) takes values in \(\{1, 2, \dots, K_j\}\)
- \(k \in \{1, 2, \dots, K_j\}\): category index
- \(y_{i,j}\): observed ordinal response of subject \(i\) to item \(j\)
- \(f_i \in \mathbb{R}\): latent ability of subject \(i\)
- \(a_j > 0\): discrimination parameter of item \(j\) (controls the slope of the response curve)
- \(b_{j,k}\): cut-off (difficulty) for item \(j\) at category threshold \(k\), with \(b_{j,1} < b_{j,2} < \dots < b_{j,K_j - 1}\) ordered (\(K_j - 1\) cut-offs in total)
Pumas’ OrderedLogit distribution assumes the response takes values in \(\{1, 2, \dots, K_j\}\), i.e., consecutive integers starting at 1. If your observed scores are ordinal but encoded differently (e.g., \(\{0, 1, 2, 3\}\), \(\{-2, -1, 0, 1, 2\}\), non-consecutive integers like \(\{0, 1, 2, 4\}\), fractional values such as \(\{0.0, 0.125, 0.25, \dots, 3.0\}\), or textual labels like "low" / "medium" / "high"), map them to \(1, 2, \dots, K_j\) before passing them to read_pumas or simobs. The mapping must preserve the ordering, and the same mapping should be used consistently for all subjects and time points so that a given integer always denotes the same category.
3.1 Category probabilities
The probability of each ordinal category is then defined as the difference between two cumulative probabilities. The cumulative probabilities are defined by the logistic (inverse logit) function as follows:
\[ \begin{aligned} P(y_{i,j} = k) & = P(y_{i,j} \leq k) - P(y_{i,j} \leq k - 1) \quad \forall k \in 1 \dots K_j \\ P(y_{i,j} \leq k) & = \begin{cases} 0 & k = 0 \\ 1 & k = K_j \\ \mathrm{logit}^{-1}\!\left(a_j \cdot (b_{j,k} - f_i)\right) & 1 \leq k \leq K_j - 1 \end{cases} \end{aligned} \]
Each cut-off \(b_{j,k}\) governs the cumulative probability \(P(y_{i,j} \leq k)\) of being in category \(k\) or below, with \(k = 1, \dots, K_j - 1\) indexing the \(K_j - 1\) cut-offs. The discrimination \(a_j > 0\) controls how sharply the response probability changes with ability: items with larger \(a_j\) have sharper transitions between categories (a small change in ability moves a lot of probability mass), while items with smaller \(a_j\) have flatter ICCs.
The OrderedLogit distribution in Pumas. OrderedLogit(η) takes a vector \(\eta \in \mathbb{R}^{K-1}\) of (strictly increasing) values and defines a discrete distribution over \(\{1, 2, \dots, K\}\) via
\[ P(y \leq k) = \mathrm{logit}^{-1}(\eta_k), \qquad k = 1, \dots, K - 1, \]
with the boundary cases \(P(y \leq 0) = 0\) and \(P(y \leq K) = 1\). Each \(\eta_k\) is therefore the log-odds of being in category \(k\) or lower. Per-category probabilities are obtained by differencing:
\[ P(y = k) = P(y \leq k) - P(y \leq k - 1). \]
In the IRT model above, we set \(\eta_k = a_j (b_{j,k} - f_i)\), so larger ability \(f_i\) pushes \(\eta\) downward and shifts probability mass toward higher categories. The ordering \(b_{j,1} < \dots < b_{j,K_j - 1}\) implies \(\eta_1 < \dots < \eta_{K_j - 1}\), which is exactly the monotonicity that OrderedLogit requires.
For a more detailed treatment of ordinal regression with OrderedLogit in Pumas (including exploratory data analysis, covariate effects, and category-probability diagnostics), see the dedicated Ordinal Regression tutorial, which the IRT model here builds upon.
3.2 Example: visualizing ICCs
To build intuition, the figure below shows item characteristic curves (ICCs) for a single ordinal item with \(K = 4\) categories (\(y \in \{1, 2, 3, 4\}\) and \(K - 1 = 3\) cut-offs \(b_{j,1}, b_{j,2}, b_{j,3}\)). The left panel shows the cumulative probabilities \(P(y_{i,j} \leq k)\) as a function of ability \(f_i\), and the right panel shows the per-category probabilities \(P(y_{i,j} = k)\).
logistic(x) = 1 / (1 + exp(-x))
fi = -4:0.05:4
# Single item: 4 categories, 3 cut-offs
a_ex = 1.5
b_ex = [-1.0, 0.0, 1.2]
# Cumulative probabilities P(y ≤ k) for k = 1, 2, 3
cum_probs = [logistic.(a_ex .* (b .- fi)) for b in b_ex]
# Category probabilities P(y = k) for k = 1, 2, 3, 4
p_y1 = cum_probs[1]
p_y2 = cum_probs[2] .- cum_probs[1]
p_y3 = cum_probs[3] .- cum_probs[2]
p_y4 = 1 .- cum_probs[3]
icc_cum_df = DataFrame(
ability = repeat(collect(fi), 3),
probability = vcat(cum_probs...),
threshold = repeat(["P(y ≤ 1)", "P(y ≤ 2)", "P(y ≤ 3)"]; inner = length(fi)),
)
icc_cat_df = DataFrame(
ability = repeat(collect(fi), 4),
probability = vcat(p_y1, p_y2, p_y3, p_y4),
category = repeat(["y = 1", "y = 2", "y = 3", "y = 4"]; inner = length(fi)),
)
fig = Figure(; size = (1100, 380))
ag1 = draw!(
fig[1, 1],
data(icc_cum_df) *
mapping(:ability => "ability (f)", :probability, color = :threshold => nonnumeric) *
visual(Lines; linewidth = 3);
axis = (; title = "Cumulative ICCs"),
)
legend!(fig[1, 2], ag1; tellwidth = true)
ag2 = draw!(
fig[1, 3],
data(icc_cat_df) *
mapping(:ability => "ability (f)", :probability, color = :category => nonnumeric) *
visual(Lines; linewidth = 3);
axis = (; title = "Per-category ICCs"),
)
legend!(fig[1, 4], ag2; tellwidth = true)
figThe cumulative curves are ordered logistic functions of ability, each centred at one cut-off \(b_{j,k}\) and sharing the slope \(a_j\). The per-category probabilities sum to 1 at every value of \(f_i\); the middle categories appear as bumps between adjacent cut-offs, while the extreme categories (\(y = 1\) and \(y = K_j\)) are monotone in ability: \(P(y = 1)\) decreasing and \(P(y = K_j)\) increasing as \(f_i\) grows.
4 A Pumas ordinal IRT model with a linear latent ability
Having grounded the cross-sectional picture in a plot, we now turn it into a longitudinal Pumas model. Concretely, we build a graded response model with \(J = 2\) items (score1, score2), each with \(K = 4\) response categories, and each subject measured at multiple time points. In this first example, the latent ability is taken to be linear in time:
\[ \begin{aligned} f_i(t) & = f_{0,i} + s_i \cdot t \\ f_{0,i} & \sim \mathrm{Normal}\!\left(\mu_f,\; \omega_f^2\right) \\ s_i & \sim \mathrm{Normal}\!\left(\mu_s,\; \omega_s^2\right) \end{aligned} \]
Both the intercept \(f_{0,i}\) and the slope \(s_i\) of this linear trajectory carry fixed and random effects: the population means \(\mu_f, \mu_s\) describe the typical disease trajectory, while the per-subject draws \(f_{0,i}, s_i\) around those means capture inter-individual variability. This makes the model a nonlinear mixed effects (NLME) model. Concretely, the population-level parameters are:
- \(\mu_f, \mu_s\): fixed-effect (population) mean intercept and progression slope of ability
- \(\omega_f^2, \omega_s^2\): variances of the random effects on the intercept and slope (so \(\omega_f, \omega_s\) are SDs)
- \(f_{0,i}, s_i\): subject-specific random effects (the actual baseline ability and slope for subject \(i\))
The item-level parameters per item \(j \in \{1, 2\}\) are:
- \(b_j \in \mathbb{R}^{K_j - 1}\): ordered per-category cut-offs of item \(j\), with \(b_{j,1} < b_{j,2} < \dots < b_{j,K_j - 1}\). The length of \(b_j\) is the number of cut-offs, one fewer than the number of categories. Each item has its own cut-off parameter, accommodating items with different numbers of categories.
- \(a_j > 0\): discrimination of item \(j\) (positive scalar).
The OrderedLogit threshold input for item \(j\) at time \(t\) is \(\eta_{j,k} = a_j \cdot (b_{j,k} - f_i(t))\) for each cut-off \(k = 1, \dots, K_j - 1\), matching the bounded ordinal model defined earlier: \(\mathrm{logit}^{-1}(\eta_{j,k}) = P(y_{i,j} \leq k)\).
We stick with the linear form throughout the rest of this tutorial because it keeps the focus on the IRT-specific parts of the model. More generally, any disease progression model can play the role of \(f_i(t)\), such as exponential, EMAX, indirect-response, or mechanistic ODE-based forms. Switching between progressions is a localized change to @pre (and optionally @dynamics) and its associated parameters; the rest of the model (item cut-offs, discriminations, OrderedLogit observation) is unaffected. We come back to this in the Extensions section.
4.1 Enforcing ordering via an unconstrained parameterization
The ordered cut-off vector \(b_j\) lives in a constrained region of \(\mathbb{R}^{K_j - 1}\), namely the strictly increasing sequences. To let the optimizer work in a simple, unconstrained space, we introduce an unconstrained surrogate \(\theta^b_j \in \mathbb{R}^{K_j - 1}\) (called θb1 and θb2 in the Pumas @param block) and obtain the ordered cut-offs \(b_j\) from \(\theta^b_j\) via a cumulative log-exp transform:
\[ \begin{aligned} b_{j,1} & = \theta^b_{j,1} & & \text{(first cut-off is unconstrained on the real line)} \\ \theta^b_{j,k} & = \log\!\left(b_{j,k} - b_{j,k-1}\right), \quad k > 1 & & \text{(log of the gap between consecutive cut-offs)} \\ b_{j,k} & = b_{j,k-1} + e^{\theta^b_{j,k}}, \quad k > 1 & & \text{(reconstruct ordered cut-offs)} \end{aligned} \]
So when item \(j\)’s first cut-off is estimated:
- \(\theta^b_{j,1}\) is the location of the lowest cut-off on the ability scale, free to take any real value.
- \(\theta^b_{j,k}\) for \(k > 1\) is the logarithm of the gap between consecutive cut-offs. Because \(e^{\theta^b_{j,k}} > 0\), the reconstructed \(b_{j,k}\) sequence is strictly increasing for any value of \(\theta^b_j\), so the ordering constraint is satisfied automatically. The \(\theta\) prefix distinguishes them from the ordered cut-offs \(b_j\).
When item \(j\)’s first cut-off is instead fixed (which we do for item 1 below, to anchor the latent scale), the lowest cut-off drops out of \(\theta^b_j\) entirely: \(\theta^b_j\) then consists only of log-gaps, has length \(K_j - 2\), and the reconstruction prepends the fixed value before applying the same cumulative log-exp transform. So the meaning of \(\theta^b_{j,1}\) differs between the anchor item (a log-gap) and the rest (a location).
The reparameterization is implemented as a pair of plain Julia functions used both inside the model and when post-processing the fit:
# Map an unconstrained vector θ ∈ R^m to an ordered vector b ∈ R^m:
# b[1] = θ[1], b[k] = b[k-1] + exp(θ[k]) for k > 1.
function to_ordered(θ)
b = similar(θ)
b[1] = θ[1]
for k = 2:length(θ)
b[k] = b[k-1] + exp(θ[k])
end
return b
end
# Inverse: map an ordered vector b ∈ R^m to its unconstrained θ ∈ R^m:
# θ[1] = b[1], θ[k] = log(b[k] - b[k-1]) for k > 1.
function to_raw(b)
θ = similar(b)
θ[1] = b[1]
for k = 2:length(b)
θ[k] = log(b[k] - b[k-1])
end
return θ
end4.2 Pinning a reference cut-off for location identifiability
The model is invariant under a joint shift of the latent ability \(f\) and all cut-offs \(b_{j,k}\) by the same constant, so without an external constraint, the location of \(f\) (and hence of every \(b_{j,k}\)) is unidentified. Only the relative positions of the cut-offs across items carry information; the common origin is a convention. To make the parameters recoverable, we therefore pick any one cut-off and pin it to a fixed value (a “reference cut-off”), and let everything else vary freely. The choice of which cut-off is arbitrary (it does not have to be the lowest, the highest, or come from any particular item). For concreteness, we use
\[ b_{1,1} \;=\; 0 \]
throughout this tutorial. In the Pumas model, this is hard-coded directly inside @pre rather than declared in @param, so \(b_{1,1}\) is not a fitted parameter at all. Concretely, \(\theta^b_1\) now holds only the \(K_1 - 2\) log-gaps of item 1, and the ordered cut-offs are reconstructed by prepending \(0\) before applying the cumulative transform.
4.3 Pinning a reference discrimination for scale identifiability
Location is not the only invariance. The model is also invariant under a joint rescaling: for any \(c > 0\), mapping \(f_i \to c f_i\), \(b_{j,k} \to c b_{j,k}\), and \(a_j \to a_j / c\) leaves \(\eta_{j,k} = a_j(b_{j,k} - f_i)\) unchanged. Fixing \(b_{1,1} = 0\) does not resolve this; it only kills the shift. We therefore additionally pin the discrimination of item 1,
\[ a_1 \;=\; 1, \]
also hard-coded in @pre. Only the remaining discriminations \(a_2, \dots, a_J\) are estimated, and they are interpreted relative to item 1: \(a_j = 1.5\) means item \(j\)’s response curve is 1.5× as steep as item 1’s, not 1.5 on some absolute scale.
4.4 The Pumas model
The Pumas model for the IRT model with a linear latent ability trajectory is defined as follows.
n_items = 2
# Per-item number of categories; can differ across items.
# Each item j has n_cats[j] categories and n_cats[j] - 1 cut-offs.
n_cats = [4, 4]
m = n_cats .- 1
irt_ordinal_model = @model begin
@param begin
# Ability fixed effects (population intercept and slope)
μf ∈ RealDomain()
μs ∈ RealDomain()
# Ability random-effect SDs (subject-level variability in intercept and slope)
ωf ∈ RealDomain(; lower = 0.05)
ωs ∈ RealDomain(; lower = 1e-3)
# Per-item log-gap vectors. Item 1's first cut-off is hard-coded to 0
# in @pre as the location reference (not estimated), so θb1 contributes
# only m[1] - 1 free entries (the log-gaps). Item 2 contributes m[2]
# entries (first cut-off + log-gaps).
θb1 ∈ VectorDomain(m[1] - 1)
θb2 ∈ VectorDomain(m[2])
# Discriminations for items 2..n_items, on the positive scale. Item 1's
# discrimination is hard-coded to 1 in @pre as the scale reference, so
# a_free has length n_items - 1.
a_free ∈ VectorDomain(n_items - 1; lower = 0.0)
end
@random begin
# Subject-specific intercept and slope drawn directly from their
# population distributions, parameterized by mean and SD
f0 ~ Normal(μf, ωf)
slope ~ Normal(μs, ωs)
end
@pre begin
# Latent ability trajectory for the subject
f = f0 + slope * t
# Reconstruct ordered cut-offs b_j. For item 1, prepend the reference
# cut-off (hard-coded to 0) before the log-gaps θb1.
b1_ordered = to_ordered(vcat(0.0, θb1))
b2_ordered = to_ordered(θb2)
# Full discrimination vector with a[1] = 1 fixed as the scale reference.
a = vcat(1.0, a_free)
# OrderedLogit thresholds: η_{j,k} = a_j (b_{j,k} - f)
lle1 = a[1] .* (b1_ordered .- f)
lle2 = a[2] .* (b2_ordered .- f)
end
@derived begin
score1 ~ @. OrderedLogit(lle1)
score2 ~ @. OrderedLogit(lle2)
end
endPumasModel
Parameters: μf, μs, ωf, ωs, θb1, θb2, a_free
Random effects: f0, slope
Covariates:
Dynamical system variables:
Dynamical system type: No dynamical model
Derived: score1, score2
Observed: score1, score2
5 Simulate data
With the model defined, we generate a synthetic dataset that we can later fit it on, so that we can compare the recovered estimates against parameter values we actually know. We simulate ordinal responses for 300 subjects measured at 5 time points (e.g., weeks 0, 4, 8, 12, 24) from fixed ordinal parameter values using Pumas’ simobs. First we build a template population (one Subject per id), then call simobs at the true parameter values.
n_subjects = 300
obs_times = [0.0, 4.0, 8.0, 12.0, 24.0]
# Target ordered cut-offs per item (length m[j] = n_cats[j] - 1).
# Item 1 anchors the latent scale: its first cut-off is fixed at 0.
b1_ordered_true = [0.0, 1.5, 2.5]
b2_ordered_true = [-1.0, 0.3, 1.2]
# Drop the (fixed-at-0) first cut-off of item 1; keep only the log-gaps,
# matching the @param decomposition where θb1 holds the log-gaps alone.
# a_free holds only the free discriminations (items 2..n_items); item 1's
# discrimination is fixed to 1 inside @pre as the scale reference.
true_params = (
μf = -0.3,
ωf = 0.8,
μs = 0.04,
ωs = 0.02,
θb1 = to_raw(b1_ordered_true)[2:end],
θb2 = to_raw(b2_ordered_true),
a_free = [0.75],
)
template_pop = map(1:n_subjects) do i
Subject(; id = i, time = obs_times)
end
sim_obs = simobs(irt_ordinal_model, template_pop, true_params)
irt_df = DataFrame(sim_obs)
first(irt_df, 10)| Row | id | time | score1 | score2 | evid | f0 | slope | f | b1_ordered | b2_ordered | a | lle1 | lle2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| String | Float64 | Int64? | Int64? | Int64? | Float64? | Float64? | Float64? | Array…? | Array…? | Array…? | Array…? | Array…? | |
| 1 | 1 | 0.0 | 2 | 4 | 0 | 0.482736 | 0.00734241 | 0.482736 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-0.482736, 1.01726, 2.01726] | [-1.11205, -0.137052, 0.537948] |
| 2 | 1 | 4.0 | 2 | 1 | 0 | 0.482736 | 0.00734241 | 0.512106 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-0.512106, 0.987894, 1.98789] | [-1.13408, -0.159079, 0.515921] |
| 3 | 1 | 8.0 | 2 | 2 | 0 | 0.482736 | 0.00734241 | 0.541475 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-0.541475, 0.958525, 1.95852] | [-1.15611, -0.181106, 0.493894] |
| 4 | 1 | 12.0 | 1 | 2 | 0 | 0.482736 | 0.00734241 | 0.570845 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-0.570845, 0.929155, 1.92916] | [-1.17813, -0.203134, 0.471866] |
| 5 | 1 | 24.0 | 2 | 4 | 0 | 0.482736 | 0.00734241 | 0.658954 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-0.658954, 0.841046, 1.84105] | [-1.24422, -0.269215, 0.405785] |
| 6 | 2 | 0.0 | 1 | 4 | 0 | 1.56184 | 0.0543695 | 1.56184 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-1.56184, -0.0618438, 0.938156] | [-1.92138, -0.946383, -0.271383] |
| 7 | 2 | 4.0 | 3 | 4 | 0 | 1.56184 | 0.0543695 | 1.77932 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-1.77932, -0.279322, 0.720678] | [-2.08449, -1.10949, -0.434491] |
| 8 | 2 | 8.0 | 1 | 4 | 0 | 1.56184 | 0.0543695 | 1.9968 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-1.9968, -0.4968, 0.5032] | [-2.2476, -1.2726, -0.5976] |
| 9 | 2 | 12.0 | 2 | 4 | 0 | 1.56184 | 0.0543695 | 2.21428 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-2.21428, -0.714278, 0.285722] | [-2.41071, -1.43571, -0.760708] |
| 10 | 2 | 24.0 | 4 | 3 | 0 | 1.56184 | 0.0543695 | 2.86671 | [0.0, 1.5, 2.5] | [-1.0, 0.3, 1.2] | [1.0, 0.75] | [-2.86671, -1.36671, -0.366712] | [-2.90003, -1.92503, -1.25003] |
5.1 Checking category coverage
Before fitting, it is worth checking that every response category for every score is observed at every time point; otherwise some cut-offs will be poorly identified by the likelihood:
@chain irt_df begin
groupby([:time, :score1])
combine(nrow => :n)
simple_table
end| Time | score1 | n |
| 0 | 1 | 197 |
| 0 | 2 | 62 |
| 0 | 3 | 26 |
| 0 | 4 | 15 |
| 4 | 1 | 172 |
| 4 | 2 | 87 |
| 4 | 3 | 26 |
| 4 | 4 | 15 |
| 8 | 1 | 159 |
| 8 | 2 | 88 |
| 8 | 3 | 31 |
| 8 | 4 | 22 |
| 12 | 1 | 140 |
| 12 | 2 | 100 |
| 12 | 3 | 39 |
| 12 | 4 | 21 |
| 24 | 1 | 142 |
| 24 | 2 | 69 |
| 24 | 3 | 43 |
| 24 | 4 | 46 |
@chain irt_df begin
groupby([:time, :score2])
combine(nrow => :n)
simple_table
end| Time | score2 | n |
| 0 | 1 | 111 |
| 0 | 2 | 75 |
| 0 | 3 | 43 |
| 0 | 4 | 71 |
| 4 | 1 | 113 |
| 4 | 2 | 68 |
| 4 | 3 | 30 |
| 4 | 4 | 89 |
| 8 | 1 | 95 |
| 8 | 2 | 65 |
| 8 | 3 | 48 |
| 8 | 4 | 92 |
| 12 | 1 | 101 |
| 12 | 2 | 61 |
| 12 | 3 | 47 |
| 12 | 4 | 91 |
| 24 | 1 | 71 |
| 24 | 2 | 54 |
| 24 | 3 | 49 |
| 24 | 4 | 126 |
All four categories appear with a reasonable count at every time point for both items, so we proceed.
5.2 Reading into a Pumas population
We convert the simulated DataFrame into a Pumas population, ready for fitting:
irt_pop = read_pumas(
irt_df;
id = :id,
time = :time,
observations = [:score1, :score2],
event_data = false,
)Population
Subjects: 300
Observations: score1, score2
6 Fit the IRT model with LaplaceI
With the simulated population defined, we now estimate the population parameters using the LaplaceI() algorithm. We deliberately initialize the optimizer away from the true values so that recovery is not trivial:
# Initial unconstrained θb vectors, split out for clarity:
# - θb1_init holds only the log-gaps of item 1 (length m[1] - 1); the first
# cut-off is hard-coded to 0 inside @pre and is not estimated.
# - θb2_init holds the first cut-off and log-gaps of item 2 (length m[2]).
θb1_init = fill(-0.5, m[1] - 1)
θb2_init = vcat(0.0, fill(-0.5, m[2] - 1))
init_irt = (
μf = 0.5,
ωf = 0.4,
μs = 0.2,
ωs = 0.1,
θb1 = θb1_init,
θb2 = θb2_init,
a_free = fill(0.5, n_items - 1),
)(μf = 0.5, ωf = 0.4, μs = 0.2, ωs = 0.1, θb1 = [-0.5, -0.5], θb2 = [0.0, -0.5, -0.5], a_free = [0.5])
fpm_irt_laplace = fit(irt_ordinal_model, irt_pop, init_irt, LaplaceI())[ Info: Checking the initial parameter values. [ Info: The initial negative log likelihood and its gradient are finite. Check passed. Iter Function value Gradient norm 0 4.800428e+03 4.779925e+03 * time: 0.07552695274353027 1 4.245452e+03 1.013823e+03 * time: 8.973871946334839 2 4.158189e+03 3.264277e+02 * time: 10.426065921783447 3 3.977290e+03 1.242868e+03 * time: 13.118886947631836 4 3.822159e+03 1.826704e+03 * time: 14.558562994003296 5 3.733687e+03 5.928409e+02 * time: 15.98140811920166 6 3.704887e+03 3.621482e+02 * time: 17.457737922668457 7 3.687178e+03 4.735016e+02 * time: 18.76045513153076 8 3.659058e+03 1.472809e+02 * time: 20.105556964874268 9 3.654377e+03 9.295584e+01 * time: 21.43154501914978 10 3.641771e+03 3.179643e+02 * time: 22.68164610862732 11 3.627748e+03 4.153773e+02 * time: 24.008630990982056 12 3.616755e+03 1.477176e+02 * time: 25.30159592628479 13 3.614091e+03 3.115247e+01 * time: 27.54238200187683 14 3.613411e+03 2.419889e+01 * time: 30.2689311504364 15 3.612684e+03 4.667978e+01 * time: 32.18584609031677 16 3.611347e+03 6.251912e+01 * time: 33.562819957733154 17 3.610067e+03 4.765711e+01 * time: 34.89968800544739 18 3.609369e+03 1.652206e+01 * time: 36.200438022613525 19 3.609052e+03 1.704878e+01 * time: 37.53099513053894 20 3.608696e+03 2.589193e+01 * time: 38.838845014572144 21 3.608037e+03 4.448229e+01 * time: 40.20035910606384 22 3.606979e+03 5.243231e+01 * time: 41.456571102142334 23 3.605960e+03 3.656687e+01 * time: 42.752665996551514 24 3.605339e+03 1.755167e+01 * time: 44.0217399597168 25 3.604951e+03 1.569934e+01 * time: 45.23465299606323 26 3.604685e+03 2.354730e+01 * time: 46.54598093032837 27 3.604013e+03 3.901258e+01 * time: 47.8078510761261 28 3.603063e+03 4.595394e+01 * time: 49.598448038101196 29 3.602303e+03 3.541563e+01 * time: 51.248960971832275 30 3.601964e+03 1.522746e+01 * time: 52.7619731426239 31 3.601848e+03 1.157073e+01 * time: 54.35477805137634 32 3.601795e+03 1.111696e+01 * time: 55.8671350479126 33 3.601675e+03 1.172637e+01 * time: 57.292197942733765 34 3.601483e+03 1.638165e+01 * time: 58.893232107162476 35 3.601269e+03 1.431190e+01 * time: 61.269081115722656 36 3.601149e+03 8.426453e+00 * time: 64.09379696846008 37 3.601094e+03 6.148582e+00 * time: 65.4372010231018 38 3.601059e+03 6.060319e+00 * time: 66.63735795021057 39 3.600998e+03 1.146124e+01 * time: 67.89059114456177 40 3.600882e+03 1.640790e+01 * time: 69.08458995819092 41 3.600728e+03 1.643570e+01 * time: 70.29219794273376 42 3.600601e+03 9.916291e+00 * time: 71.50445795059204 43 3.600552e+03 2.555235e+00 * time: 72.67200994491577 44 3.600546e+03 2.600715e+00 * time: 73.89862704277039 45 3.600543e+03 2.581232e+00 * time: 75.09082007408142 46 3.600533e+03 4.282126e+00 * time: 76.27247405052185 47 3.600510e+03 7.019430e+00 * time: 77.47766304016113 48 3.600455e+03 1.095331e+01 * time: 78.68084812164307 49 3.600361e+03 1.401551e+01 * time: 79.95507502555847 50 3.600264e+03 1.213499e+01 * time: 81.1827290058136 51 3.600196e+03 4.054761e+00 * time: 82.42414212226868 52 3.600186e+03 2.280079e-01 * time: 83.63864398002625 53 3.600185e+03 3.282533e-02 * time: 84.8054130077362 54 3.600185e+03 2.051408e-02 * time: 86.02464008331299 55 3.600185e+03 1.828249e-03 * time: 87.18229699134827 56 3.600185e+03 1.543443e-03 * time: 88.32853698730469 57 3.600185e+03 1.489942e-03 * time: 89.68591904640198 58 3.600185e+03 1.352684e-03 * time: 90.7310860157013 59 3.600185e+03 1.351321e-03 * time: 92.27502703666687 60 3.600185e+03 1.349973e-03 * time: 93.74305415153503 61 3.600185e+03 1.348623e-03 * time: 95.0069899559021 62 3.600185e+03 1.347275e-03 * time: 96.14704394340515 63 3.600185e+03 1.347140e-03 * time: 97.3240339756012 64 3.600185e+03 1.347005e-03 * time: 98.52478909492493 65 3.600185e+03 1.346870e-03 * time: 100.06579399108887 66 3.600185e+03 1.346736e-03 * time: 101.40681195259094 67 3.600185e+03 1.346601e-03 * time: 102.72308111190796 68 3.600185e+03 1.346466e-03 * time: 104.09223198890686 69 3.600185e+03 1.346453e-03 * time: 105.46969509124756 70 3.600185e+03 1.346439e-03 * time: 106.78732013702393 71 3.600185e+03 1.346426e-03 * time: 108.12729501724243 72 3.600185e+03 1.346413e-03 * time: 109.33165097236633 73 3.600185e+03 1.346399e-03 * time: 110.71047401428223 74 3.600185e+03 1.346386e-03 * time: 111.87297797203064 75 3.600185e+03 1.346384e-03 * time: 113.03497815132141 76 3.600185e+03 1.346383e-03 * time: 114.17545008659363 77 3.600185e+03 1.346382e-03 * time: 115.48273205757141 78 3.600185e+03 1.346380e-03 * time: 117.29748797416687 79 3.600185e+03 1.346379e-03 * time: 119.08050608634949 80 3.600185e+03 1.346379e-03 * time: 120.85689997673035 81 3.600185e+03 1.346379e-03 * time: 122.27842998504639 82 3.600185e+03 1.346379e-03 * time: 123.64098000526428 83 3.600185e+03 1.346379e-03 * time: 124.9803249835968
FittedPumasModel
Dynamical system type: No dynamical model
Number of subjects: 300
Observation records: Active Missing
score1: 1500 0
score2: 1500 0
Total: 3000 0
Number of parameters: Constant Optimized
0 10
Likelihood approximation: LaplaceI
Likelihood optimizer: BFGS
Termination Reason: NoObjectiveChange
Log-likelihood value: -3600.1853
--------------------
Estimate
--------------------
μf -0.60264
μs 0.044821
ωf 0.8417
ωs 0.038788
θb1₁ 0.44853
θb1₂ 0.14895
θb2₁ -1.4797
θb2₂ 0.47622
θb2₃ 0.11072
a_free₁ 0.60231
--------------------
Recall the true parameters used to simulate the data, to compare against the estimates above:
true_params(μf = -0.3, ωf = 0.8, μs = 0.04, ωs = 0.02, θb1 = [0.4054651081081644, 0.0], θb2 = [-1.0, 0.26236426446749106, -0.1053605156578264], a_free = [0.75])
Despite the deliberately off starting point, the optimizer converges to estimates closer (but not identical) to the parameters used to simulate the data. Next, we check whether the confidence intervals of the estimates contain the true values.
7 Confidence intervals
Next, we want to get confidence intervals (CIs) for the population parameters and check that they contain the true values we used to simulate the data. Asymptotic confidence intervals for the population parameters can be obtained using infer on the LaplaceI fitted Pumas model.
infer_irt = infer(fpm_irt_laplace)
ci_df = coeftable(infer_irt)
simple_table(ci_df)[ Info: Calculating: variance-covariance matrix. [ Info: Done.
| parameter | constant | estimate | se | relative_se | ci_lower | ci_upper |
| μf | false | -0.603 | 0.0957 | 0.159 | -0.79 | -0.415 |
| μs | false | 0.0448 | 0.00601 | 0.134 | 0.033 | 0.0566 |
| ωf | false | 0.842 | 0.0983 | 0.117 | 0.649 | 1.03 |
| ωs | false | 0.0388 | 0.00864 | 0.223 | 0.0218 | 0.0557 |
| θb1₁ | false | 0.449 | 0.0477 | 0.106 | 0.355 | 0.542 |
| θb1₂ | false | 0.149 | 0.0818 | 0.549 | -0.0114 | 0.309 |
| θb2₁ | false | -1.48 | 0.223 | 0.15 | -1.92 | -1.04 |
| θb2₂ | false | 0.476 | 0.144 | 0.302 | 0.194 | 0.758 |
| θb2₃ | false | 0.111 | 0.15 | 1.35 | -0.183 | 0.404 |
| a_free₁ | false | 0.602 | 0.0844 | 0.14 | 0.437 | 0.768 |
We then append true_value and init_value columns by flattening true_params and init_irt in the same row order that coeftable uses. coeftable lists the population scalars as μf, μs, ωf, ωs (all means first, then all SDs), followed by the vector parameters θb1, θb2, a_free:
ci_df.true_value = vcat(
[true_params.μf, true_params.μs, true_params.ωf, true_params.ωs],
true_params.θb1,
true_params.θb2,
true_params.a_free,
)
ci_df.init_value = vcat(
[init_irt.μf, init_irt.μs, init_irt.ωf, init_irt.ωs],
init_irt.θb1,
init_irt.θb2,
init_irt.a_free,
)
simple_table(ci_df)| parameter | constant | estimate | se | relative_se | ci_lower | ci_upper | true_value | init_value |
| μf | false | -0.603 | 0.0957 | 0.159 | -0.79 | -0.415 | -0.3 | 0.5 |
| μs | false | 0.0448 | 0.00601 | 0.134 | 0.033 | 0.0566 | 0.04 | 0.2 |
| ωf | false | 0.842 | 0.0983 | 0.117 | 0.649 | 1.03 | 0.8 | 0.4 |
| ωs | false | 0.0388 | 0.00864 | 0.223 | 0.0218 | 0.0557 | 0.02 | 0.1 |
| θb1₁ | false | 0.449 | 0.0477 | 0.106 | 0.355 | 0.542 | 0.405 | -0.5 |
| θb1₂ | false | 0.149 | 0.0818 | 0.549 | -0.0114 | 0.309 | 0 | -0.5 |
| θb2₁ | false | -1.48 | 0.223 | 0.15 | -1.92 | -1.04 | -1.0 | 0 |
| θb2₂ | false | 0.476 | 0.144 | 0.302 | 0.194 | 0.758 | 0.262 | -0.5 |
| θb2₃ | false | 0.111 | 0.15 | 1.35 | -0.183 | 0.404 | -0.105 | -0.5 |
| a_free₁ | false | 0.602 | 0.0844 | 0.14 | 0.437 | 0.768 | 0.75 | 0.5 |
Each row now shows the LaplaceI point estimate, its confidence interval, the true value used to simulate the data, and the initial value handed to the optimizer, making it easy to check both that every true parameter falls inside its CI and that the fit moved away from its starting point.
8 Predicted item characteristic curves
We opened the tutorial by plotting ICCs from hypothetical parameters; now we plot the estimated ICCs and overlay the true curves. We reconstruct each item’s cut-offs and discrimination from the fit (via to_ordered, prepending the pinned 0 for item 1) and reuse the per-category probability construction from the introductory example.
# Reconstruct estimated item parameters from the LaplaceI fit
est = coef(fpm_irt_laplace)
b1_hat = to_ordered(vcat(0.0, est.θb1))
b2_hat = to_ordered(est.θb2)
a_hat = vcat(1.0, est.a_free)
# Per-category probabilities P(y = k) over an ability grid for one item,
# given its discrimination `a` and ordered cut-offs `b`.
function category_probs(a, b, grid)
cum = [logistic.(a .* (bk .- grid)) for bk in b] # P(y ≤ k), k = 1 … K-1
K = length(b) + 1
return [k == 1 ? cum[1] : k == K ? 1 .- cum[end] : cum[k] .- cum[k-1] for k = 1:K]
end
# Tidy long-format ICC table for one item and one parameter source.
function icc_long(a, b, grid, item, source)
probs = category_probs(a, b, grid)
K = length(probs)
DataFrame(
ability = repeat(collect(grid), K),
probability = reduce(vcat, probs),
category = repeat(["y = $k" for k = 1:K]; inner = length(grid)),
item = "item $item",
source = source,
)
end
icc_compare = reduce(
vcat,
[
icc_long(a_hat[1], b1_hat, fi, 1, "estimated"),
icc_long(a_hat[2], b2_hat, fi, 2, "estimated"),
icc_long(1.0, b1_ordered_true, fi, 1, "true"),
icc_long(true_params.a_free[1], b2_ordered_true, fi, 2, "true"),
],
)
plt =
data(icc_compare) *
mapping(
:ability => "ability (f)",
:probability => "P(y = k)",
color = :category => nonnumeric,
linestyle = :source => nonnumeric => "source",
col = :item => nonnumeric,
) *
visual(Lines; linewidth = 2)
draw(plt)The estimated curves (solid) sit close to the true curves (dashed) for both items, confirming visually what the confidence-interval table reported. Item 2’s flatter transitions reflect its smaller estimated discrimination.
9 Model diagnostics and qualification
With real data there is no ground truth to compare against, so we rely on diagnostics that ask whether the fitted model is consistent with the observed responses.
9.1 Information criteria
aic and bic are most useful for comparing competing models (e.g. a linear vs. an EMAX latent trajectory) fitted to the same data:
(; AIC = aic(fpm_irt_laplace), BIC = bic(fpm_irt_laplace))(AIC = 7220.37065592953, BIC = 7280.434331606032)
9.2 Visual predictive check
The key qualification tool is the visual predictive check (VPC), which overlays the observed category proportions over time on the distribution simulated from the fitted model. Pumas builds a categorical VPC with one panel per response category, so we plot one item at a time. The panels default to first-appearance order, so we pass levels to force them to read \(1 \to K\). Pumas names the per-category columns <observation>_<value>, which we reconstruct from the sorted category values:
# Build sorted level names matching Pumas' `<observation>_<value>` columns.
pop_df = DataFrame(irt_pop)
ordered_levels(score) =
[Symbol("$(score)_$(v)") for v in sort(unique(skipmissing(pop_df[!, score])))]ordered_levels (generic function with 1 method)
irt_vpc_1 = vpc(fpm_irt_laplace; observations = [:score1])
vpc_plot(irt_vpc_1; levels = ordered_levels(:score1), figure = (; size = (600, 1200)))[ Info: Discrete VPC [ Info: Automatically choosing bandwidth for smooth discrete VPC. [ Info: Automatic bandwidth choice complete with bandwidth: 24.0.
irt_vpc_2 = vpc(fpm_irt_laplace; observations = [:score2])
vpc_plot(irt_vpc_2; levels = ordered_levels(:score2), figure = (; size = (600, 1200)))[ Info: Discrete VPC [ Info: Automatically choosing bandwidth for smooth discrete VPC. [ Info: Automatic bandwidth choice complete with bandwidth: 24.0.
Each panel shows one response category’s observed proportion over time (for one item) against the simulated prediction band; bands that envelop the observed proportions across categories and items indicate an adequate fit.
10 Priors as regularization when data is sparse
With 300 subjects, every category was observed enough times that pure maximum likelihood worked fine. But what happens when we don’t have that much data?
The fit we just ran uses marginal maximum likelihood (LaplaceI()): each parameter is declared in @param with ∈ RealDomain(...) or ∈ VectorDomain(...), which specifies only its support, with no probabilistic information. With the \(N = 300\) subjects and 5 observations per item per subject we just used, every category of every item is observed enough times that the likelihood alone identifies all parameters well, and MLE works fine.
The story changes when some categories are barely observed: for example, a clinical scale where the worst category is reached by only a handful of subjects, or an item where one threshold is rarely crossed. In that regime, the corresponding cut-off has very little likelihood information, leading to numerical instability and non-estimability of some parameters.
10.1 A prior-aware Pumas model
The standard remedy is to regularize via priors, i.e., to fit by maximum a posteriori (MAP) instead of MLE. In Pumas, this means redeclaring each parameter with the ~ syntax instead of ∈, which both fixes its support and attaches a prior distribution:
irt_ordinal_model_map = @model begin
@param begin
# Ability fixed effects: weakly informative Gaussians centred at 0.
μf ~ Normal(0, 2)
μs ~ Normal(0, 2)
# Random-effect SDs: half-Cauchy on the positive line. Heavy tails
# allow large values; the small lower bound keeps the optimizer
# from collapsing the SD to exactly zero.
ωf ~ truncated(Cauchy(0, 1); lower = 0.05)
ωs ~ truncated(Cauchy(0, 1); lower = 1e-3)
# Item 1: θb1 contains only log-gaps (first cut-off is fixed to 0).
# N(0, 1) on each log-gap implies a prior 95% interval of roughly
# (e^{-2}, e^{2}) ≈ (0.14, 7.4) for the gap between consecutive
# cut-offs. This is broad relative to typical category spacings but not
# absurd.
θb1 ~ MvNormal(zeros(m[1] - 1), I(m[1] - 1))
# Item 2: θb2[1] is the lowest cut-off (a location on the latent
# scale) and gets a wider N(0, 2) prior; θb2[2:end] are log-gaps
# and get the same N(0, 1) prior as for item 1.
θb2 ~ MvNormal(zeros(m[2]), Diagonal(vcat(4.0, ones(m[2] - 1))))
# Free item discriminations (items 2..n_items): N(0, 2) on log(a_j).
# Item 1's discrimination is fixed to 1 as the scale reference.
a_free ~ MvLogNormal(zeros(n_items - 1), 4 * I(n_items - 1))
end
@random begin
f0 ~ Normal(μf, ωf)
slope ~ Normal(μs, ωs)
end
@pre begin
f = f0 + slope * t
b1_ordered = to_ordered(vcat(0.0, θb1))
b2_ordered = to_ordered(θb2)
a = vcat(1.0, a_free)
lle1 = a[1] .* (b1_ordered .- f)
lle2 = a[2] .* (b2_ordered .- f)
end
@derived begin
score1 ~ @. OrderedLogit(lle1)
score2 ~ @. OrderedLogit(lle2)
end
endPumasModel
Parameters: μf, μs, ωf, ωs, θb1, θb2, a_free
Random effects: f0, slope
Covariates:
Dynamical system variables:
Dynamical system type: No dynamical model
Derived: score1, score2
Observed: score1, score2
Only the @param block has changed; everything downstream is identical. The prior choices above are deliberately weakly informative, and they treat locations on the latent scale and log-gaps between cut-offs differently. A wide prior on the log scale corresponds to an absurdly wide prior on the gap itself, which is rarely what we want:
- \(\mathrm{Normal}(0, 2)\) on the ability fixed effects (\(\mu_f, \mu_s\)) and the lowest cut-off of item 2 (\(\theta^b_{2,1}\), which lives on the ability scale). Wide enough not to dominate informative data, but tight enough to keep the latent scale anchored.
- \(\mathrm{Normal}(0, 1)\) on every log-gap component (\(\theta^b_{1,:}\) and \(\theta^b_{2,2:}\)). This places the prior 95% interval for each consecutive-cut-off gap at \((e^{-2}, e^{2}) \approx (0.14, 7.4)\). Bumping this up to \(\mathrm{Normal}(0, 2)\) (the prior we use on locations) would imply gaps anywhere in \((e^{-4}, e^{4}) \approx (0.02, 55)\), which is far too diffuse and gives the optimizer / sampler very little to hold on to in sparse-data regimes.
- \(\mathrm{LogNormal}(0, 2)\) on the free item discriminations \(a_{2:J}\) (item 1’s discrimination is fixed to 1 as the scale reference, so it has no prior). Equivalent to \(\mathrm{Normal}(0, 2)\) on \(\log a_j\), putting the bulk of the mass on \(a_j \in (e^{-4}, e^{4})\) on the positive scale.
- Half-Cauchy \(\mathrm{Cauchy}(0, 1)\) truncated to a small positive lower bound on the random-effect SDs: heavy-tailed, permits very small values (near-zero between-subject variability) while still allowing large ones.
10.2 Fitting MAP and comparing to MLE
We then use MAP(LaplaceI()) to fit the model, which adds the log-priors to the Laplace-approximated log-likelihood as a regularizer. To make the regularization actually matter, we downsample the simulated population to its first 50 subjects (irt_pop[1:50]). At that size, some categories of each item are only observed a handful of times, so the likelihood alone leaves several cut-offs and discriminations weakly identified:
fpm_irt_map = fit(irt_ordinal_model_map, irt_pop[1:50], init_irt, MAP(LaplaceI()))[ Info: Checking the initial parameter values. [ Info: The initial negative log likelihood and its gradient are finite. Check passed. Iter Function value Gradient norm 0 8.188248e+02 1.610446e+03 * time: 3.409385681152344e-5 1 7.504969e+02 8.749098e+02 * time: 2.6990771293640137 2 7.089774e+02 1.006082e+02 * time: 2.8642709255218506 3 6.925979e+02 1.783234e+02 * time: 3.029294967651367 4 6.659047e+02 5.086880e+02 * time: 3.193648099899292 5 6.546412e+02 1.926765e+02 * time: 3.3721721172332764 6 6.420159e+02 2.389700e+02 * time: 3.5493199825286865 7 6.344980e+02 3.002435e+02 * time: 3.719093084335327 8 6.253718e+02 9.710727e+01 * time: 4.0051209926605225 9 6.234522e+02 2.739849e+01 * time: 4.155004978179932 10 6.212891e+02 6.323539e+01 * time: 4.306396007537842 11 6.195780e+02 6.192034e+01 * time: 4.466670036315918 12 6.182941e+02 2.404885e+01 * time: 4.6370110511779785 13 6.174628e+02 2.057114e+01 * time: 4.813821077346802 14 6.165491e+02 3.855207e+01 * time: 5.03554892539978 15 6.151824e+02 5.451810e+01 * time: 5.203447103500366 16 6.140967e+02 3.387723e+01 * time: 5.357213973999023 17 6.135866e+02 8.740670e+00 * time: 5.515713930130005 18 6.134429e+02 6.117447e+00 * time: 5.6992340087890625 19 6.133251e+02 9.539533e+00 * time: 5.936171054840088 20 6.130095e+02 1.932386e+01 * time: 6.084470987319946 21 6.125185e+02 2.390563e+01 * time: 6.2460010051727295 22 6.119794e+02 1.831283e+01 * time: 6.418977975845337 23 6.116847e+02 9.450470e+00 * time: 6.631338119506836 24 6.115896e+02 4.018618e+00 * time: 6.795009136199951 25 6.115476e+02 4.064309e+00 * time: 6.9526190757751465 26 6.114762e+02 6.402699e+00 * time: 7.160815000534058 27 6.113168e+02 1.011667e+01 * time: 7.323364019393921 28 6.109975e+02 1.578319e+01 * time: 7.487235069274902 29 6.106509e+02 1.637345e+01 * time: 7.6808271408081055 30 6.104040e+02 2.052358e+01 * time: 7.83738899230957 31 6.101593e+02 3.343998e+00 * time: 7.988717079162598 32 6.101439e+02 2.237561e+00 * time: 8.135304927825928 33 6.101287e+02 2.178054e+00 * time: 8.311553955078125 34 6.101142e+02 2.266599e+00 * time: 8.44763708114624 35 6.100966e+02 2.625904e+00 * time: 8.583951950073242 36 6.100744e+02 3.410928e+00 * time: 8.754892110824585 37 6.100449e+02 2.547888e+00 * time: 8.890653133392334 38 6.100163e+02 1.569060e+00 * time: 9.026792049407959 39 6.100003e+02 1.424563e+00 * time: 9.194275140762329 40 6.099944e+02 1.286109e+00 * time: 9.332957029342651 41 6.099904e+02 1.152158e+00 * time: 9.468460083007812 42 6.099830e+02 9.712418e-01 * time: 9.635215044021606 43 6.099662e+02 1.097342e+00 * time: 9.783276081085205 44 6.099290e+02 1.516326e+00 * time: 9.931503057479858 45 6.098629e+02 2.189174e+00 * time: 10.07871699333191 46 6.097865e+02 1.924357e+00 * time: 10.256217002868652 47 6.097423e+02 8.576499e-01 * time: 10.401358127593994 48 6.097333e+02 6.700974e-01 * time: 10.553532123565674 49 6.097319e+02 6.565466e-01 * time: 10.7359619140625 50 6.097303e+02 6.562399e-01 * time: 10.89963698387146 51 6.097263e+02 6.279605e-01 * time: 11.059168100357056 52 6.097180e+02 7.845362e-01 * time: 11.247470140457153 53 6.097045e+02 9.536278e-01 * time: 11.397933006286621 54 6.096927e+02 7.077729e-01 * time: 11.557210922241211 55 6.096881e+02 4.254716e-01 * time: 11.742897033691406 56 6.096875e+02 6.182014e-02 * time: 11.908406972885132 57 6.096875e+02 1.666577e-02 * time: 12.07488203048706 58 6.096875e+02 3.632520e-03 * time: 12.246367931365967 59 6.096875e+02 1.289419e-03 * time: 12.44923710823059 60 6.096875e+02 3.058822e-04 * time: 12.609946966171265
FittedPumasModel
Dynamical system type: No dynamical model
Number of subjects: 50
Observation records: Active Missing
score1: 250 0
score2: 250 0
Total: 500 0
Number of parameters: Constant Optimized
0 10
Estimation method: MAP (LaplaceI)
--------------------
Estimate
--------------------
μf -0.65986
μs 0.051183
ωf 0.89282
ωs 0.045609
θb1₁ 0.52516
θb1₂ 0.035215
θb2₁ -1.5777
θb2₂ 0.6691
θb2₃ 0.30676
a_free₁ 0.50861
--------------------
We can compare the MAP estimates (50 subjects, priors active) to the earlier LaplaceI estimates (full 300 subjects, MLE) by appending them to ci_df:
map_est = coef(fpm_irt_map)
ci_df.map_50 = vcat(
[map_est.μf, map_est.μs, map_est.ωf, map_est.ωs],
map_est.θb1,
map_est.θb2,
map_est.a_free,
)
simple_table(ci_df)| parameter | constant | estimate | se | relative_se | ci_lower | ci_upper | true_value | init_value | map_50 |
| μf | false | -0.603 | 0.0957 | 0.159 | -0.79 | -0.415 | -0.3 | 0.5 | -0.66 |
| μs | false | 0.0448 | 0.00601 | 0.134 | 0.033 | 0.0566 | 0.04 | 0.2 | 0.0512 |
| ωf | false | 0.842 | 0.0983 | 0.117 | 0.649 | 1.03 | 0.8 | 0.4 | 0.893 |
| ωs | false | 0.0388 | 0.00864 | 0.223 | 0.0218 | 0.0557 | 0.02 | 0.1 | 0.0456 |
| θb1₁ | false | 0.449 | 0.0477 | 0.106 | 0.355 | 0.542 | 0.405 | -0.5 | 0.525 |
| θb1₂ | false | 0.149 | 0.0818 | 0.549 | -0.0114 | 0.309 | 0 | -0.5 | 0.0352 |
| θb2₁ | false | -1.48 | 0.223 | 0.15 | -1.92 | -1.04 | -1.0 | 0 | -1.58 |
| θb2₂ | false | 0.476 | 0.144 | 0.302 | 0.194 | 0.758 | 0.262 | -0.5 | 0.669 |
| θb2₃ | false | 0.111 | 0.15 | 1.35 | -0.183 | 0.404 | -0.105 | -0.5 | 0.307 |
| a_free₁ | false | 0.602 | 0.0844 | 0.14 | 0.437 | 0.768 | 0.75 | 0.5 | 0.509 |
The MAP estimates on 50 subjects should sit close to the true values for the well-identified parameters and shrink toward the prior mode for the weakly identified ones.
10.3 Marginal log-likelihood diagnostic
A useful diagnostic is to evaluate the marginal log-likelihood on the same 50-subject dataset at both the MAP estimate and the true parameters:
ll_map = loglikelihood(irt_ordinal_model_map, irt_pop[1:50], map_est, LaplaceI())
ll_true = loglikelihood(irt_ordinal_model_map, irt_pop[1:50], true_params, LaplaceI())
(; ll_map, ll_true, Δll = ll_map - ll_true)(ll_map = -597.9485443464408, ll_true = -601.2558270522165, Δll = 3.3072827057757195)
In this case, the log-likelihood of the MAP estimate is similar to that of the true parameters on the 50-subject dataset, which implies a good fit to the observed data. Notice that the log-likelihood of the MAP estimate is actually higher than that of the true parameters on this dataset. This is because the true parameters, although they generated the data, are not guaranteed to be the best fit to any particular subset of it (especially a small one) due to sampling variability.
11 Markov chain Monte Carlo (MCMC)
To fully quantify the uncertainty in the parameters of the model in the presence of sparse data, a full Bayesian fit is recommended. Pumas has a No-U-Turn Sampler (NUTS)-based MCMC sampler. We can simply use the BayesMCMC algorithm with the same model definition as for MAP. Here, we start the sampler from the MAP estimate:
fpm_irt_bayes = fit(
irt_ordinal_model_map,
irt_pop[1:50],
coef(fpm_irt_map), # use the MAP estimate as the MCMC initial value
BayesMCMC(; nsamples = 2_000, nadapts = 1_000, nchains = 4),
)
fpm_irt_bayes_samples = discard(fpm_irt_bayes; burnin = 1_000)[ Info: Checking the initial parameter values. [ Info: The initial log probability and its gradient are finite. Check passed. [ Info: Checking the initial parameter values. [ Info: The initial log probability and its gradient are finite. Check passed. [ Info: Checking the initial parameter values. [ Info: The initial log probability and its gradient are finite. Check passed. [ Info: Checking the initial parameter values. [ Info: The initial log probability and its gradient are finite. Check passed.
Chains MCMC chain (1000×10×4 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 4
Samples per chain = 1000
Wall duration = 222.52 seconds
Compute duration = 222.2 seconds
parameters = μf, μs, ωf, ωs, θb1₁, θb1₂, θb2₁, θb2₂, θb2₃, a_free₁
Use `describe(chains)` for summary statistics and quantiles.
11.1 Convergence diagnostics
The returned object is a collection of MCMC chains rather than a single point estimate. The standard summaries are obtained with describe (mean, std, ESS, R̂, quantiles, etc.):
describe(fpm_irt_bayes_samples)Chains MCMC chain (1000×10×4 Array{Float64, 3}):
Iterations = 1:1:1000
Number of chains = 4
Samples per chain = 1000
Wall duration = 222.52 seconds
Compute duration = 222.2 seconds
parameters = μf, μs, ωf, ωs, θb1₁, θb1₂, θb2₁, θb2₂, θb2₃, a_free₁
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
μf -0.6869 0.2499 0.0041 3774.5341 2751.9469 1.0008 ⋯
μs 0.0532 0.0181 0.0003 3292.5996 3055.7186 1.0007 ⋯
ωf 0.9962 0.2456 0.0067 1315.4304 2099.2138 1.0063 ⋯
ωs 0.0549 0.0231 0.0008 790.8375 1287.2277 1.0052 ⋯
θb1₁ 0.5437 0.1184 0.0022 2800.5125 2597.8038 1.0018 ⋯
θb1₂ 0.0379 0.1977 0.0028 5004.2390 2882.9351 1.0018 ⋯
θb2₁ -1.8402 0.5852 0.0141 1925.1006 2226.2480 1.0009 ⋯
θb2₂ 0.7982 0.3087 0.0078 1576.1964 1983.9329 1.0014 ⋯
θb2₃ 0.4328 0.3226 0.0079 1684.9300 2458.5371 1.0021 ⋯
a_free₁ 0.4640 0.1480 0.0036 1521.4819 2258.7931 1.0018 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
μf -1.2082 -0.8452 -0.6820 -0.5190 -0.2158
μs 0.0182 0.0408 0.0531 0.0652 0.0888
ωf 0.5492 0.8267 0.9865 1.1477 1.5019
ωs 0.0091 0.0391 0.0546 0.0710 0.1004
θb1₁ 0.3064 0.4662 0.5478 0.6229 0.7685
θb1₂ -0.3606 -0.0924 0.0471 0.1798 0.3992
θb2₁ -3.2524 -2.1600 -1.7562 -1.4232 -0.9185
θb2₂ 0.2037 0.5882 0.7895 1.0003 1.4320
θb2₃ -0.1868 0.2170 0.4281 0.6484 1.0741
a_free₁ 0.2325 0.3594 0.4458 0.5487 0.8150
Before reading anything off the posterior, glance at the rhat and ess_* columns. As a rule of thumb, rhat should be very close to 1.0 (certainly below 1.01) for every parameter, and the effective sample sizes (both bulk and tail) should be at least a few hundred per parameter. If rhat is much above 1.01 or ESS is much smaller than the nominal sample count, the chains have not mixed and the credible intervals below should not be trusted; rerun with more nsamples / nadapts, or revisit the model parameterization and priors.
11.2 Credible intervals
We can get 95% credible intervals for the parameters from the quantiles of the marginal posteriors:
quantile_df = DataFrame(quantile(fpm_irt_bayes_samples))
quantile_df.true_value = vcat(
[true_params.μf, true_params.μs, true_params.ωf, true_params.ωs],
true_params.θb1,
true_params.θb2,
true_params.a_free,
)
simple_table(quantile_df)| parameters | 2.5% | 25.0% | 50.0% | 75.0% | 97.5% | true_value |
| :μf | -1.21 | -0.845 | -0.682 | -0.519 | -0.216 | -0.3 |
| :μs | 0.0182 | 0.0408 | 0.0531 | 0.0652 | 0.0888 | 0.04 |
| :ωf | 0.549 | 0.827 | 0.987 | 1.15 | 1.5 | 0.8 |
| :ωs | 0.00906 | 0.0391 | 0.0546 | 0.071 | 0.1 | 0.02 |
| :θb1₁ | 0.306 | 0.466 | 0.548 | 0.623 | 0.769 | 0.405 |
| :θb1₂ | -0.361 | -0.0924 | 0.0471 | 0.18 | 0.399 | 0 |
| :θb2₁ | -3.25 | -2.16 | -1.76 | -1.42 | -0.918 | -1.0 |
| :θb2₂ | 0.204 | 0.588 | 0.789 | 1 | 1.43 | 0.262 |
| :θb2₃ | -0.187 | 0.217 | 0.428 | 0.648 | 1.07 | -0.105 |
| :a_free₁ | 0.233 | 0.359 | 0.446 | 0.549 | 0.815 | 0.75 |
If the true parameters are not mostly contained in their 95% credible intervals, this could imply that the priors are not weak enough to let the data speak for itself. For more on Bayesian estimation in Pumas, see the Bayesian tutorials in Pumas.
12 Extensions
The model we have built and fitted is intentionally minimal: a linear scalar latent ability \(f_i(t)\) with OrderedLogit observations on two items, no covariates, and no treatment effect. There are at least four ways to extend the IRT model presented here. Each extension requires only a localized change to the model, with the rest of the workflow (simobs, read_pumas, fit(..., LaplaceI()) / MAP(LaplaceI()) / BayesMCMC) unchanged.
12.1 Different disease progression function forms
The linear model \(f_i(t) = f_{0,i} + s_i \, t\) assumes the latent ability changes at a constant rate. In many clinical settings, the disease severity saturates (it changes quickly at first and then approaches a plateau), and a natural parametric alternative is the EMAX form
\[ f_i(t) = f_{0,i} + E_{\max,i} \cdot \frac{t}{EC_{50} + t}, \]
where \(f_{0,i}\) is the subject-specific baseline ability, \(E_{\max,i}\) is the maximum change from baseline (so the asymptote as \(t \to \infty\) is \(f_{0,i} + E_{\max,i}\)), and \(EC_{50} > 0\) is the time at which half of the maximum change is reached. The progression rate at \(t = 0\) is \(E_{\max,i} / EC_{50}\), and the curve approaches its plateau monotonically. Other parametric forms (exponential, Weibull, indirect-response, mechanistic ODE-based, etc.) slot in the same way. Each is a localized change to the @pre block, optionally the @dynamics block, and their associated parameters in @param / @random; the rest of the model (item cut-offs, discriminations, OrderedLogit observation, fitting with LaplaceI) is unaffected.
Sometimes no fixed parametric form is plausible; for example, the trajectory may be non-monotonic and its structure is difficult to specify a priori. In such cases, the structural part of \(f_i(t)\) can be replaced by a neural network:
\[ f_i(t) = \mathrm{NN}_{\boldsymbol{\phi}}\!\left(t,\; \boldsymbol{\eta}_i,\; \boldsymbol{x}_i\right), \]
where \(\boldsymbol{\eta}_i\) are subject-specific random effects, \(\boldsymbol{x}_i\) are covariates, and \(\boldsymbol{\phi}\) are the (population-level) network weights, turning the model to a DeepNLME model. This lets a neural component be embedded directly in @pre (using DeepPumas) alongside the usual @param / @random / @derived machinery. From the rest of the IRT model’s point of view, nothing changes (f is still just a scalar function of time and individual effects), so the same OrderedLogit observation block, simobs, read_pumas, and fit(..., LaplaceI()) workflow apply.
12.2 Covariate and treatment effects
The latent ability in our model depends only on time and subject-specific random effects. In practice, \(f_i(t)\) is often also driven by baseline covariates (such as age, sex, disease subtype, or a baseline biomarker) and by treatment (active vs. placebo, dose, regimen). These can be defined either directly in @pre or using an ODE in the @dynamics block, e.g., using a pharmacokinetic model to drive the disease progression via an exposure-response relationship. The key point is that the IRT-specific parts of the model (item cut-offs, discriminations, OrderedLogit observation) are unaffected by the addition of covariates and treatment effects, so the same workflow applies. If complex covariate relationships are expected, a neural network can be used to flexibly model the covariate effects on the latent trajectory.
12.3 Bounded integer model
The extensions above all leave the OrderedLogit likelihood untouched. The next extension is to replace that likelihood. An alternative is the bounded integer (BI) model, in which the observed score \(y_{i,j} \in \{1, \dots, K_j\}\) arises by discretizing a latent continuous variable on the score scale through a Gaussian CDF, rather than a logistic function as in the OrderedLogit.
12.4 Multi-dimensional IRT
The fourth extension touches the latent state itself. The model above assumes a single scalar latent ability \(f_i(t)\). When the items are driven by multiple, partially independent latent dimensions (for instance, motor function and cognition, or fatigue and pain), the unidimensional assumption (cf. the warning at the top of this tutorial) is violated and a multi-dimensional IRT model is appropriate. The latent state becomes a vector \(\boldsymbol{f}_i(t) \in \mathbb{R}^D\), and each item’s OrderedLogit threshold reads off a linear combination of the latent dimensions via item-specific loadings \(\boldsymbol{w}_j \in \mathbb{R}^D\):
\[ \eta_{j,k} \;=\; a_j \left(b_{j,k} - \boldsymbol{w}_j^\top \boldsymbol{f}_i(t)\right). \]
The multi-dimensional latent trajectory \(\boldsymbol{f}_i(t)\) can be built in two complementary ways:
- Traditional disease progression models. Each component of \(\boldsymbol{f}_i(t)\) is given its own structural form (linear, EMAX, exponential, ODE-based, …), with its own fixed and random effects. This is the natural choice when each latent dimension corresponds to a clinically interpretable construct with a well-understood time course.
- DeepNLME (using DeepPumas). \(\boldsymbol{f}_i(t) = \mathrm{NN}_{\boldsymbol{\phi}}(t, \boldsymbol{\eta}_i, \boldsymbol{x}_i)\) is the output of a neural network with \(D\) outputs. This is well suited to high-dimensional or multi-modal trajectories where no single parametric form fits well across patients, or to settings with many covariates whose interactions with time are non-trivial.
In both cases, the rest of the IRT model (item cut-offs, discriminations, OrderedLogit observation) is unchanged from the unidimensional case.