Discrete-Time Markov Models

Authors

Vijay Ivaturi

Andreas Noack

Warning

FiniteMarkov is an experimental feature introduced in Pumas 2.8. The API may change in future releases without deprecation.

1 Introduction

In clinical trials and observational studies, we frequently encounter categorical outcomes that evolve over time. Consider the following scenarios:

  • A patient’s disease severity recorded weekly as “none”, “mild”, “moderate”, or “severe”
  • Treatment response assessed at each visit as “no response”, “partial response”, or “full response”
  • An adverse event grade tracked daily from “absent” through “grade 1” to “grade 3”
  • Sleep stages cycling through “awake”, “light sleep”, “deep sleep”, and “REM”

A common feature of these data is that the current observation depends heavily on the previous one. A patient who was in “mild” disease at the last visit is far more likely to remain in “mild” at the next visit than to suddenly jump to “severe”. This dependence between consecutive observations is called the Markov property, and models designed to capture it are called Markov models.

In this tutorial, we will learn how to build and fit discrete-time Markov models (DTMMs) in Pumas using the FiniteMarkov distribution. We will work through a complete clinical example — from recognizing Markovian features in data to fitting a PK-driven Markov model, interpreting the results, and evaluating the model.

using Pumas
using Pumas.Experimental: FiniteMarkov
using PumasUtilities
using PharmaDatasets
using DataFramesMeta
using CairoMakie
using AlgebraOfGraphics
using GLM: softmax
using SummaryTables
using Printf
using PumasReports

2 What Makes Data “Markovian”?

Before building a Markov model, it helps to understand what Markovian data looks like and why standard approaches fall short.

2.1 Recognizing Markovian Features

Data with Markovian features are characterized by three signatures (Ooi, Plan & Bergstrand, CPT:PSP 2025):

  1. Frequent observations relative to changes — observations are collected more often than the outcome is expected to change. This leads to long runs of the same category.
  2. Many consecutive same-category observations — at the individual level, you see strings of identical values (e.g., “mild, mild, mild, moderate, moderate, moderate”).
  3. Positive correlation between current and previous values — a scatter plot of \(Y_t\) versus \(Y_{t-1}\) shows a clear positive trend.

2.2 Why Not Use Standard Models?

Standard categorical models (e.g., proportional odds, logistic regression) assume that observations are independent. When the data have Markovian features, ignoring this dependence can cause:

  • Inflated Type I error in hypothesis tests — you may falsely identify covariate effects (e.g., a placebo effect) that do not exist
  • Overestimated inter-individual variability (IIV) — the model absorbs the serial correlation into IIV
  • Unrealistic simulations — simulated data will show too many state changes, inflated extreme-value occurrences, and under-predicted time spent in the same state

A dedicated Markov model solves these problems by explicitly conditioning each observation on the previous state.

3 The Markov Chain: Core Concepts

3.1 States and Transitions

A discrete-time Markov chain is a stochastic process where a subject occupies one of \(K\) discrete states at each time point. The probability of being in any state at time \(t\) depends only on the state at time \(t - 1\) — not on any earlier history. This is called the first-order Markov property:

\[ P(Y_t = k_t \mid Y_0, Y_1, \ldots, Y_{t-1}) = P(Y_t = k_t \mid Y_{t-1}) \]

NoteFirst-Order Means Short Memory, Not Small Steps

“First-order” refers to memory depth, not step size. It means the next state depends only on the current state — not on the full history of previous states. A subject can jump from any state to any other state in a single time step; the transition probabilities determine how likely each jump is. If you want to enforce that only adjacent transitions are possible (e.g., a patient cannot skip from “none” to “severe”), you would explicitly set those transition probabilities to zero by fixing the corresponding logits to \(-\infty\). That is a modeling choice, not a property of first-order Markov chains.

Higher-order Markov models exist (where the predicted value depends on two or more previous states), but they require many more parameters and their results are harder to communicate. In pharmacometric practice, first-order models are almost always sufficient.

3.2 The Transition Probability Matrix

The transition probabilities are collected in a transition probability matrix \(P\). For \(K\) states, \(P\) is a \(K \times K\) matrix where each entry \(P_{ij}\) is the probability of moving from source state \(i\) to destination state \(j\):

\[ P = \begin{bmatrix} P_{11} & P_{12} & \cdots & P_{1K} \\ P_{21} & P_{22} & \cdots & P_{2K} \\ \vdots & \vdots & \ddots & \vdots \\ P_{K1} & P_{K2} & \cdots & P_{KK} \end{bmatrix} \]

Each row represents a complete probability distribution: “Given that a subject is currently in state \(i\), what is the probability of transitioning to each possible state?” Because the subject must end up somewhere (including staying in the same state), each row sums to 1.

Let’s consider a concrete example with three disease states (no response, partial response, full response). A plausible transition matrix might look like:

From  To No Response Partial Full
No Response 0.85 0.12 0.03
Partial 0.05 0.80 0.15
Full 0.02 0.08 0.90
TipHow to Read the Transition Matrix

Read by row — each row is a self-contained probability distribution for a subject currently in that state:

  • Row 1 (No Response): A patient currently in “No Response” has an 85% chance of staying there, a 12% chance of improving to “Partial”, and a 3% chance of jumping directly to “Full Response”. These three probabilities sum to 1.
  • Row 2 (Partial): A different patient currently in “Partial” has a 5% chance of worsening to “No Response”, an 80% chance of staying in “Partial”, and a 15% chance of improving to “Full Response”.

Do not read down columns. The 0.12 in Row 1 and the 0.05 in Row 2 of the “No Response” column are about different patients in different current states. Each row is independent — you pick the row that matches the subject’s current state, then read across to find the probabilities for their next state.

The strong diagonal dominance (high same-state probabilities) is typical of Markovian data — subjects tend to stay in their current state from one observation to the next.

3.3 Absorbing States

An absorbing state is one that a subject can never leave. Common examples include dropout, death, or cure. Once reached, \(P(\text{absorbing} \mid \text{absorbing}) = 1\) and all other transitions from that state are zero.

3.4 Equidistant Time Assumption

A key assumption of the standard DTMM is that the effect of time since the last observation on transition probabilities is negligible. This works well when observations are equidistant (e.g., daily, weekly). When time intervals vary substantially, a continuous-time Markov model (CTMM) should be considered instead.

TipDTMM vs. CTMM: Quick Guide

Use a DTMM when:

  • Observations are at regular, equidistant intervals
  • You want a straightforward, easy-to-communicate model
  • The number of states is moderate (2–5)

Consider a CTMM when:

  • Time intervals between observations vary
  • You want to simulate at different time intervals than observed
  • Time-varying covariates change between observation times
  • The variable has many ordered categories (CTMM is more parsimonious)

4 Parameterizing Transitions: The Logit Approach

4.1 Why Not Estimate Probabilities Directly?

Transition probabilities must satisfy two constraints:

  1. Each probability is between 0 and 1
  2. Each row sums to 1

Pumas uses a logit parameterization that maps unconstrained real numbers to valid probabilities.

4.2 The Logistic Transformation

Consider a subject currently in source state \(i\). They can transition to any of \(K\) destination states. We need \(K\) probabilities that are each between 0 and 1 and sum to 1.

The trick is to estimate only \(K - 1\) unconstrained parameters and derive the \(K\)-th probability from the others. We pick one destination state as the reference category — by convention, the last state (state \(K\)). The \(K - 1\) estimated parameters \(\tau_{i1}, \ldots, \tau_{i(K-1)}\) are called logits and can take any value on the real line (\(-\infty\) to \(+\infty\)).

The logistic transformation converts these logits into probabilities via a shared denominator that guarantees the row sums to 1:

\[ D_i = 1 + \sum_{m=1}^{K-1} \exp(\tau_{im}) \]

where \(m\) ranges over the \(K - 1\) non-reference destination states. This denominator is the same for every destination state in the row.

Probability of transitioning to a non-reference state (\(j < K\)):

\[ P(i \to j) = \frac{\exp(\tau_{ij})}{D_i} \]

Each non-reference destination state \(j\) contributes \(\exp(\tau_{ij})\) in the numerator. A larger logit \(\tau_{ij}\) means a larger \(\exp(\tau_{ij})\) and therefore a higher probability of transitioning to state \(j\).

Probability of transitioning to the reference state (\(j = K\)):

\[ P(i \to K) = \frac{1}{D_i} \]

The reference state has no estimated logit parameter — its logit is implicitly fixed at zero, so \(\exp(0) = 1\) appears in the numerator. Its probability is whatever is “left over” after the other states claim their share.

4.3 A Worked Example

Let’s make this concrete. Suppose we have \(K = 3\) destination states, so we estimate \(K - 1 = 2\) logit parameters for this row: \(\tau_1 = 2.0\) and \(\tau_2 = -1.0\). State 3 is the reference (implicit logit = 0).

Step 1 — Compute the shared denominator:

\[ D = 1 + \exp(2.0) + \exp(-1.0) = 1 + 7.389 + 0.368 = 8.757 \]

The leading 1 comes from the reference state: \(\exp(0) = 1\).

Step 2 — Divide each \(\exp(\text{logit})\) by the denominator:

Destination Logit \(\exp(\text{logit})\) Probability = \(\exp / D\)
State 1 2.0 (estimated) 7.389 7.389 / 8.757 = 0.844
State 2 -1.0 (estimated) 0.368 0.368 / 8.757 = 0.042
State 3 0 (reference, fixed) 1.000 1.000 / 8.757 = 0.114
Total 1.000

The probabilities sum to 1, as required. And critically, any real values of \(\tau_1\) and \(\tau_2\) will produce valid probabilities — the optimizer is free to explore the entire real line without ever violating the constraints.

4.4 This Is the Softmax (Multinomial Logistic) Function

The transformation we just performed has a well-known name: softmax (also called the multinomial logistic function). You may already know the binary logistic function, which converts a single logit to a probability:

\[ \text{logistic}(\tau) = \frac{\exp(\tau)}{1 + \exp(\tau)} \]

The softmax is its generalization to \(K\) categories. Given a vector of logits \([\tau_1, \tau_2, \ldots, \tau_K]\), softmax returns a vector of probabilities:

\[ \text{softmax}(\boldsymbol{\tau})_j = \frac{\exp(\tau_j)}{\sum_{m=1}^{K} \exp(\tau_m)} \]

In our parameterization, we estimate \(K - 1\) logits and fix the reference state logit to 0. So the full logit vector is \([\tau_1, \ldots, \tau_{K-1}, 0]\), and softmax produces exactly the probabilities we computed by hand above.

Let’s verify with code:

# The same worked example: τ₁ = 2.0, τ₂ = -1.0, reference logit = 0
softmax([2.0, -1.0, 0.0])
3-element Vector{Float64}:
 0.8437947344813396
 0.042010066134066056
 0.1141951993845945

This matches our manual calculation (0.844, 0.042, 0.114).

NoteBinary Logistic Is a Special Case

When \(K = 2\) (two states), the softmax with logits \([\tau, 0]\) reduces to:

\[ P(\text{State 1}) = \frac{\exp(\tau)}{\exp(\tau) + 1} = \text{logistic}(\tau) \]

So the binary logistic function is just softmax with two categories. The multinomial logistic (softmax) generalizes this to any number of categories.

4.5 Why This Parameterization Is Useful for Drug Effects

When a drug effect is additive on the logit scale, the model is:

\[ \tau_{ij}^{\text{drug}} = \tau_{ij} + \text{slope} \times \text{concentration} \]

A positive slope increases the logit for that transition, which increases the corresponding probability and proportionally decreases all other probabilities from that state. This is both mathematically clean and clinically interpretable.

5 Loading and Exploring the Dataset

Now let’s work through a complete example. We’ll use a clinical trial dataset that records disease states over time alongside PK information about drug exposure.

df = dataset("psptutorials/psp413278/markov-discrete-time")
first(df, 10)
10×10 DataFrame
Row ID TIME AMT II ADDL DV EVID TREA ETA1 ETA2
Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Float64
1 1 0 0 0 0 1 0 0 -0.0754288 -0.761771
2 1 1 0 0 0 1 0 0 -0.0754288 -0.761771
3 1 2 0 0 0 1 0 0 -0.0754288 -0.761771
4 1 3 0 0 0 1 0 0 -0.0754288 -0.761771
5 1 4 0 0 0 1 0 0 -0.0754288 -0.761771
6 1 5 0 0 0 1 0 0 -0.0754288 -0.761771
7 1 6 0 0 0 1 0 0 -0.0754288 -0.761771
8 1 7 0 0 0 1 0 0 -0.0754288 -0.761771
9 1 8 0 0 0 1 0 0 -0.0754288 -0.761771
10 1 9 0 0 0 1 0 0 -0.0754288 -0.761771
overview_table(df)
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 ID
[Int64]
Mean (sd): 50.5 (28.9)
min ≤ med ≤ max:
1 ≤ 51 ≤ 100
IQR (CV): 50 (0.571)
100 distinct values 36650
(100%)
0
(0%)
2 TIME
[Int64]
Mean (sd): 182 (106)
min ≤ med ≤ max:
0 ≤ 182 ≤ 365
IQR (CV): 183 (0.581)
366 distinct values 36650
(100%)
0
(0%)
3 AMT
[Int64]
Mean (sd): 0.136 (3.69)
min ≤ med ≤ max:
0 ≤ 0 ≤ 100
IQR (CV): 0 (27.1)
2 distinct values 36650
(100%)
0
(0%)
4 II
[Int64]
Mean (sd): 0.00136 (0.0369)
min ≤ med ≤ max:
0 ≤ 0 ≤ 1
IQR (CV): 0 (27.1)
2 distinct values 36650
(100%)
0
(0%)
5 ADDL
[Int64]
Mean (sd): 0.497 (13.4)
min ≤ med ≤ max:
0 ≤ 0 ≤ 364
IQR (CV): 0 (27.1)
2 distinct values 36650
(100%)
0
(0%)
6 DV
[Int64]
Mean (sd): 1.83 (0.926)
min ≤ med ≤ max:
0 ≤ 2 ≤ 4
IQR (CV): 2 (0.506)
5 distinct values 36650
(100%)
0
(0%)
7 EVID
[Int64]
Mean (sd): 0.00136 (0.0369)
min ≤ med ≤ max:
0 ≤ 0 ≤ 1
IQR (CV): 0 (27.1)
2 distinct values 36650
(100%)
0
(0%)
8 TREA
[Int64]
Mean (sd): 0.501 (0.5)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1
IQR (CV): 1 (0.999)
2 distinct values 36650
(100%)
0
(0%)
9 ETA1
[Float64]
Mean (sd): -0.0409 (0.293)
min ≤ med ≤ max:
-0.742 ≤ -0.0754 ≤ 0.598
IQR (CV): 0.437 (-7.17)
100 distinct values 36650
(100%)
0
(0%)
10 ETA2
[Float64]
Mean (sd): 0.0188 (0.352)
min ≤ med ≤ max:
-0.762 ≤ 0.0396 ≤ 0.756
IQR (CV): 0.479 (18.7)
100 distinct values 36650
(100%)
0
(0%)
Dimensions: 36650 x 10
Duplicate rows: 0

The dataset contains:

  • ID: Subject identifier
  • TIME: Observation time
  • EVID: Event identifier (1 = dosing event, 0 = observation)
  • AMT, ADDL, II: Dosing information (amount, additional doses, dosing interval)
  • DV: Disease state observation (integer 1, 2, 3, or 4)
  • ETA1, ETA2: Pre-computed individual PK random effects
NotePre-Computed Random Effects

This dataset includes ETA1 and ETA2 columns representing pre-computed individual PK random effects from a prior PK analysis. They are used as covariates to account for individual PK variability without estimating random effects on PK parameters within the Markov model. This sequential approach is common when the focus is on the transition model rather than PK.

5.1 Data Preparation

Before fitting, we need two data wrangling steps:

  1. Add a CMT (compartment) column set to 1 — required by read_pumas
  2. Set DV to missing for dosing events (EVID == 1) — dosing rows carry no disease state observation
@rtransform!(df, :CMT = 1);
@rtransform!(df, :DV = :EVID == 1 ? missing : :DV);
WarningMissing Values in FiniteMarkov

FiniteMarkov does not support missing observations in the disease state sequence. The missing values we just created for dosing events are handled correctly by read_pumas (dosing events are not part of the observation vector). However, if your dataset has missing observations at non-dosing timepoints, you must impute them before fitting.

Now we read the data into a Pumas Population:

pop = read_pumas(
    df;
    id = :ID,
    time = :TIME,
    evid = :EVID,
    cmt = :CMT,
    amt = :AMT,
    addl = :ADDL,
    ii = :II,
    observations = [:DV],
    covariates = [:ETA1, :ETA2],
)
Population
  Subjects: 100
  Covariates: ETA1, ETA2 (heterogenous)
  Observations: DV (heterogenous)

5.2 Exploratory Data Analysis

Good EDA is essential before fitting a Markov model. Following the recommendations of Ooi et al. (2025), we examine three aspects:

  1. The longitudinal profile of state values per subject
  2. The distribution of states over time across the population
  3. The correlation between current and previous states (Markov signature)

5.2.1 Longitudinal Profiles

Let’s first look at the raw observations.

obs_df = @chain df begin
    @rsubset !ismissing(:DV)
    @rtransform :State = Int(:DV)
    @rtransform :StateStr = string(Int(:DV))
end
simple_table(first(obs_df, 5))
ID TIME AMT II ADDL DV EVID TREA ETA1 ETA2 CMT State StateStr
1 0 0 0 0 1 0 0 -0.0754 -0.762 1 1 1
1 1 0 0 0 1 0 0 -0.0754 -0.762 1 1 1
1 2 0 0 0 1 0 0 -0.0754 -0.762 1 1 1
1 3 0 0 0 1 0 0 -0.0754 -0.762 1 1 1
1 4 0 0 0 1 0 0 -0.0754 -0.762 1 1 1
# Longitudinal state profiles for a few subjects
selected_ids = unique(obs_df.ID)[1:min(6, length(unique(obs_df.ID)))]
sub_df = @rsubset(obs_df, :ID in selected_ids)
@rtransform!(sub_df, :IDStr = string(:ID));
plt_long =
    data(sub_df) *
    mapping(:TIME => "Time", :State => "State"; layout = :IDStr => "Subject") *
    visual(Stairs; step = :post)
draw(plt_long; axis = (; yticks = 1:4))

Notice the long runs of the same state — this is the hallmark of Markovian data. Subjects tend to stay in their current state for multiple consecutive observations before transitioning.

5.2.2 Distribution of States Over Time

Next, let’s look at how the population distributes across states at each time point.

state_counts = @chain obs_df begin
    groupby([:TIME, :StateStr])
    @combine :Count = length(:ID)
end
plt_states =
    data(state_counts) *
    mapping(
        :TIME => "Time",
        :Count => "Number of Subjects";
        color = :StateStr => "State",
        layout = :StateStr,
    ) *
    visual(BarPlot; alpha = 0.7)
draw(plt_states)

5.2.3 Transition Frequency Table

One of the most informative diagnostics is a transition frequency table — a count of how often each pair of transitions occurs in the data. This directly reveals the Markov structure:

# Create current-vs-previous state pairs
obs_by_id = groupby(obs_df, :ID)
transitions = DataFrame(ID = Int[], PrevState = Int[], CurrState = Int[])
for group in obs_by_id
    states = group.State
    ids = group.ID
    for i = 2:length(states)
        push!(transitions, (ID = ids[i], PrevState = states[i-1], CurrState = states[i]))
    end
end
simple_table(first(transitions, 5))
ID PrevState CurrState
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
# Transition frequency matrix
trans_counts = @chain transitions begin
    groupby([:PrevState, :CurrState])
    @combine :N = length(:ID)
end
# Pivot to matrix form
trans_wide = unstack(trans_counts, :PrevState, :CurrState, :N)
simple_table(trans_wide)
PrevState 1 2 3 4
1 17432 231 2 5
2 183 8430 128 1
3 1 109 8502 1
4 missing missing missing 1475
TipReading the Transition Table

Each row represents the previous state and each column represents the current state. Look for:

  • Large diagonal values → strong Markov property (subjects tend to stay)
  • Asymmetry → preferred directions of transition (e.g., more 1→2 than 2→1 may indicate disease progression)
  • Zero or rare transitions → these can be fixed or constrained in the model

5.2.4 Computing Initial Logit Estimates from the Data

Good initial values help the optimizer converge faster and avoid local minima. We can derive them directly from the observed transition frequencies.

Step 1 — Convert counts to row-wise probabilities. For each source state (row), divide each transition count by the row total:

# Compute observed transition probabilities from the frequency table
trans_probs = @chain transitions begin
    groupby([:PrevState, :CurrState])
    @combine :N = length(:ID)
end
# Add row totals
row_totals = @chain trans_probs begin
    groupby(:PrevState)
    @combine :RowTotal = sum(:N)
end
trans_probs = leftjoin(trans_probs, row_totals; on = :PrevState)
@rtransform!(trans_probs, :Prob = :N / :RowTotal);

# Display as a pivot table
prob_wide = unstack(trans_probs, :PrevState, :CurrState, :Prob)
simple_table(prob_wide)
PrevState 1 2 3 4
1 0.987 0.0131 0.000113 0.000283
2 0.0209 0.964 0.0146 0.000114
3 0.000116 0.0127 0.987 0.000116
4 missing missing missing 1

Each row sums to 1 — these are the empirical transition probabilities.

Step 2 — Convert probabilities to logits. The logit for transitioning from state \(i\) to state \(j\) (relative to the reference state \(K\)) is:

\[ \tau_{ij} = \log\left(\frac{P_{ij}}{P_{iK}}\right) \]

In our 4-state model, state 4 is the reference category:

K = 4
# For each source state (row), compute logits relative to state K
init_logits = Matrix{Float64}(undef, K - 1, K - 1)
for src = 1:(K-1)
    row_probs = @rsubset(trans_probs, :PrevState == src)
    p_ref = only(@rsubset(row_probs, :CurrState == K).Prob)
    for dst = 1:(K-1)
        p = only(@rsubset(row_probs, :CurrState == dst).Prob)
        init_logits[src, dst] = log(p / p_ref)
    end
    @printf(
        "State %d → logits [τ_%d1, τ_%d2, τ_%d3] = [%8.3f, %8.3f, %8.3f]\n",
        src,
        src,
        src,
        src,
        init_logits[src, 1],
        init_logits[src, 2],
        init_logits[src, 3]
    )
end
State 1 → logits [τ_11, τ_12, τ_13] = [   8.157,    3.833,   -0.916]
State 2 → logits [τ_21, τ_22, τ_23] = [   5.209,    9.040,    4.852]
State 3 → logits [τ_31, τ_32, τ_33] = [   0.000,    4.691,    9.048]
TipUsing These as Initial Values

These data-derived logits are baseline estimates (no drug effect). They make excellent starting values for the θ parameters in the model. The drug effect slopes (slope₁₂, slope₂₃) can be initialized at zero, since the baseline logits already capture the average transition pattern.

Note that state 4 (the absorbing state) has very few transitions from it, so its row is fixed to all \(-\infty\) in the model rather than estimated.

6 Building the Markov Chain Model

Now that we’ve confirmed the Markovian nature of the data, let’s build a model. Our model has three interconnected components:

  1. PK sub-model: A one-compartment model that tracks drug concentration over time
  2. Drug effect: Concentration-dependent shifts on specific transition probabilities
  3. Markov chain: A 4-state discrete-time Markov model (3 active states + 1 absorbing)

6.1 The Full Model

markov_model = @model begin
    @param begin
        """
        Clearance (L/d)
        """
        θCL  RealDomain(lower = 0)
        """
        Volume (L)
        """
        θV  RealDomain(lower = 0)
        """
        Transition logits from State 1
        """
        θ₁₁  RealDomain()
        θ₁₂  RealDomain()
        θ₁₃  RealDomain()
        """
        Transition logits from State 2
        """
        θ₂₁  RealDomain()
        θ₂₂  RealDomain()
        θ₂₃  RealDomain()
        """
        Transition logits from State 3
        """
        θ₃₁  RealDomain()
        θ₃₂  RealDomain()
        θ₃₃  RealDomain()
        """
        Drug effect slope on 1→2 transition
        """
        slope₁₂  RealDomain()
        """
        Drug effect slope on 2→3 transition
        """
        slope₂₃  RealDomain()
    end

    @covariates ETA1 ETA2

    @pre begin
        CL = θCL * exp(ETA1)
        V = θV * exp(ETA2)
    end

    @dynamics begin
        Central' = -CL / V * Central
    end

    @vars begin
        conc = Central / V

        # Drug effects on transition probabilities (logit scale)
        drugep₁₂ = slope₁₂ * conc
        drugep₂₃ = slope₂₃ * conc

        # Transition logits with drug effect added
        θ₁₂de = θ₁₂ + drugep₁₂
        θ₂₃de = θ₂₃ + drugep₂₃
    end

    @derived begin
        # Time-varying transition matrices (logit parameterization)
        # Each matrix is k × (k-1) where k = 4 states
        # Row 4 is the absorbing state: -Inf means zero probability of leaving
:= [
            [
                θ₁₁ _θ₁₂de θ₁₃
                θ₂₁ θ₂₂ _θ₂₃de
                θ₃₁ θ₃₂ θ₃₃
                -Inf -Inf -Inf
            ] for (_θ₁₂de, _θ₂₃de) in zip(θ₁₂de, θ₂₃de)
        ]

        DV ~ FiniteMarkov(Pθ, [θ₁₁, θ₁₂, θ₁₃])
    end
end
PumasModel
  Parameters: θCL, θV, θ₁₁, θ₁₂, θ₁₃, θ₂₁, θ₂₂, θ₂₃, θ₃₁, θ₃₂, θ₃₃, slope₁₂, slope₂₃
  Random effects:
  Covariates: ETA1, ETA2
  Dynamical system variables: Central
  Dynamical system type: Matrix exponential
  Derived: DV, conc, drugep₁₂, drugep₂₃, θ₁₂de, θ₂₃de
  Observed: DV, conc, drugep₁₂, drugep₂₃, θ₁₂de, θ₂₃de

6.2 Step-by-Step Model Walkthrough

Let’s break this model down piece by piece.

6.2.1 The @param Block: What We’re Estimating

The model has 13 parameters:

Parameter Role Domain
θCL, θV PK clearance and volume Fixed (from prior PK analysis)
θ₁₁, θ₁₂, θ₁₃ Logits for transitions from State 1 to States 1, 2, 3 Unconstrained
θ₂₁, θ₂₂, θ₂₃ Logits for transitions from State 2 to States 1, 2, 3 Unconstrained
θ₃₁, θ₃₂, θ₃₃ Logits for transitions from State 3 to States 1, 2, 3 Unconstrained
slope₁₂ Drug effect on 1→2 transition Unconstrained
slope₂₃ Drug effect on 2→3 transition Unconstrained

The nine θ parameters define a \(3 \times 3\) baseline transition matrix (on the logit scale). Each row also has an implicit fourth entry (logit = 0) for State 4 (the absorbing state). So we actually model a \(4 \times 3\) logit matrix — 3 logits per row for 4 states.

TipParameter Count

For \(K\) states, a DTMM needs \(K \times (K-1)\) transition logit parameters (one row per source state, \(K-1\) logits per row). With \(K = 4\): \(4 \times 3 = 12\) parameters. But the absorbing state row is all \(-\infty\) (fixed), so we estimate \(3 \times 3 = 9\) logits, plus 2 drug effect slopes = 11 free parameters.

6.2.2 The @covariates and @pre Blocks: Individual PK

ETA1 and ETA2 are pre-computed individual PK random effects that are read as covariates from the dataset. The @pre block uses these to compute individual clearance and volume:

CL = θCL * exp(ETA1)     # Individual clearance
V  = θV  * exp(ETA2)     # Individual volume

6.2.3 The @dynamics Block: PK Concentration Over Time

A standard one-compartment oral PK model tracks the amount of drug in the central compartment:

Central' = -CL / V * Central

The concentration at each time point is then conc = Central / V.

6.2.4 The @vars Block: Drug Effects on Transitions

This is where pharmacology meets the Markov model. The drug concentration modifies two specific transition logits:

drugep₁₂ = slope₁₂ * conc    # Drug effect on State 1 → State 2
drugep₂₃ = slope₂₃ * conc    # Drug effect on State 2 → State 3

θ₁₂de = θ₁₂ + drugep₁₂      # Modified logit
θ₂₃de = θ₂₃ + drugep₂₃      # Modified logit

The drug effect is additive on the logit scale and linear in concentration (a slope model). This is the most common parameterization for drug effects in DTMMs.

NoteChoosing Which Transitions to Modify

Drug effects need not be placed on all transitions. The choice should be guided by:

  • Clinical hypothesis — which transitions does the drug affect?
  • Data support — rare transitions may not support drug effect estimation
  • Parsimony — each drug effect adds a parameter

In our model, the drug affects transitions toward worse states (1→2 and 2→3), which would be typical for a model where higher drug exposure exacerbates disease progression. Alternatively, in a beneficial drug model, you might place effects on transitions toward better states.

6.2.5 The @derived Block: Assembling the FiniteMarkov Distribution

The transition matrices are assembled as a vector of matrices, one per observation time. This is because the drug concentration (and therefore the transition probabilities) changes over time:

:= [
    [
        θ₁₁ _θ₁₂de θ₁₃        # Row 1: transitions from State 1
        θ₂₁ θ₂₂ _θ₂₃de      # Row 2: transitions from State 2
        θ₃₁ θ₃₂ θ₃₃         # Row 3: transitions from State 3
        -Inf -Inf -Inf          # Row 4: absorbing state (can't leave)
    ] for (_θ₁₂de, _θ₂₃de) in zip(θ₁₂de, θ₂₃de)
]

Each matrix in is \(4 \times 3\) — 4 rows (one per source state) and 3 columns (logits for transition to States 1, 2, and 3). The probability of transitioning to State 4 (the absorbing state) is implicit (the reference category).

The FiniteMarkov call:

DV ~ FiniteMarkov(Pθ, [θ₁₁, θ₁₂, θ₁₃])

takes two arguments:

  1. : The vector of transition matrices (one per observation time)
  2. [θ₁₁, θ₁₂, θ₁₃]: The initial state distribution (logits for States 1, 2, 3; State 4 logit is implicitly 0)
NoteStarting the Markov Chain

The initial state vector [θ₁₁, θ₁₂, θ₁₃] determines the probability of each state at the first observation. The first transition matrix in is not referenced — only the initial distribution is used for time 1. From time 2 onwards, each observation uses the corresponding transition matrix from .

6.3 The Transition Matrix Laid Out

To make the model concrete, here is the matrix of logits that the model assembles at a given time. This is exactly what goes into , with the implicit reference column for State 4 (logit \(= 0\)) shown for completeness:

\[ \begin{array}{c|cccc} \text{From} \,\backslash\, \text{To} & \text{State 1} & \text{State 2} & \text{State 3} & \text{State 4 (ref)} \\ \hline \text{State 1} & \theta_{11} & \theta_{12} + \text{slope}_{12}\, c & \theta_{13} & 0 \\ \text{State 2} & \theta_{21} & \theta_{22} & \theta_{23} + \text{slope}_{23}\, c & 0 \\ \text{State 3} & \theta_{31} & \theta_{32} & \theta_{33} & 0 \\ \text{State 4} & -\infty & -\infty & -\infty & 0 \end{array} \]

where \(c\) is the drug concentration at that time. The transition probabilities are obtained by applying the softmax to each row of this logit matrix. For source state \(i\), the probability of moving to state \(j\) is:

\[ P_{ij} = \frac{\exp(\tau_{ij})}{\sum_{l=1}^{4} \exp(\tau_{il})}, \]

where \(\tau_{ij}\) is the logit in row \(i\), column \(j\) of the array above (and \(\tau_{i4} = 0\) for every active row). For the absorbing State 4, the \(-\infty\) logits give \(P_{4j} = 0\) for \(j \neq 4\) and \(P_{44} = 1\).

The two drug effects (slope₁₂ and slope₂₃) cause the transition probabilities to be time-varying — they shift as drug concentration rises and falls after each dose.

7 Fitting the Model

We fit the model using NaivePooled estimation. Since individual PK variability is already handled via the ETA1/ETA2 covariates, no additional random effects are needed. We hold the PK parameters constant at their known values.

The initial parameter estimates combine the PK values from a prior analysis with the data-derived logits we computed earlier:

params = (
    θCL = 4.8,
    θV = 10.0,
    θ₁₁ = init_logits[1, 1],
    θ₁₂ = init_logits[1, 2],
    θ₁₃ = init_logits[1, 3],
    θ₂₁ = init_logits[2, 1],
    θ₂₂ = init_logits[2, 2],
    θ₂₃ = init_logits[2, 3],
    θ₃₁ = init_logits[3, 1],
    θ₃₂ = init_logits[3, 2],
    θ₃₃ = init_logits[3, 3],
    slope₁₂ = 0.0,
    slope₂₃ = 0.0,
)
(θCL = 4.8, θV = 10.0, θ₁₁ = 8.156624964190355, θ₁₂ = 3.832979798087693, θ₁₃ = -0.916290731874155, θ₂₁ = 5.209486152841421, θ₂₂ = 9.039552050995901, θ₂₃ = 4.852030263919617, θ₃₁ = 0.0, θ₃₂ = 4.6913478822291435, θ₃₃ = 9.048056708918736, slope₁₂ = 0.0, slope₂₃ = 0.0)
fpm = fit(
    markov_model,
    pop,
    params,
    NaivePooled();
    constantcoef = (:θCL, :θV),
    optim_options = (; show_trace = true, show_every = 20),
)
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter     Function value   Gradient norm 
     0     3.467011e+03     1.628651e+03
 * time: 0.09011411666870117
    20     3.419425e+03     1.886967e+01
 * time: 16.273836135864258
    40     3.419316e+03     4.406543e-02
 * time: 25.863924026489258
FittedPumasModel

Dynamical system type:          Matrix exponential

Number of subjects:                            100

Observation records:         Active        Missing
    DV:                       36600              0
    Total:                    36600              0

Number of parameters:      Constant      Optimized
                                  2             11

Likelihood approximation:              NaivePooled
Likelihood optimizer:                         BFGS

Termination Reason:                   GradientNorm
Log-likelihood value:                   -3419.3066

----------------------
            Estimate
----------------------
† θCL        4.8
† θV        10.0
  θ₁₁        8.1627
  θ₁₂        3.4056
  θ₁₃       -0.91548
  θ₂₁        5.2082
  θ₂₂        9.0382
  θ₂₃        4.1391
  θ₃₁       -6.37e-13
  θ₃₂        4.6913
  θ₃₃        9.0481
  slope₁₂    0.037161
  slope₂₃    0.03561
----------------------
† indicates constant parameters
NoteWhy NaivePooled?

NaivePooled treats each subject’s observation sequence as independent conditional on the model parameters and their covariates. The Markov structure already captures within-subject dependence through the conditional transition probabilities. This is the standard and most common estimation approach for DTMMs.

If you wanted to estimate IIV on the transition parameters, you would need LaplaceI instead. Estimation of Markov models with random effects can sometimes be difficult, and this tutorial follows the published tutorial of Ooi et al. (2025) in not estimating with random effects.

TipSetting Initial Values

The transition logit parameters default to 0 (equal probability for all states). If you have prior estimates (e.g., from observed transition frequencies), use them as initial values to speed up convergence. You can compute initial logits from observed transition probabilities as: \(\tau_{ij} = \log(P_{ij} / P_{iK})\) where \(P_{iK}\) is the probability for the reference state.

7.1 Examining the Estimates

simple_table(coeftable(fpm))
parameter constant estimate
θCL true 4.8
θV true 10
θ₁₁ false 8.16
θ₁₂ false 3.41
θ₁₃ false -0.915
θ₂₁ false 5.21
θ₂₂ false 9.04
θ₂₃ false 4.14
θ₃₁ false -6.37e-13
θ₃₂ false 4.69
θ₃₃ false 9.05
slope₁₂ false 0.0372
slope₂₃ false 0.0356

The logit-scale parameters are not directly interpretable. Let’s convert them to probabilities.

8 Interpreting the Results

8.1 Converting Logits to Transition Probabilities

We recover the actual transition probabilities at baseline (zero drug concentration) by assembling the logit matrix and applying the softmax to each row:

pars = coef(fpm)

# Baseline 4×4 logit matrix (zero drug concentration).
# The last column is the reference state (logit 0); the last row is the
# absorbing state (−Inf logits, so it can never leave State 4).
logit_matrix = [
    pars.θ₁₁ pars.θ₁₂ pars.θ₁₃ 0.0
    pars.θ₂₁ pars.θ₂₂ pars.θ₂₃ 0.0
    pars.θ₃₁ pars.θ₃₂ pars.θ₃₃ 0.0
    -Inf -Inf -Inf 0.0
]

# Convert each row of logits to transition probabilities with the softmax
P = mapslices(softmax, logit_matrix; dims = 2)
4×4 Matrix{Float64}:
 0.99109      0.00851461  0.000113116  0.00028256
 0.0210906    0.971553    0.00724125   0.0001154
 0.000116104  0.0126553   0.987113     0.000116104
 0.0          0.0         0.0          1.0

The baseline transition matrix can be displayed as a formatted table:

labels = ["State 1", "State 2", "State 3", "State 4"]
P_table = DataFrame(round.(P; digits = 6), labels)
insertcols!(P_table, 1, "From \\ To" => labels)
simple_table(P_table)
From \\ To State 1 State 2 State 3 State 4
State 1 0.991 0.00852 0.000113 0.000283
State 2 0.0211 0.972 0.00724 0.000115
State 3 0.000116 0.0127 0.987 0.000116
State 4 0 0 0 1
TipWhat to Look For
  • Large diagonal values indicate strong Markov behavior (subjects tend to stay in their current state)
  • The magnitude of State 4 probabilities shows how quickly subjects drop out from each state
  • Asymmetry between \(P_{12}\) and \(P_{21}\) reveals preferred directions of transitions

8.2 State Occupation Probabilities with probstable

Pumas provides the probstable function to compute how the probability of occupying each state evolves over time, accounting for the full sequence of time-varying transition matrices. This is one of the most useful diagnostic tools for Markov models:

pt = probstable(fpm)
simple_table(first(pt, 10))
id time DV_iprob1 DV_iprob2 DV_iprob3 DV_iprob4
1 0 0.991 0.00851 0.000113 0.000283
1 1 0.982 0.0167 0.000285 0.000564
1 2 0.974 0.0246 0.000514 0.000843
1 3 0.966 0.0322 0.000796 0.00112
1 4 0.958 0.0395 0.00113 0.0014
1 5 0.95 0.0466 0.00151 0.00167
1 6 0.943 0.0534 0.00193 0.00195
1 7 0.935 0.0599 0.0024 0.00222
1 8 0.928 0.0662 0.00291 0.00249
1 9 0.922 0.0722 0.00346 0.00276

The columns DV_iprob1 through DV_iprob4 give the probability of being in each state at each time point. These always sum to 1:

# Verify probabilities sum to 1
prob_cols = select(pt, :DV_iprob1, :DV_iprob2, :DV_iprob3, :DV_iprob4)
all_sums = [sum(row) for row in eachrow(prob_cols)]
println("All probability sums ≈ 1.0: ", all(x -> isapprox(x, 1.0), all_sums))
All probability sums ≈ 1.0: true

Let’s visualize the state occupation probabilities for a representative subject:

pt_sub1 = @rsubset(pt, :id == first(pt.id))
pt_long = stack(
    pt_sub1,
    [:DV_iprob1, :DV_iprob2, :DV_iprob3, :DV_iprob4];
    variable_name = :State,
    value_name = :Probability,
)
@rtransform!(
    pt_long,
    :State = replace(
        string(:State),
        "DV_iprob1" => "State 1",
        "DV_iprob2" => "State 2",
        "DV_iprob3" => "State 3",
        "DV_iprob4" => "State 4 (Absorbing)",
    )
);
plt_prob =
    data(pt_long) *
    mapping(:time => "Time", :Probability => "State Probability"; color = :State) *
    visual(Lines; linewidth = 2)
draw(
    plt_prob;
    axis = (; title = "State Occupation Probabilities — Subject $(first(pt.id))",),
    legend = (position = :bottom, title = "State", titlealignment = :left),
)

NoteInterpreting State Occupation Probabilities

These probabilities show the model’s prediction of the chance of being in each state at each time. As time progresses:

  • The absorbing state probability typically increases (more subjects have dropped out)
  • Active state probabilities may shift as drug concentrations change
  • The rate of change reflects both the baseline transition rates and the drug effect

8.3 Understanding the Drug Effect

The drug effects in our model operate on the logit scale. Let’s visualize how drug concentration changes the transition probabilities.

A positive slope₁₂ means that higher drug concentration increases the logit for the 1→2 transition, shifting probability toward State 2. Let’s compute this across a range of concentrations:

concentrations = 0.0:0.5:10.0

probs_12 = map(concentrations) do c
    softmax([pars.θ₁₁, pars.θ₁₂ + pars.slope₁₂ * c, pars.θ₁₃, 0.0])[2]
end

probs_23 = map(concentrations) do c
    softmax([pars.θ₂₁, pars.θ₂₂, pars.θ₂₃ + pars.slope₂₃ * c, 0.0])[3]
end

drug_df = DataFrame(
    Concentration = repeat(collect(concentrations), 2),
    Probability = vcat(probs_12, probs_23),
    Transition = vcat(
        fill("P(State 1 → State 2)", length(concentrations)),
        fill("P(State 2 → State 3)", length(concentrations)),
    ),
)
plt_drug =
    data(drug_df) *
    mapping(
        :Concentration => "Drug Concentration",
        :Probability => "Transition Probability";
        color = :Transition,
    ) *
    visual(Lines; linewidth = 2)
draw(
    plt_drug;
    axis = (; title = "Drug Effect on Transition Probabilities",),
    legend = (position = :bottom, title = "Transition", titlealignment = :left),
)

TipDrug Effect Interpretation

Because the drug effect is on the logit scale, the relationship between concentration and probability is sigmoidal (S-shaped), not linear. At low concentrations, the effect is approximately linear. At high concentrations, the probability saturates (it can never exceed 1). This built-in saturation is a natural advantage of the logit parameterization.

9 Inference and Model Evaluation

9.1 Parameter Uncertainty

FiniteMarkov models fitted with NaivePooled support standard error computation:

fpmi = infer(fpm)
[ Info: Calculating: variance-covariance matrix.
[ Info: Done.
Asymptotic inference results using sandwich estimator

Dynamical system type:          Matrix exponential

Number of subjects:                            100

Observation records:         Active        Missing
    DV:                       36600              0
    Total:                    36600              0

Number of parameters:      Constant      Optimized
                                  2             11

Likelihood approximation:              NaivePooled
Likelihood optimizer:                         BFGS

Termination Reason:                   GradientNorm
Log-likelihood value:                   -3419.3066

------------------------------------------------------------
            Estimate    SE          95.0% C.I.
------------------------------------------------------------
† θCL        4.8        NaN         [  NaN     ;  NaN     ]
† θV        10.0        NaN         [  NaN     ;  NaN     ]
  θ₁₁        8.1627     0.45064     [  7.2794  ;  9.0459  ]
  θ₁₂        3.4056     0.46148     [  2.5012  ;  4.3101  ]
  θ₁₃       -0.91548    0.83588     [ -2.5538  ;  0.72281 ]
  θ₂₁        5.2082     0.99104     [  3.2658  ;  7.1506  ]
  θ₂₂        9.0382     0.99691     [  7.0843  ; 10.992   ]
  θ₂₃        4.1391     1.0318      [  2.1169  ;  6.1614  ]
  θ₃₁       -6.37e-13   1.4186      [ -2.7805  ;  2.7805  ]
  θ₃₂        4.6913     1.0175      [  2.6971  ;  6.6856  ]
  θ₃₃        9.0481     0.99835     [  7.0913  ; 11.005   ]
  slope₁₂    0.037161   0.0044194   [  0.028499;  0.045823]
  slope₂₃    0.03561    0.0061912   [  0.023475;  0.047744]
------------------------------------------------------------
† indicates constant parameters

This provides standard errors, confidence intervals, and p-values for all estimated parameters (on the logit scale).

NoteInference on the Logit Scale

The confidence intervals from infer are on the logit scale. While you can interpret whether individual logit parameters are significantly different from zero, translating these to the probability scale requires care. The logistic is a multivariate transformation — confidence intervals on individual logits do not directly translate to confidence intervals on probabilities. For probability-scale uncertainty, simulation-based approaches (e.g., bootstrap) are recommended.

9.2 Log-Likelihood

The model log-likelihood can be used for model comparison:

ll = loglikelihood(fpm)
println("Log-likelihood: ", ll)
println("-2LL: ", -2 * ll)
Log-likelihood: -3419.3065604734097
-2LL: 6838.613120946819

The \(-2 \text{LL}\) value can be compared between nested models using the likelihood ratio test (the difference is approximately \(\chi^2\) distributed with degrees of freedom equal to the number of extra parameters).

9.3 Simulation and Model Checking

Markov models support simulation via simobs. This is the foundation for model evaluation: if the model is adequate, simulated data should reproduce the key features of the observed data.

sims = simobs(markov_model, pop, coef(fpm))
Simulated population (Vector{<:Subject})
  Simulated subjects: 100
  Simulated variables: DV, conc, drugep₁₂, drugep₂₃, θ₁₂de, θ₂₃de

9.4 Simulation Roundtripping

We can verify model identifiability by simulating data from the fitted model and re-estimating the parameters. If the model is well-identified, the re-estimated parameters should be close to the original:

sim_pop = Subject.(sims)
refit_fpm = fit(
    markov_model,
    sim_pop,
    coef(fpm),
    NaivePooled();
    constantcoef = (:θCL, :θV),
    optim_options = (; show_trace = true, show_every = 20),
)
[ Info: Checking the initial parameter values.
[ Info: The initial negative log likelihood and its gradient are finite. Check passed.
Iter     Function value   Gradient norm 
     0     3.496158e+03     6.586136e+02
 * time: 3.0040740966796875e-5
    20     3.489921e+03     1.445022e+01
 * time: 11.985846996307373
    40     3.488356e+03     8.734019e-01
 * time: 21.271727085113525
    60     3.488335e+03     2.436686e-02
 * time: 31.257397174835205
FittedPumasModel

Dynamical system type:          Matrix exponential

Number of subjects:                            100

Observation records:         Active        Missing
    DV:                       36600              0
    Total:                    36600              0

Number of parameters:      Constant      Optimized
                                  2             11

Likelihood approximation:              NaivePooled
Likelihood optimizer:                         BFGS

Termination Reason:                   GradientNorm
Log-likelihood value:                   -3488.3351

-----------------------
            Estimate
-----------------------
† θCL         4.8
† θV         10.0
  θ₁₁         7.3316
  θ₁₂         2.6501
  θ₁₃        -1.2993
  θ₂₁         5.2781
  θ₂₂         9.12
  θ₂₃         4.2492
  θ₃₁       -12.291
  θ₃₂        19.979
  θ₃₃        24.288
  slope₁₂     0.037031
  slope₂₃     0.027865
-----------------------
† indicates constant parameters
# Compare original and re-estimated parameters
simple_table(compare_estimates(; original = fpm, refit = refit_fpm))
parameter original refit
θCL 4.8 4.8
θV 10 10
θ₁₁ 8.16 7.33
θ₁₂ 3.41 2.65
θ₁₃ -0.915 -1.3
θ₂₁ 5.21 5.28
θ₂₂ 9.04 9.12
θ₂₃ 4.14 4.25
θ₃₁ -6.37e-13 -12.3
θ₃₂ 4.69 20
θ₃₃ 9.05 24.3
slope₁₂ 0.0372 0.037
slope₂₃ 0.0356 0.0279

10 References

  • Ooi QX, Plan E, Bergstrand M. A tutorial on pharmacometric Markov models. CPT: Pharmacometrics & Systems Pharmacology. 2025;14(2):197-216.
  • Girard P, Blaschke TF, Kastrissios H, Sheiner LB. A Markov mixed effect regression model for drug compliance. Statistics in Medicine. 1998;17(20):2313-2333.
  • Karlsson MO, Schoemaker RC, Kemp B, et al. A pharmacodynamic Markov mixed-effects model for the effect of temazepam on sleep. Clinical Pharmacology & Therapeutics. 2000;68(2):175-188.
  • Lacroix BD, Lovern MR, Stockis A, et al. A pharmacodynamic Markov mixed-effects model for determining the effect of exposure to certolizumab pegol on the ACR20 score in patients with rheumatoid arthritis. Clinical Pharmacology & Therapeutics. 2009;86(4):387-395.