DeepPumas Epidemiology: force-of-infection across heterogeneous countries

Author

Niklas Korsbo

DeepPumas allows us to integrate data-driven function identification that accounts for variability within different stratifications of our dataset. In pharmacology, this approach is often used to identify disease-specific functions that are partially individualized across patients. However, a population-subject split is only one of many possible and relevant ways to stratify data.

In this tutorial, we will use DeepPumas to analyze epidemiological data, focusing on a disease that exhibits both similarities and differences in its spread across countries. The nonlinear mixed effects (NLME) framework enables us to effectively manage data from heterogeneous sources, even when it is unbalanced. Each country may differ not only in the mechanisms of disease transmission but also in the availability and quality of data. A key question is whether we can leverage abundant data from certain countries to enhance predictions in countries where data is sparse.

This is a novel application area not only for DeepNLME but also for NLME frameworks in general. While our analysis will focus on a specific question, DeepNLME enables us to separate common trends from individual variations across datasets, opening up a broad range of possible questions to explore.

Here, we will study the force-of-infection of a virus that individuals typically acquire over the course of their lives and do not eliminate. In this case, our data is cross-sectional rather than longitudinal - the percentage of people testing positive for the virus increasing by age.

We will start by loading a few necessary packages and setting a theme to enhance the visual clarity of our plots.

using DeepPumas
using CairoMakie
using Random
using DataFramesMeta
set_theme!(deep_light())

1 Synthetic data generation

We next generate a synthetic data set where the fraction of positive tests is stratified by country and age, where the country significantly affects the relationship between the age and the positive test rates.

We start by defining a force-of-infection function \(λ\) that depends on age.

λ(age) =
    pdf.(Normal(16, 7), age) .+ 1.2 .* pdf.(Normal(37, 10), age) .+
    0.4 .* pdf.(Normal(65, 14), age) .+ 0.02

age = 0:0.1:100
fig = Figure()
ax = Axis(
    fig[1, 1],
    limits = (nothing, nothing, 0, nothing),
    xlabel = "Age (years)",
    ylabel = "Probability Density",
)
lines!(ax, age, λ)
display(fig);

Force-of-infection of the data-generating function. Two peaks of infectiousness are located roughly at late school age and at the age of the students’ parents. The third smaller peak is located around the older age.

Next, we define a data-generating model that uess the force-of-infection, applying a country-specific scaling factor to simulate synthetic data for various countries, each with differing numbers of measurements. For reference, the complete code is provided in a collapsible box below, though the model details are not essential for following this tutorial. We also plot the generated training data.

Click to see the data generating model and data generation.

datamodel = @model begin
    @param begin
        σ ∈ RealDomain(; lower = 0, init = 10)
        tvc ∈ RealDomain(; lower = 0)
        ω ∈ RealDomain(; lower = 0)
    end
    @random η ~ Normal(0, ω)
    @pre begin
        c = tvc * exp(η)
        _λ = t -> c * λ(t)
    end
    @init S = 1
    @dynamics begin
        S' = -_λ(t) * S
    end
    @derived begin
        """
        Positive test fraction
        """
        PosFrac ~ @. Beta(σ * (1 - 0.99S), σ * (0.99S))
    end
    @observed begin
        iλ = @. _λ[1](t)
    end
end

countries = [
    "Austria",
    "Belgium",
    "Bulgaria",
    "Croatia",
    "Cyprus",
    "Czech Republic",
    "Denmark",
    "Estonia",
    "Finland",
    "France",
    "Germany",
    "Greece",
    "Hungary",
    "Ireland",
    "Italy",
    "Latvia",
    "Lithuania",
    "Sweden",
]
sparse_countries = ["Brazil", "Vietnam"]

data_params = (; tvc = 0.3, σ = 1000.0, ω = 0.6)
obstimes = vcat(0:5, 10:5:100)
Random.seed!(123)

sim_data = map(eachindex(countries)) do i
    simobs(
        datamodel,
        Subject(; id = countries[i]),
        data_params;
        obstimes = sort(sample(obstimes, rand(1:20); replace = false)),
    )
end

sim_data_sparse = map(eachindex(sparse_countries)) do i
    simobs(
        datamodel,
        Subject(; id = sparse_countries[i]),
        data_params;
        obstimes = sort(sample(1:100, rand(1:2); replace = false)),
    )
end

all_countries = vcat(countries, sparse_countries)
all_sims = vcat(sim_data, sim_data_sparse)
all_data = Subject.(all_sims)

train_data = all_data[1:end-4]
val_data = all_data[end-3:end]

plotgrid(
    all_data;
    xlabel = "AGE",
    ylabel = "Fraction positive tests",
    legend = false,
    data = (; markersize = 10),
    axis = (; yticks = 0:0.25:1),
    title = (s, i) -> "$(s.id) ($(s.id in getfield.(val_data, :id) ? "Test" : "Train"))",
)

Synthetic data where the fraction of people infected by a life-long virus increase differently with age across different countries. The different countries differ not only in how the infection spreads, but also in the quantity and quality of the collected data. The data is split into a training and a test data set. Only data from countries in the training data set will be used to fit the model.

2 Modeling the data

Now, observing this data — but assuming we do not know the original data-generating model — we can define a simple model. We represent the fraction of the population that is susceptible to the virus as \(S\), starting from 1 for newborns. This fraction decreases with age, at a rate governed by an age-dependent force-of-infection and the remaining proportion of people who are still susceptible at each age. Note that while our independent variable is age, DeepPumas represents it as time t.

The susceptible fraction of the population for country \(i\), \(S_i\), is defined as follows:

\[\begin{equation} \frac{dS_i}{dAge} = - λ_i(Age) * S_i \end{equation}\]

where \(λ_i(Age)\) is unknown. We use a neural network to estimate \(λ_i(Age) \approx NN(Age, \eta_i)\), where we assume a single, shared functional form for the force-of-infection across countries. Rather than modeling it with entirely independent functions, we allow the function to be adjusted by a single, tunable random effect, \(\eta_i\), specific to each country. When fitting to a marginalizing likelihood (FO, FOCE, LaplaceI), this single degree of freedom — \(\eta\) — allows us to account for cross-country variability effectively.

The theoretical implications of using marginalizing likelihoods in NLME models go beyond the scope of this tutorial, but in essence, this approach allows the embedded neural network to capture common patterns across the data, while relying on a single dimension of variability that meaningfully influences outcomes across countries. Using two random effects would allow us to capture an additional dimension of variability.

The neural network itself is a simple multi-layer perceptron (MLP) with normalized age, \(Age/100\), and \(\eta\) as inputs. It consists of two hidden layers, each with five nodes using the tanh activation function, and a single output node with a softplus activation function to ensure positive output.

The data contains noise, which is likely related to the population size, \(N\), used to compute the fraction of positive test results for each age group (with ages binned in 1-year intervals). While in some datasets we might have direct access to \(N\), here we estimate it. We model the observed fraction of positive tests (\(1 - S\)) as samples from a Beta((1 - S) * N, S * N) distribution. To avoid numerical and boundary issues, we modify this slightly to Beta(abs(1 - 0.99 * S) * N, abs(0.99 * S * N)).

model = @model begin
    @param begin
        N ∈ RealDomain(; lower = 0.0, init = 1000.0)
        λ ∈ MLPDomain(2, 5, 5, (1, softplus))
    end
    @random η ~ Normal(0, 0.1)
    @init S = 1
    @dynamics begin
        S' = -λ(t / 100, η)[1] * S
    end
    @derived PosFrac ~ @. Beta(abs(N * (1 - 0.99S)), abs(N * 0.99S))
    @observed begin
        iλ = @. only(λ(t / 100, η))
    end
end

With the model defined, we fit it to maximize the Laplace-approximated maximum aposteriori, MAP(LaplaceI()).

fpm = fit(
    model,
    train_data,
    init_params(model),
    MAP(LaplaceI());
    checkidentification = false,
    optim_options = (; show_every = 100),
)

Having fitted the model, we now apply it to all data — including the test data we withheld during training.

Our predict function generates two types of predictions. The first is the “Central tendency”, also known as the “Population prediction”. This is a true prediction that does not “peek” at the data being predicted. Since we do not have covariates or additional information beyond the observed infection proportion, this prediction will be identical across all countries. For this prediction, we set the \(\eta\) value to the mean of the prior distribution.

The second type of prediction uses \(\eta\) set to the mean of the approximated random effect posterior distribution, which is based on all the observations available from the country in question. This is often referred to as an “Individual prediction”.

Our goal is to make accurate predictions of how the force-of-infection and the burden of disease change with age in countries with sparse data. To evaluate the effectiveness of the model, we make individual predictions for all countries — including those withheld from training — and plot the results, overlaying them with the observed data. We also overlay the latent, noise-free, truth that we have available since we know how the data was generated.

plotgrid(
    predict(fpm, all_data; obstimes = 1:100);
    xlabel = "AGE",
    ylabel = "Fraction positive tests",
    pred = (; label = "Central tendency", color = (:black, 0.5)),
    ipred = (; label = "Individual prediction"),
    title = (s, i) -> "$(s.id) ($(s.id in getfield.(val_data, :id) ? "Test" : "Train"))",
    data = (; markersize = 10),
    axis = (; yticks = 0:0.25:1),
)
plotgrid!(
    simobs(
        datamodel,
        all_data,
        data_params,
        getfield.(all_sims, :randeffs);
        simulate_error = false,
        obstimes = 1:100,
    );
    sim = (; linestyle = :dash, markersize = 0, linewidth = 2, label = "Truth"),
)

The fitted model captures the observed data and underlying trends remarkably well, even for countries like Vietnam, which in this case has only two observations. Without leveraging information on typical infectiousness patterns from other countries, estimating the full infection-age curve for Vietnam would have been impossible.

The model accurately matches positive test rates by identifying a country-specific functional form for the force-of-infection. This estimated force-of-infection curve can, of course, be extracted directly from the fitted model.

model_ebes = empirical_bayes(fpm.model, all_data, coef(fpm), FOCE())
plotgrid(
    simobs(fpm.model, all_data, coef(fpm), model_ebes; obstimes = age);
    title = (s, i) -> "$(s.id)",
    sim = (; plot_type = :lines, linewidth = 2, label = "Predicted", color = Cycled(2)),
    observation = :iλ,
    ylabel = "λ(Age)",
)

# Since we use synthetic data, we have the luxury of extracting the ground truth for comparison.
plotgrid!(
    simobs(
        datamodel,
        all_data,
        data_params,
        getfield.(all_sims, :randeffs);
        obstimes = age,
    );
    sim = (; plot_type = :lines, linewidth = 2, label = "Actual", color = Cycled(1)),
    observation = :iλ,
)

Here again, we see that the model estimates align closely with the true values used to generate the data. The model captures the main trends effectively, even in sparsely observed test data. Notably, Vietnam — excluded from model fitting, with only two observations and a distinct force-of-infection profile — was still well-predicted.

3 Conclusions

In summary, the model successfully:

Identified a reasonable functional form for the force-of-infection,
Individualized this force-of-infection using a single random effect,
Found an appropriate transformation of the random effect distribution, making the prior informative. This transformation helps encode information from rich data sources, guiding accurate predictions even in countries with sparse data

This tutorial provides just one example of how DeepNLME and DeepPumas can elucidate complex dynamics using data from heterogeneous sources. The same methodology can be applied to more intricate models, such as SIR models where time (rather than age) is the independent variable. Furthermore, after characterizing between-country (or entity-specific) variability through random effects, we can explore predictive patterns in geographical or demographic data to help account for some of this variability.

Reuse

CC BY-SA 4.0