Data Representation in Pumas

Author

Vijay Ivaturi

This guide is designed to help you prepare an existing analysis dataset for use in Pumas, focusing on datasets that already include dosing events, observations, and covariates. It is not intended to demonstrate how to create analysis-ready datasets from source SDTM and ADaM datasets. Instead, the goal is to ensure your dataset is correctly structured and formatted before running your first model in Pumas.

Typically, the process begins by reading a dataset from a spreadsheet or tabular format, such as Excel or CSV, into a DataFrame in Julia. This DataFrame serves as the foundation for analysis. The read_pumas function is then used to convert this DataFrame into a Population object, which is essential for pharmacometric modeling in Pumas. The read_pumas function not only facilitates this conversion but also performs checks similar to those in NMTRAN when a NONMEM file is run, ensuring that the data adheres to the necessary standards for successful model execution.

The expectations of the tabular form in various use cases will be introduced below and some references to read_pumas requirements will be provided. A more detailed description is also provided below.

1 Prerequisites

  1. Basic Julia Proficiency A fundamental understanding of the Julia programming language is beneficial.
  2. Familiarity with DataFrames Working knowledge of the DataFrames.jl or DataFramesMeta.jl package is helpful for pivoting data (long-to-wide format) and general data wrangling tasks.
  3. Pharmacometric Concepts A basic background in pharmacokinetics/pharmacodynamics (PK/PD) modeling or general NLME concepts ensures that terms such as “dependent variables,” “dosing events,” and “observations” are clear.
  4. CSV/Spreadsheet Handling Experience reading in files (CSV or similar formats) and handling missing values simplifies the data import process.
  5. NM-TRAN Data Format (Optional) Knowledge of how NM-TRAN-formatted datasets typically look can assist in drawing parallels with Pumas data requirements, though it is not strictly required.

2 Learning Goals

By the end of this tutorial, participants will be able to:

  • Prepare and structure datasets with single and multiple dependent variables in the wide format required by Pumas.
  • Specify and manage different types of dosing information, including essential and advanced dosing columns.
  • Differentiate between time-dependent and time-independent covariates and handle them appropriately in datasets.
  • Address and manage missing data effectively, ensuring data integrity for modeling.
  • Utilize the read_pumas function to convert datasets into Population objects, performing necessary validations and checks.
  • Inspect and navigate Population objects in Pumas, understanding the structure and components of the data.
  • Apply data wrangling techniques to transform datasets from long to wide format, ensuring compatibility with Pumas.
  • Understand the key differences between Pumas and NONMEM data formats, facilitating a smoother transition for users familiar with NONMEM.

3 Representation of Dependent Variables (Wide Format)

3.1 Rationale for Wide Format

Pumas adopts a wide format for datasets containing multiple dependent variables because it provides several practical advantages:

  1. Clear Separation of Measurement Types Storing each analyte or measurement type in its own column clarifies which values belong to which endpoint. This clarity reduces the need for additional identifiers (e.g., CMT, DVID) that might complicate the interpretation of rows, especially in models that handle two or more dependent variables simultaneously.

  2. Facilitated Model Specification In a wide format, each dependent variable can be referenced directly by its column name. This approach simplifies the process of writing models that involve multiple analytes or linked processes (such as PK/PD), making it more transparent where each measurement is used in the model definition.

  3. Flexible Handling of Missing Data When multiple dependent variables share observation times, it is common for one to be missing while another is present. A wide format makes it easier to store these missing entries independently for each DV without overlapping or clashing in a single “long” DV column.

  4. Consistency of Output Pumas similarly returns results in a wide format, maintaining alignment between model inputs and outputs. This consistent structure can streamline subsequent data checks, diagnostics, and reporting.

For those accustomed to NM-TRAN’s single DV column plus a CMT or DVID, it can initially appear more convenient to keep a single column for all dependent variables. However, separating them into multiple columns aligns well with the data structures in Julia (e.g., DataFrame columns), reduces indexing overhead, and promotes clearer, more maintainable models when multiple DVs are involved.

3.2 Single Dependent Variable

If a study measures only one type of dependent variable (for example, a single plasma drug concentration), the data can be arranged in a single column (commonly named DV):

ID TIME AMT DV WT AGE SEX
1 0 100 70 45 M
1 1 0 8.0 70 45 M
1 2 0 7.1 70 45 M
  • For single-DV datasets, typically no pivoting is necessary; simply specify observations = [:DV] in read_pumas.

3.3 Multiple Dependent Variables

If multiple dependent variables exist (for example, a parent drug and a metabolite, or PK and PD markers), Pumas requires a wide format. In this format, each DV appears in its own column:

ID TIME AMT DV_parent DV_metabolite WT AGE SEX
1 0 100 70 45 M
1 1 0 8.0 1.2 70 45 M
1 2 0 6.9 1.5 70 45 M
  • NM-TRAN files often store multiple DVs in a long format. In Pumas, each DV must be in a separate column.
  • A “pivot” or “unstack” operation may be necessary if the original data has a single DV column with an identifier column (e.g., CMT or DVID).
Tip

Observations variables in Pumas do not have to be uppercase letters. For illustration purposes while comparing NONMEM, we have included the DV column in the example below because lowercase letters were not supported prior to NONMEM v7.2. Use of descriptive names such as conc or painscore is customary in Pumas because the variables are not in one common column. This opens up the possibility of being more descriptive in the variable naming scheme.

This process is known in data wrangling terminology as pivoting (specifically, “pivot longer” or “pivot wider”). In Julia’s DataFrames.jl package, the function to use is unstack. Below is a step-by-step guide.

Detailed Example: Converting an NM-TRAN Style Dataset

Suppose you have a DataFrame named df that looks something like this:

using DataFramesMeta

df = DataFrame(
    ID = 1,
    TIME = repeat([0; 24:12:48; 72:24:120], inner = 2),
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2], outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)
14×9 DataFrame
Row ID TIME DV CMT EVID AMT WT AGE SEX
Int64 Int64 Float64? Int64 Int64 Int64? Float64 Int64 Int64
1 1 0 missing 1 1 100 66.7 50 1
2 1 0 100.0 2 0 missing 66.7 50 1
3 1 24 9.2 1 0 missing 66.7 50 1
4 1 24 49.0 2 0 missing 66.7 50 1
5 1 36 8.5 1 0 missing 66.7 50 1
6 1 36 32.0 2 0 missing 66.7 50 1
7 1 48 6.4 1 0 missing 66.7 50 1
8 1 48 26.0 2 0 missing 66.7 50 1
9 1 72 4.8 1 0 missing 66.7 50 1
10 1 72 22.0 2 0 missing 66.7 50 1
11 1 96 3.1 1 0 missing 66.7 50 1
12 1 96 28.0 2 0 missing 66.7 50 1
13 1 120 2.5 1 0 missing 66.7 50 1
14 1 120 33.0 2 0 missing 66.7 50 1

Here, CMT=1 corresponds to one dependent variable (e.g., analyte 1), and CMT=2 corresponds to another dependent variable (analyte 2).

Inspect the Dataset

  1. Rows: Each row is an event record (dose or observation).

  2. Columns:

    • ID: Subject identifier (all = 1 in this small example).
    • TIME: Time of the event or observation.
    • DV: Dependent variable (could be concentration of analyte 1 or analyte 2).
    • CMT: Distinguishes the analyte or the “compartment.”
    • EVID: Event ID (1 = dosing event, 0 = observation).
    • AMT: The dose amount; only for dose rows (EVID=1).
    • Additional covariates: WT, AGE, SEX.

Clone CMT as DVID

Pumas uses CMT for specifying dose compartments. It won’t help to identify different DVs in the final wide format, because CMT must be set to missing for observations in Pumas. Hence, create a new column DVID (which is typical in some NM-TRAN data structures):

@transform! df :DVID = :CMT
14×10 DataFrame
Row ID TIME DV CMT EVID AMT WT AGE SEX DVID
Int64 Int64 Float64? Int64 Int64 Int64? Float64 Int64 Int64 Int64
1 1 0 missing 1 1 100 66.7 50 1 1
2 1 0 100.0 2 0 missing 66.7 50 1 2
3 1 24 9.2 1 0 missing 66.7 50 1 1
4 1 24 49.0 2 0 missing 66.7 50 1 2
5 1 36 8.5 1 0 missing 66.7 50 1 1
6 1 36 32.0 2 0 missing 66.7 50 1 2
7 1 48 6.4 1 0 missing 66.7 50 1 1
8 1 48 26.0 2 0 missing 66.7 50 1 2
9 1 72 4.8 1 0 missing 66.7 50 1 1
10 1 72 22.0 2 0 missing 66.7 50 1 2
11 1 96 3.1 1 0 missing 66.7 50 1 1
12 1 96 28.0 2 0 missing 66.7 50 1 2
13 1 120 2.5 1 0 missing 66.7 50 1 1
14 1 120 33.0 2 0 missing 66.7 50 1 2

Now :DVID holds the “type” of DV.

Adjust CMT for Dosing

Next, we set :CMT to missing for observation rows (EVID == 0) and leave it as is for dosing rows:

@rtransform! df :CMT = :EVID != 0 ? :CMT : missing
14×10 DataFrame
Row ID TIME DV CMT EVID AMT WT AGE SEX DVID
Int64 Int64 Float64? Int64? Int64 Int64? Float64 Int64 Int64 Int64
1 1 0 missing 1 1 100 66.7 50 1 1
2 1 0 100.0 missing 0 missing 66.7 50 1 2
3 1 24 9.2 missing 0 missing 66.7 50 1 1
4 1 24 49.0 missing 0 missing 66.7 50 1 2
5 1 36 8.5 missing 0 missing 66.7 50 1 1
6 1 36 32.0 missing 0 missing 66.7 50 1 2
7 1 48 6.4 missing 0 missing 66.7 50 1 1
8 1 48 26.0 missing 0 missing 66.7 50 1 2
9 1 72 4.8 missing 0 missing 66.7 50 1 1
10 1 72 22.0 missing 0 missing 66.7 50 1 2
11 1 96 3.1 missing 0 missing 66.7 50 1 1
12 1 96 28.0 missing 0 missing 66.7 50 1 2
13 1 120 2.5 missing 0 missing 66.7 50 1 1
14 1 120 33.0 missing 0 missing 66.7 50 1 2
  • When EVID=1, we keep the original compartment number (e.g., 1 if dosing to compartment 1).
  • When EVID=0 (observation), we set it to missing because for Pumas, the cmt is not needed for the observation itself.

Unstack (Pivot) to Wide Format

Use unstack from DataFrames.jl to pivot the dataset. The idea is:

  • Row Key: The variable that will form new columns—here, :DVID.
  • Value Column: The variable being spread across the new columns—here, :DV.
wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
8×10 DataFrame
Row ID TIME CMT EVID AMT WT AGE SEX DV_1 DV_2
Int64 Int64 Int64? Int64 Int64? Float64 Int64 Int64 Float64? Float64?
1 1 0 1 1 100 66.7 50 1 missing missing
2 1 0 missing 0 missing 66.7 50 1 missing 100.0
3 1 24 missing 0 missing 66.7 50 1 9.2 49.0
4 1 36 missing 0 missing 66.7 50 1 8.5 32.0
5 1 48 missing 0 missing 66.7 50 1 6.4 26.0
6 1 72 missing 0 missing 66.7 50 1 4.8 22.0
7 1 96 missing 0 missing 66.7 50 1 3.1 28.0
8 1 120 missing 0 missing 66.7 50 1 2.5 33.0

Details:

  1. unstack(df, :DVID, :DV) says “take the data from the df DataFrame, create new columns based on the unique values in DVID, and fill these columns with the values from the DV column.”
  2. renamecols = x -> Symbol(:DV_, x) renames the resulting columns from “1”, “2” to “DV_1”, “DV_2”.

After this step, wide_df will have separate columns for each DV type:

  • DV_1 for CMT==1
  • DV_2 for CMT==2

Final Structure

The final wide_df is now in the structure that Pumas can parse for modeling:

  • ID
  • TIME
  • EVID
  • AMT
  • CMT (dosing compartments if needed)
  • DV_1 (observations for analyte 1)
  • DV_2 (observations for analyte 2)
  • Covariates (e.g., WT, AGE, SEX)

Handling Mismatched Observation Times

What if the times do not align between the two DVs? For example:

df = DataFrame(
    ID = 1,
    TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2], outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)
14×9 DataFrame
Row ID TIME DV CMT EVID AMT WT AGE SEX
Int64 Int64 Float64? Int64 Int64 Int64? Float64 Int64 Int64
1 1 0 missing 1 1 100 66.7 50 1
2 1 0 100.0 2 0 missing 66.7 50 1
3 1 12 9.2 1 0 missing 66.7 50 1
4 1 24 49.0 2 0 missing 66.7 50 1
5 1 32 8.5 1 0 missing 66.7 50 1
6 1 36 32.0 2 0 missing 66.7 50 1
7 1 44 6.4 1 0 missing 66.7 50 1
8 1 48 26.0 2 0 missing 66.7 50 1
9 1 66 4.8 1 0 missing 66.7 50 1
10 1 72 22.0 2 0 missing 66.7 50 1
11 1 90 3.1 1 0 missing 66.7 50 1
12 1 96 28.0 2 0 missing 66.7 50 1
13 1 112 2.5 1 0 missing 66.7 50 1
14 1 120 33.0 2 0 missing 66.7 50 1
  1. Repeat the same pivot:
@chain df begin
    @transform! :DVID = :CMT
    @rtransform! :CMT = :EVID == 0 ? missing : :CMT
end

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))
14×10 DataFrame
Row ID TIME CMT EVID AMT WT AGE SEX DV_1 DV_2
Int64 Int64 Int64? Int64 Int64? Float64 Int64 Int64 Float64? Float64?
1 1 0 1 1 100 66.7 50 1 missing missing
2 1 0 missing 0 missing 66.7 50 1 missing 100.0
3 1 12 missing 0 missing 66.7 50 1 9.2 missing
4 1 24 missing 0 missing 66.7 50 1 missing 49.0
5 1 32 missing 0 missing 66.7 50 1 8.5 missing
6 1 36 missing 0 missing 66.7 50 1 missing 32.0
7 1 44 missing 0 missing 66.7 50 1 6.4 missing
8 1 48 missing 0 missing 66.7 50 1 missing 26.0
9 1 66 missing 0 missing 66.7 50 1 4.8 missing
10 1 72 missing 0 missing 66.7 50 1 missing 22.0
11 1 90 missing 0 missing 66.7 50 1 3.1 missing
12 1 96 missing 0 missing 66.7 50 1 missing 28.0
13 1 112 missing 0 missing 66.7 50 1 2.5 missing
14 1 120 missing 0 missing 66.7 50 1 missing 33.0
  1. Result: You will see some rows have missing in DV_1 or DV_2, depending on which one did not have a measurement at that time. Pumas can handle these missing values (missing) in each column.

4 Representation of Dosing Information (amt, cmt, evid, addl, ii, ss, rate, duration)

4.1 Essential Dosing Columns

  • amt: The dose amount.

    • When amt > 0, Pumas interprets the row as a dosing event.
  • cmt: The compartment being dosed (for example, 1, 2, or a string such as "Depot").

    • Typically set to missing for observation rows.
  • evid: The event ID:

    • 0: Observation (no dose).
    • 1: Standard dosing event.
    • 3 or 4: Reset events (less common).
    • If this column does not exist, Pumas infers evid=1 when amt>0, otherwise evid=0.

4.2 Advanced Dosing Columns

  • addl: The number of additional doses.

    • Must be 0 if there are no repeated doses.
    • If > 0, a non-zero ii (interdose interval) must also be provided.
  • ii: The interdose interval.

    • If ii > 0, addl > 0 is expected (and vice versa).
  • ss: The steady-state indicator.

    • 0: Not a steady-state dose.
    • 1: Steady-state dose (compartment amounts reset to steady-state amounts).
    • 2: Add steady-state amounts to the existing amounts.
    • For repeated bolus dosing and infusion with ss, a non-zero ii is required.
    • For a constant infusion steady-state, amt=0 and rate>0 must be combined with ii=0.

4.3 Infusion vs. Bolus

  • rate: The rate of infusion.

  • duration: An alternative to rate. If > 0, Pumas calculates rate = amt / duration.

These columns allow for specification of different dosing regimens.

5 Representation of Time-Independent Covariates (Baseline Demographics)

Time-independent covariates, such as weight, age, and sex, remain constant for an individual. For example:

ID TIME AMT DV WT AGE SEX
1 0 100 70 45 M
1 1 0 8.0 70 45 M
1 2 0 7.5 70 45 M
  • Values for each covariate must remain the same across all rows for the individual.
  • During the read_pumas call, use covariates = [:WT, :AGE, :SEX] to specify these columns as covariates.
Tip

Values for covariates can be numeric or character, which provides flexibility in the data preparation step.

6 Representation of Time-Dependent Covariates

Time-dependent covariates can be represented by assigning different values at different time points:

ID TIME AMT DV WT AGE SEX Note
1 0 100 70 45 M
1 1 0 8.0 70 45 M
1 2 0 7.5 69 45 M WT changed from 70 to 69 at TIME=2
  • To indicate time-dependency, simply provide multiple rows for each subject with changing covariate values.
  • For both missing constant and time-varying covariates, Pumas, by default, does piece-wise constant interpolation with “next observation carried backward” (NOCB, NONMEM default). Of course for constant covariates the interpolated values over the missing values will be constant values. This can be adjusted with the covariates_direction keyword argument of read_pumas. The default value :right is NOCB and :left is “last observation carried forward” LOCF.

7 Additional Intricacies

7.1 Unique Times per Subject

  • Within each subject’s data, Pumas expects unique time values. If two records share the same time, they must not represent the same event, one of them should be moved or combined. For example, one dependent variable cannot have repeated observations at the same time points. Covariates can also not have more than one value per point in time.

7.2 Missing Covariate Information

  • Missing values can be handled by placing missing in the covariate column.
  • read_pumas does not automatically fill missing values. Approaches such as last-observation-carried-forward (LOCF) or other imputation methods may be applied in data preprocessing or via the covariates_direction argument of read_pumas.

7.3 Observations at Dosing Time

  • Observations (DVs) must be missing on any row where amt > 0.
  • If a numeric DV appears at the same time as a dose, an error from read_pumas will occur.

8 Key Differences Between Pumas and NONMEM Data Formats

Feature NONMEM (NM-TRAN) Pumas
Format Long format Wide format
Dependent Variables Single column for all DVs Separate columns for each DV type
Identifier Columns Uses DVID or CMT to identify DV types No need for DVID; each DV has its column
Character Values Requires special handling (e.g., mapping “M”/“F”) Handles String types natively
Data Parsing Uses DATA and INPUT blocks Uses read_pumas with named arguments
Covariate Handling Numeric columns with transformations in model Flexible format, can transform in DataFrame
Missing Values Special codes (e.g., “.”) Native missing type support

9 read_pumas

After the dataset is prepared—whether single-DV or multi-DV (pivoted to wide format)—it is typically loaded into Pumas using the read_pumas function. Under the hood, read_pumas converts the DataFrame into a Population object containing one or more Subject objects. Each subject’s data includes:

  • Subject identifier (ID).
  • Event records (dosing events).
  • Observation records (measurements of dependent variables).
  • Covariate values (time-varying or constant).

9.1 The read_pumas function signature

The read_pumas function constructs a Population object, converting rows from a CSV (or DataFrame) into a validated Pumas format.

read_pumas(filepath::AbstractString; missingstring = ["", ".", "NA"], kwargs...)
read_pumas(df::AbstractDataFrame; kwargs...)
Parameter Type & Default Description
observations Vector{Symbol}
Default: [:dv]
A vector of column names of dependent variables.
covariates Vector{Symbol}
Default: Symbol[]
A vector of column names of covariates.
id Symbol
Default: :id
The name of the column with the IDs of the individuals. Each individual should have a unique integer or string.
time Symbol
Default: :time
The name of the column with the time corresponding to the row.
Time should be unique per ID (no duplicate time values for a given subject).
evid Union{Symbol, Nothing}
Default: nothing
The name of the column with event IDs, or nothing.

Possible event IDs are:
0 : observation
1 : dose event
2 : other type event
3 : reset event (resets amounts in each compartment to zero and resets on/off status to initial)
4 : reset and dose event

Event ID defaults to 0 if the dose amount is 0 or missing, and 1 otherwise.
amt Symbol
Default: :amt
The name of the column of dose amounts.
If the event ID is specified and non-zero, the dose amount should be non-zero. The default dose amount is 0.
addl Symbol
Default: :addl
The name of the column that indicates the number of repeated dose events.
The number of additional doses defaults to 0.
ii Symbol
Default: :ii
The name of the column of inter-dose intervals.
When the number of additional doses (addl) is specified and non-zero, this is the time to the next dose. For steady-state events with multiple infusions or bolus doses, this is the time between implied doses. The default inter-dose interval is 0.

Requirements:
• Must be non-zero for steady-state events with multiple infusions or bolus doses.
• Must be zero for steady-state events with constant infusion.
cmt Symbol
Default: :cmt
The name of the column with the compartment to be dosed.
Compartments can be specified by integers, strings, or symbols. The default compartment is 1.
rate Symbol
Default: :rate
The name of the column with the rate of administration. A rate of -2 allows the rate to be determined by Dose Control Parameters (DCP). Defaults to 0.

Possible values:
0 : instantaneous bolus dose
> 0 : infusion dose administered at a constant rate for a duration equal to amt / rate
-2 : infusion rate or duration specified by the dose control parameters (see @dosecontrol)
ss Symbol
Default: :ss
The name of the column that indicates whether a dose is a steady-state dose.

Possible values:
0 : dose is not a steady-state dose.
1 : dose is a steady-state dose; compartment amounts are reset to the resulting steady-state amounts from the given dose (prior dose events are zeroed out, and infusions in progress or pending additional doses are cancelled).
2 : dose is a steady-state dose; compartment amounts are set to the sum of the steady-state amounts plus any amounts that would be present otherwise.
route Symbol (if present) The name of the column that specifies the route of administration.
mdv Union{Symbol, Nothing}
Default: nothing
The name of the column that indicates if observations are missing, or nothing.
event_data Bool
Default: true
Toggles assertions applicable to event data. Specifically checks if the following columns are present in the DataFrame (either as default or user-defined): :id, :time, and :amt.

If no :evid column is present, a warning is thrown and :evid is set to 1 when :amt values are > 0 or not missing, or to 0 when :amt values are missing and observations are not missing. Otherwise, read_pumas will throw an error.
covariates_direction Symbol
Default: :left
The direction of covariate interpolation, either :left (LOCF) or :right (NOCB).
Note: For models with occasion variables, :left ensures correct interpolation behavior.
check Bool
Default: event_data
Toggles NMTRAN-compliance checks of the input data. Checks if the following columns are present in the DataFrame (either as default or user-defined): :id, :time, :amt, :cmt, :evid, :addl, :ii, :ss, and :route.

Additional checks include:
• All variables in observations must be numeric (Integer or AbstractFloat).
:amt must be numeric.
:cmt must be a positive Integer, AbstractString, or Symbol.
:amt must be missing or 0 when evid = 0; otherwise ≥ 0.
• All variables in observations must be missing when evid = 1.
:ii must be present if :ss is present.
:ii must be missing or 0 when evid = 0.
:ii must be > 0 if :addl > 0, and vice versa.
:addl must be ≥ 0 when evid = 1.
:evid must be nonzero when :amt > 0 or when :addl and :ii values are > 0.
adjust_evid34 Bool
Default: true
Toggles adjustment of the time vector for reset events (evid = 3 and evid = 4). If true, the time of the previous event is added to the time on record to keep the time vector monotonically increasing.

9.2 How read_pumas Builds a Population

Once a properly formatted DataFrame is available (e.g., wide_df from the multi-DV pivot example or a simpler single-DV dataset), the following can be run:

using Pumas

pop = read_pumas(
    wide_df;
    id = :ID,
    time = :TIME,
    evid = :EVID,     # optional if the column doesn't exist (Pumas will infer)
    amt = :AMT,       # optional if no dosing events in your data
    cmt = :CMT,       # optional if you prefer the default of 1
    observations = [:DV_1, :DV_2],  # One or more DV columns
    covariates = [:WT, :AGE, :SEX], # Any additional columns for subject data
    event_data = true,               # Default. If your dataset includes doses/observations.
)

Here is what happens under the hood:

  1. Check for Required Columns

    • If event_data=true (the default, meaning the data includes doses and observations), Pumas requires columns for id, time, and amt.
    • At least one observation column must be declared via observations = ....
    • If any of these are missing, an error will appear like: “The input must have: id, time, amt, and observations” when event_data is true.
  2. Check for Basic Column Validity

    • If the dataset does not have an evid column, Pumas issues a warning and auto-creates one:

      Warning: Your dataset has a dose event but no evid column…

    • If id is missing (in case event_data=false), an error will appear.

  3. Inferring or Validating Dose Rows

    • If a row has amt > 0, Pumas expects evid in (1,4). If evid is not provided, Pumas sets evid=1 for that row.
    • cmt must be positive (e.g., 1, 2) or a valid symbol/string (e.g., :Depot/"Depot") for dosing rows.
  4. Inferring or Validating Observation Rows

    • If a row has amt == 0, Pumas treats it as an observation row (evid=0).
    • Observations at the exact time of dose are not allowed; they must be set to missing. If a numeric observation is accidentally on the same row as a dose, Pumas issues an error.
  5. Check for Data Consistency

    • When evid = 0, amt must be zero or missing.
    • If there are non-numeric entries in a numeric column (amt, a DV column, etc.), Pumas identifies the row and column causing the problem.
    • If advanced dose features (ss, addl, ii, rate, duration) are used, each is checked for internal consistency (e.g., “If addl > 0, ii must be > 0”).
  6. Constructing Subjects

    • read_pumas groups rows by id.
    • Within each subject, Pumas sorts rows by time and checks that no two events have the same timestamp.
    • Covariate columns (e.g., AGE, SEX) become part of the subject’s data. If they are time-varying, they appear as different rows.
  7. Building the Final Population

    • After each subject passes validation, a Subject object is created for them, and these are collected into a Population object.
    • The result is stored in pop, ready for model fitting in Pumas.

9.3 Example Usage

To illustrate the process, let’s walk through an example using an internal dataset from the PharmaDataSets package, called warfarin_data.

using PharmaDatasets

warfarin_data = dataset("paganz2024/warfarin_long")
504×8 DataFrame
479 rows omitted
Row ID TIME WEIGHT AGE SEX AMOUNT DVID DV
String3 Float64 Float64 Int64 Int64 Float64? Int64 Float64?
1 1 0.0 66.7 50 1 100.0 0 missing
2 1 0.0 66.7 50 1 missing 2 100.0
3 1 24.0 66.7 50 1 missing 1 9.2
4 1 24.0 66.7 50 1 missing 2 49.0
5 1 36.0 66.7 50 1 missing 1 8.5
6 1 36.0 66.7 50 1 missing 2 32.0
7 1 48.0 66.7 50 1 missing 1 6.4
8 1 48.0 66.7 50 1 missing 2 26.0
9 1 72.0 66.7 50 1 missing 1 4.8
10 1 72.0 66.7 50 1 missing 2 22.0
11 1 96.0 66.7 50 1 missing 1 3.1
12 1 96.0 66.7 50 1 missing 2 28.0
13 1 120.0 66.7 50 1 missing 1 2.5
493 32 24.0 62.0 21 1 missing 1 8.9
494 32 24.0 62.0 21 1 missing 2 36.0
495 32 36.0 62.0 21 1 missing 1 7.7
496 32 36.0 62.0 21 1 missing 2 27.0
497 32 48.0 62.0 21 1 missing 1 6.9
498 32 48.0 62.0 21 1 missing 2 24.0
499 32 72.0 62.0 21 1 missing 1 4.4
500 32 72.0 62.0 21 1 missing 2 23.0
501 32 96.0 62.0 21 1 missing 1 3.5
502 32 96.0 62.0 21 1 missing 2 20.0
503 32 120.0 62.0 21 1 missing 1 2.5
504 32 120.0 62.0 21 1 missing 2 22.0

Inspecting the first few rows of the dataset shows that it has a single DV column and a DVID column that indicates the type of observation. As mentioned in the previous section, this is not the preferred format for Pumas. Instead, we want to pivot the data to have one column per dependent variable. We will also do some data cleaning to make the data more suitable for Pumas, some of which will be explained in the next module.

using PharmaDatasets, DataFramesMeta

warfarin_data = dataset("paganz2024/warfarin_long")

@. warfarin_data[[133, 135, 137, 139], :TIME] += 1e-6 # This is to avoid duplicate time points for observations

# Transform the data in a single chain of operations
warfarin_data_wide = @chain warfarin_data begin
    @rsubset !contains(:ID, "#")
    # Calculate size-based covariates
    @rtransform begin
        # Volume scaling based on 70kg reference weight
        :FSZV = :WEIGHT / 70
        # Clearance scaling with allometric exponent 0.75
        :FSZCL = (:WEIGHT / 70)^0.75
        # Create DV column names (e.g., "DV1", "DV2") from DVID
        :DVNAME = "DV$(:DVID)"
        # Set CMT to 1 for dosing records, missing for observations
        :CMT = ismissing(:AMOUNT) ? missing : 1
        # Set EVID to 1 for dosing records, 0 for observations
        :EVID = ismissing(:AMOUNT) ? 0 : 1
    end
    unstack(Not([:DVID, :DVNAME, :DV]), :DVNAME, :DV)
    rename!(:DV1 => :conc, :DV2 => :pca)
end
317×13 DataFrame
292 rows omitted
Row ID TIME WEIGHT AGE SEX AMOUNT FSZV FSZCL CMT EVID DV0 pca conc
String3 Float64 Float64 Int64 Int64 Float64? Float64 Float64 Int64? Int64 Float64? Float64? Float64?
1 1 0.0 66.7 50 1 100.0 0.952857 0.96443 1 1 missing missing missing
2 1 0.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 100.0 missing
3 1 24.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 49.0 9.2
4 1 36.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 32.0 8.5
5 1 48.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 26.0 6.4
6 1 72.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 22.0 4.8
7 1 96.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 28.0 3.1
8 1 120.0 66.7 50 1 missing 0.952857 0.96443 missing 0 missing 33.0 2.5
9 2 0.0 66.7 31 1 100.0 0.952857 0.96443 1 1 missing missing missing
10 2 0.0 66.7 31 1 missing 0.952857 0.96443 missing 0 missing 100.0 missing
11 2 0.5 66.7 31 1 missing 0.952857 0.96443 missing 0 missing missing 0.0
12 2 2.0 66.7 31 1 missing 0.952857 0.96443 missing 0 missing missing 8.4
13 2 3.0 66.7 31 1 missing 0.952857 0.96443 missing 0 missing missing 9.7
306 31 48.0 83.3 24 1 missing 1.19 1.13936 missing 0 missing 24.0 6.4
307 31 72.0 83.3 24 1 missing 1.19 1.13936 missing 0 missing 22.0 4.5
308 31 96.0 83.3 24 1 missing 1.19 1.13936 missing 0 missing 28.0 3.4
309 31 120.0 83.3 24 1 missing 1.19 1.13936 missing 0 missing 42.0 2.5
310 32 0.0 62.0 21 1 93.0 0.885714 0.912999 1 1 missing missing missing
311 32 0.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 100.0 missing
312 32 24.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 36.0 8.9
313 32 36.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 27.0 7.7
314 32 48.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 24.0 6.9
315 32 72.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 23.0 4.4
316 32 96.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 20.0 3.5
317 32 120.0 62.0 21 1 missing 0.885714 0.912999 missing 0 missing 22.0 2.5

The data wrangling steps above address several key requirements for Pumas data format:

Multiple Dependent Variables

  • Original data was in long format with a single DV column and DVID identifier
  • Used unstack to create wide format with separate columns (:conc, :pca) for each DV type
  • See Multiple Dependent Variables section

Essential Dosing Information

  • Created EVID column (0 for observations, 1 for doses) based on presence of AMOUNT
  • Set CMT to 1 for dosing records and missing for observations
  • See Essential Dosing Columns section

Data Cleaning

  • Removed comment rows (containing “#” in ID)
  • Added descriptive names for DVs (:conc and :pca instead of :DV1 and :DV2)
  • See Additional Intricacies section

Covariate Preparation

using Pumas
pop = read_pumas(
    warfarin_data_wide;
    id = :ID,
    time = :TIME,
    amt = :AMOUNT,
    cmt = :CMT,
    evid = :EVID,
    covariates = [:SEX, :WEIGHT, :FSZV, :FSZCL],
    observations = [:conc, :pca],
)

The function call above constructs a Population object by interpreting each row in df according to the specified mappings.

10 Viewing the Dataset After read_pumas

Once read_pumas completes successfully, it returns a Population object, which is a container for one or more Subject objects. Several methods are available for inspection:

  • Number of Subjects
length(pop)  # Returns the number of subjects
  • Accessing Individual Subjects
first(pop)   # Returns the first Subject
pop[2]       # Returns the second Subject, if it exists
  • Subject Fields Each Subject contains fields like .events (dose records), .observations (DV measurements), and .covariates. For example:
subj = pop[1]
subj.events       # DataFrame of dosing events
subj.observations # DataFrame of observation rows
subj.covariates   # NamedTuple of constant covariates
  • Printing the Population
pop

Displays a brief summary, including the number of subjects, their IDs, covariates, and observations.

One can also inspect the population in a more detailed way by converting it to a DataFrame.

using DataFrames
DataFrame(pop)

11 Putting Everything Together: An Example Workflow

  1. Read From CSV Into a DataFrame
using CSV, DataFrames, Pumas
df = CSV.read("mydata.csv", DataFrame; missingstring = ["", ".", "NA"])
  1. Pivot If Multiple DVs (Optional)

    • If the data originally has a single DV column plus a column like CMT or DVID, use unstack to create columns such as DV_1, DV_2.
  2. Use read_pumas

pop = read_pumas(
    df;
    observations = [:DV],
    covariates = [:WT, :AGE, :SEX],
    id = :ID,
    time = :TIME,
    amt = :AMT,
    cmt = :CMT,
    # other columns as needed
)
  1. Inspect the Resulting Population
pop
length(pop)
pop[1].events
pop[1].observations
pop[1].covariates

12 Summary

  1. Dependent Variables:

    • Single DV → single column.
    • Multiple DVs → pivot to wide format, one column per DV.
  2. Dosing Information:

    • Required columns for standard dosing: amt, possibly cmt, and evid.
    • Advanced dosing: addl, ii, ss, and rate/duration.
  3. Covariates:

    • Time-independent covariates remain constant for each subject.
    • Time-dependent covariates change over different rows and can be interpolated within Pumas.
  4. Missing Data:

    • Observations must be missing at dosing times.
    • Covariates that are missing can be filled via user-defined methods or left as missing if suitable for the modeling approach.
  5. read_pumas Function:

    • Converts a DataFrame (or file) into a Population object.
    • Performs extensive checks and enforces NMTRAN-like rules if check is true.
  6. Viewing the Dataset:

    • Population objects can be explored by indexing subjects and querying .events, .observations, or .covariates.