Data Representation in Pumas

Author

Vijay Ivaturi

This guide is designed to help you prepare an existing analysis dataset for use in Pumas, focusing on datasets that already include dosing events, observations, and covariates. It is not intended to demonstrate how to create analysis-ready datasets from source SDTM and ADaM datasets. Instead, the goal is to ensure your dataset is correctly structured and formatted before running your first model in Pumas.

Typically, the process begins by reading a dataset from a spreadsheet or tabular format, such as Excel or CSV, into a DataFrame in Julia. This DataFrame serves as the foundation for analysis. The read_pumas function is then used to convert this DataFrame into a Population object, which is essential for pharmacometric modeling in Pumas. The read_pumas function not only facilitates this conversion but also performs checks similar to those in NMTRAN when a NONMEM file is run, ensuring that the data adheres to the necessary standards for successful model execution.

The expectations of the tabular form in various use cases will be introduced below and some references to read_pumas requirements will be provided. A more detailed description is also provided below.

1 Prerequisites

Basic Julia Proficiency A fundamental understanding of the Julia programming language is beneficial.
Familiarity with DataFrames Working knowledge of the DataFrames.jl or DataFramesMeta.jl package is helpful for pivoting data (long-to-wide format) and general data wrangling tasks.
Pharmacometric Concepts A basic background in pharmacokinetics/pharmacodynamics (PK/PD) modeling or general NLME concepts ensures that terms such as “dependent variables,” “dosing events,” and “observations” are clear.
CSV/Spreadsheet Handling Experience reading in files (CSV or similar formats) and handling missing values simplifies the data import process.
NM-TRAN Data Format (Optional) Knowledge of how NM-TRAN-formatted datasets typically look can assist in drawing parallels with Pumas data requirements, though it is not strictly required.

2 Learning Goals

By the end of this tutorial, participants will be able to:

Prepare and structure datasets with single and multiple dependent variables in the wide format required by Pumas.
Specify and manage different types of dosing information, including essential and advanced dosing columns.
Differentiate between time-dependent and time-independent covariates and handle them appropriately in datasets.
Address and manage missing data effectively, ensuring data integrity for modeling.
Utilize the read_pumas function to convert datasets into Population objects, performing necessary validations and checks.
Inspect and navigate Population objects in Pumas, understanding the structure and components of the data.
Apply data wrangling techniques to transform datasets from long to wide format, ensuring compatibility with Pumas.
Understand the key differences between Pumas and NONMEM data formats, facilitating a smoother transition for users familiar with NONMEM.

3 Representation of Dependent Variables (Wide Format)

3.1 Rationale for Wide Format

Pumas adopts a wide format for datasets containing multiple dependent variables because it provides several practical advantages:

Clear Separation of Measurement Types Storing each analyte or measurement type in its own column clarifies which values belong to which endpoint. This clarity reduces the need for additional identifiers (e.g., CMT, DVID) that might complicate the interpretation of rows, especially in models that handle two or more dependent variables simultaneously.
Facilitated Model Specification In a wide format, each dependent variable can be referenced directly by its column name. This approach simplifies the process of writing models that involve multiple analytes or linked processes (such as PK/PD), making it more transparent where each measurement is used in the model definition.
Flexible Handling of Missing Data When multiple dependent variables share observation times, it is common for one to be missing while another is present. A wide format makes it easier to store these missing entries independently for each DV without overlapping or clashing in a single “long” DV column.
Consistency of Output Pumas similarly returns results in a wide format, maintaining alignment between model inputs and outputs. This consistent structure can streamline subsequent data checks, diagnostics, and reporting.

For those accustomed to NM-TRAN’s single DV column plus a CMT or DVID, it can initially appear more convenient to keep a single column for all dependent variables. However, separating them into multiple columns aligns well with the data structures in Julia (e.g., DataFrame columns), reduces indexing overhead, and promotes clearer, more maintainable models when multiple DVs are involved.

3.2 Single Dependent Variable

If a study measures only one type of dependent variable (for example, a single plasma drug concentration), the data can be arranged in a single column (commonly named DV):

ID	TIME	AMT	DV	WT	AGE	SEX	…
1	0	100		70	45	M	…
1	1	0	8.0	70	45	M	…
1	2	0	7.1	70	45	M	…

For single-DV datasets, typically no pivoting is necessary; simply specify observations = [:DV] in read_pumas.

3.3 Multiple Dependent Variables

If multiple dependent variables exist (for example, a parent drug and a metabolite, or PK and PD markers), Pumas requires a wide format. In this format, each DV appears in its own column:

ID	TIME	AMT	DV_parent	DV_metabolite	WT	AGE	SEX	…
1	0	100			70	45	M	…
1	1	0	8.0	1.2	70	45	M	…
1	2	0	6.9	1.5	70	45	M	…

NM-TRAN files often store multiple DVs in a long format. In Pumas, each DV must be in a separate column.
A “pivot” or “unstack” operation may be necessary if the original data has a single DV column with an identifier column (e.g., CMT or DVID).

Tip

Observations variables in Pumas do not have to be uppercase letters. For illustration purposes while comparing NONMEM, we have included the DV column in the example below because lowercase letters were not supported prior to NONMEM v7.2. Use of descriptive names such as conc or painscore is customary in Pumas because the variables are not in one common column. This opens up the possibility of being more descriptive in the variable naming scheme.

Converting from Long to Wide Format

This process is known in data wrangling terminology as pivoting (specifically, “pivot longer” or “pivot wider”). In Julia’s DataFrames.jl package, the function to use is unstack. Below is a step-by-step guide.

Detailed Example: Converting an NM-TRAN Style Dataset

Suppose you have a DataFrame named df that looks something like this:

using DataFramesMeta

df = DataFrame(
    ID = 1,
    TIME = repeat([0; 24:12:48; 72:24:120], inner = 2),
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2], outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)

14×9 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1
2	1	0	100.0	2	0	missing	66.7	50	1
3	1	24	9.2	1	0	missing	66.7	50	1
4	1	24	49.0	2	0	missing	66.7	50	1
5	1	36	8.5	1	0	missing	66.7	50	1
6	1	36	32.0	2	0	missing	66.7	50	1
7	1	48	6.4	1	0	missing	66.7	50	1
8	1	48	26.0	2	0	missing	66.7	50	1
9	1	72	4.8	1	0	missing	66.7	50	1
10	1	72	22.0	2	0	missing	66.7	50	1
11	1	96	3.1	1	0	missing	66.7	50	1
12	1	96	28.0	2	0	missing	66.7	50	1
13	1	120	2.5	1	0	missing	66.7	50	1
14	1	120	33.0	2	0	missing	66.7	50	1

Here, CMT=1 corresponds to one dependent variable (e.g., analyte 1), and CMT=2 corresponds to another dependent variable (analyte 2).

Inspect the Dataset

Rows: Each row is an event record (dose or observation).
Columns:
- ID: Subject identifier (all = 1 in this small example).
- TIME: Time of the event or observation.
- DV: Dependent variable (could be concentration of analyte 1 or analyte 2).
- CMT: Distinguishes the analyte or the “compartment.”
- EVID: Event ID (1 = dosing event, 0 = observation).
- AMT: The dose amount; only for dose rows (EVID=1).
- Additional covariates: WT, AGE, SEX.

Clone CMT as DVID

Pumas uses CMT for specifying dose compartments. It won’t help to identify different DVs in the final wide format, because CMT must be set to missing for observations in Pumas. Hence, create a new column DVID (which is typical in some NM-TRAN data structures):

@transform! df :DVID = :CMT

14×10 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX	DVID
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1	1
2	1	0	100.0	2	0	missing	66.7	50	1	2
3	1	24	9.2	1	0	missing	66.7	50	1	1
4	1	24	49.0	2	0	missing	66.7	50	1	2
5	1	36	8.5	1	0	missing	66.7	50	1	1
6	1	36	32.0	2	0	missing	66.7	50	1	2
7	1	48	6.4	1	0	missing	66.7	50	1	1
8	1	48	26.0	2	0	missing	66.7	50	1	2
9	1	72	4.8	1	0	missing	66.7	50	1	1
10	1	72	22.0	2	0	missing	66.7	50	1	2
11	1	96	3.1	1	0	missing	66.7	50	1	1
12	1	96	28.0	2	0	missing	66.7	50	1	2
13	1	120	2.5	1	0	missing	66.7	50	1	1
14	1	120	33.0	2	0	missing	66.7	50	1	2

Now :DVID holds the “type” of DV.

Adjust CMT for Dosing

Next, we set :CMT to missing for observation rows (EVID == 0) and leave it as is for dosing rows:

@rtransform! df :CMT = :EVID != 0 ? :CMT : missing

14×10 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX	DVID
	Int64	Int64	Float64?	Int64?	Int64	Int64?	Float64	Int64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1	1
2	1	0	100.0	missing	0	missing	66.7	50	1	2
3	1	24	9.2	missing	0	missing	66.7	50	1	1
4	1	24	49.0	missing	0	missing	66.7	50	1	2
5	1	36	8.5	missing	0	missing	66.7	50	1	1
6	1	36	32.0	missing	0	missing	66.7	50	1	2
7	1	48	6.4	missing	0	missing	66.7	50	1	1
8	1	48	26.0	missing	0	missing	66.7	50	1	2
9	1	72	4.8	missing	0	missing	66.7	50	1	1
10	1	72	22.0	missing	0	missing	66.7	50	1	2
11	1	96	3.1	missing	0	missing	66.7	50	1	1
12	1	96	28.0	missing	0	missing	66.7	50	1	2
13	1	120	2.5	missing	0	missing	66.7	50	1	1
14	1	120	33.0	missing	0	missing	66.7	50	1	2

When EVID=1, we keep the original compartment number (e.g., 1 if dosing to compartment 1).
When EVID=0 (observation), we set it to missing because for Pumas, the cmt is not needed for the observation itself.

Unstack (Pivot) to Wide Format

Use unstack from DataFrames.jl to pivot the dataset. The idea is:

Row Key: The variable that will form new columns—here, :DVID.
Value Column: The variable being spread across the new columns—here, :DV.

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))

8×10 DataFrame

Row	ID	TIME	CMT	EVID	AMT	WT	AGE	SEX	DV_1	DV_2
	Int64	Int64	Int64?	Int64	Int64?	Float64	Int64	Int64	Float64?	Float64?
1	1	0	1	1	100	66.7	50	1	missing	missing
2	1	0	missing	0	missing	66.7	50	1	missing	100.0
3	1	24	missing	0	missing	66.7	50	1	9.2	49.0
4	1	36	missing	0	missing	66.7	50	1	8.5	32.0
5	1	48	missing	0	missing	66.7	50	1	6.4	26.0
6	1	72	missing	0	missing	66.7	50	1	4.8	22.0
7	1	96	missing	0	missing	66.7	50	1	3.1	28.0
8	1	120	missing	0	missing	66.7	50	1	2.5	33.0

Details:

unstack(df, :DVID, :DV) says “take the data from the df DataFrame, create new columns based on the unique values in DVID, and fill these columns with the values from the DV column.”
renamecols = x -> Symbol(:DV_, x) renames the resulting columns from “1”, “2” to “DV_1”, “DV_2”.

After this step, wide_df will have separate columns for each DV type:

DV_1 for CMT==1
DV_2 for CMT==2

Final Structure

The final wide_df is now in the structure that Pumas can parse for modeling:

ID
TIME
EVID
AMT
CMT (dosing compartments if needed)
DV_1 (observations for analyte 1)
DV_2 (observations for analyte 2)
Covariates (e.g., WT, AGE, SEX)

Handling Mismatched Observation Times

What if the times do not align between the two DVs? For example:

df = DataFrame(
    ID = 1,
    TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120],
    DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0],
    CMT = repeat([1, 2], outer = 7),
    EVID = [1; repeat([0], 13)],
    AMT = [100; repeat([missing], 13)],
    WT = 66.7,
    AGE = 50,
    SEX = 1,
)

14×9 DataFrame

Row	ID	TIME	DV	CMT	EVID	AMT	WT	AGE	SEX
	Int64	Int64	Float64?	Int64	Int64	Int64?	Float64	Int64	Int64
1	1	0	missing	1	1	100	66.7	50	1
2	1	0	100.0	2	0	missing	66.7	50	1
3	1	12	9.2	1	0	missing	66.7	50	1
4	1	24	49.0	2	0	missing	66.7	50	1
5	1	32	8.5	1	0	missing	66.7	50	1
6	1	36	32.0	2	0	missing	66.7	50	1
7	1	44	6.4	1	0	missing	66.7	50	1
8	1	48	26.0	2	0	missing	66.7	50	1
9	1	66	4.8	1	0	missing	66.7	50	1
10	1	72	22.0	2	0	missing	66.7	50	1
11	1	90	3.1	1	0	missing	66.7	50	1
12	1	96	28.0	2	0	missing	66.7	50	1
13	1	112	2.5	1	0	missing	66.7	50	1
14	1	120	33.0	2	0	missing	66.7	50	1

Repeat the same pivot:

@chain df begin
    @transform! :DVID = :CMT
    @rtransform! :CMT = :EVID == 0 ? missing : :CMT
end

wide_df = unstack(df, :DVID, :DV; renamecols = x -> Symbol(:DV_, x))

14×10 DataFrame

Row	ID	TIME	CMT	EVID	AMT	WT	AGE	SEX	DV_1	DV_2
	Int64	Int64	Int64?	Int64	Int64?	Float64	Int64	Int64	Float64?	Float64?
1	1	0	1	1	100	66.7	50	1	missing	missing
2	1	0	missing	0	missing	66.7	50	1	missing	100.0
3	1	12	missing	0	missing	66.7	50	1	9.2	missing
4	1	24	missing	0	missing	66.7	50	1	missing	49.0
5	1	32	missing	0	missing	66.7	50	1	8.5	missing
6	1	36	missing	0	missing	66.7	50	1	missing	32.0
7	1	44	missing	0	missing	66.7	50	1	6.4	missing
8	1	48	missing	0	missing	66.7	50	1	missing	26.0
9	1	66	missing	0	missing	66.7	50	1	4.8	missing
10	1	72	missing	0	missing	66.7	50	1	missing	22.0
11	1	90	missing	0	missing	66.7	50	1	3.1	missing
12	1	96	missing	0	missing	66.7	50	1	missing	28.0
13	1	112	missing	0	missing	66.7	50	1	2.5	missing
14	1	120	missing	0	missing	66.7	50	1	missing	33.0

Result: You will see some rows have missing in DV_1 or DV_2, depending on which one did not have a measurement at that time. Pumas can handle these missing values (missing) in each column.

4 Representation of Dosing Information (`amt`, `cmt`, `evid`, `addl`, `ii`, `ss`, `rate`, `duration`)

4.1 Essential Dosing Columns

amt: The dose amount.
- When amt > 0, Pumas interprets the row as a dosing event.
cmt: The compartment being dosed (for example, 1, 2, or a string such as "Depot").
- Typically set to missing for observation rows.
evid: The event ID:
- 0: Observation (no dose).
- 1: Standard dosing event.
- 3 or 4: Reset events (less common).
- If this column does not exist, Pumas infers evid=1 when amt>0, otherwise evid=0.

4.2 Advanced Dosing Columns

addl: The number of additional doses.
- Must be 0 if there are no repeated doses.
- If > 0, a non-zero ii (interdose interval) must also be provided.
ii: The interdose interval.
- If ii > 0, addl > 0 is expected (and vice versa).
ss: The steady-state indicator.
- 0: Not a steady-state dose.
- 1: Steady-state dose (compartment amounts reset to steady-state amounts).
- 2: Add steady-state amounts to the existing amounts.
- For repeated bolus dosing and infusion with ss, a non-zero ii is required.
- For a constant infusion steady-state, amt=0 and rate>0 must be combined with ii=0.

4.3 Infusion vs. Bolus

rate: The rate of infusion.
- 0: Bolus/instantaneous dose.
- A positive value (> 0): Infusion over amt / rate.
- -2: Usage of Dose Control Parameters (DCP).
duration: An alternative to rate. If > 0, Pumas calculates rate = amt / duration.

These columns allow for specification of different dosing regimens.

5 Representation of Time-Independent Covariates (Baseline Demographics)

Time-independent covariates, such as weight, age, and sex, remain constant for an individual. For example:

ID	TIME	AMT	DV	WT	AGE	SEX	…
1	0	100		70	45	M	…
1	1	0	8.0	70	45	M	…
1	2	0	7.5	70	45	M	…

Values for each covariate must remain the same across all rows for the individual.
During the read_pumas call, use covariates = [:WT, :AGE, :SEX] to specify these columns as covariates.

Tip

Values for covariates can be numeric or character, which provides flexibility in the data preparation step.

6 Representation of Time-Dependent Covariates

Time-dependent covariates can be represented by assigning different values at different time points:

ID	TIME	AMT	DV	WT	AGE	SEX	…	Note
1	0	100		70	45	M	…
1	1	0	8.0	70	45	M	…
1	2	0	7.5	69	45	M	…	WT changed from 70 to 69 at TIME=2

To indicate time-dependency, simply provide multiple rows for each subject with changing covariate values.
For both missing constant and time-varying covariates, Pumas, by default, does piece-wise constant interpolation with “next observation carried backward” (NOCB, NONMEM default). Of course for constant covariates the interpolated values over the missing values will be constant values. This can be adjusted with the covariates_direction keyword argument of read_pumas. The default value :right is NOCB and :left is “last observation carried forward” LOCF.

7 Additional Intricacies

7.1 Unique Times per Subject

Within each subject’s data, Pumas expects unique time values. If two records share the same time, they must not represent the same event, one of them should be moved or combined. For example, one dependent variable cannot have repeated observations at the same time points. Covariates can also not have more than one value per point in time.

7.2 Missing Covariate Information

Missing values can be handled by placing missing in the covariate column.
read_pumas does not automatically fill missing values. Approaches such as last-observation-carried-forward (LOCF) or other imputation methods may be applied in data preprocessing or via the covariates_direction argument of read_pumas.

7.3 Observations at Dosing Time

Observations (DVs) must be missing on any row where amt > 0.
If a numeric DV appears at the same time as a dose, an error from read_pumas will occur.

8 Key Differences Between Pumas and NONMEM Data Formats

Feature	NONMEM (NM-TRAN)	Pumas
Format	Long format	Wide format
Dependent Variables	Single column for all DVs	Separate columns for each DV type
Identifier Columns	Uses DVID or CMT to identify DV types	No need for DVID; each DV has its column
Character Values	Requires special handling (e.g., mapping “M”/“F”)	Handles String types natively
Data Parsing	Uses DATA and INPUT blocks	Uses `read_pumas` with named arguments
Covariate Handling	Numeric columns with transformations in model	Flexible format, can transform in DataFrame
Missing Values	Special codes (e.g., “.”)	Native `missing` type support

9 `read_pumas`

After the dataset is prepared—whether single-DV or multi-DV (pivoted to wide format)—it is typically loaded into Pumas using the read_pumas function. Under the hood, read_pumas converts the DataFrame into a Population object containing one or more Subject objects. Each subject’s data includes:

Subject identifier (ID).
Event records (dosing events).
Observation records (measurements of dependent variables).
Covariate values (time-varying or constant).

9.1 The `read_pumas` function signature

The read_pumas function constructs a Population object, converting rows from a CSV (or DataFrame) into a validated Pumas format.

read_pumas(filepath::AbstractString; missingstring = ["", ".", "NA"], kwargs...)
read_pumas(df::AbstractDataFrame; kwargs...)

Parameter	Type & Default	Description
observations	`Vector{Symbol}` Default: `[:dv]`	A vector of column names of dependent variables.
covariates	`Vector{Symbol}` Default: `Symbol[]`	A vector of column names of covariates.
id	`Symbol` Default: `:id`	The name of the column with the IDs of the individuals. Each individual should have a unique integer or string.
time	`Symbol` Default: `:time`	The name of the column with the time corresponding to the row. Time should be unique per ID (no duplicate time values for a given subject).
evid	`Union{Symbol, Nothing}` Default: `nothing`	The name of the column with event IDs, or `nothing`. Possible event IDs are: • `0` : observation • `1` : dose event • `2` : other type event • `3` : reset event (resets amounts in each compartment to zero and resets on/off status to initial) • `4` : reset and dose event Event ID defaults to `0` if the dose amount is `0` or missing, and `1` otherwise.
amt	`Symbol` Default: `:amt`	The name of the column of dose amounts. If the event ID is specified and non-zero, the dose amount should be non-zero. The default dose amount is `0`.
addl	`Symbol` Default: `:addl`	The name of the column that indicates the number of repeated dose events. The number of additional doses defaults to `0`.
ii	`Symbol` Default: `:ii`	The name of the column of inter-dose intervals. When the number of additional doses (`addl`) is specified and non-zero, this is the time to the next dose. For steady-state events with multiple infusions or bolus doses, this is the time between implied doses. The default inter-dose interval is `0`. Requirements: • Must be non-zero for steady-state events with multiple infusions or bolus doses. • Must be zero for steady-state events with constant infusion.
cmt	`Symbol` Default: `:cmt`	The name of the column with the compartment to be dosed. Compartments can be specified by integers, strings, or symbols. The default compartment is `1`.
rate	`Symbol` Default: `:rate`	The name of the column with the rate of administration. A rate of `-2` allows the rate to be determined by Dose Control Parameters (DCP). Defaults to `0`. Possible values: • `0` : instantaneous bolus dose • `> 0` : infusion dose administered at a constant rate for a duration equal to `amt / rate` • `-2` : infusion rate or duration specified by the dose control parameters (see `@dosecontrol`)
ss	`Symbol` Default: `:ss`	The name of the column that indicates whether a dose is a steady-state dose. Possible values: • `0` : dose is not a steady-state dose. • `1` : dose is a steady-state dose; compartment amounts are reset to the resulting steady-state amounts from the given dose (prior dose events are zeroed out, and infusions in progress or pending additional doses are cancelled). • `2` : dose is a steady-state dose; compartment amounts are set to the sum of the steady-state amounts plus any amounts that would be present otherwise.
route	`Symbol` (if present)	The name of the column that specifies the route of administration.
mdv	`Union{Symbol, Nothing}` Default: `nothing`	The name of the column that indicates if observations are missing, or `nothing`.
event_data	`Bool` Default: `true`	Toggles assertions applicable to event data. Specifically checks if the following columns are present in the DataFrame (either as default or user-defined): `:id`, `:time`, and `:amt`. If no `:evid` column is present, a warning is thrown and `:evid` is set to `1` when `:amt` values are > 0 or not missing, or to `0` when `:amt` values are missing and observations are not missing. Otherwise, `read_pumas` will throw an error.
covariates_direction	`Symbol` Default: `:left`	The direction of covariate interpolation, either `:left` (LOCF) or `:right` (NOCB). Note: For models with occasion variables, `:left` ensures correct interpolation behavior.
check	`Bool` Default: `event_data`	Toggles NMTRAN-compliance checks of the input data. Checks if the following columns are present in the DataFrame (either as default or user-defined): `:id`, `:time`, `:amt`, `:cmt`, `:evid`, `:addl`, `:ii`, `:ss`, and `:route`. Additional checks include: • All variables in `observations` must be numeric (Integer or AbstractFloat). • `:amt` must be numeric. • `:cmt` must be a positive Integer, AbstractString, or Symbol. • `:amt` must be missing or `0` when `evid = 0`; otherwise `≥ 0`. • All variables in `observations` must be missing when `evid = 1`. • `:ii` must be present if `:ss` is present. • `:ii` must be missing or `0` when `evid = 0`. • `:ii` must be > 0 if `:addl` > 0, and vice versa. • `:addl` must be ≥ 0 when `evid = 1`. • `:evid` must be nonzero when `:amt` > 0 or when `:addl` and `:ii` values are > 0.
adjust_evid34	`Bool` Default: `true`	Toggles adjustment of the time vector for reset events (`evid = 3` and `evid = 4`). If `true`, the time of the previous event is added to the time on record to keep the time vector monotonically increasing.

9.2 How `read_pumas` Builds a Population

Once a properly formatted DataFrame is available (e.g., wide_df from the multi-DV pivot example or a simpler single-DV dataset), the following can be run:

using Pumas

pop = read_pumas(
    wide_df;
    id = :ID,
    time = :TIME,
    evid = :EVID,     # optional if the column doesn't exist (Pumas will infer)
    amt = :AMT,       # optional if no dosing events in your data
    cmt = :CMT,       # optional if you prefer the default of 1
    observations = [:DV_1, :DV_2],  # One or more DV columns
    covariates = [:WT, :AGE, :SEX], # Any additional columns for subject data
    event_data = true,               # Default. If your dataset includes doses/observations.
)

Here is what happens under the hood:

Check for Required Columns
- If event_data=true (the default, meaning the data includes doses and observations), Pumas requires columns for id, time, and amt.
- At least one observation column must be declared via observations = ....
- If any of these are missing, an error will appear like: “The input must have: id, time, amt, and observations” when event_data is true.
Check for Basic Column Validity
- If the dataset does not have an evid column, Pumas issues a warning and auto-creates one:
  
  Warning: Your dataset has a dose event but no evid column…
- If id is missing (in case event_data=false), an error will appear.
Inferring or Validating Dose Rows
- If a row has amt > 0, Pumas expects evid in (1,4). If evid is not provided, Pumas sets evid=1 for that row.
- cmt must be positive (e.g., 1, 2) or a valid symbol/string (e.g., :Depot/"Depot") for dosing rows.
Inferring or Validating Observation Rows
- If a row has amt == 0, Pumas treats it as an observation row (evid=0).
- Observations at the exact time of dose are not allowed; they must be set to missing. If a numeric observation is accidentally on the same row as a dose, Pumas issues an error.
Check for Data Consistency
- When evid = 0, amt must be zero or missing.
- If there are non-numeric entries in a numeric column (amt, a DV column, etc.), Pumas identifies the row and column causing the problem.
- If advanced dose features (ss, addl, ii, rate, duration) are used, each is checked for internal consistency (e.g., “If addl > 0, ii must be > 0”).
Constructing Subjects
- read_pumas groups rows by id.
- Within each subject, Pumas sorts rows by time and checks that no two events have the same timestamp.
- Covariate columns (e.g., AGE, SEX) become part of the subject’s data. If they are time-varying, they appear as different rows.
Building the Final Population
- After each subject passes validation, a Subject object is created for them, and these are collected into a Population object.
- The result is stored in pop, ready for model fitting in Pumas.

9.3 Example Usage

To illustrate the process, let’s walk through an example using an internal dataset from the PharmaDataSets package, called warfarin_data.

using PharmaDatasets

warfarin_data = dataset("pumas/warfarin_nonmem")

515×9 DataFrame

490 rows omitted

Row	id	time	evid	amt	dvid	dv	wtbl	age	sex
	Int64	Float64	Int64	Float64	Int64	Float64	Float64	Int64	String1
1	1	0.0	1	100.0	0	0.0	66.7	50	M
2	1	0.5	0	0.0	1	0.0	66.7	50	M
3	1	1.0	0	0.0	1	1.9	66.7	50	M
4	1	2.0	0	0.0	1	3.3	66.7	50	M
5	1	3.0	0	0.0	1	6.6	66.7	50	M
6	1	6.0	0	0.0	1	9.1	66.7	50	M
7	1	9.0	0	0.0	1	10.8	66.7	50	M
8	1	12.0	0	0.0	1	8.6	66.7	50	M
9	1	24.0	0	0.0	1	5.6	66.7	50	M
10	1	24.0	0	0.0	2	44.0	66.7	50	M
11	1	36.0	0	0.0	1	4.0	66.7	50	M
12	1	36.0	0	0.0	2	27.0	66.7	50	M
13	1	48.0	0	0.0	1	2.7	66.7	50	M
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
504	33	24.0	0	0.0	1	9.2	66.7	50	M
505	33	24.0	0	0.0	2	49.0	66.7	50	M
506	33	36.0	0	0.0	1	8.5	66.7	50	M
507	33	36.0	0	0.0	2	32.0	66.7	50	M
508	33	48.0	0	0.0	1	6.4	66.7	50	M
509	33	48.0	0	0.0	2	26.0	66.7	50	M
510	33	72.0	0	0.0	1	4.8	66.7	50	M
511	33	72.0	0	0.0	2	22.0	66.7	50	M
512	33	96.0	0	0.0	1	3.1	66.7	50	M
513	33	96.0	0	0.0	2	28.0	66.7	50	M
514	33	120.0	0	0.0	1	2.5	66.7	50	M
515	33	120.0	0	0.0	2	33.0	66.7	50	M

Inspecting the first few rows of the dataset shows that it has a single DV column and a DVID column that indicates the type of observation. As mentioned in the previous section, this is not the preferred format for Pumas. Instead, we want to pivot the data to have one column per dependent variable. We will also do some data cleaning to make the data more suitable for Pumas, some of which will be explained in the next module.

using PharmaDatasets, DataFramesMeta

warfarin_data = dataset("pumas/warfarin_nonmem")

# Transform the data in a single chain of operations
warfarin_data_wide = @chain warfarin_data begin
    # Calculate size-based covariates
    @rtransform begin
        # Volume scaling based on 70kg reference weight
        :FSZV = :wtbl / 70
        # Clearance scaling with allometric exponent 0.75
        :FSZCL = (:wtbl / 70)^0.75
        # Create dv column names (e.g., "DV1", "DV2") from DVID
        :dvname = "DV$(:dvid)"
        # Set cmt to 1 for dosing records, missing for observations
        :cmt = ismissing(:amt) ? missing : 1
        # Set EVID to 1 for dosing records, 0 for observations
        :dv = ismissing(:amt) ? 0 : 1
    end
    unstack(Not([:dvid, :dvname, :dv]), :dvname, :dv)
    rename!(:DV1 => :conc, :DV2 => :pca)
    select!(Not([:DV0]))
end

330×12 DataFrame

305 rows omitted

Row	id	time	evid	amt	wtbl	age	sex	FSZV	FSZCL	cmt	conc	pca
	Int64	Float64	Int64	Float64	Float64	Int64	String1	Float64	Float64	Int64	Int64?	Int64?
1	1	0.0	1	100.0	66.7	50	M	0.952857	0.96443	1	missing	missing
2	1	0.5	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
3	1	1.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
4	1	2.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
5	1	3.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
6	1	6.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
7	1	9.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
8	1	12.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	missing
9	1	24.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
10	1	36.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
11	1	48.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
12	1	72.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
13	1	96.0	0	0.0	66.7	50	M	0.952857	0.96443	1	missing	1
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
319	32	48.0	0	0.0	62.0	21	M	0.885714	0.912999	1	1	1
320	32	72.0	0	0.0	62.0	21	M	0.885714	0.912999	1	1	1
321	32	96.0	0	0.0	62.0	21	M	0.885714	0.912999	1	1	1
322	32	120.0	0	0.0	62.0	21	M	0.885714	0.912999	1	1	1
323	33	0.0	1	100.0	66.7	50	M	0.952857	0.96443	1	missing	missing
324	33	0.0	0	0.0	66.7	50	M	0.952857	0.96443	1	missing	1
325	33	24.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
326	33	36.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
327	33	48.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
328	33	72.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
329	33	96.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1
330	33	120.0	0	0.0	66.7	50	M	0.952857	0.96443	1	1	1

The data wrangling steps above address several key requirements for Pumas data format:

Multiple Dependent Variables

Original data was in long format with a single dv column and dvid identifier
Used unstack to create wide format with separate columns (:conc, :pca) for each DV type
See Multiple Dependent Variables section

Essential Dosing Information

Created evid column (0 for observations, 1 for doses) based on presence of amt
Set cmt to 1 for dosing records and missing for observations
See Essential Dosing Columns section

Data Cleaning

Added descriptive names for DVs (:conc and :pca instead of :DV1 and :DV2)
See Additional Intricacies section

Covariate Preparation

Created size-based covariates (:FSZV, :FSZCL) for modeling
Maintained original covariates for baseline weight and sex (:wtbl, :sex)
See Representation of Time-Independent Covariates section

using Pumas
pop = read_pumas(
    warfarin_data_wide;
    id = :id,
    time = :time,
    amt = :amt,
    cmt = :cmt,
    evid = :evid,
    covariates = [:sex, :wtbl, :FSZV, :FSZCL],
    observations = [:conc, :pca],
)

The function call above constructs a Population object by interpreting each row in df according to the specified mappings.

10 Viewing the Dataset After `read_pumas`

Once read_pumas completes successfully, it returns a Population object, which is a container for one or more Subject objects. Several methods are available for inspection:

Number of Subjects

length(pop)  # Returns the number of subjects

Accessing Individual Subjects

first(pop)   # Returns the first Subject
pop[2]       # Returns the second Subject, if it exists. Otherwise, a `BoundsError` is thrown

Subject Fields Each Subject contains fields like .events (dose records), .observations (DV measurements), and .covariates. For example:

subj = pop[1]
subj.events       # DataFrame of dosing events
subj.observations # DataFrame of observation rows
subj.covariates   # NamedTuple of constant covariates

Printing the Population

pop

Displays a brief summary, including the number of subjects, their IDs, covariates, and observations.

One can also inspect the population in a more detailed way by converting it to a DataFrame.

using DataFrames
DataFrame(pop)

11 Putting Everything Together: An Example Workflow

Read From CSV Into a DataFrame

using CSV, DataFrames, Pumas
df = CSV.read("mydata.csv", DataFrame; missingstring = ["", ".", "NA"])

Pivot If Multiple DVs (Optional)
- If the data originally has a single DV column plus a column like CMT or DVID, use unstack to create columns such as DV_1, DV_2.
Use read_pumas

pop = read_pumas(
    df;
    observations = [:DV],
    covariates = [:WT, :AGE, :SEX],
    id = :ID,
    time = :TIME,
    amt = :AMT,
    cmt = :CMT,
    # other columns as needed
)

Inspect the Resulting Population

pop
length(pop)
pop[1].events
pop[1].observations
pop[1].covariates

12 Summary

Dependent Variables:
- Single DV → single column.
- Multiple DVs → pivot to wide format, one column per DV.
Dosing Information:
- Required columns for standard dosing: amt, possibly cmt, and evid.
- Advanced dosing: addl, ii, ss, and rate/duration.
Covariates:
- Time-independent covariates remain constant for each subject.
- Time-dependent covariates change over different rows and can be interpolated within Pumas.
Missing Data:
- Observations must be missing at dosing times.
- Covariates that are missing can be filled via user-defined methods or left as missing if suitable for the modeling approach.
read_pumas Function:
- Converts a DataFrame (or file) into a Population object.
- Performs extensive checks and enforces NMTRAN-like rules if check is true.
Viewing the Dataset:
- Population objects can be explored by indexing subjects and querying .events, .observations, or .covariates.

Reuse

CC BY-SA 4.0

1 Prerequisites

2 Learning Goals

3 Representation of Dependent Variables (Wide Format)

3.1 Rationale for Wide Format

3.2 Single Dependent Variable

3.3 Multiple Dependent Variables

4 Representation of Dosing Information (amt, cmt, evid, addl, ii, ss, rate, duration)

4.1 Essential Dosing Columns

4.2 Advanced Dosing Columns

4.3 Infusion vs. Bolus

5 Representation of Time-Independent Covariates (Baseline Demographics)

6 Representation of Time-Dependent Covariates

7 Additional Intricacies

7.1 Unique Times per Subject

7.2 Missing Covariate Information

7.3 Observations at Dosing Time

8 Key Differences Between Pumas and NONMEM Data Formats

9 read_pumas

9.1 The read_pumas function signature

9.2 How read_pumas Builds a Population

9.3 Example Usage

10 Viewing the Dataset After read_pumas

11 Putting Everything Together: An Example Workflow

12 Summary

Reuse

4 Representation of Dosing Information (`amt`, `cmt`, `evid`, `addl`, `ii`, `ss`, `rate`, `duration`)

9 `read_pumas`

9.1 The `read_pumas` function signature

9.2 How `read_pumas` Builds a Population

10 Viewing the Dataset After `read_pumas`