This guide is designed to help you prepare an existing analysis dataset for use in Pumas, focusing on datasets that already include dosing events, observations, and covariates. It is not intended to demonstrate how to create analysis-ready datasets from source SDTM and ADaM datasets. Instead, the goal is to ensure your dataset is correctly structured and formatted before running your first model in Pumas.
Typically, the process begins by reading a dataset from a spreadsheet or tabular format, such as Excel or CSV, into a DataFrame in Julia. This DataFrame serves as the foundation for analysis. The read_pumas function is then used to convert this DataFrame into a Population object, which is essential for pharmacometric modeling in Pumas. The read_pumas function not only facilitates this conversion but also performs checks similar to those in NMTRAN when a NONMEM file is run, ensuring that the data adheres to the necessary standards for successful model execution.
The expectations of the tabular form in various use cases will be introduced below and some references to read_pumas requirements will be provided. A more detailed description is also provided below.
1 Prerequisites
Basic Julia Proficiency A fundamental understanding of the Julia programming language is beneficial.
Familiarity with DataFrames Working knowledge of the DataFrames.jl or DataFramesMeta.jl package is helpful for pivoting data (long-to-wide format) and general data wrangling tasks.
Pharmacometric Concepts A basic background in pharmacokinetics/pharmacodynamics (PK/PD) modeling or general NLME concepts ensures that terms such as “dependent variables,” “dosing events,” and “observations” are clear.
CSV/Spreadsheet Handling Experience reading in files (CSV or similar formats) and handling missing values simplifies the data import process.
NM-TRAN Data Format (Optional) Knowledge of how NM-TRAN-formatted datasets typically look can assist in drawing parallels with Pumas data requirements, though it is not strictly required.
2 Learning Goals
By the end of this tutorial, participants will be able to:
Prepare and structure datasets with single and multiple dependent variables in the wide format required by Pumas.
Specify and manage different types of dosing information, including essential and advanced dosing columns.
Differentiate between time-dependent and time-independent covariates and handle them appropriately in datasets.
Address and manage missing data effectively, ensuring data integrity for modeling.
Utilize the read_pumas function to convert datasets into Population objects, performing necessary validations and checks.
Inspect and navigate Population objects in Pumas, understanding the structure and components of the data.
Apply data wrangling techniques to transform datasets from long to wide format, ensuring compatibility with Pumas.
Understand the key differences between Pumas and NONMEM data formats, facilitating a smoother transition for users familiar with NONMEM.
3 Representation of Dependent Variables (Wide Format)
3.1 Rationale for Wide Format
Pumas adopts a wide format for datasets containing multiple dependent variables because it provides several practical advantages:
Clear Separation of Measurement Types Storing each analyte or measurement type in its own column clarifies which values belong to which endpoint. This clarity reduces the need for additional identifiers (e.g., CMT, DVID) that might complicate the interpretation of rows, especially in models that handle two or more dependent variables simultaneously.
Facilitated Model Specification In a wide format, each dependent variable can be referenced directly by its column name. This approach simplifies the process of writing models that involve multiple analytes or linked processes (such as PK/PD), making it more transparent where each measurement is used in the model definition.
Flexible Handling of Missing Data When multiple dependent variables share observation times, it is common for one to be missing while another is present. A wide format makes it easier to store these missing entries independently for each DV without overlapping or clashing in a single “long” DV column.
Consistency of Output Pumas similarly returns results in a wide format, maintaining alignment between model inputs and outputs. This consistent structure can streamline subsequent data checks, diagnostics, and reporting.
For those accustomed to NM-TRAN’s single DV column plus a CMT or DVID, it can initially appear more convenient to keep a single column for all dependent variables. However, separating them into multiple columns aligns well with the data structures in Julia (e.g., DataFrame columns), reduces indexing overhead, and promotes clearer, more maintainable models when multiple DVs are involved.
3.2 Single Dependent Variable
If a study measures only one type of dependent variable (for example, a single plasma drug concentration), the data can be arranged in a single column (commonly named DV):
ID
TIME
AMT
DV
WT
AGE
SEX
…
1
0
100
70
45
M
…
1
1
0
8.0
70
45
M
…
1
2
0
7.1
70
45
M
…
For single-DV datasets, typically no pivoting is necessary; simply specify observations = [:DV] in read_pumas.
3.3 Multiple Dependent Variables
If multiple dependent variables exist (for example, a parent drug and a metabolite, or PK and PD markers), Pumas requires a wide format. In this format, each DV appears in its own column:
ID
TIME
AMT
DV_parent
DV_metabolite
WT
AGE
SEX
…
1
0
100
70
45
M
…
1
1
0
8.0
1.2
70
45
M
…
1
2
0
6.9
1.5
70
45
M
…
NM-TRAN files often store multiple DVs in a long format. In Pumas, each DV must be in a separate column.
A “pivot” or “unstack” operation may be necessary if the original data has a single DV column with an identifier column (e.g., CMT or DVID).
Tip
Observations variables in Pumas do not have to be uppercase letters. For illustration purposes while comparing NONMEM, we have included the DV column in the example below because lowercase letters were not supported prior to NONMEM v7.2. Use of descriptive names such as conc or painscore is customary in Pumas because the variables are not in one common column. This opens up the possibility of being more descriptive in the variable naming scheme.
Converting from Long to Wide Format
This process is known in data wrangling terminology as pivoting (specifically, “pivot longer” or “pivot wider”). In Julia’s DataFrames.jl package, the function to use is unstack. Below is a step-by-step guide.
Detailed Example: Converting an NM-TRAN Style Dataset
Suppose you have a DataFrame named df that looks something like this:
usingDataFramesMetadf =DataFrame( ID =1, TIME =repeat([0; 24:12:48; 72:24:120], inner =2), DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0], CMT =repeat([1, 2], outer =7), EVID = [1; repeat([0], 13)], AMT = [100; repeat([missing], 13)], WT =66.7, AGE =50, SEX =1,)
14×9 DataFrame
Row
ID
TIME
DV
CMT
EVID
AMT
WT
AGE
SEX
Int64
Int64
Float64?
Int64
Int64
Int64?
Float64
Int64
Int64
1
1
0
missing
1
1
100
66.7
50
1
2
1
0
100.0
2
0
missing
66.7
50
1
3
1
24
9.2
1
0
missing
66.7
50
1
4
1
24
49.0
2
0
missing
66.7
50
1
5
1
36
8.5
1
0
missing
66.7
50
1
6
1
36
32.0
2
0
missing
66.7
50
1
7
1
48
6.4
1
0
missing
66.7
50
1
8
1
48
26.0
2
0
missing
66.7
50
1
9
1
72
4.8
1
0
missing
66.7
50
1
10
1
72
22.0
2
0
missing
66.7
50
1
11
1
96
3.1
1
0
missing
66.7
50
1
12
1
96
28.0
2
0
missing
66.7
50
1
13
1
120
2.5
1
0
missing
66.7
50
1
14
1
120
33.0
2
0
missing
66.7
50
1
Here, CMT=1 corresponds to one dependent variable (e.g., analyte 1), and CMT=2 corresponds to another dependent variable (analyte 2).
Inspect the Dataset
Rows: Each row is an event record (dose or observation).
Columns:
ID: Subject identifier (all = 1 in this small example).
TIME: Time of the event or observation.
DV: Dependent variable (could be concentration of analyte 1 or analyte 2).
CMT: Distinguishes the analyte or the “compartment.”
EVID: Event ID (1 = dosing event, 0 = observation).
AMT: The dose amount; only for dose rows (EVID=1).
Additional covariates: WT, AGE, SEX.
Clone CMT as DVID
Pumas uses CMT for specifying dose compartments. It won’t help to identify different DVs in the final wide format, because CMT must be set to missing for observations in Pumas. Hence, create a new column DVID (which is typical in some NM-TRAN data structures):
@transform! df :DVID =:CMT
14×10 DataFrame
Row
ID
TIME
DV
CMT
EVID
AMT
WT
AGE
SEX
DVID
Int64
Int64
Float64?
Int64
Int64
Int64?
Float64
Int64
Int64
Int64
1
1
0
missing
1
1
100
66.7
50
1
1
2
1
0
100.0
2
0
missing
66.7
50
1
2
3
1
24
9.2
1
0
missing
66.7
50
1
1
4
1
24
49.0
2
0
missing
66.7
50
1
2
5
1
36
8.5
1
0
missing
66.7
50
1
1
6
1
36
32.0
2
0
missing
66.7
50
1
2
7
1
48
6.4
1
0
missing
66.7
50
1
1
8
1
48
26.0
2
0
missing
66.7
50
1
2
9
1
72
4.8
1
0
missing
66.7
50
1
1
10
1
72
22.0
2
0
missing
66.7
50
1
2
11
1
96
3.1
1
0
missing
66.7
50
1
1
12
1
96
28.0
2
0
missing
66.7
50
1
2
13
1
120
2.5
1
0
missing
66.7
50
1
1
14
1
120
33.0
2
0
missing
66.7
50
1
2
Now :DVID holds the “type” of DV.
Adjust CMT for Dosing
Next, we set :CMT to missing for observation rows (EVID == 0) and leave it as is for dosing rows:
@rtransform! df :CMT =:EVID !=0 ? :CMT :missing
14×10 DataFrame
Row
ID
TIME
DV
CMT
EVID
AMT
WT
AGE
SEX
DVID
Int64
Int64
Float64?
Int64?
Int64
Int64?
Float64
Int64
Int64
Int64
1
1
0
missing
1
1
100
66.7
50
1
1
2
1
0
100.0
missing
0
missing
66.7
50
1
2
3
1
24
9.2
missing
0
missing
66.7
50
1
1
4
1
24
49.0
missing
0
missing
66.7
50
1
2
5
1
36
8.5
missing
0
missing
66.7
50
1
1
6
1
36
32.0
missing
0
missing
66.7
50
1
2
7
1
48
6.4
missing
0
missing
66.7
50
1
1
8
1
48
26.0
missing
0
missing
66.7
50
1
2
9
1
72
4.8
missing
0
missing
66.7
50
1
1
10
1
72
22.0
missing
0
missing
66.7
50
1
2
11
1
96
3.1
missing
0
missing
66.7
50
1
1
12
1
96
28.0
missing
0
missing
66.7
50
1
2
13
1
120
2.5
missing
0
missing
66.7
50
1
1
14
1
120
33.0
missing
0
missing
66.7
50
1
2
When EVID=1, we keep the original compartment number (e.g., 1 if dosing to compartment 1).
When EVID=0 (observation), we set it to missing because for Pumas, the cmt is not needed for the observation itself.
Unstack (Pivot) to Wide Format
Use unstack from DataFrames.jl to pivot the dataset. The idea is:
Row Key: The variable that will form new columns—here, :DVID.
Value Column: The variable being spread across the new columns—here, :DV.
wide_df =unstack(df, :DVID, :DV; renamecols = x ->Symbol(:DV_, x))
8×10 DataFrame
Row
ID
TIME
CMT
EVID
AMT
WT
AGE
SEX
DV_1
DV_2
Int64
Int64
Int64?
Int64
Int64?
Float64
Int64
Int64
Float64?
Float64?
1
1
0
1
1
100
66.7
50
1
missing
missing
2
1
0
missing
0
missing
66.7
50
1
missing
100.0
3
1
24
missing
0
missing
66.7
50
1
9.2
49.0
4
1
36
missing
0
missing
66.7
50
1
8.5
32.0
5
1
48
missing
0
missing
66.7
50
1
6.4
26.0
6
1
72
missing
0
missing
66.7
50
1
4.8
22.0
7
1
96
missing
0
missing
66.7
50
1
3.1
28.0
8
1
120
missing
0
missing
66.7
50
1
2.5
33.0
Details:
unstack(df, :DVID, :DV) says “take the data from the df DataFrame, create new columns based on the unique values in DVID, and fill these columns with the values from the DV column.”
renamecols = x -> Symbol(:DV_, x) renames the resulting columns from “1”, “2” to “DV_1”, “DV_2”.
After this step, wide_df will have separate columns for each DV type:
DV_1 for CMT==1
DV_2 for CMT==2
Final Structure
The final wide_df is now in the structure that Pumas can parse for modeling:
ID
TIME
EVID
AMT
CMT (dosing compartments if needed)
DV_1 (observations for analyte 1)
DV_2 (observations for analyte 2)
Covariates (e.g., WT, AGE, SEX)
Handling Mismatched Observation Times
What if the times do not align between the two DVs? For example:
df =DataFrame( ID =1, TIME = [0, 0, 12, 24, 32, 36, 44, 48, 66, 72, 90, 96, 112, 120], DV = [missing, 100.0, 9.2, 49.0, 8.5, 32.0, 6.4, 26.0, 4.8, 22.0, 3.1, 28.0, 2.5, 33.0], CMT =repeat([1, 2], outer =7), EVID = [1; repeat([0], 13)], AMT = [100; repeat([missing], 13)], WT =66.7, AGE =50, SEX =1,)
Result: You will see some rows have missing in DV_1 or DV_2, depending on which one did not have a measurement at that time. Pumas can handle these missing values (missing) in each column.
4 Representation of Dosing Information (amt, cmt, evid, addl, ii, ss, rate, duration)
4.1 Essential Dosing Columns
amt: The dose amount.
When amt > 0, Pumas interprets the row as a dosing event.
cmt: The compartment being dosed (for example, 1, 2, or a string such as "Depot").
Typically set to missing for observation rows.
evid: The event ID:
0: Observation (no dose).
1: Standard dosing event.
3 or 4: Reset events (less common).
If this column does not exist, Pumas infers evid=1 when amt>0, otherwise evid=0.
4.2 Advanced Dosing Columns
addl: The number of additional doses.
Must be 0 if there are no repeated doses.
If > 0, a non-zeroii (interdose interval) must also be provided.
ii: The interdose interval.
If ii > 0, addl > 0 is expected (and vice versa).
ss: The steady-state indicator.
0: Not a steady-state dose.
1: Steady-state dose (compartment amounts reset to steady-state amounts).
2: Add steady-state amounts to the existing amounts.
For repeated bolus dosing and infusion with ss, a non-zero ii is required.
For a constant infusion steady-state, amt=0 and rate>0 must be combined with ii=0.
duration: An alternative to rate. If > 0, Pumas calculates rate = amt / duration.
These columns allow for specification of different dosing regimens.
5 Representation of Time-Independent Covariates (Baseline Demographics)
Time-independent covariates, such as weight, age, and sex, remain constant for an individual. For example:
ID
TIME
AMT
DV
WT
AGE
SEX
…
1
0
100
70
45
M
…
1
1
0
8.0
70
45
M
…
1
2
0
7.5
70
45
M
…
Values for each covariate must remain the same across all rows for the individual.
During the read_pumas call, use covariates = [:WT, :AGE, :SEX] to specify these columns as covariates.
Tip
Values for covariates can be numeric or character, which provides flexibility in the data preparation step.
6 Representation of Time-Dependent Covariates
Time-dependent covariates can be represented by assigning different values at different time points:
ID
TIME
AMT
DV
WT
AGE
SEX
…
Note
1
0
100
70
45
M
…
1
1
0
8.0
70
45
M
…
1
2
0
7.5
69
45
M
…
WT changed from 70 to 69 at TIME=2
To indicate time-dependency, simply provide multiple rows for each subject with changing covariate values.
For both missing constant and time-varying covariates, Pumas, by default, does piece-wise constant interpolation with “next observation carried backward” (NOCB, NONMEM default). Of course for constant covariates the interpolated values over the missing values will be constant values. This can be adjusted with the covariates_direction keyword argument of read_pumas. The default value :right is NOCB and :left is “last observation carried forward” LOCF.
7 Additional Intricacies
7.1 Unique Times per Subject
Within each subject’s data, Pumas expects unique time values. If two records share the same time, they must not represent the same event, one of them should be moved or combined. For example, one dependent variable cannot have repeated observations at the same time points. Covariates can also not have more than one value per point in time.
7.2 Missing Covariate Information
Missing values can be handled by placing missing in the covariate column.
read_pumas does not automatically fill missing values. Approaches such as last-observation-carried-forward (LOCF) or other imputation methods may be applied in data preprocessing or via the covariates_direction argument of read_pumas.
7.3 Observations at Dosing Time
Observations (DVs) must be missing on any row where amt > 0.
If a numeric DV appears at the same time as a dose, an error from read_pumas will occur.
8 Key Differences Between Pumas and NONMEM Data Formats
Feature
NONMEM (NM-TRAN)
Pumas
Format
Long format
Wide format
Dependent Variables
Single column for all DVs
Separate columns for each DV type
Identifier Columns
Uses DVID or CMT to identify DV types
No need for DVID; each DV has its column
Character Values
Requires special handling (e.g., mapping “M”/“F”)
Handles String types natively
Data Parsing
Uses DATA and INPUT blocks
Uses read_pumas with named arguments
Covariate Handling
Numeric columns with transformations in model
Flexible format, can transform in DataFrame
Missing Values
Special codes (e.g., “.”)
Native missing type support
9read_pumas
After the dataset is prepared—whether single-DV or multi-DV (pivoted to wide format)—it is typically loaded into Pumas using the read_pumas function. Under the hood, read_pumas converts the DataFrame into a Population object containing one or more Subject objects. Each subject’s data includes:
Subject identifier (ID).
Event records (dosing events).
Observation records (measurements of dependent variables).
Covariate values (time-varying or constant).
9.1 The read_pumas function signature
The read_pumas function constructs a Population object, converting rows from a CSV (or DataFrame) into a validated Pumas format.
The name of the column with the IDs of the individuals. Each individual should have a unique integer or string.
time
Symbol Default::time
The name of the column with the time corresponding to the row. Time should be unique per ID (no duplicate time values for a given subject).
evid
Union{Symbol, Nothing} Default:nothing
The name of the column with event IDs, or nothing.
Possible event IDs are: • 0 : observation • 1 : dose event • 2 : other type event • 3 : reset event (resets amounts in each compartment to zero and resets on/off status to initial) • 4 : reset and dose event
Event ID defaults to 0 if the dose amount is 0 or missing, and 1 otherwise.
amt
Symbol Default::amt
The name of the column of dose amounts. If the event ID is specified and non-zero, the dose amount should be non-zero. The default dose amount is 0.
addl
Symbol Default::addl
The name of the column that indicates the number of repeated dose events. The number of additional doses defaults to 0.
ii
Symbol Default::ii
The name of the column of inter-dose intervals. When the number of additional doses (addl) is specified and non-zero, this is the time to the next dose. For steady-state events with multiple infusions or bolus doses, this is the time between implied doses. The default inter-dose interval is 0.
Requirements: • Must be non-zero for steady-state events with multiple infusions or bolus doses. • Must be zero for steady-state events with constant infusion.
cmt
Symbol Default::cmt
The name of the column with the compartment to be dosed. Compartments can be specified by integers, strings, or symbols. The default compartment is 1.
rate
Symbol Default::rate
The name of the column with the rate of administration. A rate of -2 allows the rate to be determined by Dose Control Parameters (DCP). Defaults to 0.
Possible values: • 0 : instantaneous bolus dose • > 0 : infusion dose administered at a constant rate for a duration equal to amt / rate • -2 : infusion rate or duration specified by the dose control parameters (see @dosecontrol)
ss
Symbol Default::ss
The name of the column that indicates whether a dose is a steady-state dose.
Possible values: • 0 : dose is not a steady-state dose. • 1 : dose is a steady-state dose; compartment amounts are reset to the resulting steady-state amounts from the given dose (prior dose events are zeroed out, and infusions in progress or pending additional doses are cancelled). • 2 : dose is a steady-state dose; compartment amounts are set to the sum of the steady-state amounts plus any amounts that would be present otherwise.
route
Symbol (if present)
The name of the column that specifies the route of administration.
mdv
Union{Symbol, Nothing} Default:nothing
The name of the column that indicates if observations are missing, or nothing.
event_data
Bool Default:true
Toggles assertions applicable to event data. Specifically checks if the following columns are present in the DataFrame (either as default or user-defined): :id, :time, and :amt.
If no :evid column is present, a warning is thrown and :evid is set to 1 when :amt values are > 0 or not missing, or to 0 when :amt values are missing and observations are not missing. Otherwise, read_pumas will throw an error.
covariates_direction
Symbol Default::left
The direction of covariate interpolation, either :left (LOCF) or :right (NOCB). Note: For models with occasion variables, :left ensures correct interpolation behavior.
check
Bool Default:event_data
Toggles NMTRAN-compliance checks of the input data. Checks if the following columns are present in the DataFrame (either as default or user-defined): :id, :time, :amt, :cmt, :evid, :addl, :ii, :ss, and :route.
Additional checks include: • All variables in observations must be numeric (Integer or AbstractFloat). • :amt must be numeric. • :cmt must be a positive Integer, AbstractString, or Symbol. • :amt must be missing or 0 when evid = 0; otherwise ≥ 0. • All variables in observations must be missing when evid = 1. • :ii must be present if :ss is present. • :ii must be missing or 0 when evid = 0. • :ii must be > 0 if :addl > 0, and vice versa. • :addl must be ≥ 0 when evid = 1. • :evid must be nonzero when :amt > 0 or when :addl and :ii values are > 0.
adjust_evid34
Bool Default:true
Toggles adjustment of the time vector for reset events (evid = 3 and evid = 4). If true, the time of the previous event is added to the time on record to keep the time vector monotonically increasing.
9.2 How read_pumas Builds a Population
Once a properly formatted DataFrame is available (e.g., wide_df from the multi-DV pivot example or a simpler single-DV dataset), the following can be run:
usingPumaspop =read_pumas( wide_df; id =:ID, time =:TIME, evid =:EVID, # optional if the column doesn't exist (Pumas will infer) amt =:AMT, # optional if no dosing events in your data cmt =:CMT, # optional if you prefer the default of 1 observations = [:DV_1, :DV_2], # One or more DV columns covariates = [:WT, :AGE, :SEX], # Any additional columns for subject data event_data =true, # Default. If your dataset includes doses/observations.)
Here is what happens under the hood:
Check for Required Columns
If event_data=true (the default, meaning the data includes doses and observations), Pumas requires columns for id, time, and amt.
At least one observation column must be declared via observations = ....
If any of these are missing, an error will appear like: “The input must have: id, time, amt, and observations” when event_data is true.
Check for Basic Column Validity
If the dataset does not have an evid column, Pumas issues a warning and auto-creates one:
Warning: Your dataset has a dose event but no evid column…
If id is missing (in case event_data=false), an error will appear.
Inferring or Validating Dose Rows
If a row has amt > 0, Pumas expects evid in (1,4). If evid is not provided, Pumas sets evid=1 for that row.
cmt must be positive (e.g., 1, 2) or a valid symbol/string (e.g., :Depot/"Depot") for dosing rows.
Inferring or Validating Observation Rows
If a row has amt == 0, Pumas treats it as an observation row (evid=0).
Observations at the exact time of dose are not allowed; they must be set to missing. If a numeric observation is accidentally on the same row as a dose, Pumas issues an error.
Check for Data Consistency
When evid = 0, amt must be zero or missing.
If there are non-numeric entries in a numeric column (amt, a DV column, etc.), Pumas identifies the row and column causing the problem.
If advanced dose features (ss, addl, ii, rate, duration) are used, each is checked for internal consistency (e.g., “If addl > 0, ii must be > 0”).
Constructing Subjects
read_pumas groups rows by id.
Within each subject, Pumas sorts rows by time and checks that no two events have the same timestamp.
Covariate columns (e.g., AGE, SEX) become part of the subject’s data. If they are time-varying, they appear as different rows.
Building the Final Population
After each subject passes validation, a Subject object is created for them, and these are collected into a Population object.
The result is stored in pop, ready for model fitting in Pumas.
9.3 Example Usage
To illustrate the process, let’s walk through an example using an internal dataset from the PharmaDataSets package, called warfarin_data.
Inspecting the first few rows of the dataset shows that it has a single DV column and a DVID column that indicates the type of observation. As mentioned in the previous section, this is not the preferred format for Pumas. Instead, we want to pivot the data to have one column per dependent variable. We will also do some data cleaning to make the data more suitable for Pumas, some of which will be explained in the next module.
usingPharmaDatasets, DataFramesMetawarfarin_data =dataset("paganz2024/warfarin_long")@. warfarin_data[[133, 135, 137, 139], :TIME] +=1e-6# This is to avoid duplicate time points for observations# Transform the data in a single chain of operationswarfarin_data_wide =@chain warfarin_data begin@rsubset !contains(:ID, "#")# Calculate size-based covariates@rtransformbegin# Volume scaling based on 70kg reference weight:FSZV =:WEIGHT /70# Clearance scaling with allometric exponent 0.75:FSZCL = (:WEIGHT /70)^0.75# Create DV column names (e.g., "DV1", "DV2") from DVID:DVNAME ="DV$(:DVID)"# Set CMT to 1 for dosing records, missing for observations:CMT =ismissing(:AMOUNT) ? missing:1# Set EVID to 1 for dosing records, 0 for observations:EVID =ismissing(:AMOUNT) ? 0:1endunstack(Not([:DVID, :DVNAME, :DV]), :DVNAME, :DV)rename!(:DV1 =>:conc, :DV2 =>:pca)end
317×13 DataFrame
292 rows omitted
Row
ID
TIME
WEIGHT
AGE
SEX
AMOUNT
FSZV
FSZCL
CMT
EVID
DV0
pca
conc
String3
Float64
Float64
Int64
Int64
Float64?
Float64
Float64
Int64?
Int64
Float64?
Float64?
Float64?
1
1
0.0
66.7
50
1
100.0
0.952857
0.96443
1
1
missing
missing
missing
2
1
0.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
100.0
missing
3
1
24.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
49.0
9.2
4
1
36.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
32.0
8.5
5
1
48.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
26.0
6.4
6
1
72.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
22.0
4.8
7
1
96.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
28.0
3.1
8
1
120.0
66.7
50
1
missing
0.952857
0.96443
missing
0
missing
33.0
2.5
9
2
0.0
66.7
31
1
100.0
0.952857
0.96443
1
1
missing
missing
missing
10
2
0.0
66.7
31
1
missing
0.952857
0.96443
missing
0
missing
100.0
missing
11
2
0.5
66.7
31
1
missing
0.952857
0.96443
missing
0
missing
missing
0.0
12
2
2.0
66.7
31
1
missing
0.952857
0.96443
missing
0
missing
missing
8.4
13
2
3.0
66.7
31
1
missing
0.952857
0.96443
missing
0
missing
missing
9.7
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
⋮
306
31
48.0
83.3
24
1
missing
1.19
1.13936
missing
0
missing
24.0
6.4
307
31
72.0
83.3
24
1
missing
1.19
1.13936
missing
0
missing
22.0
4.5
308
31
96.0
83.3
24
1
missing
1.19
1.13936
missing
0
missing
28.0
3.4
309
31
120.0
83.3
24
1
missing
1.19
1.13936
missing
0
missing
42.0
2.5
310
32
0.0
62.0
21
1
93.0
0.885714
0.912999
1
1
missing
missing
missing
311
32
0.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
100.0
missing
312
32
24.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
36.0
8.9
313
32
36.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
27.0
7.7
314
32
48.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
24.0
6.9
315
32
72.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
23.0
4.4
316
32
96.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
20.0
3.5
317
32
120.0
62.0
21
1
missing
0.885714
0.912999
missing
0
missing
22.0
2.5
The data wrangling steps above address several key requirements for Pumas data format:
Multiple Dependent Variables
Original data was in long format with a single DV column and DVID identifier
Used unstack to create wide format with separate columns (:conc, :pca) for each DV type
usingPumaspop =read_pumas( warfarin_data_wide; id =:ID, time =:TIME, amt =:AMOUNT, cmt =:CMT, evid =:EVID, covariates = [:SEX, :WEIGHT, :FSZV, :FSZCL], observations = [:conc, :pca],)
The function call above constructs a Population object by interpreting each row in df according to the specified mappings.
10 Viewing the Dataset After read_pumas
Once read_pumas completes successfully, it returns a Population object, which is a container for one or more Subject objects. Several methods are available for inspection:
Number of Subjects
length(pop) # Returns the number of subjects
Accessing Individual Subjects
first(pop) # Returns the first Subjectpop[2] # Returns the second Subject, if it exists
Subject Fields Each Subject contains fields like .events (dose records), .observations (DV measurements), and .covariates. For example:
subj = pop[1]subj.events # DataFrame of dosing eventssubj.observations # DataFrame of observation rowssubj.covariates # NamedTuple of constant covariates
Printing the Population
pop
Displays a brief summary, including the number of subjects, their IDs, covariates, and observations.
One can also inspect the population in a more detailed way by converting it to a DataFrame.
usingDataFramesDataFrame(pop)
11 Putting Everything Together: An Example Workflow