Module 2: Data Wrangling and Visualization in Julia

1 Module Introduction and Objectives

The target audience are pharmacometricians with experience in the R programming and statistical language and familiar with dataset preparation and exploratory data analysis in pharmacometrics. The Module makes reference to similarities and differences to R and builds on the concepts described in Module 1: Introduction to Julia.

The concepts and Julia packages showcased in this Module are:

Reading and writing data: CSV.jl and ReadStatTables.jl
Piping practices: Chain.jl
Manipulating data frames: DataFrames.jl and DataFramesMeta.jl
Handling categorical variables: Categorical Arrays.jl
Handling date/time variables: Dates.jl
Plotting and visualization: CairoMakie.jl and AlgebraOfGraphics.jl

All output DataFrames have been converted for presentation purposes using SummaryTables.jl.

1.1 Objectives

The objectives of Module 2: Data Wrangling and Visualization in Julia are to:

Introduce a pharmacometrics analysis dataset (consisting of pharmacokinetic [PK] and pharmacodynamic [PD] data) to serve as the basis for demonstrating data manipulation, summarisation, and visualization in Julia
Highlight important Julia packages required for performing data wrangling and visualization, and demonstrate their implementation through example code
Provide a brief introduction of the Julia functions (with some reference to R) used to perform data wrangling and visualization activities (particularly in context of exploratory data analysis)

1.2 Example Analysis Dataset

The example pharmacometrics analysis dataset consists of PK and PD data following the administration of warfarin. The dataset was originally published in: O’Reilly (1968). Studies on coumarin anticoagulant drugs. Initiation of warfarin therapy without a loading dose. Circulation 1968, 38:169-177. Included are the plasma warfarin concentrations (PK) and Prothrombin Complex Response (PD) in 30 normal subjects following the administration of a single, oral, loading dose of 1.5 mg/kg warfarin sodium.

The dataset is in a long format whereby individual dosing, PK, and PD observations have unique records (i.e., rows) and are identified by a dependent variable identification number (dvid). The first individual’s records are presented below:


id	datetime	wtbl	age	sex	amt	dvid	dv

1	1967-10-19T08:00:00	66.7	50	M	100	0	0
1	1967-10-19T08:30:00	66.7	50	M	0	1	0
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1
1	1967-10-19T17:00:00	66.7	50	M	0	1	10.8
1	1967-10-19T20:00:00	66.7	50	M	0	1	8.6
1	1967-10-20T08:00:00	66.7	50	M	0	1	5.6
1	1967-10-20T08:00:00	66.7	50	M	0	2	44
1	1967-10-20T20:00:00	66.7	50	M	0	1	4
1	1967-10-20T20:00:00	66.7	50	M	0	2	27
1	1967-10-21T08:00:00	66.7	50	M	0	1	2.7
1	1967-10-21T08:00:00	66.7	50	M	0	2	28
1	1967-10-22T08:00:00	66.7	50	M	0	1	0.8
1	1967-10-22T08:00:00	66.7	50	M	0	2	31
1	1967-10-23T08:00:00	66.7	50	M	0	2	60
1	1967-10-24T08:00:00	66.7	50	M	0	2	65
1	1967-10-25T08:00:00	66.7	50	M	0	2	71

Throughout this Module, the full analysis dataset will be referred to as the DataFrame object, examp_df, and will be progressively modified throughout the Module to demonstrate data manipulation techniques in Julia.

2 Specifying Directories

It is generally considered best practice not to hard code directory paths particularly when working in a collaborative environment. Julia has some useful functions to assist with identifying and specifying the paths to files and directories:

# Print the current directory
curr_dir = @__DIR__
println("The path to the current directory is:\n", curr_dir)

The path to the current directory is:
/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module

# Print the path to the current file
curr_file = @__FILE__
println("The path to the current file is:\n", curr_file)

The path to the current file is:
/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module/mod2-data-wrangling-visualization.qmd

# Bonus: print the current line number
curr_line = @__LINE__
println("The current line is:\n", curr_line)

The current line is:
188

Note: the strings returned from @__DIR__ and @__FILE__ do not include the terminal slash that may be required when joining paths together.

2.1 Joining and Modifying Paths

Operating systems specify file paths differently (i.e., use of forward slashes “/” versus backslashes “\”) which can cause problems when sharing code. The function joinpath joins path components into a full path as a string. A simple demonstration of the function is shown below where an empty string is being joined with the current directory to return a path including the appropriate terminal slash:

# Create a string that is the current directory joined with the appropriate slash for the operating system
replace_currdir_string = joinpath(curr_dir, "")

"/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module/"

Sometimes it be useful to obtain only the current filename and not the full file path. Here, basename can be used to obtain just the filename from the file’s path:

# Print just the current filename
curr_filename = basename(curr_file)
println("The current filename is:\n", curr_filename)

The current filename is:
mod2-data-wrangling-visualization.qmd

3 Reading and Writing Data

There are several Julia packages that handle reading in different file types into the environment that typically contain data for pharmacometrics analysis:

CSV.jl for .csv extensions
ReadStatTables.jl for .xpt extensions
XLSX.jl for .xlsx extensions

This Module will focus on .csv and .xpt formats using CSV.jl and ReadStatTables.jl packages, respectively.

3.1 Comma-Separated Files

This example of reading and writing .csv files in Julia uses CSV.jl.

Reading in a .csv file uses the CSV.read function which requires:

The path to the delimited file
A sink type, for most pharmacometrics applications this will be DataFrame

Other optional keyword arguments can be specified for example, specifying the string pattern for missing values, the format for Date/Time/DateTime variables (more in Section 8), or the types for variables. Type ?CSV.read in the REPL to see other options that can be specified when reading in .csv files.

# Specify the filepath to the .csv file
csv_filepath = joinpath(@__DIR__, "warfarin-pkpd-data.csv")
# Read in the .csv file
csv_df = CSV.read(
    csv_filepath,
    DataFrame;
    missingstring = ".",
    types = Dict(
        :id => String,
        :datetime => DateTime,
        :wtbl => Float64,
        :age => Float64,
        :sex => Int64,
        :amt => Float64,
        :dvid => Int64,
        :dv => Float64,
    ),
    dateformat = "yyyy-mm-ddTH:M:S.s",
)

Writing a DataFrame to .csv uses the CSV.write function which requires:

The path where the DataFrame will be written
The DataFrame object that will be written to .csv

Other optional keyword arguments can be specified for example, specifying the string pattern for missing values. Type ?CSV.write in the REPL to see other options that can be specified when reading in .csv files.

# Specify the filepath to write the new .csv file
csv_output = joinpath(@__DIR__, "warfarin-pkpd-data-out.csv")
# Write to .csv
CSV.write(csv_output, csv_df; missingstring = ".")

3.2 SAS Transport Files

This example of reading and writing SAS transport files in Julia uses ReadStatTables.jl.

Reading in a .xpt file uses the ReadStatTables.readstat function which requires the filepath. This function does not return a DataFrame object type. An additional step is required to convert the object to a useable DataFrame:

# Specify the filepath to the .xpt file
xpt_filepath = joinpath(@__DIR__, "warfarin-pkpd-data.xpt")
# Read in the .xpt file
xpt_df = ReadStatTables.readstat(xpt_filepath)
xpt_df = DataFrame(xpt_df)

Other optional keyword arguments can be specified. Type ?ReadStatTables.readstat in the REPL to see other options that can be specified when reading in .xpt files.

Writing a DataFrame to .xpt uses the ReadStatTables.writestat function which requires:

The path where the DataFrame will be written
The DataFrame object that will be written to .xpt

Other optional keyword arguments can be specified for example, specifying the string pattern for missing values. Type ?ReadStatTables.writestat in the REPL to see other options that can be specified when reading in .csv files.

# Specify the filepath to write the new .xpt file
xpt_output = joinpath(@__DIR__, "warfarin-pkpd-data-out.xpt")
# Write to .csv
ReadStatTables.writestat(xpt_output, xpt_df;)

Additional Information for Reading and Writing Data

4 DataFrames

Objects of the DataFrame type represent a data table as a series of vectors, each corresponding to a column or a variable. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs. The DataFrame (when printed in the REPL) will provide information regarding the dimensions, types for each of the columns, and row numbers:

# Create a simple DataFrame
DataFrame(id = 1:5, sex = ["Male", "Female", "Female", "Female", "Female"])

5×2 DataFrame

Row	id	sex
	Int64	String
1	1	Male
2	2	Female
3	3	Female
4	4	Female
5	5	Female

4.1 Dimensions

Properties of DataFrame objects (i.e., the dimensions, number of rows, number of columns) can be examined, printed in the REPL, and stored as new objects.

# Print the dimensions of the DataFrame (rows, columns)
dims = size(examp_df)
println("The dimensions of the DataFrame are: ", dims)

The dimensions of the DataFrame are: (515, 8)

# Print the number of rows
total_rows = nrow(examp_df)
println("The DataFrame has ", total_rows, " rows")

The DataFrame has 515 rows

# Print the number of columns
total_columns = ncol(examp_df)
println("The DataFrame has ", total_columns, " columns")

The DataFrame has 8 columns

4.2 Viewing

DataFrame objects can be viewed in the REPL or as part of the application. Here, the DataFrame read in from the .xpt dataset (xpt_df from Section 3.2) can being viewed in the application using vscodedisplay and demonstrate the presentation of the metadata stored:

# View the DataFrame in VSCode
vscodedisplay(xpt_df)

This provides an interactive view where the DataFrame can be filtered to examine subsets of the DataFrame.

Alternatively, the DataFrame can be printed in the REPL. By default, only a sample of the rows and columns in the DataFrame are printed to fit the screen. Default printing options can be adjusted by the use of show.

Note: Only the code, and not the output in the REPL, has been provided for the purpose of this Module.

# Print all rows of the DataFrame in the REPL
show(examp_df, allrows = true)
# Print all columns of the DataFrame in the REPL
show(examp_df, allcols = true)

There are options to print a select number of rows in the DataFrame. For example, the first 6 rows of the DataFrame can be extracted using first:

# Print the first 6 rows of the DataFrame
first_examp_df = first(examp_df, 6)


id	datetime	wtbl	age	sex	amt	dvid	dv

1	1967-10-19T08:00:00	66.7	50	M	100	0	0
1	1967-10-19T08:30:00	66.7	50	M	0	1	0
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1

Additionally, the last 6 rows of the DataFrame can be extracted using last:

# Print the last 6 rows of the DataFrame
last_examp_df = last(examp_df, 6)


id	datetime	wtbl	age	sex	amt	dvid	dv

33	1967-12-04T08:00:00	66.7	50	M	0	1	4.8
33	1967-12-04T08:00:00	66.7	50	M	0	2	22
33	1967-12-05T08:00:00	66.7	50	M	0	1	3.1
33	1967-12-05T08:00:00	66.7	50	M	0	2	28
33	1967-12-06T08:00:00	66.7	50	M	0	1	2.5
33	1967-12-06T08:00:00	66.7	50	M	0	2	33

Extracting the column names of the DataFrame using names returns a vector:

# Extract the names of the columns
examp_df_names = names(examp_df)
# Print all of the column names
show(examp_df_names)

["id", "datetime", "wtbl", "age", "sex", "amt", "dvid", "dv"]

4.3 Combining DataFrames

DataFrames with a common dimension (i.e., the number of rows or the number of columns) can be concatenated together. The following demonstrates the case of horizontally combining DataFrames with a common number of rows using hcat:

# Create a vector with length equal to the number of rows in examp_df
# Sampling random integers from 0 to 1000
new_col_df = DataFrame(new_col = rand(0:1:1000, nrow(examp_df)))
# Add this to the DataFrame as a new column
examp_df_newcol = hcat(examp_df, new_col_df)


id	datetime	wtbl	age	sex	amt	dvid	dv	new_col

1	1967-10-19T08:00:00	66.7	50	M	100	0	0	720
1	1967-10-19T08:30:00	66.7	50	M	0	1	0	462
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9	667
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3	888
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6	388
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1	104
1	1967-10-19T17:00:00	66.7	50	M	0	1	10.8	708
1	1967-10-19T20:00:00	66.7	50	M	0	1	8.6	849
1	1967-10-20T08:00:00	66.7	50	M	0	1	5.6	36
1	1967-10-20T08:00:00	66.7	50	M	0	2	44	309
1	1967-10-20T20:00:00	66.7	50	M	0	1	4	7
1	1967-10-20T20:00:00	66.7	50	M	0	2	27	328
1	1967-10-21T08:00:00	66.7	50	M	0	1	2.7	580
1	1967-10-21T08:00:00	66.7	50	M	0	2	28	738
1	1967-10-22T08:00:00	66.7	50	M	0	1	0.8	333
1	1967-10-22T08:00:00	66.7	50	M	0	2	31	795
1	1967-10-23T08:00:00	66.7	50	M	0	2	60	246
1	1967-10-24T08:00:00	66.7	50	M	0	2	65	368
1	1967-10-25T08:00:00	66.7	50	M	0	2	71	600

Alternatively, vcat is used to vertically concatenate two DataFrames with common columns:

# Create a DataFrame storing a new row to be added to the DataFrame
new_row_df = DataFrame(
    id = 9999,
    datetime = "2025-01-25T14:00:00",
    wtbl = 70.0,
    age = 51,
    sex = 2,
    amt = missing,
    dvid = 3,
    dv = missing,
)
# Add this to the DataFrame as a new row
examp_df_newrow = vcat(new_row_df, examp_df)


id	datetime	wtbl	age	sex	amt	dvid	dv

9999	2025-01-25T14:00:00	70	51	2	missing	3	missing
1	1967-10-19T08:00:00	66.7	50	M	100	0	0
1	1967-10-19T08:30:00	66.7	50	M	0	1	0
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1
1	1967-10-19T17:00:00	66.7	50	M	0	1	10.8
1	1967-10-19T20:00:00	66.7	50	M	0	1	8.6
1	1967-10-20T08:00:00	66.7	50	M	0	1	5.6
1	1967-10-20T08:00:00	66.7	50	M	0	2	44
1	1967-10-20T20:00:00	66.7	50	M	0	1	4
1	1967-10-20T20:00:00	66.7	50	M	0	2	27
1	1967-10-21T08:00:00	66.7	50	M	0	1	2.7
1	1967-10-21T08:00:00	66.7	50	M	0	2	28
1	1967-10-22T08:00:00	66.7	50	M	0	1	0.8
1	1967-10-22T08:00:00	66.7	50	M	0	2	31
1	1967-10-23T08:00:00	66.7	50	M	0	2	60
1	1967-10-24T08:00:00	66.7	50	M	0	2	65
1	1967-10-25T08:00:00	66.7	50	M	0	2	71

In both cases, these functions take additional keyword arguments to handle situations where the column names are not identical. Type ?hcat and ?vcat into the REPL to find more information regarding these options. Note: Additional methods of joining DataFrames that do not have a matching number of rows or columns will be described in Section 6.8.

4.4 Replacing Missing Values

Missing values can be replaced with a new value broadly across the entire DataFrame using coalesce. Note: By default, the function takes single values. The broadcast notation coalesce. is required to apply the function to a vector or a DataFrame (collection of vectors). See Module 1: Introduction to Julia for additional information regarding broadcasting in Julia.

# Replace all missing values in the DataFrame with "."
nomissing_examp_df = coalesce.(examp_df, ".")


id	datetime	wtbl	age	sex	amt	dvid	dv

1	1967-10-19T08:00:00	66.7	50	M	100	0	0
1	1967-10-19T08:30:00	66.7	50	M	0	1	0
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1
1	1967-10-19T17:00:00	66.7	50	M	0	1	10.8
1	1967-10-19T20:00:00	66.7	50	M	0	1	8.6
1	1967-10-20T08:00:00	66.7	50	M	0	1	5.6
1	1967-10-20T08:00:00	66.7	50	M	0	2	44
1	1967-10-20T20:00:00	66.7	50	M	0	1	4
1	1967-10-20T20:00:00	66.7	50	M	0	2	27
1	1967-10-21T08:00:00	66.7	50	M	0	1	2.7
1	1967-10-21T08:00:00	66.7	50	M	0	2	28
1	1967-10-22T08:00:00	66.7	50	M	0	1	0.8
1	1967-10-22T08:00:00	66.7	50	M	0	2	31
1	1967-10-23T08:00:00	66.7	50	M	0	2	60
1	1967-10-24T08:00:00	66.7	50	M	0	2	65
1	1967-10-25T08:00:00	66.7	50	M	0	2	71

4.5 Extracting a Column

A single column can be extracted from a DataFrame to return a vector of all values from that column. The notation <DataFrame>.<Column> in Julia is similar to that of <data.frame>$<Column> in R:

# Extract a single column from the DataFrame
ids = examp_df.id
# Print just the first 10 values in the vector
first_ids = first(ids, 10)
show(first_ids)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Alternatively, a column can be extracted from a DataFrame using getproperty. This function takes a DataFrame and the target variable name (using the Symbol notation) and its behavior is consistent with dplyr::pull in R:

# Use getproperty to extract a single column from the DataFrame
# Useful in context of a pipe sequence where the first argument is the
# result of the previous expression
ids = getproperty(examp_df, :id)
# Print just the first 10 values in the vector
first_ids = first(ids, 10)
show(first_ids)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Both examples return the same results and object type, however, choice to use one over the other is dependent on the application.

5 Arrays and Vectors

In Julia, an array is a collection of objects stored in a multi-dimensional grid, and vectors are single-dimensional arrays. Vectors can be examined and indexed similar to R.

5.1 Unique

The unique values in a vector can be obtained using unique:

# Show the unique ID values from the analysis dataset
unique_ids = unique(ids)
show(unique_ids)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 11  …  24, 25, 26, 27, 28, 29, 30, 31, 32, 33]

5.2 Indexing

Specific values of a vector can be indexed by providing the location. For example, identifying the fifth value of the unique IDs from the analysis dataset requires:

# Show the 5th unique ID value from the analysis dataset
fifth_id = unique_ids[5]
show(fifth_id)

5.3 Length

The length of a vector can also be obtained:

# Determine the number of unique ID values in the analysis dataset
length_unique_ids = length(unique_ids)
println("There are ", length_unique_ids, " individuals in the analysis dataset")

There are 32 individuals in the analysis dataset

5.4 Missing Values

Missing values can be identified by the use of ismissing.

# Determine if value is missing
test_val = missing
ismissing(test_val)

true

Alternatively, !ismissing can be used to identify non-missing values.

# Determine if value is not missing
!ismissing(test_val)

false

5.5 Filter

Values in a vector can be filtered out by the use of filter and specifying a function of the conditions. A copy of the vector will be return where all values that returned false are removed. Note: the function needs to be specified as an anonymous function:

# Filter for dv values that are greater than 10
high_dvs = filter(x -> !ismissing(x) && x > 10, examp_df.dv)
# Check that the minimum dv in high_dvs is greater than 10
println("The minimum value in the new vector is: ", minimum(high_dvs))

The minimum value in the new vector is: 10.1

6 Wrangling DataFrames

Data wrangling and manipulation are predominantly handled by functions available in base Julia and the following packages:

DataFrames.jl
- Data Wrangling with DataFrames.jl Cheat Sheet
DataFramesMeta.jl

DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to that functions from dplyr and data.table (the latter by providing in-place functions) in R. DataFramesMeta.jl provides macros that mirror DataFrames.jl functions with a more convenient syntax, as well as additional functions to assist with data manipulation and summarization. The majority of examples presented in this Module will primarily use functions from DataFramesMeta.jl.

The differences in syntax between the 2 packages is demonstrated through the use of transform (a function for modifying or adding columns to a DataFrame likened to dplyr::mutate in R). Note: in both cases, broadcasting (i.e., string.) is required as transform and @transform do not natively perform row-wise operations:

General syntax: DataFramesFunction(DataFrame, function)

Where function requires the use of pairs (=>) and functions/anonymous functions to be applied to the currently available column to generate the new column:

ColumnInDataFrame => (function or anonymous function) => NewColumnInDataFrame

# Add a column where the ID values have "_examp" appended to them
# Where transform is the DataFrames.jl function, examp_df is the input DataFrame,
# and :id => (x -> string.(x,"_examp")) => :new_id is the function
transform(examp_df, :id => (x -> string.(x, "_examp")) => :new_id)

General syntax: @DataFramesMetaFunction DataFrame Assignment/Mutation

Where the Assignment/Mutation has syntax similar to dplyr::mutate:

NewColumnInDataFrame = function(ColumnInDataFrame)

# Add a column where the ID values have "_examp" appended to them
# Where @transform is the DataFramesMeta.jl macro, examp_df is the input DataFrame,
# and :new_id = string.(:id,"_examp") is an assignment
@transform examp_df :new_id = string.(:id, "_examp")

Note

It is important to understand the syntax of functions from both DataFrames.jl and DataFramesMeta.jl. While DataFramesMeta.jl syntax is easier to understand (particularly when transitioning from R to Julia programming), the use of anonymous functions with DataFrames.jl make it easier to manipulate multiple columns at once, for example. You do not have to use functions from either package exclusively, and can develop a workflow that takes advantage of each.

Many functions in either package also have “in-place” forms denoted with a ! suffix. These functions directly mutate the DataFrame object serving as input to the function (as opposed to requiring assignment to new a object to store the results). The majority of examples presented in this Module will limit the use of in-place functions.

Considerations Regarding Tidier.jl

Tidier.jl is a Julia data analysis package inspired by R’s tidyverse. It provides an ecosystem of packages that have translated the functions from R/tidyverse to the Julia language.

It is not recommended to adopt these packages in a regular workflow, even though individuals with previous R/tidyverse experience may find familiarity with these functions and their output. This is because:

All packages/functions from tidyverse have not yet been translated, such that a workflow based on these packages may still be limited
Many functions are wrappers around the Julia functions that are described in this Module
Julia and R are different programming languages. Packages in the Tidier.jl ecosystem are designed to present Julia code as R code which may have scalability issues for larger, more complicated workflows.

There will be long-term benefits to understanding and applying the Julia functions presented in this Module with respect to future learning and implementation of Pumas in pharmacometrics analyses.

6.1 Pipes and Chains

Sometimes it can be convenient to pipe functions together in a sequence. Julia has a native pipe operator, |>, like R. The native Julia pipe can link single-argument actions by taking the result of one action and pass it as the next’s first argument. When the next function takes multiple arguments, syntax requires the use of anonymous functions.

Chain.jl offers a more convenient syntax as demonstrated below with the use of @rtransform (row-wise mutation function similar to dplyr::mutate in R):

# Example of standard declaration
examp_df_idmod = @rtransform(examp_df, :id = string(:id, "_001"))

# Example of using native base Julia pipe
examp_df_idmod = examp_df |> x -> @rtransform(x, :id = string(:id, "_001"))

# Example of using @chain
examp_df_idmod = @chain examp_df begin
    @rtransform :id = string(:id, "_001")
end

Note: If the result should not be passed to the first argument of the next function in the sequence, then _ can be used to inform where the result should be passed. An example is provided in Section 6.2.

The majority of examples presented in this Module will link sequences of actions on DataFrames with @chain.

Additional Information for Pipes and Chains

Julia Basics - Pipes

6.2 Mutate/Transform

In DataFramesMeta.jl, @transform and @rtransform perform the same actions as dplyr::mutate in R.

@transform returns the original DataFrame and any newly created columns. It does not natively perform row-wise operations - therefore, requires the use of broadcasting in the syntax to imply row-wise operations or @rtransform can be used instead.

In the example below, @transform is used to generate a record sequence column where :recseq1 is a sequence of numbers from 1 to the number of rows in examp_df as a vector. As nrow requires the same DataFrame input as @transform, the _ syntax is used to denote where the previous result should be passed (and note that it can be passed to multiple places). A simpler syntax is demonstrated in the second example (:recseq2), where the eachindex function is used to create an range of values for each index in :id.

# Modify the analysis dataset to add additional variables
examp_df_dose = @chain examp_df begin
    # Create a record sequence column (example 1 demonstrating use of passing previous result
    # into multiple places of the next function)
    @transform _ :recseq1 = 1:nrow(_)
    # Create a record sequence column (example 2, using eachindex syntax)
    @transform :recseq2 = eachindex(:id)
end

The next example demonstrates the use of row-wise operations and directly applying mutations to the previous DataFrame result using the in-place form @rtransform!. Here, @rtransform is required instead of @transform as each row needs to be separately evaluated to determine the values of :evid and :cmt conditional on the value for :dvid.

# Modify the analysis dataset to add additional variables
@chain examp_df_dose begin
    # Add columns to specify dosing information for non-linear mixed effects modeling
    @rtransform! begin
        # Set event ID variable (1 = dosing events, 0 = observations)
        :evid = :dvid == 0 ? 1 : 0
        # Set compartment variable for dosing events (1 = depot)
        :cmt = :dvid == 0 ? 1 : missing
    end
end
# The result of the first individual's records are shown below:


recseq1	recseq2	id	datetime	amt	evid	cmt	dvid	dv

1	1	1	1967-10-19T08:00:00	100	1	1	0	0
2	2	1	1967-10-19T08:30:00	0	0	missing	1	0
3	3	1	1967-10-19T09:00:00	0	0	missing	1	1.9
4	4	1	1967-10-19T10:00:00	0	0	missing	1	3.3
5	5	1	1967-10-19T11:00:00	0	0	missing	1	6.6
6	6	1	1967-10-19T14:00:00	0	0	missing	1	9.1
7	7	1	1967-10-19T17:00:00	0	0	missing	1	10.8
8	8	1	1967-10-19T20:00:00	0	0	missing	1	8.6
9	9	1	1967-10-20T08:00:00	0	0	missing	1	5.6
10	10	1	1967-10-20T08:00:00	0	0	missing	2	44
11	11	1	1967-10-20T20:00:00	0	0	missing	1	4
12	12	1	1967-10-20T20:00:00	0	0	missing	2	27
13	13	1	1967-10-21T08:00:00	0	0	missing	1	2.7
14	14	1	1967-10-21T08:00:00	0	0	missing	2	28
15	15	1	1967-10-22T08:00:00	0	0	missing	1	0.8
16	16	1	1967-10-22T08:00:00	0	0	missing	2	31
17	17	1	1967-10-23T08:00:00	0	0	missing	2	60
18	18	1	1967-10-24T08:00:00	0	0	missing	2	65
19	19	1	1967-10-25T08:00:00	0	0	missing	2	71

6.3 Subset

To return a subset of rows from a DataFrame under a given set of conditions, the @subset and @rsubset functions are used. Note: Like @transform, @subset does not perform row-wise operations and broadcasting is required when defining the conditions for subsetting. Simpler syntax is available with @rsubset (natively performs row-wise operations).

Excluding Rows with Missing Values

The dropmissing function can be used to exclude rows that are associated with missing values for a given set of variables.

# Subset for only observation records and drop the records where dv is missing
examp_df_obs = @chain examp_df_dose begin
    # Retain only observation records
    @rsubset :evid == 0
    # Exclude records associated with missing dv values
    dropmissing(:dv)
end

6.3.1 Bonus: Subsetting for 1 Row per Individual

A common case in pharmacometrics is to subset the analysis population for 1 row per individual prior to summarizing baseline demographic information. In Julia, the unique function can be applied to a DataFrame to return only unique rows for a given set of variables. This performs similar to !duplicated or dplyr::distinct functions in R:

# Subset 1 row per individual
examp_df_one = unique(examp_df_obs, :id)

6.4 Sort

DataFrames can be arranged/ordered by a set of variables using the @orderby function. This applies to numerical variables (i.e., Int or Float), categorical variables (based on levels assigned otherwise alphabetical order), and DateTime variables:

# Arrange the DataFrame by datetime
examp_df_time = @orderby examp_df_one :datetime
# The result of the first 10 records are shown below:


id	datetime	amt	evid	cmt	dvid	dv

9	1967-06-30T08:00:00	0	0	missing	2	90
15	1967-07-03T08:00:00	0	0	missing	2	100
18	1967-07-14T08:00:00	0	0	missing	2	100
2	1967-07-17T08:00:00	0	0	missing	2	100
29	1967-07-18T08:00:00	0	0	missing	2	86
12	1967-07-25T08:00:00	0	0	missing	2	85
32	1967-07-25T08:00:00	0	0	missing	2	100
28	1967-09-05T08:00:00	0	0	missing	2	100
6	1967-09-10T11:00:00	0	0	missing	1	13.4
27	1967-09-11T08:00:00	0	0	missing	2	100

6.5 Select

Columns can be selected or removed from the DataFrame using @select. The function returns a DataFrame with columns in the order of which they were selected - therefore, it can be useful to re-order columns for the purposes of presentation. Additionally, there are helper functions inherited from DataFrames.jl (Not,All,Cols, and Between) that make syntax for selecting/deselecting columns simpler (refer to DataFramesMeta - @select and @select! for additional information).

# Retain only the necessary columns (re-arrange the order too)
examp_df_select = @select examp_df_dose begin
    :id
    :datetime
    :amt
    :evid
    :cmt
    :dv
    :dvid
    :wtbl
    :age
    :sex
end
# The result of the first 10 records are shown below:


id	datetime	amt	evid	cmt	dv	dvid	wtbl	age	sex

1	1967-10-19T08:00:00	100	1	1	0	0	66.7	50	M
1	1967-10-19T08:30:00	0	0	missing	0	1	66.7	50	M
1	1967-10-19T09:00:00	0	0	missing	1.9	1	66.7	50	M
1	1967-10-19T10:00:00	0	0	missing	3.3	1	66.7	50	M
1	1967-10-19T11:00:00	0	0	missing	6.6	1	66.7	50	M
1	1967-10-19T14:00:00	0	0	missing	9.1	1	66.7	50	M
1	1967-10-19T17:00:00	0	0	missing	10.8	1	66.7	50	M
1	1967-10-19T20:00:00	0	0	missing	8.6	1	66.7	50	M
1	1967-10-20T08:00:00	0	0	missing	5.6	1	66.7	50	M
1	1967-10-20T08:00:00	0	0	missing	44	2	66.7	50	M

# Remove the dv_0 column
examp_df_remove = @select examp_df_dose Not(:recseq1, :recseq2)
# The result of the first 10 records are shown below:


id	datetime	wtbl	age	sex	amt	dvid	dv	evid	cmt

1	1967-10-19T08:00:00	66.7	50	M	100	0	0	1	1
1	1967-10-19T08:30:00	66.7	50	M	0	1	0	0	missing
1	1967-10-19T09:00:00	66.7	50	M	0	1	1.9	0	missing
1	1967-10-19T10:00:00	66.7	50	M	0	1	3.3	0	missing
1	1967-10-19T11:00:00	66.7	50	M	0	1	6.6	0	missing
1	1967-10-19T14:00:00	66.7	50	M	0	1	9.1	0	missing
1	1967-10-19T17:00:00	66.7	50	M	0	1	10.8	0	missing
1	1967-10-19T20:00:00	66.7	50	M	0	1	8.6	0	missing
1	1967-10-20T08:00:00	66.7	50	M	0	1	5.6	0	missing
1	1967-10-20T08:00:00	66.7	50	M	0	2	44	0	missing

6.6 Pivot to Long/Wide

The example analysis dataset used for this Module is presented in a long format. In this format, each row represents a unique observation - this could be observations at different time-points or even different types of observations. In our example, the measurement variables are in the dv (dependent variable) column and their identifier variables are in the dvid column (dependent variable identifier). Such that, dosing records are in rows where :dvid == 0, PK observations are in rows where :dvid == 1, and PD observations are in rows were :dvid == 2.

This is a typical format for population modeling analysis with NONMEM. However, Pumas (covered in later Modules) expects the different types of dependent variables to be presented in a wide data format. In a wide data format, measurement variables have unique columns and identified by the column name.

In order to interchange between long and wide data formats, the following functions are required from DataFrames.jl:

The unstack function converts the DataFrame from a long format to a wide format (i.e., it unstacks the measurement variables into their unique columns).

Note: the combine keyword argument helps handle duplicate values through the use of functions/anonymous functions.

# Convert the DataFrame from long to wide format
examp_df_wide = unstack(
    # Input DataFrame
    examp_df_select,
    # rowkeys (i.e., variable that stores the identifier variables)
    :dvid,
    # colkeys (i.e., variable that stores the measurement variables)
    :dv;
    # Option to rename the new columns created for the measurement variables
    # Here an anonymous function is used to create a new column name based on
    # the identifier variable (dvid = x)
    renamecols = x -> Symbol(:dv_, x),
    # Specify how to handle duplicate records
    # Here an anonymous function is used to specify only taking the first
    # value for 2 observations of the same type at the same time
    combine = x -> first(x),
)
# The result of the first 10 rows are shown below:


id	datetime	amt	evid	cmt	dv_0	dv_1	dv_2

1	1967-10-19T08:00:00	100	1	1	0	missing	missing
1	1967-10-19T08:30:00	0	0	missing	missing	0	missing
1	1967-10-19T09:00:00	0	0	missing	missing	1.9	missing
1	1967-10-19T10:00:00	0	0	missing	missing	3.3	missing
1	1967-10-19T11:00:00	0	0	missing	missing	6.6	missing
1	1967-10-19T14:00:00	0	0	missing	missing	9.1	missing
1	1967-10-19T17:00:00	0	0	missing	missing	10.8	missing
1	1967-10-19T20:00:00	0	0	missing	missing	8.6	missing
1	1967-10-20T08:00:00	0	0	missing	missing	5.6	44
1	1967-10-20T20:00:00	0	0	missing	missing	4	27

The stack function converts the DataFrame from a wide format to a long format (i.e., it stacks the measurement variables into a single column and uses an identifier variable to denote the type of measurement variable).

Note: in this example, the wide format DataFrame from unstack converted back to a long format does not return the original DataFrame. This is because many Julia functions do not automatically propagate missing values, such that, all missing :dv_0 values have been stacked as a measurement variable.

# Convert the DataFrame from wide to long format
examp_df_long = @chain examp_df_wide begin
    stack(
        # Input DataFrame
        _,
        # Measurement variables to be stacked
        [:dv_0, :dv_1, :dv_2],
        # Supply a new column name to be the identifier variable
        variable_name = :dvid,
        # Supply a new column name to store the values of the
        # measurement variables
        value_name = :dv,
    )
end
# The result of the first 10 rows are shown below:


id	datetime	amt	evid	cmt	dvid	dv

1	1967-10-19T08:00:00	100	1	1	dv_0	0
1	1967-10-19T08:30:00	0	0	missing	dv_0	missing
1	1967-10-19T09:00:00	0	0	missing	dv_0	missing
1	1967-10-19T10:00:00	0	0	missing	dv_0	missing
1	1967-10-19T11:00:00	0	0	missing	dv_0	missing
1	1967-10-19T14:00:00	0	0	missing	dv_0	missing
1	1967-10-19T17:00:00	0	0	missing	dv_0	missing
1	1967-10-19T20:00:00	0	0	missing	dv_0	missing
1	1967-10-20T08:00:00	0	0	missing	dv_0	missing
1	1967-10-20T20:00:00	0	0	missing	dv_0	missing

6.7 Rename

Columns can be renamed using @rename by using the syntax:

:new_column_name = :old_column_name

# Rename columns
@rename examp_df_wide begin
    :id = :id
    :datetime = :datetime
    :amt = :amt
    :evid = :evid
    :cmt = :cmt
end

6.8 Join

Joining functions from DataFrames.jl do not have DataFramesMeta.jl counterparts. Available joins of 2 DataFrames include:

leftjoin: returns all rows that were in the first DataFrame
rightjoin: returns all rows that were in the second DataFrame
innerjoin: returns rows with keys that matched in all passed DataFrames
outerjoin: returns rows with keys that appeared in any of the passed DataFrames
semijoin: returns the subset of rows in the first DataFrame that did match with the keys in the second DataFrame
antijoin: returns the subset of rows in the first DataFrame that did not match with the keys in the second DataFrame
crossjoin: returns the cartesian product of rows from all passed DataFrames, where the first passed DataFrame is assigned to the dimension that changes the slowest and the last DataFrame is assigned to the dimension that changes the fastest

An example of leftjoin is provided below where a DataFrame containing dummy coding for sex called ismale is joined with our example analysis DataFrame:

# Create dummy coding for sex variable called ismale
cov_labels = DataFrame(sex = ["F", "M"], ismale = [false, true])
# Join the covariate labels with the example analysis dataset
# Use the numeric variable for sex to join (common variable between
# the datasets)
examp_df_join = leftjoin(examp_df_wide, cov_labels, on = :sex)
# The result of the first 10 individuals are shown below:


id	datetime	amt	evid	cmt	sex	ismale

1	1967-10-19T08:00:00	100	1	1	M	true
2	1967-07-17T08:00:00	100	1	1	M	true
3	1968-01-01T08:00:00	120	1	1	M	true
4	1967-09-12T08:00:00	60	1	1	F	false
5	1967-10-22T08:00:00	113	1	1	M	true
6	1967-09-10T08:00:00	90	1	1	F	false
7	1967-12-16T08:00:00	135	1	1	M	true
8	1967-12-22T08:00:00	75	1	1	F	false
9	1967-06-30T08:00:00	105	1	1	M	true
11	1967-10-07T08:00:00	123	1	1	M	true

6.9 Summarize

Commonly in pharmacometrics, statistical summaries are conducted on the demographics of the analysis population (parametric and non-parametric) by treatment group or other key variables. There are several approaches to generating summaries on a DataFrame with DataFramesMeta.jl. In all examples, the DataFrame returned consists of all grouping variables and all summary variables calculated as part of the summarization.

Excluding Missing Values During Calculation

Many Julia functions do not automatically propagate missing values. The skipmissing function can be wrapped around the target variables to ignore missing values when performing calculations in order to prevent error messages.

Using a combination of @groupby (splits the DataFrame by each group for the given variable combinations) and @combine (performs the transformations and combines the GroupedDataFrames back together) is similar as dplyr::group_by and dplyr::summarize in R.

In this example, one variable generated in @combine cannot be used to generate a variable dependent on its values.

# Summarise wtbl for each of the categories in sex
examp_df_summary = @chain examp_df_join begin
    unique(:id)
    @groupby :sex
    @combine begin
        # Number of individuals
        :nid = length(:id)
        # Mean
        :mean_value = mean(skipmissing(:wtbl))
        # Standard deviation
        :sd_value = std(skipmissing(:wtbl))
        # Median
        :median_value = median(skipmissing(:wtbl))
        # Minimum
        :min_value = minimum(skipmissing(:wtbl))
        # Maximum
        :max_value = maximum(skipmissing(:wtbl))
    end
end


sex	nid	mean_value	sd_value	median_value	min_value	max_value

M	27	73.5	10.2	74.7	58	102
F	5	51.3	7.68	50	40	60

The use of @by combines the @groupby and @combine substeps into a single step.

However, it should be noted that still in this example, one variable generated in @by cannot be used to generate a variable dependent on its values.

# Summarise wtbl for each of the categories in sex
examp_df_summary_atby = @chain examp_df_join begin
    unique(:id)
    @by :sex begin
        # Number of individuals
        :nid = length(:id)
        # Mean
        :mean_value = mean(skipmissing(:wtbl))
        # Standard deviation
        :sd_value = std(skipmissing(:wtbl))
        # Median
        :median_value = median(skipmissing(:wtbl))
        # Minimum
        :min_value = minimum(skipmissing(:wtbl))
        # Maximum
        :max_value = maximum(skipmissing(:wtbl))
    end
end


sex	nid	mean_value	sd_value	median_value	min_value	max_value

M	27	73.5	10.2	74.7	58	102
F	5	51.3	7.68	50	40	60

To generate variables that are dependent on others within the same block @astable needs to be specified. For example, confidence intervals require the previous calculation of the mean, standard deviation, and degrees of freedom [for a T-distribution]:

# Summarise wtbl for each of the categories in sex
examp_df_summary_astable = @chain examp_df_join begin
    unique(:id)
    @by :sex @astable begin
        # Number of individuals
        :nid = length(:id)
        # Mean
        :mean_value = mean(skipmissing(:wtbl))
        # Standard deviation
        :sd_value = std(skipmissing(:wtbl))
        # Calculating 90% confidence intervals using a T-distribution and previously
        # calculate mean, degrees of freedom, and standard deviation
        :lo90_ci = :mean_value + quantile(TDist(:nid), 0.05) * :sd_value / sqrt(:nid)
        :hi90_ci = :mean_value + quantile(TDist(:nid), 0.95) * :sd_value / sqrt(:nid)
        # Median
        :median_value = median(skipmissing(:wtbl))
        # Minimum
        :min_value = minimum(skipmissing(:wtbl))
        # Maximum
        :max_value = maximum(skipmissing(:wtbl))
    end
end


sex	nid	mean_value	sd_value	lo90_ci	hi90_ci	median_value	min_value	max_value

M	27	73.5	10.2	70.1	76.8	74.7	58	102
F	5	51.3	7.68	44.4	58.3	50	40	60

Additional Information for Wrangling DataFrames

7 Categorical Variables

Handling categorical variables (i.e., assigning levels, creating bins from continuous data, etc) or “factors” (in R terms) is predominantly handled by functions available in the CategoricalArrays.jl package in Julia.

Section 6.8 demonstrated how to assign descriptive labels for a numerical variable like :sex. However, converting these variables to a Categorical type can be useful when assigning an order to the categories, adding labels for categories that may not be available in the current analysis dataset, and ease of re-labeling categories.

The example below demonstrates how :ismale can be converted to an ordered categorical variable by the use of the categorical function. To add descriptive labels, the recode function is used to define the labels for each value in :ismale (using pairs notation with =>). Labels can also be assigned to values that are not present. For example, a label can be assigned to missing despite there being no missing values for :ismale in the DataFrame. This is extremely useful when summarizing the demographics of an analysis population and noting the level of missing information as the category is present when the levels of the variable are returned.

Note: The functions below require the use of @transform as these are not row-wise operations but being applied to the whole variable.

# Assign new labels for :sex
sexcat_examp_df = @chain examp_df_join begin
    @transform @astable begin
        # Make ismale an ordered categorical variable
        :ismale = categorical(:ismale; ordered = true)
        # Assign new labels for each of the categories
        :ismale = recode(:ismale, 0 => "Female", 1 => "Male", missing => "Missing")
    end
end
# Return the levels of :ismale
levels(sexcat_examp_df.ismale)

3-element Vector{String}:
 "Female"
 "Male"
 "Missing"

Categories can be re-ordered whereby the levels keyword argument of categorical takes a vector specifying the new order of categories:

# Reorder the categories of :ismale
@transform! sexcat_examp_df :ismale =
    categorical(:ismale; ordered = true, levels = ["Male", "Female", "Missing"])
# Return the levels of :ismale
levels(sexcat_examp_df.ismale)

3-element Vector{String}:
 "Male"
 "Female"
 "Missing"

Categories can also be converted back to an integer type using levelcode. The numbers are based on the category’s index in the ordered categorical variable:

# Generate a numerical value to sex categories
@rtransform! sexcat_examp_df :ismalen = levelcode(:ismale)
# Return the levels of :ismalen
levels(sexcat_examp_df.ismalen)

2-element Vector{Int64}:
 1
 2

7.1 Binning Continuous Variables

Continuous variables can be binned into multiple groups using the cut function to obtain a Categorical type for that variable. By default, the cut function returns category labels that include the quantile number and the range of values in that quantile:

# Bin age into 2 categories
agebin_examp_df = @chain sexcat_examp_df begin
    # Retain only the first row for each individual so that bins are assigned
    # correctly as subject-level information
    unique(:id)
    # Cut the age variable into 2 categories
    @transform :agebin = cut(:age, 2)
    # Retain only the necessary columns
    @select :id :agebin
    # Merge back with the original DataFrame with time-dependent information
    leftjoin(sexcat_examp_df, _, on = :id)
end
# Return the levels of :agebin
levels(agebin_examp_df.agebin)

2-element Vector{String}:
 "Q1: [21.0, 27.5)"
 "Q2: [27.5, 63.0]"

Note: The cut function does have a keyword argument, levels, where descriptive labels for the bins can be provided. However, labels containing the cut-points will be required to be assigned after the generation of the bins using recode:

# Re-label the categories for :agebin
recode!(
    agebin_examp_df.agebin,
    "Q1: [21.0, 27.0)" => "Age < 27 Years",
    "Q2: [27.0, 63.0]" => "Age ≥ 27 Years",
)
# Return the levels of :agebin
levels(agebin_examp_df.agebin)

4-element Vector{String}:
 "Q1: [21.0, 27.5)"
 "Q2: [27.5, 63.0]"
 "Age < 27 Years"
 "Age ≥ 27 Years"

Additional Information for Categorical Variables

Handling Factors and Categorical Data with CategoricalArrays.jl

8 Date/Time Variables

The Dates.jl package assists with handling variables that require Date/Time formats. Julia treats dates as a specific type in the base library, allowing for easy handling of Date/Time variables.

In the example dataset for this Module, the :datetime column has been consistently represented as a String type. Section 3.1 demonstrated that Date/Time variables can be read into the environment in the correct format through the CSV.read function. However, a variable can be converted from type String to DateTime using the DateTime function and specifying the Date/Time format of that variable:

# Convert datetime column from string to DataTime format
datetime_examp_df = @chain agebin_examp_df begin
    @rtransform :datetime = DateTime(:datetime, "yyyy-mm-ddTHH:MM:SS")
end
# Return first 5 values for :datetime
first(datetime_examp_df.datetime, 5)

5-element Vector{DateTime}:
 1967-10-19T08:00:00
 1967-10-19T08:30:00
 1967-10-19T09:00:00
 1967-10-19T10:00:00
 1967-10-19T11:00:00

The Date and Time functions can be used to extract the Date and Time from a DateTime variable:

@rtransform! datetime_examp_df begin
    :date = Date(:datetime)
    :time = Time(:datetime)
end


id	datetime	date	time	amt	evid	cmt

1	1967-10-19T08:00:00	1967-10-19	08:00:00	100	1	1
1	1967-10-19T08:30:00	1967-10-19	08:30:00	0	0	missing
1	1967-10-19T09:00:00	1967-10-19	09:00:00	0	0	missing
1	1967-10-19T10:00:00	1967-10-19	10:00:00	0	0	missing
1	1967-10-19T11:00:00	1967-10-19	11:00:00	0	0	missing
1	1967-10-19T14:00:00	1967-10-19	14:00:00	0	0	missing
1	1967-10-19T17:00:00	1967-10-19	17:00:00	0	0	missing
1	1967-10-19T20:00:00	1967-10-19	20:00:00	0	0	missing
1	1967-10-20T08:00:00	1967-10-20	08:00:00	0	0	missing
1	1967-10-20T20:00:00	1967-10-20	20:00:00	0	0	missing
1	1967-10-21T08:00:00	1967-10-21	08:00:00	0	0	missing
1	1967-10-22T08:00:00	1967-10-22	08:00:00	0	0	missing
1	1967-10-23T08:00:00	1967-10-23	08:00:00	0	0	missing
1	1967-10-24T08:00:00	1967-10-24	08:00:00	0	0	missing
1	1967-10-25T08:00:00	1967-10-25	08:00:00	0	0	missing

Conversely, Date and Time variables can be joined (i.e., added to each other) to create variables of DateTime format:

@rtransform! datetime_examp_df :datetime = :date + :time

Typically, the times of dosing and observation records in a pharmacometrics analysis dataset are not represented as DateTime format but as hours after the first dose or observation for an individual subject. This can be achieved by creating an interim variable for each individual that identifies the time of the first dose, and then subtracting it from all other values for :datetime within that individual.

Note: Subtracting 2 :datetime variables returned a value in the units of milliseconds. This can be converted to units of hours using the Hour function:

# Calculate time after dose for each observation
tafd_examp_df = @chain datetime_examp_df begin
    # For each individual, determine time after first dose in hours
    # Group by each ID number (creates a GroupedDataFrame)
    @groupby :id
    # Modify/transform for each group in the GroupedDataFrame...
    # Note the use of DataFrames.jl transform, not @transform
    transform(_) do group
        @rtransform group @astable begin
            # Determine the datetime of the first dose for the individual
            :first_dose = minimum(group.datetime[group.evid.==1])
            # Calculate time after first dose for all records and convert to hours
            :tafd = (:datetime - :first_dose) / Hour(1)
        end
    end
    # Remove the first_dose column (only an intermediate)
    @select Not(:first_dose)
end


id	datetime	tafd	amt	evid	cmt

1	1967-10-19T08:00:00	0	100	1	1
1	1967-10-19T08:30:00	0.5	0	0	missing
1	1967-10-19T09:00:00	1	0	0	missing
1	1967-10-19T10:00:00	2	0	0	missing
1	1967-10-19T11:00:00	3	0	0	missing
1	1967-10-19T14:00:00	6	0	0	missing
1	1967-10-19T17:00:00	9	0	0	missing
1	1967-10-19T20:00:00	12	0	0	missing
1	1967-10-20T08:00:00	24	0	0	missing
1	1967-10-20T20:00:00	36	0	0	missing
1	1967-10-21T08:00:00	48	0	0	missing
1	1967-10-22T08:00:00	72	0	0	missing
1	1967-10-23T08:00:00	96	0	0	missing
1	1967-10-24T08:00:00	120	0	0	missing
1	1967-10-25T08:00:00	144	0	0	missing

Additional Information for Date/Time Variables

Handling Dates and Times

9 Plotting and Data Visualization

The recommended Julia packages for performing data visualization and generating publication-quality figures are:

AlgebraOfGraphics.jl provides a set of tools for plotting data in Julia. Its design and functionality are similar to that of ggplot2 in R, whereby it involves the development of layers (data, mapping aesthetics, and geometrics) to build a plot.

CairoMakie.jl is the underlying plotting system for AlgebraOfGraphics.jl using a Cairo backend to draw vector graphics to SVG and PNG.

While most plots can be generated by only interacting with AlgebraOfGraphics.jl, it should be emphasized that a good foundational knowledge of CairoMakie.jl will allow additional customization (such as arranging multiple plots).

9.1 The Algebra Of Graphics

The general structure for creating a plot is as follows:

The input DataFrame can be prepared prior to developing the plot. In our example, only the non-missing concentration records are required. It is important to ensure there are no missing values in variables intended to be plotted as AlgebraOfGraphics functions will throw error messages.

# Generating a DataFrame containing only relevant information
plot_examp_df = @chain tafd_examp_df begin
    # Retain only observation records
    @rsubset :evid == 0
    # Exclude records with missing PK observations
    dropmissing(:dv_1)
end

The general structure of layer consists of data, mapping, and visual elements.

Elements of a layer are concatenated using * (multiplication operator), and multiple layers are superimposed onto each other using + (addition operator). The implementation is consistent with the order of operations, therefore, the order of which layers are sequentially added is important.

Once all layers are combined, the resulting plot is compiled using draw.

# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
p_obs_time_scatter =
    data(plot_examp_df) *
    # Specifying the mapping aesthetics using positional arguments
    # First: x-axis
    # Second: y-axis
    # [Optional] Third: z-axis (for 3D plots)
    # All other subsequent options: keyword arguments
    # The nonnumeric function makes :id interpreted as a categorical variable
    mapping(:tafd, :dv_1, group = :id => nonnumeric) *
    # Specifying the visuals/geometry
    visual(Scatter)

# Creating a layer for individual lines connecting observations over time
p_obs_time_lines =
    data(plot_examp_df) * mapping(:tafd, :dv_1, group = :id => nonnumeric) * visual(Lines)

# Combine the layers of the plot
p_obs_time = p_obs_time_scatter + p_obs_time_lines

# Draw the resulting plot
draw(p_obs_time)

Note: An input DataFrame for plotting is not necessary and vectors can be directly passed to mapping in the absence of data. Additionally, if visual is not supplied as an element for the layer, the default geometry Scatter will be used.

The composition of the plot can also be simplified particularly if several layers require common data and mapping elements. The simplification of our example would be:

# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
p_obs_time_scatter_lines =
    data(plot_examp_df) *
    # Specifying the mapping aesthetics using positional arguments
    mapping(:tafd, :dv_1, group = :id => nonnumeric) *
    # Combining the visuals/geometry
    (visual(Scatter) + visual(Lines))

# Combine the layers of the plot
p_obs_time = p_obs_time_scatter_lines

# Draw the resulting plot
draw(p_obs_time)

Additional Information for The Algebra Of Graphics

9.2 Stratification Through Mapping

Several elements of a plot can be modified based on stratification variables. For example; to assign different colors to different values of sex, the appropriate column is passed to the keyword argument - color. Other keyword arguments that can be passed are dependent on the visual selected. Common keyword arguments include: marker and linestyle.

There are helper functions and syntax that can assist with providing descriptive labels or modifying plotting variables (if not available in the input DataFrame). These use pair syntax (denoted with => to specify the relationships). They take the general form:

:column_name => function() => "New Label"

Where :column_name is the current variable name in the input DataFrame, function() is optional and allows you to pass a function (anonymous or a helper function described below) that modifies the values of the variable, and "New Label" is the new label for :column_name.

# Modifying a previous layer to account for different colors with
# different levels of sex
p_obs_time_scatter_lines_color = p_obs_time_scatter_lines * mapping(color = :sex => "Sex");

Helper functions that assist with manipulating the variable prior to plotting include:

Where sorter allows the reordering of values for the variable:

# Modifying a previous layer to account for different colors with
# different levels of sex
p_obs_time_scatter_lines_color =
    p_obs_time_scatter_lines * mapping(color = :sex => sorter("M", "F") => "Sex");

Where renamer allows renaming of the values for the variable via pairs syntax. This requires specifying relationship between the old value and the new value, i.e., "old value" => "new value":

# Modifying a previous layer to account for different colors with
# different levels of sex
p_obs_time_scatter_lines_color =
    p_obs_time_scatter_lines *
    mapping(color = :sex => renamer("F" => "Female", "M" => "Male") => "Sex");

Stratification may fail when non-categorical variables (such as those with types Float or Int) are passed to keyword arguments in mapping. In our example analysis dataset, the sex variable takes values of 0 or 1. The use of nonnumeric allows these variables to be passed:

# Modifying a previous layer to account for different colors with
# different levels of sex
p_obs_time_scatter_lines_color =
    p_obs_time_scatter_lines * mapping(color = :ismale => nonnumeric => "is male");

9.2.1 Facetting

Plots can be facetted on stratification variables by specifying keyword arguments row and/or col in mapping depending on the intended layout. By default, x- and y-axis labels are linked between the facets.

# Modifying a previous layer to account for different facets with
# different levels of sex
p_obs_time_scatter_lines_facetsex = p_obs_time_scatter_lines * mapping(row = :sex);

# Modifying a previous layer to account for different facets with
# different levels of agebin
p_obs_time_scatter_lines_facetage = p_obs_time_scatter_lines * mapping(col = :agebin);

A grid layout can be constructed by specifying passing variables to both col and row keyword arguments:

# Modifying a previous layer to account for different facets with
# different levels of sex and agebin
p_obs_time_scatter_lines_facetgrid =
    p_obs_time_scatter_lines * mapping(col = :sex, row = :agebin);

9.3 Visuals (aka Geometrics)

The examples provided heavily demonstrate the use of the visuals - Scatter and Lines. However, there are several others available from CairoMakie.jl - Plots. Presented below are some common examples used in pharmacometrics for summarizing the population’s demographics:

# Ensure 1 row per ID before summarizing demographics
oneperid_examp_df = unique(tafd_examp_df, :id)

Linear regression and LOESS (locally-estimated scatter-plot smoothing) trend lines can be applied over a x-y scatter plot using AlgebraOfGraphics.linear and AlgebraOfGraphics.smooth, respectively. Note: neither linear or smooth are arguments for visual, and both need to be specified being from the AlgebraOfGraphics.jl package. As shown below, other visual options can be concatenated to the trend line functions using *.

The confidence interval of the linear regression can be specified by the use of the keyword argument, level. Here, 0.95 is the default and corresponds to $\alpha$ = 0.05.

Additional options can be specified as keyword arguments for visual aesthetics associated with lines (CairoMakie.jl - lines).

# Generate a scatter plot of body weight versus age
# Apply a linear regression line
p_weight_age =
    data(oneperid_examp_df) *
    mapping(:age, :wtbl) *
    (
        visual(Scatter) +
        AlgebraOfGraphics.linear(level = 0.95) * visual(; label = "Linear Regression") +
        AlgebraOfGraphics.smooth() * visual(; color = :red, label = "LOESS")
    )
# Draw the resulting plot
draw(p_weight_age, legend = (; framevisible = false, position = :bottom))

A histogram can be plotted by passing Hist to visual. Additional options can be specified as keyword arguments including the number of bins, normalization (i.e., density, probability density function, etc), and visual aesthetics (CairoMakie.jl - hist).

# Generate a histogram of body weight
p_weight_hist =
    data(oneperid_examp_df) *
    mapping(:wtbl) *
    visual(Hist; normalization = :pdf, strokecolor = :black, color = :dodgerblue4)
# Draw the resulting plot
draw(p_weight_hist, axis = (; xlabel = "Body Weight (kg)", ylabel = "Probability Density"))

A box-and-whisker plot can be plotted by passing BoxPlot to visual. Additional options can be specified as keyword arguments including visual aesthetics (CairoMakie.jl - boxplot).

# Generate a box-and-whisker plot of body weight stratified by sex
p_weight_bxplot =
    data(oneperid_examp_df) *
    mapping(:sex, :wtbl, color = :sex) *
    visual(BoxPlot; show_notch = true)
# Draw the resulting plot
draw(p_weight_bxplot, legend = (; show = false))

And now for something completely different…

Generating pairwise plots requires the use of another Julia package called PairPlots.jl. This package uses the Makie plotting library such that there should be familiarity in context of the other plots presented using AlgebraOfGraphics.jl and CairoMakie.jl.

A pairwise plot can be generated using pairplot. The example code below is just to demonstrate one method for creating pairwise plots in Julia and it is highly recommended to use help queries where possible, i.e., ?pairplot, and review the PairsPlot.jl Guide.

# Retain only the columns required for plotting
cov_oneperid_examp_df = @select oneperid_examp_df :wtbl :age

# Construct a figure object
p_covcorr = Figure()
# Construct the pairwise plot into the figure object and specify its position in the grid
pairplot(
    p_covcorr[1, 1],
    # Specify the input DataFrame
    cov_oneperid_examp_df => (
        # Specify elements that should be presented on the off-diagonals
        # Scatter plot with a correlation
        PairPlots.Scatter(marker = '∘', markersize = 24, alpha = 0.5, color = :dodgerblue4),
        PairPlots.Calculation(cor, position = Makie.Point2f(0.2, 0.1)),
        PairPlots.TrendLine(color = :firebrick3),
        # Specify elements that should be presented on the diagonals
        # Histogram with a density distribution
        PairPlots.MarginHist(color = :dodgerblue4, strokewidth = 0.5),
        PairPlots.MarginDensity(color = :black),
    ),
    # Specify if the off-diagonal elements should be presented on both
    # the upper and lower
    fullgrid = false,
    # Re-label variable names to be more descriptive
    labels = Dict(:wtbl => "Body Weight (kg)", :age => "Age (years)"),
    # Options for labels for axis ticks
    bodyaxis = (; xticklabelrotation = 0, yticklabelrotation = 0),
    diagaxis = (; xticklabelrotation = 0, yticklabelrotation = 0),
)
# Print the plot
p_covcorr

Additional Information for Visuals (aka Geometrics)

9.4 Figure Customizations

Other figure customizations that globally impact the plot can be modified and are passed to draw through its keyword arguments. These options are typically passed as a NamedTuple.

Figure options predominantly take on the arguments from CairoMakie.jl - Figures. Some example options are provided:

draw(
    p_obs_time,
    figure = (;
        # Modify the figure resolution (in px)
        size = (600, 400),
        # Modify figure padding
        figure_padding = 40,
        # Modify background color
        backgroundcolor = :gray80,
    ),
)

Axis options predominantly take on the arguments from CairoMakie.jl - Creating an Axis. Some example options are provided:

draw(
    p_obs_time,
    axis = (;
        # Add descriptive labels to the plot
        title = "Concentration-Time Profiles",
        xlabel = "Time After First Dose (hours)",
        ylabel = "Concentration (mg/L)",
        # Make the y-axis on semi-log scale
        yscale = Makie.pseudolog10,
        # Define the placement of axis ticks/labels
        # Specified a range from 0 to 120, every 12 hours
        xticks = [0:12:120;],
        # Specify exact values where ticks/labels should appear
        # Also specify what they labels should exactly appear as
        yticks = (
            [0.1, 0.3, 1, 3, 10, 30, 100, 300],
            ["0.1", "0.3", "1", "3", "10", "30", "100", "300"],
        ),
        # Define the limits of the plot
        # Format ((xmin,xmax),(ymin,ymax))
        limits = ((-5, 125), (nothing, nothing)),
    ),
)

A legend appears by default when keyword arguments such as color and marker are specified in mapping. Additionally, a custom legend can be built to describe specific layers by passing the keyword argument - label - in visual for that layer. In the example below, “Observation” is passed to the label argument for both the Scatter and Lines geometrics. The common label for these geometrics creates a legend key incorporating both variables.

By default, a legend is positioned to the right of the plot with a black frame. Legend options predominantly take on the arguments from CairoMakie.jl - Legend. Some example options are provided:

# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
p_obs_time_scatter_lines =
    data(plot_examp_df) *
    # Specifying the mapping aesthetics using positional arguments
    mapping(:tafd, :dv_1, group = :id => nonnumeric) *
    # Combining the visuals/geometry
    (visual(Scatter; label = "Observation") + visual(Lines; label = "Observation"))

# Combine the layers of the plot
p_obs_time = p_obs_time_scatter_lines

# Draw the resulting plot
draw(p_obs_time, legend = (; framevisible = false, position = :bottom))

Note: While not provided in the example, the keyword argument nbanks is useful for arranging legends with many elements over multiple rows, and show = false is useful to remove the legend completely.

Some options for facetted plots are specified through the facet keyword argument. The x- and y-axes can be linked (where ticks and labels are removed depending on the arrangement of the facets) or not linked (each facet of the plot has its own axis ticks and labels):

# Specifying the input DataFrame
p_obs_time_scatter_lines =
    data(plot_examp_df) *
    # Specifying the mapping aesthetics using positional arguments
    mapping(:tafd, :dv_1, group = :id => nonnumeric, col = :agebin) *
    # Combining the visuals/geometry
    (visual(Scatter) + visual(Lines))

# Draw the resulting plot
draw(p_obs_time_scatter_lines, facet = (;
    # Option to link x and/or y axes
    # Default is :minimal (removed ticks and labels)
    linkxaxes = :minimal,
    linkyaxes = :none,
))

All properties defined in mapping (such as color, linestyle, marker, layout, etc) for stratification are passed as scale options (using the scales function). This is then provided as the second positional argument of draw. Further details are specified in Algebra Of Graphics - Scale Options.

For example, to manually set the color palette for the stratification of sex in the plot:

# Draw resulting plot
draw(p_obs_time_scatter_lines_color, scales(Color = (; palette = [:blue, :red])))

Alternatively, preset color palettes from ColorSchemes.jl can also be applied.

# Draw resulting plot
draw(p_obs_time_scatter_lines_color, scales(Color = (; palette = :tab10)))

Additional Information for Figure Customizations

Customization of AlgebraOfGraphics.jl Plots

9.5 Arranging Multiple Plots

All AlgebraOfGraphics.jl plots can be arranged to create multi-panel figures. The example provided below demonstrates the arrangement of 2 figures previously generated in this Module, 1) Individual concentration-time profiles colored by sex (p_obs_time_scatter_lines_color), and 2) Box-and-whisker plots depicting the distribution of body weight stratified and colored by sex (p_weight_bxplot).

Advanced layouts need to be performed by CairoMakie.jl. The general process involves:

Creating a Figure() object
Adding elements or layers (or AlgebraOfGraphics.jl plots in this example) to positions in the Figure. Note: a grid layout does not need to be pre-defined in order to arrange the figures and can be specified as each element is added
Returning and printing the resulting figure

It is important to recognize that axis, legend, and facet options that were specified when generating the AlgebraOfGraphics.jl plot may not be carried over when arranged using CairoMakie.jl. Therefore, CairoMakie.jl functions such as Axis and Legend will be required to add these layers. Two examples are provided below (Note: the use of draw! here instead of draw):

# Create an empty Figure object
p_sex = Figure()

# Draw the AlgebraOfGraphics.jl scatter plot in the following position:
# Row 1, Column 1
draw!(p_sex[1, 1], p_obs_time_scatter_lines_color)

# Then, draw the AlgebraOfGraphics.jl box-and-whisker plot in the
# following position: Row 1, Column 2
draw!(p_sex[1, 2], p_weight_bxplot)

# Return and print the figure
p_sex

# Create an empty Figure object
p_sex = Figure()

# First, specify axis options for a plot in the following position:
# Row 1, Column 1
ax_sex_scatter = Axis(
    p_sex[1, 1],
    xlabel = "Time After First Dose (hours)",
    ylabel = "Concentration (mg/L)",
)

# Then, draw the AlgebraOfGraphics.jl scatter plot where the Axis
# has been defined
draw!(ax_sex_scatter, p_obs_time_scatter_lines_color)

# Then, specify axis options for a plot in the following position:
# Row 1, Column 2
ax_sex_bxplot = Axis(
    p_sex[1, 2],
    xlabel = "Sex",
    xticks = ([1, 2], ["Female", "Male"]),
    ylabel = "Body Weight (kg)",
)

# Then, draw the AlgebraOfGraphics.jl box-and-whisker plot where
# the Axis has been defined
p_sex_bxplot = draw!(ax_sex_bxplot, p_weight_bxplot)

# Then, add one of the legends from one of the figures to the plot
# (Row 2, Columns 1-2)
legend!(
    p_sex[2, 1:2],
    p_sex_bxplot,
    titleposition = :left,
    orientation = :horizontal,
    position = :bottom,
    framevisible = false,
)

# Return and print the figure
p_sex

Additional Information for Arranging Multiple Plots

Advanced Layouts with AlgebraOfGraphics.jl

10 Summary

Julia has several packages and functions for data manipulation and visualization such as DataFrames.jl, DataFramesMeta.jl, and AlgebraOfGraphics.jl that have similar concepts and functionality as dplyr and ggplot2 in R. The Module has aimed to provide a brief introduction to how these packages can be used in context of pharmacometrics, however, they are not the full extent. Such that, it is highly recommended to refer to the package documentations and other PumasAI tutorials linked throughout this Module to explore other functionalities.

1 Module Introduction and Objectives

1.1 Objectives

1.2 Example Analysis Dataset

2 Specifying Directories

2.1 Joining and Modifying Paths

3 Reading and Writing Data

3.1 Comma-Separated Files

3.2 SAS Transport Files

4 DataFrames

4.1 Dimensions

4.2 Viewing

4.3 Combining DataFrames

4.4 Replacing Missing Values

4.5 Extracting a Column

5 Arrays and Vectors

5.1 Unique

5.2 Indexing

5.3 Length

5.4 Missing Values

5.5 Filter

6 Wrangling DataFrames

6.1 Pipes and Chains

6.2 Mutate/Transform

6.3 Subset

6.3.1 Bonus: Subsetting for 1 Row per Individual

6.4 Sort

6.5 Select

6.6 Pivot to Long/Wide

6.7 Rename

6.8 Join

6.9 Summarize

7 Categorical Variables

7.1 Binning Continuous Variables

8 Date/Time Variables

9 Plotting and Data Visualization

9.1 The Algebra Of Graphics

9.2 Stratification Through Mapping

9.2.1 Facetting

9.3 Visuals (aka Geometrics)

9.4 Figure Customizations

9.5 Arranging Multiple Plots

10 Summary

Reuse