ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | . |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | . | 2 | 100 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | . | 1 | 9.2 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | . | 2 | 49 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | . | 1 | 8.5 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | . | 2 | 32 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | . | 1 | 6.4 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | . | 2 | 26 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | . | 1 | 4.8 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | . | 2 | 22 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | . | 1 | 3.1 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | . | 2 | 28 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | . | 1 | 2.5 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | . | 2 | 33 |
Module 2: Data Wrangling and Visualization in Julia
1 Module Introduction and Objectives
The target audience are pharmacometricians with experience in the R programming and statistical language and familiar with dataset preparation and exploratory data analysis in pharmacometrics. The Module makes reference to similarities and differences to R and builds on the concepts described in Module 1: Introduction to Julia.
The concepts and Julia packages showcased in this Module are:
- Reading and writing data: CSV.jl and ReadStatTables.jl
- Piping practices: Chain.jl
- Manipulating data frames: DataFrames.jl and DataFramesMeta.jl
- Handling categorical variables: Categorical Arrays.jl
- Handling date/time variables: Dates.jl
- Plotting and visualization: CairoMakie.jl and AlgebraOfGraphics.jl
All output DataFrames have been converted for presentation purposes using SummaryTables.jl.
1.1 Objectives
The objectives of Module 2: Data Wrangling and Visualization in Julia are to:
- Introduce a pharmacometrics analysis dataset (consisting of pharmacokinetic [PK] and pharmacodynamic [PD] data) to serve as the basis for demonstrating data manipulation, summarisation, and visualization in Julia
- Highlight important Julia packages required for performing data wrangling and visualization, and demonstrate their implementation through example code
- Provide a brief introduction of the Julia functions (with some reference to R) used to perform data wrangling and visualization activities (particularly in context of exploratory data analysis)
1.2 Example Analysis Dataset
The example pharmacometrics analysis dataset consists of PK and PD data following the administration of warfarin. The dataset was originally published in: O’Reilly (1968). Studies on coumarin anticoagulant drugs. Initiation of warfarin therapy without a loading dose. Circulation 1968, 38:169-177. Included are the plasma warfarin concentrations (PK) and Prothrombin Complex Response (PD) in 30 normal subjects following the administration of a single, oral, loading dose of 1.5 mg/kg warfarin sodium.
The dataset is in a long format whereby individual dosing, PK, and PD observations have unique records (i.e., rows) and are identified by a dependent variable identification number (DVID). The first individual’s records are presented below:
Throughout this Module, the full analysis dataset will be referred to as the DataFrame object, examp_df
, and will be progressively modified throughout the Module to demonstrate data manipulation techniques in Julia.
2 Specifying Directories
It is generally considered best practice not to hard code directory paths particularly when working in a collaborative environment. Julia has some useful functions to assist with identifying and specifying the paths to files and directories:
# Print the current directory
= @__DIR__
curr_dir println("The path to the current directory is:\n", curr_dir)
The path to the current directory is:
/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module
# Print the path to the current file
= @__FILE__
curr_file println("The path to the current file is:\n", curr_file)
The path to the current file is:
/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module/mod2-data-wrangling-visualization.qmd
# Bonus: print the current line number
= @__LINE__
curr_line println("The current line is:\n", curr_line)
The current line is:
187
Note: the strings returned from @__DIR__
and @__FILE__
do not include the terminal slash that may be required when joining paths together.
2.1 Joining and Modifying Paths
Operating systems specify file paths differently (i.e., use of forward slashes “/” versus backslashes “\”) which can cause problems when sharing code. The function joinpath
joins path components into a full path as a string. A simple demonstration of the function is shown below where an empty string is being joined with the current directory to return a path including the appropriate terminal slash:
# Create a string that is the current directory joined with the appropriate slash for the operating system
= joinpath(curr_dir, "") replace_currdir_string
"/build/run/_work/PumasTutorials.jl/PumasTutorials.jl/tutorials/LearningPaths/01-LP/02-Module/"
Sometimes it be useful to obtain only the current filename and not the full file path. Here, basename
can be used to obtain just the filename from the file’s path:
# Print just the current filename
= basename(curr_file)
curr_filename println("The current filename is:\n", curr_filename)
The current filename is:
mod2-data-wrangling-visualization.qmd
3 Reading and Writing Data
There are several Julia packages that handle reading in different file types into the environment that typically contain data for pharmacometrics analysis:
- CSV.jl for .csv extensions
- ReadStatTables.jl for .xpt extensions
- XLSX.jl for .xlsx extensions
This Module will focus on .csv and .xpt formats using CSV.jl and ReadStatTables.jl packages, respectively.
3.1 Comma-Separated Files
This example of reading and writing .csv files in Julia uses CSV.jl.
Reading in a .csv file uses the CSV.read
function which requires:
- The path to the delimited file
- A sink type, for most pharmacometrics applications this will be DataFrame
Other optional keyword arguments can be specified for example, specifying the string pattern for missing values, the format for Date/Time/DateTime variables (more in Section 8), or the types for variables. Type ?CSV.read
in the REPL to see other options that can be specified when reading in .csv files.
# Specify the filepath to the .csv file
= joinpath(@__DIR__, "warfarin-pkpd-data.csv")
csv_filepath # Read in the .csv file
= CSV.read(
csv_df
csv_filepath,
DataFrame;= ".",
missingstring = Dict(
types :ID => String,
:DATETIME => DateTime,
:WEIGHT => Float64,
:AGE => Float64,
:SEX => Int64,
:AMOUNT => Float64,
:DVID => Int64,
:DV => Float64,
),= "yyyy-mm-ddTH:M:S.s",
dateformat )
Writing a DataFrame to .csv uses the CSV.write
function which requires:
- The path where the DataFrame will be written
- The DataFrame object that will be written to .csv
Other optional keyword arguments can be specified for example, specifying the string pattern for missing values. Type ?CSV.write
in the REPL to see other options that can be specified when reading in .csv files.
# Specify the filepath to write the new .csv file
= joinpath(@__DIR__, "warfarin-pkpd-data-out.csv")
csv_output # Write to .csv
write(csv_output, csv_df; missingstring = ".") CSV.
3.2 SAS Transport Files
This example of reading and writing SAS transport files in Julia uses ReadStatTables.jl.
Reading in a .xpt file uses the ReadStatTables.readstat
function which requires the filepath. This function does not return a DataFrame object type. An additional step is required to convert the object to a useable DataFrame:
# Specify the filepath to the .xpt file
= joinpath(@__DIR__, "warfarin-pkpd-data.xpt")
xpt_filepath # Read in the .xpt file
= ReadStatTables.readstat(xpt_filepath)
xpt_df = DataFrame(xpt_df) xpt_df
Other optional keyword arguments can be specified. Type ?ReadStatTables.readstat
in the REPL to see other options that can be specified when reading in .xpt files.
Writing a DataFrame to .xpt uses the ReadStatTables.writestat
function which requires:
- The path where the DataFrame will be written
- The DataFrame object that will be written to .xpt
Other optional keyword arguments can be specified for example, specifying the string pattern for missing values. Type ?ReadStatTables.writestat
in the REPL to see other options that can be specified when reading in .csv files.
# Specify the filepath to write the new .xpt file
= joinpath(@__DIR__, "warfarin-pkpd-data-out.xpt")
xpt_output # Write to .csv
writestat(xpt_output, xpt_df;) ReadStatTables.
4 DataFrames
Objects of the DataFrame
type represent a data table as a series of vectors, each corresponding to a column or a variable. The simplest way of constructing a DataFrame is to pass column vectors using keyword arguments or pairs. The DataFrame (when printed in the REPL) will provide information regarding the dimensions, types for each of the columns, and row numbers:
# Create a simple DataFrame
DataFrame(ID = 1:5, SEX = ["Male", "Female", "Female", "Female", "Female"])
Row | ID | SEX |
---|---|---|
Int64 | String | |
1 | 1 | Male |
2 | 2 | Female |
3 | 3 | Female |
4 | 4 | Female |
5 | 5 | Female |
4.1 Dimensions
Properties of DataFrame objects (i.e., the dimensions, number of rows, number of columns) can be examined, printed in the REPL, and stored as new objects.
# Print the dimensions of the DataFrame (rows, columns)
= size(examp_df)
dims println("The dimensions of the DataFrame are: ", dims)
The dimensions of the DataFrame are: (504, 8)
# Print the number of rows
= nrow(examp_df)
total_rows println("The DataFrame has ", total_rows, " rows")
The DataFrame has 504 rows
# Print the number of columns
= ncol(examp_df)
total_columns println("The DataFrame has ", total_columns, " columns")
The DataFrame has 8 columns
4.2 Viewing
DataFrame objects can be viewed in the REPL or as part of the application. Here, the DataFrame read in from the .xpt dataset (xpt_df from Section 3.2) can being viewed in the application using vscodedisplay
and demonstrate the presentation of the metadata stored:
# View the DataFrame in VSCode
vscodedisplay(xpt_df)
This provides an interactive view where the DataFrame can be filtered to examine subsets of the DataFrame.
Alternatively, the DataFrame can be printed in the REPL. By default, only a sample of the rows and columns in the DataFrame are printed to fit the screen. Default printing options can be adjusted by the use of show
.
Note: Only the code, and not the output in the REPL, has been provided for the purpose of this Module.
# Print all rows of the DataFrame in the REPL
show(examp_df, allrows = true)
# Print all columns of the DataFrame in the REPL
show(examp_df, allcols = true)
There are options to print a select number of rows in the DataFrame. For example, the first 6 rows of the DataFrame can be extracted using first
:
# Print the first 6 rows of the DataFrame
= first(examp_df, 6) first_examp_df
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | missing |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 100 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 9.2 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 49 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 1 | 8.5 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 2 | 32 |
Additionally, the last 6 rows of the DataFrame can be extracted using last
:
# Print the last 6 rows of the DataFrame
= last(examp_df, 6) last_examp_df
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV |
32 | 1967-10-18T08:00:00 | 62 | 21 | 1 | missing | 1 | 4.4 |
32 | 1967-10-18T08:00:00 | 62 | 21 | 1 | missing | 2 | 23 |
32 | 1967-10-19T08:00:00 | 62 | 21 | 1 | missing | 1 | 3.5 |
32 | 1967-10-19T08:00:00 | 62 | 21 | 1 | missing | 2 | 20 |
32 | 1967-10-20T08:00:00 | 62 | 21 | 1 | missing | 1 | 2.5 |
32 | 1967-10-20T08:00:00 | 62 | 21 | 1 | missing | 2 | 22 |
Extracting the column names of the DataFrame using names
returns a vector:
# Extract the names of the columns
= names(examp_df)
examp_df_names # Print all of the column names
show(examp_df_names)
["ID", "DATETIME", "WEIGHT", "AGE", "SEX", "AMOUNT", "DVID", "DV"]
4.3 Combining DataFrames
DataFrames with a common dimension (i.e., the number of rows or the number of columns) can be concatenated together. The following demonstrates the case of horizontally combining DataFrames with a common number of rows using hcat
:
# Create a vector with length equal to the number of rows in examp_df
# Sampling random integers from 0 to 1000
= DataFrame(new_col = rand(0:1:1000, nrow(examp_df)))
new_col_df # Add this to the DataFrame as a new column
= hcat(examp_df, new_col_df) examp_df_newcol
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV | new_col |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | missing | 720 |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 100 | 462 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 9.2 | 667 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 49 | 888 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 1 | 8.5 | 388 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 2 | 32 | 104 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 6.4 | 708 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 26 | 849 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 4.8 | 36 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 22 | 309 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 3.1 | 7 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 28 | 328 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 2.5 | 580 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 33 | 738 |
Alternatively, vcat
is used to vertically concatenate two DataFrames with common columns:
# Create a DataFrame storing a new row to be added to the DataFrame
= DataFrame(
new_row_df = "new_row",
ID = "2025-01-25T14:00:00",
DATETIME = 70.0,
WEIGHT = 51,
AGE = 2,
SEX = missing,
AMOUNT = 3,
DVID = missing,
DV
)# Add this to the DataFrame as a new row
= vcat(new_row_df, examp_df) examp_df_newrow
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV |
new_row | 2025-01-25T14:00:00 | 70 | 51 | 2 | missing | 3 | missing |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | missing |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 100 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 9.2 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 49 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 1 | 8.5 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 2 | 32 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 6.4 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 26 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 4.8 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 22 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 3.1 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 28 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 2.5 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 33 |
In both cases, these functions take additional keyword arguments to handle situations where the column names are not identical. Type ?hcat
and ?vcat
into the REPL to find more information regarding these options. Note: Additional methods of joining DataFrames that do not have a matching number of rows or columns will be described in Section 6.8.
4.4 Replacing Missing Values
Missing values can be replaced with a new value broadly across the entire DataFrame using coalesce
. Note: By default, the function takes single values. The broadcast notation coalesce.
is required to apply the function to a vector or a DataFrame (collection of vectors). See Module 1: Introduction to Julia for additional information regarding broadcasting in Julia.
# Replace all missing values in the DataFrame with "."
= coalesce.(examp_df, ".") nomissing_examp_df
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | . |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | . | 2 | 100 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | . | 1 | 9.2 |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | . | 2 | 49 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | . | 1 | 8.5 |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | . | 2 | 32 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | . | 1 | 6.4 |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | . | 2 | 26 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | . | 1 | 4.8 |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | . | 2 | 22 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | . | 1 | 3.1 |
1 | 1967-09-06T08:00:00 | 66.7 | 50 | 1 | . | 2 | 28 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | . | 1 | 2.5 |
1 | 1967-09-07T08:00:00 | 66.7 | 50 | 1 | . | 2 | 33 |
4.5 Extracting a Column
A single column can be extracted from a DataFrame to return a vector of all values from that column. The notation <DataFrame>.<Column>
in Julia is similar to that of <data.frame>$<Column>
in R:
# Extract a single column from the DataFrame
= examp_df.ID
ids # Print just the first 10 values in the vector
= first(ids, 10)
first_ids show(first_ids)
AbstractString[String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1")]
Alternatively, a column can be extracted from a DataFrame using getproperty
. This function takes a DataFrame and the target variable name (using the Symbol notation) and its behavior is consistent with dplyr::pull
in R:
# Use getproperty to extract a single column from the DataFrame
# Useful in context of a pipe sequence where the first argument is the
# result of the previous expression
= getproperty(examp_df, :ID)
ids # Print just the first 10 values in the vector
= first(ids, 10)
first_ids show(first_ids)
AbstractString[String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1"), String3("1")]
Both examples return the same results and object type, however, choice to use one over the other is dependent on the application.
5 Arrays and Vectors
In Julia, an array is a collection of objects stored in a multi-dimensional grid, and vectors are single-dimensional arrays. Vectors can be examined and indexed similar to R.
5.1 Unique
The unique values in a vector can be obtained using unique
:
# Show the unique ID values from the analysis dataset
= unique(ids)
unique_ids show(unique_ids)
AbstractString[String3("1"), String3("2"), String3("3"), String3("4"), String3("5"), String3("6"), String3("7"), String3("8"), String3("9"), String3("11") … String3("23"), String3("24"), String3("25"), String3("26"), String3("27"), String3("28"), String3("29"), String3("30"), String3("31"), String3("32")]
5.2 Indexing
Specific values of a vector can be indexed by providing the location. For example, identifying the fifth value of the unique IDs from the analysis dataset requires:
# Show the 5th unique ID value from the analysis dataset
= unique_ids[5]
fifth_id show(fifth_id)
String3("5")
5.3 Length
The length of a vector can also be obtained:
# Determine the number of unique ID values in the analysis dataset
= length(unique_ids)
length_unique_ids println("There are ", length_unique_ids, " individuals in the analysis dataset")
There are 31 individuals in the analysis dataset
5.4 Missing Values
Missing values can be identified by the use of ismissing
.
# Determine if value is missing
= missing
test_val ismissing(test_val)
true
Alternatively, !ismissing
can be used to identify non-missing values.
# Determine if value is not missing
ismissing(test_val) !
false
5.5 Filter
Values in a vector can be filtered out by the use of filter
and specifying a function of the conditions. A copy of the vector will be return where all values that returned false
are removed. Note: the function needs to be specified as an anonymous function:
# Filter for DV values that are greater than 10
= filter(x -> !ismissing(x) && x > 10, examp_df.DV)
high_dvs # Check that the minimum DV in high_dvs is greater than 10
println("The minimum value in the new vector is: ", minimum(high_dvs))
The minimum value in the new vector is: 10.1
6 Wrangling DataFrames
Data wrangling and manipulation are predominantly handled by functions available in base Julia and the following packages:
-
- Data Wrangling with DataFrames.jl Cheat Sheet
DataFrames.jl provides a set of tools for working with tabular data in Julia. Its design and functionality are similar to that functions from dplyr
and data.table
(the latter by providing in-place functions) in R. DataFramesMeta.jl provides macros that mirror DataFrames.jl functions with a more convenient syntax, as well as additional functions to assist with data manipulation and summarization. The majority of examples presented in this Module will primarily use functions from DataFramesMeta.jl.
The differences in syntax between the 2 packages is demonstrated through the use of transform
(a function for modifying or adding columns to a DataFrame likened to dplyr::mutate
in R). Note: in both cases, broadcasting (i.e., string.
) is required as transform
and @transform
do not natively perform row-wise operations:
General syntax: DataFramesFunction(DataFrame, function)
Where function requires the use of pairs (=>
) and functions/anonymous functions to be applied to the currently available column to generate the new column:
ColumnInDataFrame => (function or anonymous function) => NewColumnInDataFrame
# Add a column where the ID values have "_examp" appended to them
# Where transform is the DataFrames.jl function, examp_df is the input DataFrame,
# and :ID => (x -> string.(x,"_examp")) => :new_ID is the function
transform(examp_df, :ID => (x -> string.(x, "_examp")) => :new_ID)
General syntax: @DataFramesMetaFunction DataFrame Assignment/Mutation
Where the Assignment/Mutation has syntax similar to dplyr::mutate
:
NewColumnInDataFrame = function(ColumnInDataFrame)
# Add a column where the ID values have "_examp" appended to them
# Where @transform is the DataFramesMeta.jl macro, examp_df is the input DataFrame,
# and :new_ID = string.(:ID,"_examp") is an assignment
@transform examp_df :new_ID = string.(:ID, "_examp")
It is important to understand the syntax of functions from both DataFrames.jl and DataFramesMeta.jl. While DataFramesMeta.jl syntax is easier to understand (particularly when transitioning from R to Julia programming), the use of anonymous functions with DataFrames.jl make it easier to manipulate multiple columns at once, for example. You do not have to use functions from either package exclusively, and can develop a workflow that takes advantage of each.
Many functions in either package also have “in-place” forms denoted with a !
suffix. These functions directly mutate the DataFrame object serving as input to the function (as opposed to requiring assignment to new a object to store the results). The majority of examples presented in this Module will limit the use of in-place functions.
Tidier.jl is a Julia data analysis package inspired by R’s tidyverse. It provides an ecosystem of packages that have translated the functions from R/tidyverse to the Julia language.
It is not recommended to adopt these packages in a regular workflow, even though individuals with previous R/tidyverse experience may find familiarity with these functions and their output. This is because:
- All packages/functions from tidyverse have not yet been translated, such that a workflow based on these packages may still be limited
- Many functions are wrappers around the Julia functions that are described in this Module
- Julia and R are different programming languages. Packages in the Tidier.jl ecosystem are designed to present Julia code as R code which may have scalability issues for larger, more complicated workflows.
There will be long-term benefits to understanding and applying the Julia functions presented in this Module with respect to future learning and implementation of Pumas in pharmacometrics analyses.
6.1 Pipes and Chains
Sometimes it can be convenient to pipe functions together in a sequence. Julia has a native pipe operator, |>
, like R. The native Julia pipe can link single-argument actions by taking the result of one action and pass it as the next’s first argument. When the next function takes multiple arguments, syntax requires the use of anonymous functions.
Chain.jl offers a more convenient syntax as demonstrated below with the use of @rtransform
(row-wise mutation function similar to dplyr::mutate
in R):
# Example of standard declaration
= @rtransform(examp_df, :ID = string(:ID, "_001"))
examp_df_idmod
# Example of using native base Julia pipe
= examp_df |> x -> @rtransform(x, :ID = string(:ID, "_001"))
examp_df_idmod
# Example of using @chain
= @chain examp_df begin
examp_df_idmod @rtransform :ID = string(:ID, "_001")
end
Note: If the result should not be passed to the first argument of the next function in the sequence, then _
can be used to inform where the result should be passed. An example is provided in Section 6.2.
The majority of examples presented in this Module will link sequences of actions on DataFrames with @chain
.
6.2 Mutate/Transform
In DataFramesMeta.jl, @transform
and @rtransform
perform the same actions as dplyr::mutate
in R.
@transform
returns the original DataFrame and any newly created columns. It does not natively perform row-wise operations - therefore, requires the use of broadcasting in the syntax to imply row-wise operations or @rtransform
can be used instead.
In the example below, @transform
is used to generate a record sequence column where :RECSEQ1
is a sequence of numbers from 1 to the number of rows in examp_df
as a vector. As nrow
requires the same DataFrame input as @transform
, the _
syntax is used to denote where the previous result should be passed (and note that it can be passed to multiple places). A simpler syntax is demonstrated in the second example (:RECSEQ2
), where the eachindex
function is used to create an range of values for each index in :ID
.
# Modify the analysis dataset to add additional variables
= @chain examp_df begin
examp_df_dose # Create a record sequence column (example 1 demonstrating use of passing previous result
# into multiple places of the next function)
@transform _ :RECSEQ1 = 1:nrow(_)
# Create a record sequence column (example 2, using eachindex syntax)
@transform :RECSEQ2 = eachindex(:ID)
end
The next example demonstrates the use of row-wise operations and directly applying mutations to the previous DataFrame result using the in-place form @rtransform!
. Here, @rtransform
is required instead of @transform
as each row needs to be separately evaluated to determine the values of :EVID
and :CMT
conditional on the value for :DVID
.
# Modify the analysis dataset to add additional variables
@chain examp_df_dose begin
# Add columns to specify dosing information for non-linear mixed effects modeling
@rtransform! begin
# Set event ID variable (1 = dosing events, 0 = observations)
:EVID = :DVID == 0 ? 1 : 0
# Set compartment variable for dosing events (1 = depot)
:CMT = :DVID == 0 ? 1 : missing
end
end
# The result of the first individual's records are shown below:
RECSEQ1 | RECSEQ2 | ID | DATETIME | AMOUNT | EVID | CMT | DVID | DV |
1 | 1 | 1 | 1967-09-02T08:00:00 | 100 | 1 | 1 | 0 | missing |
2 | 2 | 1 | 1967-09-02T08:00:00 | missing | 0 | missing | 2 | 100 |
3 | 3 | 1 | 1967-09-03T08:00:00 | missing | 0 | missing | 1 | 9.2 |
4 | 4 | 1 | 1967-09-03T08:00:00 | missing | 0 | missing | 2 | 49 |
5 | 5 | 1 | 1967-09-03T20:00:00 | missing | 0 | missing | 1 | 8.5 |
6 | 6 | 1 | 1967-09-03T20:00:00 | missing | 0 | missing | 2 | 32 |
7 | 7 | 1 | 1967-09-04T08:00:00 | missing | 0 | missing | 1 | 6.4 |
8 | 8 | 1 | 1967-09-04T08:00:00 | missing | 0 | missing | 2 | 26 |
9 | 9 | 1 | 1967-09-05T08:00:00 | missing | 0 | missing | 1 | 4.8 |
10 | 10 | 1 | 1967-09-05T08:00:00 | missing | 0 | missing | 2 | 22 |
11 | 11 | 1 | 1967-09-06T08:00:00 | missing | 0 | missing | 1 | 3.1 |
12 | 12 | 1 | 1967-09-06T08:00:00 | missing | 0 | missing | 2 | 28 |
13 | 13 | 1 | 1967-09-07T08:00:00 | missing | 0 | missing | 1 | 2.5 |
14 | 14 | 1 | 1967-09-07T08:00:00 | missing | 0 | missing | 2 | 33 |
6.3 Subset
To return a subset of rows from a DataFrame under a given set of conditions, the @subset
and @rsubset
functions are used. Note: Like @transform
, @subset
does not perform row-wise operations and broadcasting is required when defining the conditions for subsetting. Simpler syntax is available with @rsubset
(natively performs row-wise operations).
The dropmissing
function can be used to exclude rows that are associated with missing values for a given set of variables.
# Subset for only observation records and drop the records where DV is missing
= @chain examp_df_dose begin
examp_df_obs # Retain only observation records
@rsubset :EVID == 0
# Exclude records associated with missing DV values
dropmissing(:DV)
end
6.3.1 Bonus: Subsetting for 1 Row per Individual
A common case in pharmacometrics is to subset the analysis population for 1 row per individual prior to summarizing baseline demographic information. In Julia, the unique
function can be applied to a DataFrame to return only unique rows for a given set of variables. This performs similar to !duplicated
or dplyr::distinct
functions in R:
# Subset 1 row per individual
= unique(examp_df_obs, :ID) examp_df_one
6.4 Sort
DataFrames can be arranged/ordered by a set of variables using the @orderby
function. This applies to numerical variables (i.e., Int or Float), categorical variables (based on levels assigned otherwise alphabetical order), and DateTime variables:
# Arrange the DataFrame by DATETIME
= @orderby examp_df_one :DATETIME
examp_df_time # The result of the first 10 records are shown below:
ID | DATETIME | AMOUNT | EVID | CMT | DVID | DV |
12 | 1967-06-30T08:00:00 | missing | 0 | missing | 2 | 85 |
30 | 1967-07-01T08:00:00 | missing | 0 | missing | 2 | 100 |
13 | 1967-07-16T08:00:00 | missing | 0 | missing | 2 | 88 |
15 | 1967-07-23T08:00:00 | missing | 0 | missing | 2 | 100 |
7 | 1967-08-05T08:00:00 | missing | 0 | missing | 2 | 88 |
3 | 1967-08-26T08:00:00 | missing | 0 | missing | 2 | 100 |
20 | 1967-08-28T08:00:00 | missing | 0 | missing | 2 | 100 |
1 | 1967-09-02T08:00:00 | missing | 0 | missing | 2 | 100 |
11 | 1967-09-06T08:00:00 | missing | 0 | missing | 2 | 100 |
24 | 1967-09-08T08:00:00 | missing | 0 | missing | 2 | 88 |
6.5 Select
Columns can be selected or removed from the DataFrame using @select
. The function returns a DataFrame with columns in the order of which they were selected - therefore, it can be useful to re-order columns for the purposes of presentation. Additionally, there are helper functions inherited from DataFrames.jl (Not
,All
,Cols
, and Between
) that make syntax for selecting/deselecting columns simpler (refer to DataFramesMeta - @select
and @select!
for additional information).
# Retain only the necessary columns (re-arrange the order too)
= @select examp_df_dose begin
examp_df_select :ID
:DATETIME
:AMOUNT
:EVID
:CMT
:DV
:DVID
:WEIGHT
:AGE
:SEX
end
# The result of the first 10 records are shown below:
ID | DATETIME | AMOUNT | EVID | CMT | DV | DVID | WEIGHT | AGE | SEX |
1 | 1967-09-02T08:00:00 | 100 | 1 | 1 | missing | 0 | 66.7 | 50 | 1 |
1 | 1967-09-02T08:00:00 | missing | 0 | missing | 100 | 2 | 66.7 | 50 | 1 |
1 | 1967-09-03T08:00:00 | missing | 0 | missing | 9.2 | 1 | 66.7 | 50 | 1 |
1 | 1967-09-03T08:00:00 | missing | 0 | missing | 49 | 2 | 66.7 | 50 | 1 |
1 | 1967-09-03T20:00:00 | missing | 0 | missing | 8.5 | 1 | 66.7 | 50 | 1 |
1 | 1967-09-03T20:00:00 | missing | 0 | missing | 32 | 2 | 66.7 | 50 | 1 |
1 | 1967-09-04T08:00:00 | missing | 0 | missing | 6.4 | 1 | 66.7 | 50 | 1 |
1 | 1967-09-04T08:00:00 | missing | 0 | missing | 26 | 2 | 66.7 | 50 | 1 |
1 | 1967-09-05T08:00:00 | missing | 0 | missing | 4.8 | 1 | 66.7 | 50 | 1 |
1 | 1967-09-05T08:00:00 | missing | 0 | missing | 22 | 2 | 66.7 | 50 | 1 |
# Remove the DV_0 column
= @select examp_df_dose Not(:RECSEQ1, :RECSEQ2)
examp_df_remove # The result of the first 10 records are shown below:
ID | DATETIME | WEIGHT | AGE | SEX | AMOUNT | DVID | DV | EVID | CMT |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | 100 | 0 | missing | 1 | 1 |
1 | 1967-09-02T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 100 | 0 | missing |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 9.2 | 0 | missing |
1 | 1967-09-03T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 49 | 0 | missing |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 1 | 8.5 | 0 | missing |
1 | 1967-09-03T20:00:00 | 66.7 | 50 | 1 | missing | 2 | 32 | 0 | missing |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 6.4 | 0 | missing |
1 | 1967-09-04T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 26 | 0 | missing |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 1 | 4.8 | 0 | missing |
1 | 1967-09-05T08:00:00 | 66.7 | 50 | 1 | missing | 2 | 22 | 0 | missing |
6.6 Pivot to Long/Wide
The example analysis dataset used for this Module is presented in a long format. In this format, each row represents a unique observation - this could be observations at different time-points or even different types of observations. In our example, the measurement variables are in the DV (dependent variable) column and their identifier variables are in the DVID column (dependent variable identifier). Such that, dosing records are in rows where :DVID == 0
, PK observations are in rows where :DVID == 1
, and PD observations are in rows were :DVID == 2
.
This is a typical format for population modeling analysis with NONMEM. However, Pumas (covered in later Modules) expects the different types of dependent variables to be presented in a wide data format. In a wide data format, measurement variables have unique columns and identified by the column name.
In order to interchange between long and wide data formats, the following functions are required from DataFrames.jl:
The unstack
function converts the DataFrame from a long format to a wide format (i.e., it unstacks the measurement variables into their unique columns).
Note: the combine
keyword argument helps handle duplicate values through the use of functions/anonymous functions.
# Convert the DataFrame from long to wide format
= unstack(
examp_df_wide # Input DataFrame
examp_df_select,# rowkeys (i.e., variable that stores the identifier variables)
:DVID,
# colkeys (i.e., variable that stores the measurement variables)
:DV;
# Option to rename the new columns created for the measurement variables
# Here an anonymous function is used to create a new column name based on
# the identifier variable (DVID = x)
= x -> Symbol(:DV_, x),
renamecols # Specify how to handle duplicate records
# Here an anonymous function is used to specify only taking the first
# value for 2 observations of the same type at the same time
= x -> first(x),
combine
)# The result of the first 10 rows are shown below:
ID | DATETIME | AMOUNT | EVID | CMT | DV_0 | DV_1 | DV_2 |
1 | 1967-09-02T08:00:00 | 100 | 1 | 1 | missing | missing | missing |
1 | 1967-09-02T08:00:00 | missing | 0 | missing | missing | missing | 100 |
1 | 1967-09-03T08:00:00 | missing | 0 | missing | missing | 9.2 | 49 |
1 | 1967-09-03T20:00:00 | missing | 0 | missing | missing | 8.5 | 32 |
1 | 1967-09-04T08:00:00 | missing | 0 | missing | missing | 6.4 | 26 |
1 | 1967-09-05T08:00:00 | missing | 0 | missing | missing | 4.8 | 22 |
1 | 1967-09-06T08:00:00 | missing | 0 | missing | missing | 3.1 | 28 |
1 | 1967-09-07T08:00:00 | missing | 0 | missing | missing | 2.5 | 33 |
2 | 1967-12-08T08:00:00 | 100 | 1 | 1 | missing | missing | missing |
2 | 1967-12-08T08:00:00 | missing | 0 | missing | missing | missing | 100 |
The stack
function converts the DataFrame from a wide format to a long format (i.e., it stacks the measurement variables into a single column and uses an identifier variable to denote the type of measurement variable).
Note: in this example, the wide format DataFrame from unstack
converted back to a long format does not return the original DataFrame. This is because many Julia functions do not automatically propagate missing values, such that, all missing :DV_0
values have been stacked as a measurement variable.
# Convert the DataFrame from wide to long format
= @chain examp_df_wide begin
examp_df_long stack(
# Input DataFrame
_,# Measurement variables to be stacked
:DV_0, :DV_1, :DV_2],
[# Supply a new column name to be the identifier variable
= :DVID,
variable_name # Supply a new column name to store the values of the
# measurement variables
= :DV,
value_name
)end
# The result of the first 10 rows are shown below:
ID | DATETIME | AMOUNT | EVID | CMT | DVID | DV |
1 | 1967-09-02T08:00:00 | 100 | 1 | 1 | DV_0 | missing |
1 | 1967-09-02T08:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-03T08:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-03T20:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-04T08:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-05T08:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-06T08:00:00 | missing | 0 | missing | DV_0 | missing |
1 | 1967-09-07T08:00:00 | missing | 0 | missing | DV_0 | missing |
2 | 1967-12-08T08:00:00 | 100 | 1 | 1 | DV_0 | missing |
2 | 1967-12-08T08:00:00 | missing | 0 | missing | DV_0 | missing |
6.7 Rename
Columns can be renamed using @rename
by using the syntax:
:new_column_name = :old_column_name
# Rename columns
@rename examp_df_wide begin
:id = :ID
:datetime = :DATETIME
:amt = :AMOUNT
:evid = :EVID
:cmt = :CMT
end
6.8 Join
Joining functions from DataFrames.jl do not have DataFramesMeta.jl counterparts. Available joins of 2 DataFrames include:
leftjoin
: returns all rows that were in the first DataFramerightjoin
: returns all rows that were in the second DataFrameinnerjoin
: returns rows with keys that matched in all passed DataFramesouterjoin
: returns rows with keys that appeared in any of the passed DataFramessemijoin
: returns the subset of rows in the first DataFrame that did match with the keys in the second DataFrameantijoin
: returns the subset of rows in the first DataFrame that did not match with the keys in the second DataFramecrossjoin
: returns the cartesian product of rows from all passed DataFrames, where the first passed DataFrame is assigned to the dimension that changes the slowest and the last DataFrame is assigned to the dimension that changes the fastest
An example of leftjoin
is provided below where a DataFrame containing descriptive labels for SEX
is joined with our example analysis DataFrame:
# Create descriptive labels for SEX covariate
= DataFrame(SEX = [0, 1], SEXl = ["F", "M"])
cov_labels # Join the covariate labels with the example analysis dataset
# Use the numeric variable for SEX to join (common variable between
# the datasets)
= leftjoin(examp_df_wide, cov_labels, on = :SEX)
examp_df_join # The result of the first 10 individuals are shown below:
ID | DATETIME | AMOUNT | EVID | CMT | SEX | SEXl |
1 | 1967-09-02T08:00:00 | 100 | 1 | 1 | 1 | M |
2 | 1967-12-08T08:00:00 | 100 | 1 | 1 | 1 | M |
3 | 1967-08-26T08:00:00 | 120 | 1 | 1 | 1 | M |
4 | 1967-11-23T08:00:00 | 60 | 1 | 1 | 0 | F |
5 | 1967-12-10T08:00:00 | 113 | 1 | 1 | 1 | M |
6 | 1967-11-19T08:00:00 | 90 | 1 | 1 | 0 | F |
7 | 1967-08-05T08:00:00 | 135 | 1 | 1 | 1 | M |
8 | 1967-12-23T08:00:00 | 75 | 1 | 1 | 0 | F |
9 | 1967-10-18T08:00:00 | 105 | 1 | 1 | 1 | M |
11 | 1967-09-06T08:00:00 | 123 | 1 | 1 | 1 | M |
6.9 Summarize
Commonly in pharmacometrics, statistical summaries are conducted on the demographics of the analysis population (parametric and non-parametric) by treatment group or other key variables. There are several approaches to generating summaries on a DataFrame with DataFramesMeta.jl. In all examples, the DataFrame returned consists of all grouping variables and all summary variables calculated as part of the summarization.
Many Julia functions do not automatically propagate missing values. The skipmissing
function can be wrapped around the target variables to ignore missing values when performing calculations in order to prevent error messages.
Using a combination of @groupby
(splits the DataFrame by each group for the given variable combinations) and @combine
(performs the transformations and combines the GroupedDataFrames back together) is similar as dplyr::group_by
and dplyr::summarize
in R.
In this example, one variable generated in @combine
cannot be used to generate a variable dependent on its values.
# Summarise WEIGHT for each of the categories in SEX
= @chain examp_df_join begin
examp_df_summary unique(:ID)
@groupby :SEXl
@combine begin
# Number of individuals
:nid = length(:ID)
# Mean
:mean_value = mean(skipmissing(:WEIGHT))
# Standard deviation
:sd_value = std(skipmissing(:WEIGHT))
# Median
:median_value = median(skipmissing(:WEIGHT))
# Minimum
:min_value = minimum(skipmissing(:WEIGHT))
# Maximum
:max_value = maximum(skipmissing(:WEIGHT))
end
end
SEXl | nid | mean_value | sd_value | median_value | min_value | max_value |
M | 26 | 73.7 | 10.3 | 75 | 58 | 102 |
F | 5 | 51.3 | 7.68 | 50 | 40 | 60 |
The use of @by
combines the @groupby
and @combine
substeps into a single step.
However, it should be noted that still in this example, one variable generated in @by
cannot be used to generate a variable dependent on its values.
# Summarise WEIGHT for each of the categories in SEX
= @chain examp_df_join begin
examp_df_summary_atby unique(:ID)
@by :SEXl begin
# Number of individuals
:nid = length(:ID)
# Mean
:mean_value = mean(skipmissing(:WEIGHT))
# Standard deviation
:sd_value = std(skipmissing(:WEIGHT))
# Median
:median_value = median(skipmissing(:WEIGHT))
# Minimum
:min_value = minimum(skipmissing(:WEIGHT))
# Maximum
:max_value = maximum(skipmissing(:WEIGHT))
end
end
SEXl | nid | mean_value | sd_value | median_value | min_value | max_value |
M | 26 | 73.7 | 10.3 | 75 | 58 | 102 |
F | 5 | 51.3 | 7.68 | 50 | 40 | 60 |
To generate variables that are dependent on others within the same block @astable
needs to be specified. For example, confidence intervals require the previous calculation of the mean, standard deviation, and degrees of freedom [for a T-distribution]:
# Summarise WEIGHT for each of the categories in SEX
= @chain examp_df_join begin
examp_df_summary_astable unique(:ID)
@by :SEXl @astable begin
# Number of individuals
:nid = length(:ID)
# Mean
:mean_value = mean(skipmissing(:WEIGHT))
# Standard deviation
:sd_value = std(skipmissing(:WEIGHT))
# Calculating 90% confidence intervals using a T-distribution and previously
# calculate mean, degrees of freedom, and standard deviation
:lo90_ci = :mean_value + quantile(TDist(:nid), 0.05) * :sd_value / sqrt(:nid)
:hi90_ci = :mean_value + quantile(TDist(:nid), 0.95) * :sd_value / sqrt(:nid)
# Median
:median_value = median(skipmissing(:WEIGHT))
# Minimum
:min_value = minimum(skipmissing(:WEIGHT))
# Maximum
:max_value = maximum(skipmissing(:WEIGHT))
end
end
SEXl | nid | mean_value | sd_value | lo90_ci | hi90_ci | median_value | min_value | max_value |
M | 26 | 73.7 | 10.3 | 70.3 | 77.2 | 75 | 58 | 102 |
F | 5 | 51.3 | 7.68 | 44.4 | 58.3 | 50 | 40 | 60 |
7 Categorical Variables
Handling categorical variables (i.e., assigning levels, creating bins from continuous data, etc) or “factors” (in R terms) is predominantly handled by functions available in the CategoricalArrays.jl package in Julia.
Section 6.8 demonstrated how to assign descriptive labels for a numerical variable like :SEX
. However, converting these variables to a Categorical
type can be useful when assigning an order to the categories, adding labels for categories that may not be available in the current analysis dataset, and ease of re-labeling categories.
The example below demonstrates how :SEX
can be converted to an ordered categorical variable by the use of the categorical
function. To add descriptive labels, the recode
function is used to define the labels for each value in :SEX
(using pairs notation with =>
). Labels can also be assigned to values that are not present. For example, a label can be assigned to missing
despite there being no missing values for :SEX
in the DataFrame. This is extremely useful when summarizing the demographics of an analysis population and noting the level of missing information as the category is present when the levels of the variable are returned.
Note: The functions below require the use of @transform
as these are not row-wise operations but being applied to the whole variable.
# Assign new labels for :SEX
= @chain examp_df_join begin
sexcat_examp_df @transform @astable begin
# Make SEX an ordered categorical variable
:SEX = categorical(:SEX; ordered = true)
# Assign new labels for each of the categories
:SEX = recode(:SEX, 0 => "Female", 1 => "Male", missing => "Missing")
end
end
# Return the levels of :SEX
levels(sexcat_examp_df.SEX)
3-element Vector{String}:
"Female"
"Male"
"Missing"
Categories can be re-ordered whereby the levels
keyword argument of categorical
takes a vector specifying the new order of categories:
# Reorder the categories of :SEX
@transform! sexcat_examp_df :SEX =
categorical(:SEX; ordered = true, levels = ["Male", "Female", "Missing"])
# Return the levels of :SEX
levels(sexcat_examp_df.SEX)
3-element Vector{String}:
"Male"
"Female"
"Missing"
Categories can also be converted back to an integer type using levelcode
. The numbers are based on the category’s index in the ordered categorical variable:
# Generate a numerical value to SEX categories
@rtransform! sexcat_examp_df :SEXn = levelcode(:SEX)
# Return the levels of :SEXn
levels(sexcat_examp_df.SEXn)
2-element Vector{Int64}:
1
2
7.1 Binning Continuous Variables
Continuous variables can be binned into multiple groups using the cut
function to obtain a Categorical
type for that variable. By default, the cut
function returns category labels that include the quantile number and the range of values in that quantile:
# Bin age into 2 categories
= @chain sexcat_examp_df begin
agebin_examp_df # Retain only the first row for each individual so that bins are assigned
# correctly as subject-level information
unique(:ID)
# Cut the AGE variable into 2 categories
@transform :AGEBIN = cut(:AGE, 2)
# Retain only the necessary columns
@select :ID :AGEBIN
# Merge back with the original DataFrame with time-dependent information
leftjoin(sexcat_examp_df, _, on = :ID)
end
# Return the levels of :AGEBIN
levels(agebin_examp_df.AGEBIN)
2-element Vector{String}:
"Q1: [21.0, 27.0)"
"Q2: [27.0, 63.0]"
Note: The cut
function does have a keyword argument, levels
, where descriptive labels for the bins can be provided. However, labels containing the cut-points will be required to be assigned after the generation of the bins using recode
:
# Re-label the categories for :AGEBIN
recode!(
agebin_examp_df.AGEBIN,"Q1: [21.0, 27.0)" => "Age < 27 Years",
"Q2: [27.0, 63.0]" => "Age ≥ 27 Years",
)# Return the levels of :AGEBIN
levels(agebin_examp_df.AGEBIN)
2-element Vector{String}:
"Age < 27 Years"
"Age ≥ 27 Years"
8 Date/Time Variables
The Dates.jl package assists with handling variables that require Date/Time formats. Julia treats dates as a specific type in the base library, allowing for easy handling of Date/Time variables.
In the example dataset for this Module, the :DATETIME
column has been consistently represented as a String
type. Section 3.1 demonstrated that Date/Time variables can be read into the environment in the correct format through the CSV.read
function. However, a variable can be converted from type String
to DateTime
using the DateTime
function and specifying the Date/Time format of that variable:
# Convert DATETIME column from string to DataTime format
= @chain agebin_examp_df begin
datetime_examp_df @rtransform :DATETIME = DateTime(:DATETIME, "yyyy-mm-ddTHH:MM:SS")
end
# Return first 5 values for :DATETIME
first(datetime_examp_df.DATETIME, 5)
5-element Vector{DateTime}:
1967-09-02T08:00:00
1967-09-02T08:00:00
1967-09-03T08:00:00
1967-09-03T20:00:00
1967-09-04T08:00:00
The Date
and Time
functions can be used to extract the Date and Time from a DateTime
variable:
@rtransform! datetime_examp_df begin
:DATE = Date(:DATETIME)
:TIME = Time(:DATETIME)
end
ID | DATETIME | DATE | TIME | AMOUNT | EVID | CMT |
1 | 1967-09-02T08:00:00 | 1967-09-02 | 08:00:00 | 100 | 1 | 1 |
1 | 1967-09-02T08:00:00 | 1967-09-02 | 08:00:00 | missing | 0 | missing |
1 | 1967-09-03T08:00:00 | 1967-09-03 | 08:00:00 | missing | 0 | missing |
1 | 1967-09-03T20:00:00 | 1967-09-03 | 20:00:00 | missing | 0 | missing |
1 | 1967-09-04T08:00:00 | 1967-09-04 | 08:00:00 | missing | 0 | missing |
1 | 1967-09-05T08:00:00 | 1967-09-05 | 08:00:00 | missing | 0 | missing |
1 | 1967-09-06T08:00:00 | 1967-09-06 | 08:00:00 | missing | 0 | missing |
1 | 1967-09-07T08:00:00 | 1967-09-07 | 08:00:00 | missing | 0 | missing |
Conversely, Date and Time variables can be joined (i.e., added to each other) to create variables of DateTime
format:
@rtransform! datetime_examp_df :DATETIME = :DATE + :TIME
Typically, the times of dosing and observation records in a pharmacometrics analysis dataset are not represented as DateTime
format but as hours after the first dose or observation for an individual subject. This can be achieved by creating an interim variable for each individual that identifies the time of the first dose, and then subtracting it from all other values for :DATETIME
within that individual.
Note: Subtracting 2 :DATETIME
variables returned a value in the units of milliseconds. This can be converted to units of hours using the Hour
function:
# Calculate time after dose for each observation
= @chain datetime_examp_df begin
tafd_examp_df # For each individual, determine time after first dose in hours
# Group by each ID number (creates a GroupedDataFrame)
@groupby :ID
# Modify/transform for each group in the GroupedDataFrame...
# Note the use of DataFrames.jl transform, not @transform
transform(_) do group
@rtransform group @astable begin
# Determine the DATETIME of the first dose for the individual
:first_dose = minimum(group.DATETIME[group.EVID.==1])
# Calculate time after first dose for all records and convert to hours
:TAFD = (:DATETIME - :first_dose) / Hour(1)
end
end
# Remove the first_dose column (only an intermediate)
@select Not(:first_dose)
end
ID | DATETIME | TAFD | AMOUNT | EVID | CMT |
1 | 1967-09-02T08:00:00 | 0 | 100 | 1 | 1 |
1 | 1967-09-02T08:00:00 | 0 | missing | 0 | missing |
1 | 1967-09-03T08:00:00 | 24 | missing | 0 | missing |
1 | 1967-09-03T20:00:00 | 36 | missing | 0 | missing |
1 | 1967-09-04T08:00:00 | 48 | missing | 0 | missing |
1 | 1967-09-05T08:00:00 | 72 | missing | 0 | missing |
1 | 1967-09-06T08:00:00 | 96 | missing | 0 | missing |
1 | 1967-09-07T08:00:00 | 120 | missing | 0 | missing |
9 Plotting and Data Visualization
The recommended Julia packages for performing data visualization and generating publication-quality figures are:
AlgebraOfGraphics.jl provides a set of tools for plotting data in Julia. Its design and functionality are similar to that of ggplot2
in R, whereby it involves the development of layers (data, mapping aesthetics, and geometrics) to build a plot.
CairoMakie.jl is the underlying plotting system for AlgebraOfGraphics.jl using a Cairo backend to draw vector graphics to SVG and PNG.
While most plots can be generated by only interacting with AlgebraOfGraphics.jl, it should be emphasized that a good foundational knowledge of CairoMakie.jl will allow additional customization (such as arranging multiple plots).
9.1 The Algebra Of Graphics
The general structure for creating a plot is as follows:
The input DataFrame can be prepared prior to developing the plot. In our example, only the non-missing concentration records are required. It is important to ensure there are no missing values in variables intended to be plotted as AlgebraOfGraphics functions will throw error messages.
# Generating a DataFrame containing only relevant information
= @chain tafd_examp_df begin
plot_examp_df # Retain only observation records
@rsubset :EVID == 0
# Exclude records with missing PK observations
dropmissing(:DV_1)
end
The general structure of layer consists of data
, mapping
, and visual
elements.
Elements of a layer are concatenated using *
(multiplication operator), and multiple layers are superimposed onto each other using +
(addition operator). The implementation is consistent with the order of operations, therefore, the order of which layers are sequentially added is important.
Once all layers are combined, the resulting plot is compiled using draw
.
# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
=
p_obs_time_scatter data(plot_examp_df) *
# Specifying the mapping aesthetics using positional arguments
# First: x-axis
# Second: y-axis
# [Optional] Third: z-axis (for 3D plots)
# All other subsequent options: keyword arguments
mapping(:TAFD, :DV_1, group = :ID) *
# Specifying the visuals/geometry
visual(Scatter)
# Creating a layer for individual lines connecting observations over time
= data(plot_examp_df) * mapping(:TAFD, :DV_1, group = :ID) * visual(Lines)
p_obs_time_lines
# Combine the layers of the plot
= p_obs_time_scatter + p_obs_time_lines
p_obs_time
# Draw the resulting plot
draw(p_obs_time)
Note: An input DataFrame for plotting is not necessary and vectors can be directly passed to mapping
in the absence of data
. Additionally, if visual
is not supplied as an element for the layer, the default geometry Scatter
will be used.
The composition of the plot can also be simplified particularly if several layers require common data
and mapping
elements. The simplification of our example would be:
# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
=
p_obs_time_scatter_lines data(plot_examp_df) *
# Specifying the mapping aesthetics using positional arguments
mapping(:TAFD, :DV_1, group = :ID) *
# Combining the visuals/geometry
visual(Scatter) + visual(Lines))
(
# Combine the layers of the plot
= p_obs_time_scatter_lines
p_obs_time
# Draw the resulting plot
draw(p_obs_time)
9.2 Stratification Through Mapping
Several elements of a plot can be modified based on stratification variables. For example; to assign different colors to different values of sex, the appropriate column is passed to the keyword argument - color
. Other keyword arguments that can be passed are dependent on the visual
selected. Common keyword arguments include: marker
and linestyle
.
There are helper functions and syntax that can assist with providing descriptive labels or modifying plotting variables (if not available in the input DataFrame). These use pair syntax (denoted with =>
to specify the relationships). They take the general form:
:column_name => function() => "New Label"
Where :column_name
is the current variable name in the input DataFrame, function()
is optional and allows you to pass a function (anonymous or a helper function described below) that modifies the values of the variable, and "New Label"
is the new label for :column_name
.
# Modifying a previous layer to account for different colors with
# different levels of SEXl
= p_obs_time_scatter_lines * mapping(color = :SEXl => "Sex"); p_obs_time_scatter_lines_color
Helper functions that assist with manipulating the variable prior to plotting include:
Where sorter
allows the reordering of values for the variable:
# Modifying a previous layer to account for different colors with
# different levels of SEXl
=
p_obs_time_scatter_lines_color * mapping(color = :SEXl => sorter("M", "F") => "Sex"); p_obs_time_scatter_lines
Where renamer
allows renaming of the values for the variable via pairs syntax. This requires specifying relationship between the old value and the new value, i.e., "old value" => "new value"
:
# Modifying a previous layer to account for different colors with
# different levels of SEXl
=
p_obs_time_scatter_lines_color *
p_obs_time_scatter_lines mapping(color = :SEXl => renamer("F" => "Female", "M" => "Male") => "Sex");
Stratification may fail when non-categorical variables (such as those with types Float
or Int
) are passed to keyword arguments in mapping
. In our example analysis dataset, the SEX
variable takes values of 0 or 1. The use of nonnumeric
allows these variables to be passed:
# Modifying a previous layer to account for different colors with
# different levels of SEX
=
p_obs_time_scatter_lines_color * mapping(color = :SEX => nonnumeric => "Sex"); p_obs_time_scatter_lines
9.2.1 Facetting
Plots can be facetted on stratification variables by specifying keyword arguments row
and/or col
in mapping
depending on the intended layout. By default, x- and y-axis labels are linked between the facets.
# Modifying a previous layer to account for different facets with
# different levels of SEXl
= p_obs_time_scatter_lines * mapping(row = :SEXl); p_obs_time_scatter_lines_facetsex
# Modifying a previous layer to account for different facets with
# different levels of AGEBIN
= p_obs_time_scatter_lines * mapping(col = :AGEBIN); p_obs_time_scatter_lines_facetage
A grid layout can be constructed by specifying passing variables to both col
and row
keyword arguments:
# Modifying a previous layer to account for different facets with
# different levels of SEXl and AGEBIN
=
p_obs_time_scatter_lines_facetgrid * mapping(col = :SEXl, row = :AGEBIN); p_obs_time_scatter_lines
9.3 Visuals (aka Geometrics)
The examples provided heavily demonstrate the use of the visuals - Scatter
and Lines
. However, there are several others available from CairoMakie.jl - Plots. Presented below are some common examples used in pharmacometrics for summarizing the population’s demographics:
# Ensure 1 row per ID before summarizing demographics
= unique(tafd_examp_df, :ID) oneperid_examp_df
Linear regression and LOESS (locally-estimated scatter-plot smoothing) trend lines can be applied over a x-y scatter plot using AlgebraOfGraphics.linear
and AlgebraOfGraphics.smooth
, respectively. Note: neither linear
or smooth
are arguments for visual, and both need to be specified being from the AlgebraOfGraphics.jl package. As shown below, other visual options can be concatenated to the trend line functions using *
.
The confidence interval of the linear regression can be specified by the use of the keyword argument, level
. Here, 0.95 is the default and corresponds to \(\alpha\) = 0.05.
Additional options can be specified as keyword arguments for visual aesthetics associated with lines
(CairoMakie.jl - lines).
# Generate a scatter plot of body weight versus age
# Apply a linear regression line
=
p_weight_age data(oneperid_examp_df) *
mapping(:AGE, :WEIGHT) *
(visual(Scatter) +
linear(level = 0.95) * visual(; label = "Linear Regression") +
AlgebraOfGraphics.smooth() * visual(; color = :red, label = "LOESS")
AlgebraOfGraphics.
)# Draw the resulting plot
draw(p_weight_age, legend = (; framevisible = false, position = :bottom))
A histogram can be plotted by passing Hist
to visual
. Additional options can be specified as keyword arguments including the number of bins, normalization (i.e., density, probability density function, etc), and visual aesthetics (CairoMakie.jl - hist).
# Generate a histogram of body weight
=
p_weight_hist data(oneperid_examp_df) *
mapping(:WEIGHT) *
visual(Hist; normalization = :pdf, strokecolor = :black, color = :dodgerblue4)
# Draw the resulting plot
draw(p_weight_hist, axis = (; xlabel = "Body Weight (kg)", ylabel = "Probability Density"))
A box-and-whisker plot can be plotted by passing BoxPlot
to visual
. Additional options can be specified as keyword arguments including visual aesthetics (CairoMakie.jl - boxplot).
# Generate a box-and-whisker plot of body weight stratified by sex
=
p_weight_bxplot data(oneperid_examp_df) *
mapping(:SEXl, :WEIGHT, color = :SEXl) *
visual(BoxPlot; show_notch = true)
# Draw the resulting plot
draw(p_weight_bxplot, legend = (; show = false))
And now for something completely different…
Generating pairwise plots requires the use of another Julia package called PairPlots.jl. This package uses the Makie plotting library such that there should be familiarity in context of the other plots presented using AlgebraOfGraphics.jl and CairoMakie.jl.
A pairwise plot can be generated using pairplot
. The example code below is just to demonstrate one method for creating pairwise plots in Julia and it is highly recommended to use help queries where possible, i.e., ?pairplot
, and review the PairsPlot.jl Guide.
# Retain only the columns required for plotting
= @select oneperid_examp_df :WEIGHT :AGE
cov_oneperid_examp_df
# Construct a figure object
= Figure()
p_covcorr # Construct the pairwise plot into the figure object and specify its position in the grid
pairplot(
1, 1],
p_covcorr[# Specify the input DataFrame
=> (
cov_oneperid_examp_df # Specify elements that should be presented on the off-diagonals
# Scatter plot with a correlation
Scatter(marker = '∘', markersize = 24, alpha = 0.5, color = :dodgerblue4),
PairPlots.Calculation(StatsBase.cor, position = Makie.Point2f(0.2, 0.1)),
PairPlots.TrendLine(color = :firebrick3),
PairPlots.# Specify elements that should be presented on the diagonals
# Histogram with a density distribution
MarginHist(color = :dodgerblue4, strokewidth = 0.5),
PairPlots.MarginDensity(color = :black),
PairPlots.
),# Specify if the off-diagonal elements should be presented on both
# the upper and lower
= false,
fullgrid # Re-label variable names to be more descriptive
= Dict(:WEIGHT => "Body Weight (kg)", :AGE => "Age (years)"),
labels # Options for labels for axis ticks
= (; xticklabelrotation = 0, yticklabelrotation = 0),
bodyaxis = (; xticklabelrotation = 0, yticklabelrotation = 0),
diagaxis
)# Print the plot
p_covcorr
9.4 Figure Customizations
Other figure customizations that globally impact the plot can be modified and are passed to draw
through its keyword arguments. These options are typically passed as a NamedTuple
.
Figure options predominantly take on the arguments from CairoMakie.jl - Figures. Some example options are provided:
draw(
p_obs_time,= (;
figure # Modify the figure resolution (in px)
= (600, 400),
size # Modify figure padding
= 40,
figure_padding # Modify background color
= :gray80,
backgroundcolor
), )
Axis options predominantly take on the arguments from CairoMakie.jl - Creating an Axis. Some example options are provided:
draw(
p_obs_time,= (;
axis # Add descriptive labels to the plot
= "Concentration-Time Profiles",
title = "Time After First Dose (hours)",
xlabel = "Concentration (mg/L)",
ylabel # Make the y-axis on semi-log scale
= Makie.pseudolog10,
yscale # Define the placement of axis ticks/labels
# Specified a range from 0 to 120, every 12 hours
= [0:12:120;],
xticks # Specify exact values where ticks/labels should appear
# Also specify what they labels should exactly appear as
= (
yticks 0.1, 0.3, 1, 3, 10, 30, 100, 300],
["0.1", "0.3", "1", "3", "10", "30", "100", "300"],
[
),# Define the limits of the plot
# Format ((xmin,xmax),(ymin,ymax))
= ((-5, 125), (nothing, nothing)),
limits
), )
A legend appears by default when keyword arguments such as color
and marker
are specified in mapping
. Additionally, a custom legend can be built to describe specific layers by passing the keyword argument - label
- in visual
for that layer. In the example below, “Observation” is passed to the label
argument for both the Scatter and Lines geometrics. The common label for these geometrics creates a legend key incorporating both variables.
By default, a legend is positioned to the right of the plot with a black frame. Legend options predominantly take on the arguments from CairoMakie.jl - Legend. Some example options are provided:
# Creating a layer for the plot (observations over time)
# Specifying the input DataFrame
=
p_obs_time_scatter_lines data(plot_examp_df) *
# Specifying the mapping aesthetics using positional arguments
mapping(:TAFD, :DV_1, group = :ID) *
# Combining the visuals/geometry
visual(Scatter; label = "Observation") + visual(Lines; label = "Observation"))
(
# Combine the layers of the plot
= p_obs_time_scatter_lines
p_obs_time
# Draw the resulting plot
draw(p_obs_time, legend = (; framevisible = false, position = :bottom))
Note: While not provided in the example, the keyword argument nbanks
is useful for arranging legends with many elements over multiple rows, and show = false
is useful to remove the legend completely.
Some options for facetted plots are specified through the facet
keyword argument. The x- and y-axes can be linked (where ticks and labels are removed depending on the arrangement of the facets) or not linked (each facet of the plot has its own axis ticks and labels):
# Specifying the input DataFrame
=
p_obs_time_scatter_lines data(plot_examp_df) *
# Specifying the mapping aesthetics using positional arguments
mapping(:TAFD, :DV_1, group = :ID, col = :AGEBIN) *
# Combining the visuals/geometry
visual(Scatter) + visual(Lines))
(
# Draw the resulting plot
draw(p_obs_time_scatter_lines, facet = (;
# Option to link x and/or y axes
# Default is :minimal (removed ticks and labels)
= :minimal,
linkxaxes = :none,
linkyaxes ))
All properties defined in mapping
(such as color, linestyle, marker, layout, etc) for stratification are passed as scale options (using the scales
function). This is then provided as the second positional argument of draw
. Further details are specified in Algebra Of Graphics - Scale Options.
For example, to manually set the color palette for the stratification of sex in the plot:
# Draw resulting plot
draw(p_obs_time_scatter_lines_color, scales(Color = (; palette = [:blue, :red])))
Alternatively, preset color palettes from ColorSchemes.jl can also be applied.
# Draw resulting plot
draw(p_obs_time_scatter_lines_color, scales(Color = (; palette = :tab10)))
9.5 Arranging Multiple Plots
All AlgebraOfGraphics.jl plots can be arranged to create multi-panel figures. The example provided below demonstrates the arrangement of 2 figures previously generated in this Module, 1) Individual concentration-time profiles colored by sex (p_obs_time_scatter_lines_color
), and 2) Box-and-whisker plots depicting the distribution of body weight stratified and colored by sex (p_weight_bxplot
).
Advanced layouts need to be performed by CairoMakie.jl. The general process involves:
- Creating a
Figure()
object - Adding elements or layers (or AlgebraOfGraphics.jl plots in this example) to positions in the Figure. Note: a grid layout does not need to be pre-defined in order to arrange the figures and can be specified as each element is added
- Returning and printing the resulting figure
It is important to recognize that axis, legend, and facet options that were specified when generating the AlgebraOfGraphics.jl plot may not be carried over when arranged using CairoMakie.jl. Therefore, CairoMakie.jl functions such as Axis
and Legend
will be required to add these layers. Two examples are provided below (Note: the use of draw!
here instead of draw
):
# Create an empty Figure object
= Figure()
p_sex
# Draw the AlgebraOfGraphics.jl scatter plot in the following position:
# Row 1, Column 1
draw!(p_sex[1, 1], p_obs_time_scatter_lines_color)
# Then, draw the AlgebraOfGraphics.jl box-and-whisker plot in the
# following position: Row 1, Column 2
draw!(p_sex[1, 2], p_weight_bxplot)
# Return and print the figure
p_sex
# Create an empty Figure object
= Figure()
p_sex
# First, specify axis options for a plot in the following position:
# Row 1, Column 1
= Axis(
ax_sex_scatter 1, 1],
p_sex[= "Time After First Dose (hours)",
xlabel = "Concentration (mg/L)",
ylabel
)
# Then, draw the AlgebraOfGraphics.jl scatter plot where the Axis
# has been defined
draw!(ax_sex_scatter, p_obs_time_scatter_lines_color)
# Then, specify axis options for a plot in the following position:
# Row 1, Column 2
= Axis(
ax_sex_bxplot 1, 2],
p_sex[= "Sex",
xlabel = ([1, 2], ["Female", "Male"]),
xticks = "Body Weight (kg)",
ylabel
)
# Then, draw the AlgebraOfGraphics.jl box-and-whisker plot where
# the Axis has been defined
= draw!(ax_sex_bxplot, p_weight_bxplot)
p_sex_bxplot
# Then, add one of the legends from one of the figures to the plot
# (Row 2, Columns 1-2)
legend!(
2, 1:2],
p_sex[
p_sex_bxplot,= :left,
titleposition = :horizontal,
orientation = :bottom,
position = false,
framevisible
)
# Return and print the figure
p_sex
10 Summary
Julia has several packages and functions for data manipulation and visualization such as DataFrames.jl, DataFramesMeta.jl, and AlgebraOfGraphics.jl that have similar concepts and functionality as dplyr
and ggplot2
in R. The Module has aimed to provide a brief introduction to how these packages can be used in context of pharmacometrics, however, they are not the full extent. Such that, it is highly recommended to refer to the package documentations and other PumasAI tutorials linked throughout this Module to explore other functionalities.