Use of XPT Files in Pumas and R

Author

Jose Storopoli

XPT files facilitate the transfer of data across diverse systems, and serve as a nearly universal data language. Originating from the SAS Institute, they adhere to the SAS Transport (Version 5) format established in the 1980s, ensuring data integrity and compatibility across different software platforms.

Floating Point Format

SAS Transport Version 5, or XPT files, uses a different floating point format than the default in all consumer-grade computers, IEEE-754. The format that XPT files uses is the Hexadecimal Floating Point (HFP) by IBM.

You don’t need to worry about that since this is handled internally by the software implementations.

1 Importance of XPT Files

  • Regulatory Submissions: Regulatory agencies allow data to be submitted in XPT format. In particular, the US FDA allows XPT formatted files for submission of data as part of an Electronic Common Technical Document (eCTD).

    The US Library of Congress maintains an archive of digital data formats with additional background information available here.

  • Data Sharing: The XPT file format enables seamless data sharing among stakeholders, including regulatory bodies, sponsors, and CROs (Contract Research Organizations).

  • Long-term Data Retention: XPT files are also used for archiving, ensuring data can be accessed and understood years after a study’s completion.

2 How Do They Work?

An XPT file is comprised of two main components:

  1. Header Information: Metadata about the dataset, including name, label, version, and the origin SAS library.
  2. Data: The dataset itself, structured for readability across various software systems.

3 XPT Support for Different Languages

We have XPT support in different programming languages:

  • Julia: The ReadStatTables.jl package can read and write XPT files.
  • R: The haven package provides functions to read and write XPT files.
  • Python: The sas7bdat package can read and write XPT files.

In this tutorial, we’ll cover R and Julia examples. For Python, you can refer to the sas7bdat package documentation.

3.1 Handling XPT Files in Julia/Pumas

Any Pumas app can read XPT files using the ReadStatTables.jl package. First you’ll need to import the package:

using ReadStatTables

Then you can read the XPT file using the readstat function:

tb = readstat(joinpath(@__DIR__, "data", "iv_bolus_sd.xpt"))

As you can see, the readstat function returns a ReadStatTable object. But you can (and probably should) convert it to a DataFrame:

using DataFrames
using DataFramesMeta
df = DataFrame(tb)
first(df, 5)
5×6 DataFrame
Row id time conc amt route dose
Float64 Float64 Float64? Float64? String3 Float64
1 1.0 0.0 missing 250.0 iv 250.0
2 1.0 0.0 8.38202 missing iv 250.0
3 1.0 0.5 7.6079 missing iv 250.0
4 1.0 1.0 8.73121 missing iv 250.0
5 1.0 2.0 5.84823 missing iv 250.0

Now you can use the df object as you would with any other DataFrame in Pumas.

3.1.1 Accessing Metadata

XPT Metadata

SAS supports metadata for XPT files. This metadata includes information about the dataset, such as the column names, labels, and types.

This information is also available in the DataFrames.jl package with the [delete|empty][col]metadata[keys][!] family of functions that operates on the metadata of DataFrames. Please check the Metadata DataFrames.jl documentation for more information.

You can access the metadata of the XPT file using the colmetadata function:

tb_metadata = colmetadata(tb)
ColMetaIterator{ReadStatColMeta} with 6 entries:
  :id    => ReadStatColMeta(id, BEST12)
  :time  => ReadStatColMeta(time, BEST12)
  :conc  => ReadStatColMeta(conc, BEST12)
  :amt   => ReadStatColMeta(amt, BEST12)
  :route => ReadStatColMeta(route, )
  :dose  => ReadStatColMeta(dose, BEST12)

This will return a Dict with the column names as keys and the metadata as values. Notice that the returned metadata values are Dicts themselves with the metadata keys as keys and the metadata values as values.

tb_metadata[:conc]
ReadStatColMeta:
  label         => conc
  format        => BEST12
  type          => READSTAT_TYPE_DOUBLE
  value label   => BEST12
  storage width => 8
  display width => 12
  measure       => READSTAT_MEASURE_UNKNOWN
  alignment     => READSTAT_ALIGNMENT_LEFT

3.1.2 Saving XPT Files in Julia/Pumas

To save a DataFrame as an XPT file, you can use the writestat function. First, let’s load a dataset from PharmaDatasets.jl:

using PharmaDatasets
df = dataset("sdtm_1/ex.xpt")
first(df, 5)
5×17 DataFrame
Row STUDYID DOMAIN USUBJID EXSEQ EXTRT EXCAT EXDOSE EXDOSU EXDOSFRM EXROUTE VISITNUM VISIT EPOCH EXSTDTC EXENDTC EXSTDY EXENDY
String15 String3 String31 Float64 String15 String31 Float64 String3 String15 String15 Float64 String String15 String31 String31 Float64? Float64?
1 DU101001 EX DU101001-00011-00001 1.0 DRUG101 STUDY DRUG ADMINISTRATION 200.0 mg INJECTION INTRAVENOUS 3001.0 Arm B Treatment Cycle 1 Day 1 TREATMENT 2019-07-27T12:34 2019-07-27T12:34 1.0 1.0
2 DU101001 EX DU101001-00011-00001 2.0 DRUG101 STUDY DRUG ADMINISTRATION 200.0 mg INJECTION INTRAVENOUS 3002.0 Arm B Treatment Cycle 2 Day 1 TREATMENT 2019-08-17T18:22 2019-08-17T18:22 22.0 22.0
3 DU101001 EX DU101001-00011-00001 3.0 DRUG101 STUDY DRUG ADMINISTRATION 200.0 mg INJECTION INTRAVENOUS 3003.0 Arm B Treatment Cycle 3 Day 1 TREATMENT 2019-09-07T08:15 2019-09-07T08:15 43.0 43.0
4 DU101001 EX DU101001-00011-00001 4.0 DRUG101 STUDY DRUG ADMINISTRATION 200.0 mg INJECTION INTRAVENOUS 3004.0 Arm B Treatment Cycle 4 Day 1 TREATMENT 2019-09-28T13:46 2019-09-28 64.0 64.0
5 DU101001 EX DU101001-00011-00001 5.0 DRUG101 STUDY DRUG ADMINISTRATION 200.0 mg INJECTION INTRAVENOUS 3005.0 Arm B Treatment Cycle 5 Day 1 TREATMENT 2019-10-19T02:29 2019-10-19T02:29 85.0 85.0
The “sdtm_1/ex.xpt” dataset

I am choosing the ex.xpt file from the sdtm_1 dataset because it was originally an XPT file, and it has dates and datetime objects that are a little tricky to handle in XPT. We’ll use this dataset later to demonstrate how to parse dates and datetimes.

Now we’re going to save this DataFrame as an XPT file:

writestat(joinpath(@__DIR__, "data", "ex.xpt"), df)

3.1.3 Reading Dates and Datetimes

When reading XPT files, you may encounter issues with dates and datetimes. Since XPT files are an old format, they don’t have a native date or datetime type. Instead, the date and time values are stored as the numbers of periods elapsed since a reference date or time point (epoch) chosen by the software. Therefore, knowing the reference date/time and the length of a single period is sufficient for uncovering the represented date/time values for a given format.

If a variable is in a date/time format that can be recognized, the values will be displayed as Julia Date or DateTime. You can find more about these internal translations in the ReadStatTables.jl documentation.

Let’s showcase this with the ex.xpt dataset we saved earlier:

tb = readstat(joinpath(@__DIR__, "data", "ex.xpt"))

In this dataset we have two columns, EXSTDTC and EXENDTC, that are dates:

ReadStatTables.colmetadata(tb, :EXSTDTC)
ReadStatColMeta:
  label         => Start Date/Time of Treatment
  format        => 31
  type          => READSTAT_TYPE_STRING
  value label   => 31
  storage width => 31
  display width => 31
  measure       => READSTAT_MEASURE_UNKNOWN
  alignment     => READSTAT_ALIGNMENT_LEFT
ReadStatTables.colmetadata(tb, :EXENDTC)
ReadStatColMeta:
  label         => End Date/Time of Treatment
  format        => 31
  type          => READSTAT_TYPE_STRING
  value label   => 31
  storage width => 31
  display width => 31
  measure       => READSTAT_MEASURE_UNKNOWN
  alignment     => READSTAT_ALIGNMENT_LEFT

As you can see by the metadata labels, these columns represent the Start and End Date/Time of the study, respectively.

However, they are automatically read as Strings by ReadStatTables.jl.

tb.EXENDTC |> typeof
Vector{String31} (alias for Array{String31, 1})
tb.EXSTDTC |> typeof
Vector{String31} (alias for Array{String31, 1})

Hence, we need to convert them to Date and/or DateTime objects. This can be accomplished with the helper functions in the Dates Julia’s standard library module:

using Dates
using DataFrames
using DataFramesMeta
df = DataFrame(tb);
@rtransform df :EXSTDTC_date = DateTime(:EXSTDTC)
ArgumentError: ArgumentError("Cannot parse an empty string as a DateTime")
ArgumentError: Cannot parse an empty string as a DateTime
Stacktrace:
  [1] parse(#unused#::Type{DateTime}, s::String31, df::DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"), Tuple{Dates.DatePart{'y'}, Dates.Delim{Char, 1}, Dates.DatePart{'m'}, Dates.Delim{Char, 1}, Dates.DatePart{'d'}, Dates.Delim{Char, 1}, Dates.DatePart{'H'}, Dates.Delim{Char, 1}, Dates.DatePart{'M'}, Dates.Delim{Char, 1}, Dates.DatePart{'S'}, Dates.Delim{Char, 1}, Dates.DatePart{'s'}}})
    @ Dates ~/_work/_tool/julia/1.9.3/x64/share/julia/stdlib/v1.9/Dates/src/parse.jl:201
  [2] DateTime (repeats 2 times)
    @ ~/_work/_tool/julia/1.9.3/x64/share/julia/stdlib/v1.9/Dates/src/io.jl:568 [inlined]
  [3] iterate
    @ ./generator.jl:47 [inlined]
  [4] collect_to!
    @ ./array.jl:840 [inlined]
  [5] collect_to_with_first!
    @ ./array.jl:818 [inlined]
  [6] _collect(c::Vector{String31}, itr::Base.Generator{Vector{String31}, Type{DateTime}}, #unused#::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base ./array.jl:812
  [7] collect_similar(cont::Vector{String31}, itr::Base.Generator{Vector{String31}, Type{DateTime}})
    @ Base ./array.jl:711
  [8] map(f::Type, A::Vector{String31})
    @ Base ./abstractarray.jl:3263
  [9] (::ByRow{DataType})(cols::Vector{String31})
    @ Tables ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Tables/u1X7W/src/utils.jl:229
 [10] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any})
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:562
 [11] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:805
 [12] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1778
 [13] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1698
 [14] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1299
 [15] select
    @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1299 [inlined]
 [16] #transform#589
    @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1379 [inlined]
 [17] transform(df::DataFrame, args::Any)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1379
 [18] top-level scope
    @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/nI3Y1/src/macros.jl:1600

As you can see this errors out because we have some missing data represented as empty strings, i.e. "":

@chain df begin
    @rsubset :EXSTDTC == ""
    @combine :nrow = length(:EXSTDTC)
end
1×1 DataFrame
Row nrow
Int64
1 33

We can fix this by replacing the empty strings with missing:

@rtransform! df :EXSTDTC = :EXSTDTC == "" ? missing : :EXSTDTC
@rtransform! df @passmissing :EXSTDTC_date = DateTime(:EXSTDTC)
@passmissing

Here we use the @passmissing macro to ensure that the DateTime constructor can handle missing values. Please check the Handling NAs and Missing Values tutorial for more information on @passmissing.

Now let’s check the EXSTDTC_date column:

@chain df begin
    @select :EXSTDTC_date
    describe
end
1×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Nothing DateTime DateTime DateTime Int64 Union
1 EXSTDTC_date 2019-01-08T00:00:00 2019-12-16T10:26:00 2020-12-29T09:18:00 33 Union{Missing, DateTime}

It works! You can do the same for the EXENDTC column and any other date or datetime columns you may have in your dataset.

Handling Dates and Datetimes

If you need to handle dates and datetimes in your dataset, don’t forget to check the Handling Dates and Times tutorial

3.2 Handling XPT Files in R

You can also handle XPT files in R using tidyverse’s {haven} package.

The ReadStat C Library

Under the hood both ReadStatTables.jl and {haven} use the ReadStat C library.

Hence, both packages have equivalent features.

Let’s load the {haven} package which exposes a read_xpt function:

library(haven)
full_path <- here::here("docs", "src", "quarto", "data_wrangling", "ex.xpt")
tb <- read_xpt(full_path)
# A tibble: 2,878 × 17
   STUDYID  DOMAIN USUBJID       EXSEQ EXTRT EXCAT EXDOSE EXDOSU EXDOSFRM EXROUTE
   <chr>    <chr>  <chr>         <dbl> <chr> <chr>  <dbl> <chr>  <chr>    <chr>  
 1 DU101001 EX     DU101001-000…     1 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 2 DU101001 EX     DU101001-000…     2 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 3 DU101001 EX     DU101001-000…     3 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 4 DU101001 EX     DU101001-000…     4 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 5 DU101001 EX     DU101001-000…     5 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 6 DU101001 EX     DU101001-000…     6 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 7 DU101001 EX     DU101001-000…     7 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 8 DU101001 EX     DU101001-000…     8 DRUG… STUD…    200 mg     INJECTI… INTRAV…
 9 DU101001 EX     DU101001-000…     9 DRUG… STUD…    200 mg     INJECTI… INTRAV…
10 DU101001 EX     DU101001-000…    10 DRUG… STUD…    200 mg     INJECTI… INTRAV…
# ℹ 2,868 more rows
# ℹ 7 more variables: VISITNUM <dbl>, VISIT <chr>, EPOCH <chr>, EXSTDTC <chr>,
#   EXENDTC <chr>, EXSTDY <dbl>, EXENDY <dbl>
# ℹ Use `print(n = ...)` to see more rows

As you can see this returns a tibble.

You can also get the metadata from a XPT file with str:

str(tb)
tibble [2,878 × 17] (S3: tbl_df/tbl/data.frame)
 $ STUDYID : chr [1:2878] "DU101001" "DU101001" "DU101001" "DU101001" ...
  ..- attr(*, "label")= chr "Study Identifier"
  ..- attr(*, "format.sas")= chr "15"
 $ DOMAIN  : chr [1:2878] "EX" "EX" "EX" "EX" ...
  ..- attr(*, "label")= chr "Domain Abbreviation"
  ..- attr(*, "format.sas")= chr "9"
 $ USUBJID : chr [1:2878] "DU101001-00011-00001" "DU101001-00011-00001" "DU101001-00011-00001" "DU101001-00011-00001" ...
  ..- attr(*, "label")= chr "Unique Subject Identifier"
  ..- attr(*, "format.sas")= chr "31"
 $ EXSEQ   : num [1:2878] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "label")= chr "Sequence Number"
  ..- attr(*, "format.sas")= chr "9"
 $ EXTRT   : chr [1:2878] "DRUG101" "DRUG101" "DRUG101" "DRUG101" ...
  ..- attr(*, "label")= chr "Name of Treatment"
  ..- attr(*, "format.sas")= chr "15"
 $ EXCAT   : chr [1:2878] "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" ...
  ..- attr(*, "label")= chr "Category of Treatment"
  ..- attr(*, "format.sas")= chr "31"
 $ EXDOSE  : num [1:2878] 200 200 200 200 200 200 200 200 200 200 ...
  ..- attr(*, "label")= chr "Dose"
  ..- attr(*, "format.sas")= chr "9"
 $ EXDOSU  : chr [1:2878] "mg" "mg" "mg" "mg" ...
  ..- attr(*, "label")= chr "Dose Units"
  ..- attr(*, "format.sas")= chr "9"
 $ EXDOSFRM: chr [1:2878] "INJECTION" "INJECTION" "INJECTION" "INJECTION" ...
  ..- attr(*, "label")= chr "Dose Form"
  ..- attr(*, "format.sas")= chr "15"
 $ EXROUTE : chr [1:2878] "INTRAVENOUS" "INTRAVENOUS" "INTRAVENOUS" "INTRAVENOUS" ...
  ..- attr(*, "label")= chr "Route of Administration"
  ..- attr(*, "format.sas")= chr "15"
 $ VISITNUM: num [1:2878] 3001 3002 3003 3004 3005 ...
  ..- attr(*, "label")= chr "Visit Number"
  ..- attr(*, "format.sas")= chr "9"
 $ VISIT   : chr [1:2878] "Arm B Treatment Cycle 1 Day 1" "Arm B Treatment Cycle 2 Day 1" "Arm B Treatment Cycle 3 Day 1" "Arm B Treatment Cycle 4 Day 1" ...
  ..- attr(*, "label")= chr "Visit Name"
  ..- attr(*, "format.sas")= chr "30"
 $ EPOCH   : chr [1:2878] "TREATMENT" "TREATMENT" "TREATMENT" "TREATMENT" ...
  ..- attr(*, "label")= chr "Epoch"
  ..- attr(*, "format.sas")= chr "15"
 $ EXSTDTC : chr [1:2878] "2019-07-27T12:34" "2019-08-17T18:22" "2019-09-07T08:15" "2019-09-28T13:46" ...
  ..- attr(*, "label")= chr "Start Date/Time of Treatment"
  ..- attr(*, "format.sas")= chr "31"
 $ EXENDTC : chr [1:2878] "2019-07-27T12:34" "2019-08-17T18:22" "2019-09-07T08:15" "2019-09-28" ...
  ..- attr(*, "label")= chr "End Date/Time of Treatment"
  ..- attr(*, "format.sas")= chr "31"
 $ EXSTDY  : num [1:2878] 1 22 43 64 85 106 127 148 169 190 ...
  ..- attr(*, "label")= chr "Study Day of Start of Treatment"
  ..- attr(*, "format.sas")= chr "9"
 $ EXENDY  : num [1:2878] 1 22 43 64 85 106 127 148 169 190 ...
  ..- attr(*, "label")= chr "Study Day of End of Treatment"
  ..- attr(*, "format.sas")= chr "9"

3.2.1 Saving XPT Files in R

To save a tibble as an XPT file, you can use the write_xpt function:

write_xpt(tb, here::here("docs", "src", "quarto", "data_wrangling", "ex.xpt"))

3.2.2 Reading Dates and Datetimes

As described above for Julia, when reading XPT files in R, you may encounter issues with dates and datetimes. We can see in the metadata above that our datetime columns EXSTDTC and EXENDTC were parsed as a string vector, i.e. as char vectors. Hence, we need to perform the same operation we did before but with tidyverse now:

library(dplyr)
library(tidyr)
library(lubridate)

tb %>%
  mutate(
    EXSTDTC = if_else(EXSTDTC == "", NA_character_, EXSTDTC),
    EXSTDTC_date = ymd_hm(EXSTDTC),
    EXENDTC = if_else(EXENDTC == "", NA_character_, EXENDTC),
    EXENDTC_date = ymd_hm(EXENDTC)
  ) %>%
  select(EXSTDTC_date, EXENDTC_date) %>%
  summarize(across(everything(), list(n = ~ sum(!is.na(.)))))
# A tibble: 1 × 2
  EXSTDTC_date_n EXENDTC_date_n
           <int>          <int>
1           2703           2713

Please refer to the amazing {lubridate} package on how to convert strings to dates/datetimes.

4 NONMEM: XPT to csv

If one is using NONMEM for data analysis, it is important to consider the issue of missing values when saving an XPT file as a CSV file for use as a NONMEM analysis dataset. Using the . (dot) character value is fine for some variables, but often not for covariates, since NONMEM will interpret the . (dot) as a zero (0) in a covariate data item. This can cause issues in your model.

One recommendation is to use a very unusual value, such as -99, to represent missing values in covariates. Then you can handle these values in your model code, in the $PK block, for example.

In order to accomplish this, you can use the replace_na function from the tidyr package in R:

library(tidyr)
df %>% replace_na(list(covariate = -99))

This will replace all NA values in the covariate column with -99.

In Julia, you can use the coalesce function from the DataFrames package:

@rtransform! df :covariate = coalesce(:covariate, -99)

This will replace all missing values in the covariate column with -99.

5 Conclusion

XPT files are a crucial part of data sharing in the pharmaceutical industry. XPT files have advantages over csv and other file formats. An important advantage is the ability to preserve column labels, which may give additional understanding to the content of the data. They are used for regulatory submissions, data sharing, and long-term data retention. You can handle XPT files in Julia using the ReadStatTables.jl package and in R using the {haven} package. Remember to handle dates and datetimes correctly when reading XPT files. Also consider the treatment of missing values.