using ReadStatTables
Use of XPT Files in Pumas and R
XPT files facilitate the transfer of data across diverse systems, and serve as a nearly universal data language. Originating from the SAS Institute, they adhere to the SAS Transport (Version 5) format established in the 1980s, ensuring data integrity and compatibility across different software platforms.
SAS Transport Version 5, or XPT files, uses a different floating point format than the default in all consumer-grade computers, IEEE-754. The format that XPT files uses is the Hexadecimal Floating Point (HFP) by IBM.
You don’t need to worry about that since this is handled internally by the software implementations.
1 Importance of XPT Files
Regulatory Submissions: Regulatory agencies allow data to be submitted in XPT format. In particular, the US FDA allows XPT formatted files for submission of data as part of an Electronic Common Technical Document (eCTD).
The US Library of Congress maintains an archive of digital data formats with additional background information available here.
Data Sharing: The XPT file format enables seamless data sharing among stakeholders, including regulatory bodies, sponsors, and CROs (Contract Research Organizations).
Long-term Data Retention: XPT files are also used for archiving, ensuring data can be accessed and understood years after a study’s completion.
2 How Do They Work?
An XPT file is comprised of two main components:
- Header Information: Metadata about the dataset, including name, label, version, and the origin SAS library.
- Data: The dataset itself, structured for readability across various software systems.
3 XPT Support for Different Languages
We have XPT support in different programming languages:
- Julia: The
ReadStatTables.jl
package can read and write XPT files. - R: The
haven
package provides functions to read and write XPT files. - Python: The
sas7bdat
package can read and write XPT files.
In this tutorial, we’ll cover R and Julia examples. For Python, you can refer to the sas7bdat
package documentation.
3.1 Handling XPT Files in Julia/Pumas
Any Pumas app can read XPT files using the ReadStatTables.jl
package. First you’ll need to import the package:
Then you can read the XPT file using the readstat
function:
= readstat(joinpath(@__DIR__, "data", "iv_bolus_sd.xpt")) tb
As you can see, the readstat
function returns a ReadStatTable
object. But you can (and probably should) convert it to a DataFrame
:
using DataFrames
using DataFramesMeta
= DataFrame(tb)
df first(df, 5)
Row | id | time | conc | amt | route | dose |
---|---|---|---|---|---|---|
Float64 | Float64 | Float64? | Float64? | String3 | Float64 | |
1 | 1.0 | 0.0 | missing | 250.0 | iv | 250.0 |
2 | 1.0 | 0.0 | 8.38202 | missing | iv | 250.0 |
3 | 1.0 | 0.5 | 7.6079 | missing | iv | 250.0 |
4 | 1.0 | 1.0 | 8.73121 | missing | iv | 250.0 |
5 | 1.0 | 2.0 | 5.84823 | missing | iv | 250.0 |
Now you can use the df
object as you would with any other DataFrame
in Pumas.
3.1.1 Accessing Metadata
SAS supports metadata for XPT files. This metadata includes information about the dataset, such as the column names, labels, and types.
This information is also available in the DataFrames.jl
package with the [delete|empty][col]metadata[keys][!]
family of functions that operates on the metadata of DataFrame
s. Please check the Metadata DataFrames.jl
documentation for more information.
You can access the metadata of the XPT file using the colmetadata
function:
= colmetadata(tb) tb_metadata
ColMetaIterator{ReadStatColMeta} with 6 entries:
:id => ReadStatColMeta(id, BEST12)
:time => ReadStatColMeta(time, BEST12)
:conc => ReadStatColMeta(conc, BEST12)
:amt => ReadStatColMeta(amt, BEST12)
:route => ReadStatColMeta(route, )
:dose => ReadStatColMeta(dose, BEST12)
This will return a Dict
with the column names as keys and the metadata as values. Notice that the returned metadata values are Dict
s themselves with the metadata keys as keys and the metadata values as values.
:conc] tb_metadata[
ReadStatColMeta:
label => conc
format => BEST12
type => READSTAT_TYPE_DOUBLE
value label => BEST12
storage width => 8
display width => 12
measure => READSTAT_MEASURE_UNKNOWN
alignment => READSTAT_ALIGNMENT_LEFT
3.1.2 Saving XPT Files in Julia/Pumas
To save a DataFrame
as an XPT file, you can use the writestat
function. First, let’s load a dataset from PharmaDatasets.jl
:
using PharmaDatasets
= dataset("SDTM/DU101001/ex.xpt")
df first(df, 5)
Row | STUDYID | DOMAIN | USUBJID | EXSEQ | EXTRT | EXCAT | EXDOSE | EXDOSU | EXDOSFRM | EXROUTE | VISITNUM | VISIT | EPOCH | EXSTDTC | EXENDTC | EXSTDY | EXENDY |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String15 | String3 | String31 | Float64 | String15 | String31 | Float64 | String3 | String15 | String15 | Float64 | String | String15 | String31 | String31 | Float64? | Float64? | |
1 | DU101001 | EX | DU101001-00011-00001 | 1.0 | DRUG101 | STUDY DRUG ADMINISTRATION | 200.0 | mg | INJECTION | INTRAVENOUS | 3001.0 | Arm B Treatment Cycle 1 Day 1 | TREATMENT | 2019-07-27T12:34 | 2019-07-27T12:34 | 1.0 | 1.0 |
2 | DU101001 | EX | DU101001-00011-00001 | 2.0 | DRUG101 | STUDY DRUG ADMINISTRATION | 200.0 | mg | INJECTION | INTRAVENOUS | 3002.0 | Arm B Treatment Cycle 2 Day 1 | TREATMENT | 2019-08-17T18:22 | 2019-08-17T18:22 | 22.0 | 22.0 |
3 | DU101001 | EX | DU101001-00011-00001 | 3.0 | DRUG101 | STUDY DRUG ADMINISTRATION | 200.0 | mg | INJECTION | INTRAVENOUS | 3003.0 | Arm B Treatment Cycle 3 Day 1 | TREATMENT | 2019-09-07T08:15 | 2019-09-07T08:15 | 43.0 | 43.0 |
4 | DU101001 | EX | DU101001-00011-00001 | 4.0 | DRUG101 | STUDY DRUG ADMINISTRATION | 200.0 | mg | INJECTION | INTRAVENOUS | 3004.0 | Arm B Treatment Cycle 4 Day 1 | TREATMENT | 2019-09-28T13:46 | 2019-09-28 | 64.0 | 64.0 |
5 | DU101001 | EX | DU101001-00011-00001 | 5.0 | DRUG101 | STUDY DRUG ADMINISTRATION | 200.0 | mg | INJECTION | INTRAVENOUS | 3005.0 | Arm B Treatment Cycle 5 Day 1 | TREATMENT | 2019-10-19T02:29 | 2019-10-19T02:29 | 85.0 | 85.0 |
I am choosing the ex.xpt
file from the SDTM/DU101001
dataset because it was originally an XPT file, and it has dates and datetime objects that are a little tricky to handle in XPT. We’ll use this dataset later to demonstrate how to parse dates and datetimes.
Now we’re going to save this DataFrame
as an XPT file:
writestat(joinpath(@__DIR__, "data", "ex.xpt"), df)
3.1.3 Reading Dates and Datetimes
When reading XPT files, you may encounter issues with dates and datetimes. Since XPT files are an old format, they don’t have a native date or datetime type. Instead, the date and time values are stored as the numbers of periods elapsed since a reference date or time point (epoch) chosen by the software. Therefore, knowing the reference date/time and the length of a single period is sufficient for uncovering the represented date/time values for a given format.
If a variable is in a date/time format that can be recognized, the values will be displayed as Julia Date
or DateTime
. You can find more about these internal translations in the ReadStatTables.jl
documentation.
Let’s showcase this with the ex.xpt
dataset we saved earlier:
= readstat(joinpath(@__DIR__, "data", "ex.xpt")) tb
In this dataset we have two columns, EXSTDTC
and EXENDTC
, that are dates:
colmetadata(tb, :EXSTDTC) ReadStatTables.
ReadStatColMeta:
label => Start Date/Time of Treatment
format => 31
type => READSTAT_TYPE_STRING
value label => 31
storage width => 31
display width => 31
measure => READSTAT_MEASURE_UNKNOWN
alignment => READSTAT_ALIGNMENT_LEFT
colmetadata(tb, :EXENDTC) ReadStatTables.
ReadStatColMeta:
label => End Date/Time of Treatment
format => 31
type => READSTAT_TYPE_STRING
value label => 31
storage width => 31
display width => 31
measure => READSTAT_MEASURE_UNKNOWN
alignment => READSTAT_ALIGNMENT_LEFT
As you can see by the metadata labels, these columns represent the Start and End Date/Time of the study, respectively.
However, they are automatically read as String
s by ReadStatTables.jl
.
|> typeof tb.EXENDTC
Vector{String31} (alias for Array{String31, 1})
|> typeof tb.EXSTDTC
Vector{String31} (alias for Array{String31, 1})
Hence, we need to convert them to Date
and/or DateTime
objects. This can be accomplished with the helper functions in the Dates
Julia’s standard library module:
using Dates
using DataFrames
using DataFramesMeta
= DataFrame(tb); df
@rtransform df :EXSTDTC_date = DateTime(:EXSTDTC)
ArgumentError: ArgumentError("Cannot parse an empty string as a DateTime")
ArgumentError: Cannot parse an empty string as a DateTime
Stacktrace:
[1] parse(::Type{DateTime}, s::String31, df::DateFormat{Symbol("yyyy-mm-dd\\THH:MM:SS.s"), Tuple{Dates.DatePart{'y'}, Dates.Delim{Char, 1}, Dates.DatePart{'m'}, Dates.Delim{Char, 1}, Dates.DatePart{'d'}, Dates.Delim{Char, 1}, Dates.DatePart{'H'}, Dates.Delim{Char, 1}, Dates.DatePart{'M'}, Dates.Delim{Char, 1}, Dates.DatePart{'S'}, Dates.Delim{Char, 1}, Dates.DatePart{'s'}}})
@ Dates ~/run/_work/_tool/julia/1.10.5/x64/share/julia/stdlib/v1.10/Dates/src/parse.jl:201
[2] DateTime (repeats 2 times)
@ ~/run/_work/_tool/julia/1.10.5/x64/share/julia/stdlib/v1.10/Dates/src/io.jl:568 [inlined]
[3] iterate
@ ./generator.jl:47 [inlined]
[4] collect_to!
@ ./array.jl:892 [inlined]
[5] collect_to_with_first!
@ ./array.jl:870 [inlined]
[6] _collect(c::Vector{String31}, itr::Base.Generator{Vector{String31}, Type{DateTime}}, ::Base.EltypeUnknown, isz::Base.HasShape{1})
@ Base ./array.jl:864
[7] collect_similar(cont::Vector{String31}, itr::Base.Generator{Vector{String31}, Type{DateTime}})
@ Base ./array.jl:763
[8] map(f::Type, A::Vector{String31})
@ Base ./abstractarray.jl:3285
[9] (::ByRow{DataType})(cols::Vector{String31})
@ Tables ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/Tables/8p03y/src/utils.jl:229
[10] _transformation_helper(df::DataFrame, col_idx::Int64, ::Base.RefValue{Any})
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:562
[11] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
[12] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
[13] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
[14] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
[15] select
@ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303 [inlined]
[16] transform(df::DataFrame, args::Any)
@ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383
[17] top-level scope
@ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/macros.jl:1599
As you can see this errors out because we have some missing data represented as empty strings, i.e. ""
:
@chain df begin
@rsubset :EXSTDTC == ""
@combine :nrow = length(:EXSTDTC)
end
Row | nrow |
---|---|
Int64 | |
1 | 33 |
We can fix this by replacing the empty strings with missing
:
@rtransform! df :EXSTDTC = :EXSTDTC == "" ? missing : :EXSTDTC
@rtransform! df @passmissing :EXSTDTC_date = DateTime(:EXSTDTC)
@passmissing
Here we use the @passmissing
macro to ensure that the DateTime
constructor can handle missing
values. Please check the Handling NAs
and Missing Values tutorial for more information on @passmissing
.
Now let’s check the EXSTDTC_date
column:
@chain df begin
@select :EXSTDTC_date
describeend
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Nothing | DateTime | DateTime | DateTime | Int64 | Union | |
1 | EXSTDTC_date | 2019-01-08T00:00:00 | 2019-12-16T10:26:00 | 2020-12-29T09:18:00 | 33 | Union{Missing, DateTime} |
It works! You can do the same for the EXENDTC
column and any other date or datetime columns you may have in your dataset.
If you need to handle dates and datetimes in your dataset, don’t forget to check the Handling Dates and Times tutorial
3.2 Handling XPT Files in R
You can also handle XPT files in R using tidyverse’s {haven}
package.
ReadStat
C Library
Under the hood both ReadStatTables.jl
and {haven}
use the ReadStat
C library.
Hence, both packages have equivalent features.
Let’s load the {haven}
package which exposes a read_xpt
function:
library(haven)
<- here::here("docs", "src", "quarto", "data_wrangling", "ex.xpt")
full_path <- read_xpt(full_path) tb
# A tibble: 2,878 × 17
STUDYID DOMAIN USUBJID EXSEQ EXTRT EXCAT EXDOSE EXDOSU EXDOSFRM EXROUTE
<chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
1 DU101001 EX DU101001-000… 1 DRUG… STUD… 200 mg INJECTI… INTRAV…
2 DU101001 EX DU101001-000… 2 DRUG… STUD… 200 mg INJECTI… INTRAV…
3 DU101001 EX DU101001-000… 3 DRUG… STUD… 200 mg INJECTI… INTRAV…
4 DU101001 EX DU101001-000… 4 DRUG… STUD… 200 mg INJECTI… INTRAV…
5 DU101001 EX DU101001-000… 5 DRUG… STUD… 200 mg INJECTI… INTRAV…
6 DU101001 EX DU101001-000… 6 DRUG… STUD… 200 mg INJECTI… INTRAV…
7 DU101001 EX DU101001-000… 7 DRUG… STUD… 200 mg INJECTI… INTRAV…
8 DU101001 EX DU101001-000… 8 DRUG… STUD… 200 mg INJECTI… INTRAV…
9 DU101001 EX DU101001-000… 9 DRUG… STUD… 200 mg INJECTI… INTRAV…
10 DU101001 EX DU101001-000… 10 DRUG… STUD… 200 mg INJECTI… INTRAV…
# ℹ 2,868 more rows
# ℹ 7 more variables: VISITNUM <dbl>, VISIT <chr>, EPOCH <chr>, EXSTDTC <chr>,
# EXENDTC <chr>, EXSTDY <dbl>, EXENDY <dbl>
# ℹ Use `print(n = ...)` to see more rows
As you can see this returns a tibble
.
You can also get the metadata from a XPT file with str
:
str(tb)
tibble [2,878 × 17] (S3: tbl_df/tbl/data.frame)
$ STUDYID : chr [1:2878] "DU101001" "DU101001" "DU101001" "DU101001" ...
..- attr(*, "label")= chr "Study Identifier"
..- attr(*, "format.sas")= chr "15"
$ DOMAIN : chr [1:2878] "EX" "EX" "EX" "EX" ...
..- attr(*, "label")= chr "Domain Abbreviation"
..- attr(*, "format.sas")= chr "9"
$ USUBJID : chr [1:2878] "DU101001-00011-00001" "DU101001-00011-00001" "DU101001-00011-00001" "DU101001-00011-00001" ...
..- attr(*, "label")= chr "Unique Subject Identifier"
..- attr(*, "format.sas")= chr "31"
$ EXSEQ : num [1:2878] 1 2 3 4 5 6 7 8 9 10 ...
..- attr(*, "label")= chr "Sequence Number"
..- attr(*, "format.sas")= chr "9"
$ EXTRT : chr [1:2878] "DRUG101" "DRUG101" "DRUG101" "DRUG101" ...
..- attr(*, "label")= chr "Name of Treatment"
..- attr(*, "format.sas")= chr "15"
$ EXCAT : chr [1:2878] "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" "STUDY DRUG ADMINISTRATION" ...
..- attr(*, "label")= chr "Category of Treatment"
..- attr(*, "format.sas")= chr "31"
$ EXDOSE : num [1:2878] 200 200 200 200 200 200 200 200 200 200 ...
..- attr(*, "label")= chr "Dose"
..- attr(*, "format.sas")= chr "9"
$ EXDOSU : chr [1:2878] "mg" "mg" "mg" "mg" ...
..- attr(*, "label")= chr "Dose Units"
..- attr(*, "format.sas")= chr "9"
$ EXDOSFRM: chr [1:2878] "INJECTION" "INJECTION" "INJECTION" "INJECTION" ...
..- attr(*, "label")= chr "Dose Form"
..- attr(*, "format.sas")= chr "15"
$ EXROUTE : chr [1:2878] "INTRAVENOUS" "INTRAVENOUS" "INTRAVENOUS" "INTRAVENOUS" ...
..- attr(*, "label")= chr "Route of Administration"
..- attr(*, "format.sas")= chr "15"
$ VISITNUM: num [1:2878] 3001 3002 3003 3004 3005 ...
..- attr(*, "label")= chr "Visit Number"
..- attr(*, "format.sas")= chr "9"
$ VISIT : chr [1:2878] "Arm B Treatment Cycle 1 Day 1" "Arm B Treatment Cycle 2 Day 1" "Arm B Treatment Cycle 3 Day 1" "Arm B Treatment Cycle 4 Day 1" ...
..- attr(*, "label")= chr "Visit Name"
..- attr(*, "format.sas")= chr "30"
$ EPOCH : chr [1:2878] "TREATMENT" "TREATMENT" "TREATMENT" "TREATMENT" ...
..- attr(*, "label")= chr "Epoch"
..- attr(*, "format.sas")= chr "15"
$ EXSTDTC : chr [1:2878] "2019-07-27T12:34" "2019-08-17T18:22" "2019-09-07T08:15" "2019-09-28T13:46" ...
..- attr(*, "label")= chr "Start Date/Time of Treatment"
..- attr(*, "format.sas")= chr "31"
$ EXENDTC : chr [1:2878] "2019-07-27T12:34" "2019-08-17T18:22" "2019-09-07T08:15" "2019-09-28" ...
..- attr(*, "label")= chr "End Date/Time of Treatment"
..- attr(*, "format.sas")= chr "31"
$ EXSTDY : num [1:2878] 1 22 43 64 85 106 127 148 169 190 ...
..- attr(*, "label")= chr "Study Day of Start of Treatment"
..- attr(*, "format.sas")= chr "9"
$ EXENDY : num [1:2878] 1 22 43 64 85 106 127 148 169 190 ...
..- attr(*, "label")= chr "Study Day of End of Treatment"
..- attr(*, "format.sas")= chr "9"
3.2.1 Saving XPT Files in R
To save a tibble
as an XPT file, you can use the write_xpt
function:
write_xpt(tb, here::here("docs", "src", "quarto", "data_wrangling", "ex.xpt"))
3.2.2 Reading Dates and Datetimes
As described above for Julia, when reading XPT files in R, you may encounter issues with dates and datetimes. We can see in the metadata above that our datetime columns EXSTDTC
and EXENDTC
were parsed as a string vector, i.e. as char
vectors. Hence, we need to perform the same operation we did before but with tidyverse
now:
library(dplyr)
library(tidyr)
library(lubridate)
%>%
tb mutate(
EXSTDTC = if_else(EXSTDTC == "", NA_character_, EXSTDTC),
EXSTDTC_date = ymd_hm(EXSTDTC),
EXENDTC = if_else(EXENDTC == "", NA_character_, EXENDTC),
EXENDTC_date = ymd_hm(EXENDTC)
%>%
) select(EXSTDTC_date, EXENDTC_date) %>%
summarize(across(everything(), list(n = ~ sum(!is.na(.)))))
# A tibble: 1 × 2
EXSTDTC_date_n EXENDTC_date_n
<int> <int>
1 2703 2713
Please refer to the amazing {lubridate}
package on how to convert strings to dates/datetimes.
4 NONMEM: XPT to csv
If one is using NONMEM for data analysis, it is important to consider the issue of missing values when saving an XPT file as a CSV file for use as a NONMEM analysis dataset. Using the .
(dot) character value is fine for some variables, but often not for covariates, since NONMEM will interpret the .
(dot) as a zero (0
) in a covariate data item. This can cause issues in your model.
One recommendation is to use a very unusual value, such as -99
, to represent missing values in covariates. Then you can handle these values in your model code, in the $PK
block, for example.
In order to accomplish this, you can use the replace_na
function from the tidyr
package in R:
library(tidyr)
%>% replace_na(list(covariate = -99)) df
This will replace all NA
values in the covariate
column with -99
.
In Julia, you can use the coalesce
function from the DataFrames
package:
@rtransform! df :covariate = coalesce(:covariate, -99)
This will replace all missing
values in the covariate
column with -99
.
5 Conclusion
XPT files are a crucial part of data sharing in the pharmaceutical industry. XPT files have advantages over csv and other file formats. An important advantage is the ability to preserve column labels, which may give additional understanding to the content of the data. They are used for regulatory submissions, data sharing, and long-term data retention. You can handle XPT files in Julia using the ReadStatTables.jl
package and in R using the {haven}
package. Remember to handle dates and datetimes correctly when reading XPT files. Also consider the treatment of missing values.