Reading and Writing Data

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

Pumas Workflow

The first step of data analysis is to read data. Julia has a rich ecosystem of packages devoted to reading data. In this tutorial, we will cover 3 main file types:

These likely constitute the bulk of the data types that you will encounter doing data wrangling in Julia.

Note

If you find yourself with an exotic file type that you need to read data from, donโ€™t forget to check the JuliaData organization at GitHub. You might find just the right package for your needs.

Caution

Starting at the next tutorials, weโ€™ll use the PharmaDatasets.jl package to load our datasets.

1 ๐Ÿ“ CSV files with CSV.jl

CSV (Comma Separated Values) files are the most common data type you will probably encounter while reading data. CSV files have the .csv extension in the filename.

This file type has the characteristic that it does not use a proprietary protocol and the data is hard-coded in a simple text file.

It uses lines to denote rows (or observations) and frequently the first line is used as a header for the column/variable names.

A specific character is used to separate values. This is often a comma (,) and thatโ€™s why they are called comma separated values.

Note

Observe that different characters can be used to separate values.

In some European countries and also in most Latin American countries the decimal number system uses commas to denote decimals instead of the dot used in anglo-saxon number system which in turn is adopted by countries like the United States of America, United Kingdom and Canada. Thus, the first 2 digits of \(\pi\), for example, would be portrayed as \(3,14\) instead of \(3.14\).

So, in some countries, the CSV file uses the semicolon (;) as the character to separate values. Despite being โ€œsemicolon separated valuesโ€, the file extension remains .csv.

Furthermore, the โ€œtabโ€ character (\t) is also used, but the file type changes to TSV (Tab Separated Values) and has the .tsv extension.

In order to read a CSV file into Julia youโ€™ll need to first import CSV.jl into your environment. This is done with the using statement:

using CSV

1.1 Reading a Single CSV File

CSV.jl has two main functions to parse and read text-delimited data:

  • CSV.read(): passes the input to a valid sink type that will be the recipient of the input data, such as a DataFrame, without making extra copies.
  • CSV.File()1: materializes the input to a valid sink type. This means that the input will be copied before being passed to a valid sink.

Our advice is to almost exclusively use CSV.read(). Youโ€™ll probably will never need to read a CSV file with CSV.File().

Also, the most frequent sink youโ€™ll use will be a DataFrame (which we will cover in the next tutorials). So, to read a CSV file into a DataFrame, youโ€™ll need to pass to CSV.read() a file path followed by a sink type.

Since we need DataFrames.jl to use DataFrames as sink, letโ€™s import it:

using DataFrames
df = CSV.read("data/iv_sd_demogs.csv", DataFrame)
first(df, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
Tip

Note that we are using a string to indicate the relative file path of the desired CSV file.

A much better approach would be to create a file path that is robust to different filesystems and also operation systems. This means that if a user is using Windows and shares their Julia script with another user that is using MacOS, then the script wouldnโ€™t fail. This is because MacOS, for instance, follows a pattern that uses forward slashes, /, for the file paths (example: /Users/JuliaUser/project/src/script.jl). And Windows instead uses backslashes, \, for the file paths (example: C:\user\JuliaUser\project\src\script.jl).

This can be accomplished with filesystem best practices. Specifically, three functions/commands that are robust to different operation systems are: @__FILE__, dirname(), and joinpath().

  • @__FILE__: returns the file path of the current Julia script.
  • dirpath(): returns the directory of a file path.
  • joinpath(): join components to create a full file or directory path.

When we combine those three functions/commands we can create a robust file path that will not break our Julia script if we share or run in different operation systems.

Suppose you have a script inside a src/ folder and you need to access a CSV file in the data/ folder which is located in the same directory as the src/ folder. You can do this with the following Julia command:

my_file = joinpath(dirname(@__FILE__), "..", "data", "file.csv")

The code above extracts the directory of the current Julia script with dirname(@__FILE__) and then joins this with "..", which represents the upper directory, "data", the data/ directory, and "file.csv", the desired CSV file inside the data/ directory.

For more information, see this section of the Julia official documentation.

1.2 Custom CSV File Specifications

Most of the time a plain CSV.read() without passing any customized options will work and parse and read correctly the desired file. This is due to the guessing that CSV.jl does under the hood to try to infer what are the specifications that the file is using.

But CSV.jl might sometimes fail to correctly guess the specifications of the underlying file. This is where youโ€™ll have to pass the specifications yourself. Fortunately, CSV.read()2 has the following arguments:

  • delim: either a character, e.g. ';', or string, e.g. "\t", that will be used to indicate how the values are delimited in a file. If empty, CSV.jl will try to guess and detect from the first 10 rows of the file.
  • decimal: a character that will be used to parse floating point numbers. If ',', then 3,14 will be parsed as a float. The default is a dot, '.'.

See this example where we are reading a CSV file that uses the semicolon as the delimiter and commas as decimals. This is a common specification when reading EU/LatAm CSV files.

# the first two lines of the CSV file
readlines("data/iv_sd_demogs_eu.csv")[1:2]
2-element Vector{String}:
 "ID;AGE;WEIGHT;SCR;ISMALE;eGFR"
 "1;34,823;38,212;1,1129;0;42,635"
# using the keyword arguments delim and decimal
df_eu = CSV.read("data/iv_sd_demogs_eu.csv", DataFrame; delim = ';', decimal = ',')
first(df_eu, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
Tip

Sometimes our CSV files have some comments or information at the initial lines. These do not constitute the data but might be some hard-coded metadata or other important information.

To handle these situations, CSV.File() and CSV.read() have the keyword argument skipto which accepts an integer specifying the row number where the data is located.

Similarly, if you have this kind of information or metadata located at the bottom of the CSV file you can use the keyword argument footerskip which also takes an integer and specifies the number of rows to ignore at the end of the file.

1.3 Specifying Custom Types for Columns while Reading a CSV File

Sometimes youโ€™ll want to overrule the automatic type detection that CSV.jl will infer for the columns present in the CSV file. This can be done with the keyword argument types. It accepts several inputs, but the easiest and most customizable is a Julia dictionary, where the keys are either an integer (for the column indices) or a string/symbol (for column names) and the values are the desired types.

For example, suppose we want to have the columns :ID and :ISMALE to be parsed as String and Bool respectively:

df_custom_types = CSV.read(
    "data/iv_sd_demogs.csv",
    DataFrame;
    types = Dict(:ID => String, :ISMALE => Bool),
)
first(df_custom_types, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
String Float64 Float64 Float64 Bool Float64
1 1 34.823 38.212 1.1129 false 42.635
2 2 32.765 74.838 0.8846 true 126.0
3 3 35.974 37.303 1.1004 true 48.981
4 4 38.206 32.969 1.1972 true 38.934
5 5 33.559 47.139 1.5924 false 37.198

1.4 Selecting and Dropping Columns while Reading a CSV File

By default, CSV.read() will parse and read all the columns in a CSV file. If you want to select or drop specific columns you can do so with the following arguments. Both accept a vector of either strings or integers that represents the names and indices, respectively, of the columns to select or drop.

  • select
  • drop

See the following examples. We are reading the same CSV file as before but now we are selecting and dropping some columns either by name or index.

# reading the same CSV file but now selecting some columns
# with a vector of strings as names

df_select_names = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = ["ID", "AGE"])
first(df_select_names, 5)
5ร—2 DataFrame
Row ID AGE
Int64 Float64
1 1 34.823
2 2 32.765
3 3 35.974
4 4 38.206
5 5 33.559
# reading the same CSV file but now selecting some columns
# with a vector of integers as indices

df_select_idxs = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = [1, 2])
first(df_select_idxs, 5)
5ร—2 DataFrame
Row ID AGE
Int64 Float64
1 1 34.823
2 2 32.765
3 3 35.974
4 4 38.206
5 5 33.559
Tip

Notice that we can be clever with the indices. If we want to select a range of indices we can materialize it into a vector using the collect():

# reading the same CSV file but now selecting some columns
# with a vector of integers as indices

df_select_idxs2 = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = collect(1:3))
first(df_select_idxs2, 5)
5ร—3 DataFrame
Row ID AGE WEIGHT
Int64 Float64 Float64
1 1 34.823 38.212
2 2 32.765 74.838
3 3 35.974 37.303
4 4 38.206 32.969
5 5 33.559 47.139

Now letโ€™s explore some examples with the drop keyword argument:

# reading the same CSV file but now dropping some columns
# with a vector of strings as names

df_drop_names = CSV.read("data/iv_sd_demogs.csv", DataFrame; drop = ["SCR", "eGFR"])
first(df_drop_names, 5)
5ร—4 DataFrame
Row ID AGE WEIGHT ISMALE
Int64 Float64 Float64 Int64
1 1 34.823 38.212 0
2 2 32.765 74.838 1
3 3 35.974 37.303 1
4 4 38.206 32.969 1
5 5 33.559 47.139 0
# reading the same CSV file but now dropping some columns
# with a vector of integers as indices

df_drop_idx = CSV.read("data/iv_sd_demogs.csv", DataFrame; drop = [4, 6])
first(df_drop_idx, 5)
5ร—4 DataFrame
Row ID AGE WEIGHT ISMALE
Int64 Float64 Float64 Int64
1 1 34.823 38.212 0
2 2 32.765 74.838 1
3 3 35.974 37.303 1
4 4 38.206 32.969 1
5 5 33.559 47.139 0
Tip

You can also pass a vector of symbols to select and drop arguments. So feel free to choose whatever works best for you.

1.5 Specifying Missing Values in a CSV File

By default, every blank value in a CSV file will be parsed as a missing value.

Sometimes missing values will be hardcoded as some other value instead of a blank value. For example, it could be a dot "." or a string "NA". In fact, any value can be specified to represent missing.

Note

We cover how to handle missing/NA values extensively in our Handling NAs and Missing Values tutorial.

Take a look at this dataset. It has several occurrences of "." which confuses Julia into parsing these columns as String types:

missing_df = CSV.read("data/iv_sd_final_target_format.csv", DataFrame)
first(missing_df, 5)
5ร—11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 String15 String3 Float64 Float64 Float64 Int64 Float64 Int64 String1
1 1 0.0 . 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 . 34.823 38.212 1.1129 0 42.635 0 .
3 1 0.5 14.984 . 34.823 38.212 1.1129 0 42.635 0 .
4 1 0.75 14.16 . 34.823 38.212 1.1129 0 42.635 0 .
5 1 1.0 19.316 . 34.823 38.212 1.1129 0 42.635 0 .

Fortunately, it is very easy to specify which values CSV.jl should treat as missing values. You just have to use the keyword argument missingstring:

missing_df_corrected =
    CSV.read("data/iv_sd_final_target_format.csv", DataFrame; missingstring = ".")
first(missing_df_corrected, 5)
5ร—11 DataFrame
Row ID TIME_AFTER_DOSING CONC AMT AGE WEIGHT SCR ISMALE eGFR EVID CMT
Int64 Float64 Float64? Int64? Float64 Float64 Float64 Int64 Float64 Int64 Int64?
1 1 0.0 missing 100 34.823 38.212 1.1129 0 42.635 1 1
2 1 0.25 13.026 missing 34.823 38.212 1.1129 0 42.635 0 missing
3 1 0.5 14.984 missing 34.823 38.212 1.1129 0 42.635 0 missing
4 1 0.75 14.16 missing 34.823 38.212 1.1129 0 42.635 0 missing
5 1 1.0 19.316 missing 34.823 38.212 1.1129 0 42.635 0 missing
Tip

If you have more than one value representing missing values, the keyword argument missingstring can also accept a vector of strings

1.6 Reading Multiple CSV Files

You might encounter data that is dispersed across several CSV files. The CSV file organizing logic could be something temporal, e.g. results_jan_2021.csv, or spatial, e.g. samples_boston.csv, or any other logic, e.g. results_john.csv.

With CSV.jl, you can simply pass a vector of strings to CSV.read() or CSV.File() and it will then concatenate all the CSV file paths into a single unit and materialize the associated files into the desired sink, such as a DataFrame.

If you have any information that is encoded in the CSV file name, it will be lost if you do not use the source argument. When passing a vector of strings to CSV.read(), you can pass the argument source with a string or symbol. CSV.jl will then create a column with the name as the string/symbol youโ€™ve passed that will capture the file name of the CSV file and add it as a value for all observations in the final data to be passed to your sink:

df_multiple_csv = CSV.read(files, DataFrame; source = "DAY")
first(df_multiple_csv, 5)

where files is a vector of CSV file paths as strings.

All the arguments that weโ€™ve used before, delim, decimal, select, and drop, work fine with the CSV.read() using a vector of strings representing multiple CSV file paths as input. So feel free to use them if necessary.

In the example below, we do just that with the ["data/multiple_csv/12-Dec-21.csv", "data/multiple_csv/13-Dec-21.csv"] as the vector of strings.

files = filter(x -> endswith(x, ".csv"), readdir("data/multiple_csv/"; join = true))
2-element Vector{String}:
 "data/multiple_csv/12-Dec-21.csv"
 "data/multiple_csv/13-Dec-21.csv"
df_multiple_csv = CSV.read(files, DataFrame; source = "DAY")
first(df_multiple_csv, 5)
5ร—3 DataFrame
Row ID CONC DAY
Int64 Float64? String
1 1 missing data/multiple_csv/12-Dec-21.csv
2 1 13.026 data/multiple_csv/12-Dec-21.csv
3 1 14.984 data/multiple_csv/12-Dec-21.csv
4 1 14.16 data/multiple_csv/12-Dec-21.csv
5 1 19.316 data/multiple_csv/12-Dec-21.csv

1.7 Writing CSV Files

In order to write CSV files, youโ€™ll use the CSV.write() function which can be used in two ways:

  1. By passing a file path as a string as the first argument and a table (such as a DataFrame) as the second argument. For example:

    CSV.write("data/my_file.csv", my_df)

    where my_df is a DataFrame.

  2. By โ€œpipingโ€ the table (such as a DataFrame ) into the CSV.write() function and also specifying a file path as a string as the first argument:

    my_df |> CSV.write("data/my_file.csv")

    where my_df is a DataFrame.

Tip

All the keywords arguments that weโ€™ve used so far can be used to write a specific CSV file format. For example:

my_df |> CSV.write("data/my_file.csv"; delim = ';', decimal = ',')

will create an EU/LatAm CSV file.

2 ๐Ÿ“ Excel Files with XLSX.jl

Excel files are also ubiquitous as a data file format. Commonly used file extensions are .xls and .xlsx.

Caution

Prefer .xlsx over .xls files. First, .xlsx files are compressed by default, thus they have smaller file sizes when compared to .xls. Second, and most important, .xls files only support a maximum of 65,536 (\(2^{16}\)) rows and 256 (\(2^8\)) columns. Whereas .xlsx files support a maximum of 1,048,576 (\(2^{20}\)) rows and 16,384 (\(2^{14}\)) columns. Finally, .xls files whenever the maximum capacity is exceed, fail silently.

In order to read an Excel file into Julia youโ€™ll need to first import XLSX.jl into your environment. This is done with the using statement:

using XLSX

2.1 Reading Excel Files

XLSX.jl has two main functions to read and write Excel files:

  1. XLSX.readxlsx(file.xlsx): reads the whole file.xlsx and returns a custom type named XLSXFile which contains a summary of all the sheets presented in the file along with their respective cell ranges. XLSXFile works pretty much like a Julia dictionary. You can index it to a specific sheet and it will return another custom type named Worksheet. But donโ€™t worry! Just call XLSX.eachtablerow() on a worksheet and pipe it to a DataFrame and youโ€™ll have a DataFrame for your desired indexed sheet. See an example below:
xf = XLSX.readxlsx("data/iv_sd_demogs.xlsx")
XLSXFile("data/iv_sd_demogs.xlsx") containing 1 Worksheet
            sheetname size          range        
-------------------------------------------------
         iv_sd_demogs 101x6         A1:F101      
df = DataFrame(XLSX.eachtablerow(xf["iv_sd_demogs"]))
first(df, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Any Any Any Any Any Any
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
  1. XLSX.readtable(file.xlsx, sheet): reads an specific sheet of the file.xlsx. For the second argument, sheet you can pass either a string with the sheet name or an integer representing the sheet index. See the following example:
df = DataFrame(XLSX.readtable("data/iv_sd_demogs.xlsx", 1))
first(df, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Any Any Any Any Any Any
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

2.2 Inferring Excel Column Types with infer_eltypes

You might have noticed that all the columns for the dataframes we created from Excel files have the type Any (just hover your mouse over the table and it will display). This is the default behavior of both XLSX.eachtablerow() and XLSX.readtable(). You can change this by passing the keyword argument infer_eltypes=true to XLSX.readtable():

df = DataFrame(XLSX.readtable("data/iv_sd_demogs.xlsx", 1; infer_eltypes = true))
first(df, 5)
5ร—6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
Caution

XLSX.jl does not have keyword arguments or options to specify missing values. By default, any Excel cell that is blank will be considered missing.

2.3 Writing Excel Files

To write a DataFrame as an Excel file, you can use the XLSX.writetable() function. Be aware that this function accepts a tuple of data and column names. This can be accomplished with collect(eachcol(df), names(df)) where df is your dataframe. For example:

XLSX.writetable("filename.xlsx", collect(eachcol(df), names(df)))

Now if you want to write data into separate worksheets, you can do so by passing a NamedTuple with the sheet name. For example:

XLSX.writetable(
    "filename.xlsx",
    sheet_1 = (collect(eachcol(df1), names(df1))),
    sheet_2 = (collect(eachcol(df2), names(df2))),
)
Note

names, like the same function in dplyr will return a vector of strings with the column names of the underlying dataframe.

3 ๐Ÿ“ SAS Data Files with ReadStatTables.jl

We can read SAS files with the package ReadStatTables.jl. We will cover two of the most common SAS data file extensions:

  1. .sas7bdat files
  2. .xpt files
Note

ReadStatTables.jl can read, as the name suggests, files from different statistical software such as Stata, SPSS, SAS and more.

If you need to import data different from the ones above, donโ€™t hesitate to check ReadStatTables.jl documentation.

First, letโ€™s load ReadStatTables.jl:

using ReadStatTables

ReadStatsTables.jl has the readstat() function which takes a file path and returns a ReadStatTable object:

tb = readstat("data/iv_bolus_sd.sas7bdat")
300ร—6 ReadStatTable:
 Row โ”‚      id     time       conc       amt                  route     dose
     โ”‚ Float64  Float64   Float64?  Float64?  InlineStrings.String3  Float64
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     1.0      0.0    missing     250.0                     iv    250.0
   2 โ”‚     1.0      0.0    8.38202   missing                     iv    250.0
   3 โ”‚     1.0      0.5     7.6079   missing                     iv    250.0
   4 โ”‚     1.0      1.0    8.73121   missing                     iv    250.0
   5 โ”‚     1.0      2.0    5.84823   missing                     iv    250.0
   6 โ”‚     1.0      6.0    2.25315   missing                     iv    250.0
   7 โ”‚     1.0     10.0    1.17591   missing                     iv    250.0
   8 โ”‚     1.0     16.0   0.346308   missing                     iv    250.0
  โ‹ฎ  โ”‚    โ‹ฎ        โ‹ฎ         โ‹ฎ         โ‹ฎ                โ‹ฎ               โ‹ฎ
 294 โ”‚    30.0      1.0    22.1795   missing                     iv    750.0
 295 โ”‚    30.0      2.0    18.0047   missing                     iv    750.0
 296 โ”‚    30.0      6.0    8.33707   missing                     iv    750.0
 297 โ”‚    30.0     10.0    3.50049   missing                     iv    750.0
 298 โ”‚    30.0     16.0    1.05457   missing                     iv    750.0
 299 โ”‚    30.0     20.0   0.472131   missing                     iv    750.0
 300 โ”‚    30.0     24.0   0.242979   missing                     iv    750.0
                                                             285 rows omitted

which can be easily converted to a DataFrame by calling the DataFrame() constructor on it:

df_sas7bdat = DataFrame(tb)
first(df_sas7bdat, 5)
5ร—6 DataFrame
Row id time conc amt route dose
Float64 Float64 Float64? Float64? String3 Float64
1 1.0 0.0 missing 250.0 iv 250.0
2 1.0 0.0 8.38202 missing iv 250.0
3 1.0 0.5 7.6079 missing iv 250.0
4 1.0 1.0 8.73121 missing iv 250.0
5 1.0 2.0 5.84823 missing iv 250.0

Ok, that was for .sas7bdat files. We can do the same for .xpt files. But now we are going to do it in a single line of code by combining the last two operations:

df_xpt = DataFrame(readstat("data/iv_bolus_sd.xpt"))
first(df_xpt, 5)
5ร—6 DataFrame
Row id time conc amt route dose
Float64 Float64 Float64? Float64? String3 Float64
1 1.0 0.0 missing 250.0 iv 250.0
2 1.0 0.0 8.38202 missing iv 250.0
3 1.0 0.5 7.6079 missing iv 250.0
4 1.0 1.0 8.73121 missing iv 250.0
5 1.0 2.0 5.84823 missing iv 250.0
Caution

Note that the package is called ReadStatTables.jl, so it performs only reading of files, and not writing. If you need to write SAS (or any other statistical software proprietary file format) for some reason, youโ€™ll probably need to export as either CSV or Excel and then convert it with an external tool to the desired file format.

Footnotes

  1. technically, this is actually a constructor.โ†ฉ๏ธŽ

  2. CSV.File shares the same arguments as the CSV.read. So the same arguments will work for both of them.โ†ฉ๏ธŽ

Reuse