Reading and Writing Data

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

The first step of data analysis is to read data. Julia has a rich ecosystem of packages devoted to reading data. In this tutorial, we will cover 3 main file types:

CSV (Comma Separated Values) files
Excel files
SAS data (.sas7bdat and .xpt) files
Apache Arrow format (.arrow) files

These likely constitute the bulk of the data types that you will encounter doing data wrangling in Julia.

Note

If you find yourself with an exotic file type that you need to read data from, don’t forget to check the JuliaData organization at GitHub. You might find just the right package for your needs.

Caution

Starting at the next tutorials, we’ll use the PharmaDatasets.jl package to load our datasets.

1 📁 CSV files with `CSV.jl`

CSV (Comma Separated Values) files are the most common data type you will probably encounter while reading data. CSV files have the .csv extension in the filename.

This file type has the characteristic that it does not use a proprietary protocol and the data is hard-coded in a simple text file.

It uses lines to denote rows (or observations) and frequently the first line is used as a header for the column/variable names.

A specific character is used to separate values. This is often a comma (,) and that’s why they are called comma separated values.

Note

Observe that different characters can be used to separate values.

In some European countries and also in most Latin American countries the decimal number system uses commas to denote decimals instead of the dot used in anglo-saxon number system which in turn is adopted by countries like the United States of America, United Kingdom and Canada. Thus, the first 2 digits of \(\pi\), for example, would be portrayed as \(3,14\) instead of \(3.14\).

So, in some countries, the CSV file uses the semicolon (;) as the character to separate values. Despite being “semicolon separated values”, the file extension remains .csv.

Furthermore, the “tab” character (\t) is also used, but the file type changes to TSV (Tab Separated Values) and has the .tsv extension.

In order to read a CSV file into Julia you’ll need to first import CSV.jl into your environment. This is done with the using statement:

using CSV

1.1 Reading a Single CSV File

CSV.jl has two main functions to parse and read text-delimited data:

CSV.read(): passes the input to a valid sink type that will be the recipient of the input data, such as a DataFrame, without making extra copies.
CSV.File()¹: materializes the input to a valid sink type. This means that the input will be copied before being passed to a valid sink.

Our advice is to almost exclusively use CSV.read(). You’ll probably will never need to read a CSV file with CSV.File().

Also, the most frequent sink you’ll use will be a DataFrame (which we will cover in the next tutorials). So, to read a CSV file into a DataFrame, you’ll need to pass to CSV.read() a file path followed by a sink type.

Since we need DataFrames.jl to use DataFrames as sink, let’s import it:

using DataFrames

df = CSV.read("data/iv_sd_demogs.csv", DataFrame)
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Tip

Note that we are using a string to indicate the relative file path of the desired CSV file.

A much better approach would be to create a file path that is robust to different filesystems and also operation systems. This means that if a user is using Windows and shares their Julia script with another user that is using MacOS, then the script wouldn’t fail. This is because MacOS, for instance, follows a pattern that uses forward slashes, /, for the file paths (example: /Users/JuliaUser/project/src/script.jl). And Windows instead uses backslashes, \, for the file paths (example: C:\user\JuliaUser\project\src\script.jl).

This can be accomplished with filesystem best practices. Specifically, three functions/commands that are robust to different operation systems are: @__FILE__, dirname(), and joinpath().

@__FILE__: returns the file path of the current Julia script.
dirpath(): returns the directory of a file path.
joinpath(): join components to create a full file or directory path.

When we combine those three functions/commands we can create a robust file path that will not break our Julia script if we share or run in different operation systems.

Suppose you have a script inside a src/ folder and you need to access a CSV file in the data/ folder which is located in the same directory as the src/ folder. You can do this with the following Julia command:

my_file = joinpath(dirname(@__FILE__), "..", "data", "file.csv")

The code above extracts the directory of the current Julia script with dirname(@__FILE__) and then joins this with "..", which represents the upper directory, "data", the data/ directory, and "file.csv", the desired CSV file inside the data/ directory.

For more information, see this section of the Julia official documentation.

1.2 Custom CSV File Specifications

Most of the time a plain CSV.read() without passing any customized options will work and parse and read correctly the desired file. This is due to the guessing that CSV.jl does under the hood to try to infer what are the specifications that the file is using.

But CSV.jl might sometimes fail to correctly guess the specifications of the underlying file. This is where you’ll have to pass the specifications yourself. Fortunately, CSV.read()² has the following arguments:

delim: either a character, e.g. ';', or string, e.g. "\t", that will be used to indicate how the values are delimited in a file. If empty, CSV.jl will try to guess and detect from the first 10 rows of the file.
decimal: a character that will be used to parse floating point numbers. If ',', then 3,14 will be parsed as a float. The default is a dot, '.'.

See this example where we are reading a CSV file that uses the semicolon as the delimiter and commas as decimals. This is a common specification when reading EU/LatAm CSV files.

# the first two lines of the CSV file
readlines("data/iv_sd_demogs_eu.csv")[1:2]

2-element Vector{String}:
 "ID;AGE;WEIGHT;SCR;ISMALE;eGFR"
 "1;34,823;38,212;1,1129;0;42,635"

# using the keyword arguments delim and decimal
df_eu = CSV.read("data/iv_sd_demogs_eu.csv", DataFrame; delim = ';', decimal = ',')
first(df_eu, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Tip

Sometimes our CSV files have some comments or information at the initial lines. These do not constitute the data but might be some hard-coded metadata or other important information.

To handle these situations, CSV.File() and CSV.read() have the keyword argument skipto which accepts an integer specifying the row number where the data is located.

Similarly, if you have this kind of information or metadata located at the bottom of the CSV file you can use the keyword argument footerskip which also takes an integer and specifies the number of rows to ignore at the end of the file.

1.3 Specifying Custom Types for Columns while Reading a CSV File

Sometimes you’ll want to overrule the automatic type detection that CSV.jl will infer for the columns present in the CSV file. This can be done with the keyword argument types. It accepts several inputs, but the easiest and most customizable is a Julia dictionary, where the keys are either an integer (for the column indices) or a string/symbol (for column names) and the values are the desired types.

For example, suppose we want to have the columns :ID and :ISMALE to be parsed as String and Bool respectively:

df_custom_types = CSV.read(
    "data/iv_sd_demogs.csv",
    DataFrame;
    types = Dict(:ID => String, :ISMALE => Bool),
)
first(df_custom_types, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	String	Float64	Float64	Float64	Bool	Float64
1	1	34.823	38.212	1.1129	false	42.635
2	2	32.765	74.838	0.8846	true	126.0
3	3	35.974	37.303	1.1004	true	48.981
4	4	38.206	32.969	1.1972	true	38.934
5	5	33.559	47.139	1.5924	false	37.198

1.4 Selecting and Dropping Columns while Reading a CSV File

By default, CSV.read() will parse and read all the columns in a CSV file. If you want to select or drop specific columns you can do so with the following arguments. Both accept a vector of either strings or integers that represents the names and indices, respectively, of the columns to select or drop.

select
drop

See the following examples. We are reading the same CSV file as before but now we are selecting and dropping some columns either by name or index.

# reading the same CSV file but now selecting some columns
# with a vector of strings as names

df_select_names = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = ["ID", "AGE"])
first(df_select_names, 5)

5×2 DataFrame

Row	ID	AGE
	Int64	Float64
1	1	34.823
2	2	32.765
3	3	35.974
4	4	38.206
5	5	33.559

# reading the same CSV file but now selecting some columns
# with a vector of integers as indices

df_select_idxs = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = [1, 2])
first(df_select_idxs, 5)

5×2 DataFrame

Row	ID	AGE
	Int64	Float64
1	1	34.823
2	2	32.765
3	3	35.974
4	4	38.206
5	5	33.559

Tip

Notice that we can be clever with the indices. If we want to select a range of indices we can materialize it into a vector using the collect():

# reading the same CSV file but now selecting some columns
# with a vector of integers as indices

df_select_idxs2 = CSV.read("data/iv_sd_demogs.csv", DataFrame; select = collect(1:3))
first(df_select_idxs2, 5)

5×3 DataFrame

Row	ID	AGE	WEIGHT
	Int64	Float64	Float64
1	1	34.823	38.212
2	2	32.765	74.838
3	3	35.974	37.303
4	4	38.206	32.969
5	5	33.559	47.139

Now let’s explore some examples with the drop keyword argument:

# reading the same CSV file but now dropping some columns
# with a vector of strings as names

df_drop_names = CSV.read("data/iv_sd_demogs.csv", DataFrame; drop = ["SCR", "eGFR"])
first(df_drop_names, 5)

5×4 DataFrame

Row	ID	AGE	WEIGHT	ISMALE
	Int64	Float64	Float64	Int64
1	1	34.823	38.212	0
2	2	32.765	74.838	1
3	3	35.974	37.303	1
4	4	38.206	32.969	1
5	5	33.559	47.139	0

# reading the same CSV file but now dropping some columns
# with a vector of integers as indices

df_drop_idx = CSV.read("data/iv_sd_demogs.csv", DataFrame; drop = [4, 6])
first(df_drop_idx, 5)

5×4 DataFrame

Row	ID	AGE	WEIGHT	ISMALE
	Int64	Float64	Float64	Int64
1	1	34.823	38.212	0
2	2	32.765	74.838	1
3	3	35.974	37.303	1
4	4	38.206	32.969	1
5	5	33.559	47.139	0

Tip

You can also pass a vector of symbols to select and drop arguments. So feel free to choose whatever works best for you.

1.5 Specifying Missing Values in a CSV File

By default, every blank value in a CSV file will be parsed as a missing value.

Sometimes missing values will be hardcoded as some other value instead of a blank value. For example, it could be a dot "." or a string "NA". In fact, any value can be specified to represent missing.

Note

We cover how to handle missing/NA values extensively in our Handling NAs and Missing Values tutorial.

Take a look at this dataset. It has several occurrences of "." which confuses Julia into parsing these columns as String types:

missing_df = CSV.read("data/iv_sd_final_target_format.csv", DataFrame)
first(missing_df, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	String15	String3	Float64	Float64	Float64	Int64	Float64	Int64	String1
1	1	0.0	.	100	34.823	38.212	1.1129	0	42.635	1	1
2	1	0.25	13.026	.	34.823	38.212	1.1129	0	42.635	0	.
3	1	0.5	14.984	.	34.823	38.212	1.1129	0	42.635	0	.
4	1	0.75	14.16	.	34.823	38.212	1.1129	0	42.635	0	.
5	1	1.0	19.316	.	34.823	38.212	1.1129	0	42.635	0	.

Fortunately, it is very easy to specify which values CSV.jl should treat as missing values. You just have to use the keyword argument missingstring:

missing_df_corrected =
    CSV.read("data/iv_sd_final_target_format.csv", DataFrame; missingstring = ".")
first(missing_df_corrected, 5)

5×11 DataFrame

Row	ID	TIME_AFTER_DOSING	CONC	AMT	AGE	WEIGHT	SCR	ISMALE	eGFR	EVID	CMT
	Int64	Float64	Float64?	Int64?	Float64	Float64	Float64	Int64	Float64	Int64	Int64?
1	1	0.0	missing	100	34.823	38.212	1.1129	0	42.635	1	1
2	1	0.25	13.026	missing	34.823	38.212	1.1129	0	42.635	0	missing
3	1	0.5	14.984	missing	34.823	38.212	1.1129	0	42.635	0	missing
4	1	0.75	14.16	missing	34.823	38.212	1.1129	0	42.635	0	missing
5	1	1.0	19.316	missing	34.823	38.212	1.1129	0	42.635	0	missing

Tip

If you have more than one value representing missing values, the keyword argument missingstring can also accept a vector of strings

1.6 Reading Multiple CSV Files

You might encounter data that is dispersed across several CSV files. The CSV file organizing logic could be something temporal, e.g. results_jan_2021.csv, or spatial, e.g. samples_boston.csv, or any other logic, e.g. results_john.csv.

With CSV.jl, you can simply pass a vector of strings to CSV.read() or CSV.File() and it will then concatenate all the CSV file paths into a single unit and materialize the associated files into the desired sink, such as a DataFrame.

If you have any information that is encoded in the CSV file name, it will be lost if you do not use the source argument. When passing a vector of strings to CSV.read(), you can pass the argument source with a string or symbol. CSV.jl will then create a column with the name as the string/symbol you’ve passed that will capture the file name of the CSV file and add it as a value for all observations in the final data to be passed to your sink:

df_multiple_csv = CSV.read(files, DataFrame; source = "DAY")
first(df_multiple_csv, 5)

where files is a vector of CSV file paths as strings.

All the arguments that we’ve used before, delim, decimal, select, and drop, work fine with the CSV.read() using a vector of strings representing multiple CSV file paths as input. So feel free to use them if necessary.

In the example below, we do just that with the ["data/multiple_csv/12-Dec-21.csv", "data/multiple_csv/13-Dec-21.csv"] as the vector of strings.

files = filter(x -> endswith(x, ".csv"), readdir("data/multiple_csv/"; join = true))

2-element Vector{String}:
 "data/multiple_csv/12-Dec-21.csv"
 "data/multiple_csv/13-Dec-21.csv"

df_multiple_csv = CSV.read(files, DataFrame; source = "DAY")
first(df_multiple_csv, 5)

5×3 DataFrame

Row	ID	CONC	DAY
	Int64	Float64?	String
1	1	missing	data/multiple_csv/12-Dec-21.csv
2	1	13.026	data/multiple_csv/12-Dec-21.csv
3	1	14.984	data/multiple_csv/12-Dec-21.csv
4	1	14.16	data/multiple_csv/12-Dec-21.csv
5	1	19.316	data/multiple_csv/12-Dec-21.csv

1.7 Writing CSV Files

In order to write CSV files, you’ll use the CSV.write() function which can be used in two ways:

By passing a file path as a string as the first argument and a table (such as a DataFrame) as the second argument. For example:
```
CSV.write("data/my_file.csv", my_df)
```
where my_df is a DataFrame.
By “piping” the table (such as a DataFrame ) into the CSV.write() function and also specifying a file path as a string as the first argument:
```
my_df |> CSV.write("data/my_file.csv")
```
where my_df is a DataFrame.

Tip

All the keywords arguments that we’ve used so far can be used to write a specific CSV file format. For example:

my_df |> CSV.write("data/my_file.csv"; delim = ';', decimal = ',')

will create an EU/LatAm CSV file.

2 📁 Excel Files with `XLSX.jl`

Excel files are also ubiquitous as a data file format. Commonly used file extensions are .xls and .xlsx.

Caution

Prefer .xlsx over .xls files. First, .xlsx files are compressed by default, thus they have smaller file sizes when compared to .xls. Second, and most important, .xls files only support a maximum of 65,536 (\(2^{16}\)) rows and 256 (\(2^8\)) columns. Whereas .xlsx files support a maximum of 1,048,576 (\(2^{20}\)) rows and 16,384 (\(2^{14}\)) columns. Finally, .xls files whenever the maximum capacity is exceed, fail silently.

In order to read an Excel file into Julia you’ll need to first import XLSX.jl into your environment. This is done with the using statement:

using XLSX

2.1 Reading Excel Files

XLSX.jl has two main functions to read and write Excel files:

XLSX.readxlsx(file.xlsx): reads the whole file.xlsx and returns a custom type named XLSXFile which contains a summary of all the sheets presented in the file along with their respective cell ranges. XLSXFile works pretty much like a Julia dictionary. You can index it to a specific sheet and it will return another custom type named Worksheet. But don’t worry! Just call XLSX.eachtablerow() on a worksheet and pipe it to a DataFrame and you’ll have a DataFrame for your desired indexed sheet. See an example below:

xf = XLSX.readxlsx("data/iv_sd_demogs.xlsx")

XLSXFile("data/iv_sd_demogs.xlsx") containing 1 Worksheet
            sheetname size          range        
-------------------------------------------------
         iv_sd_demogs 101x6         A1:F101

df = DataFrame(XLSX.eachtablerow(xf["iv_sd_demogs"]))
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Any	Any	Any	Any	Any	Any
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

XLSX.readtable(file.xlsx, sheet): reads an specific sheet of the file.xlsx. For the second argument, sheet you can pass either a string with the sheet name or an integer representing the sheet index. See the following example:

df = DataFrame(XLSX.readtable("data/iv_sd_demogs.xlsx", 1))
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Any	Any	Any	Any	Any	Any
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

2.2 Inferring Excel Column Types with `infer_eltypes`

You might have noticed that all the columns for the dataframes we created from Excel files have the type Any (just hover your mouse over the table and it will display). This is the default behavior of both XLSX.eachtablerow() and XLSX.readtable(). You can change this by passing the keyword argument infer_eltypes=true to XLSX.readtable():

df = DataFrame(XLSX.readtable("data/iv_sd_demogs.xlsx", 1; infer_eltypes = true))
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Caution

XLSX.jl does not have keyword arguments or options to specify missing values. By default, any Excel cell that is blank will be considered missing.

2.3 Writing Excel Files

To write a DataFrame as an Excel file, you can use the XLSX.writetable() function. Be aware that this function accepts a tuple of data and column names. This can be accomplished with collect(eachcol(df), names(df)) where df is your dataframe. For example:

XLSX.writetable("filename.xlsx", collect(eachcol(df), names(df)))

Now if you want to write data into separate worksheets, you can do so by passing a NamedTuple with the sheet name. For example:

XLSX.writetable(
    "filename.xlsx",
    sheet_1 = (collect(eachcol(df1), names(df1))),
    sheet_2 = (collect(eachcol(df2), names(df2))),
)

Note

names, like the same function in dplyr will return a vector of strings with the column names of the underlying dataframe.

3 📁 SAS Data Files with `ReadStatTables.jl`

We can read SAS files with the package ReadStatTables.jl. We will cover two of the most common SAS data file extensions:

.sas7bdat files
.xpt files

Note

ReadStatTables.jl can read, as the name suggests, files from different statistical software such as Stata, SPSS, SAS and more.

If you need to import data different from the ones above, don’t hesitate to check ReadStatTables.jl documentation.

First, let’s load ReadStatTables.jl:

using ReadStatTables

ReadStatsTables.jl has the readstat() function which takes a file path and returns a ReadStatTable object:

tb = readstat("data/iv_bolus_sd.sas7bdat")

300×6 ReadStatTable:
 Row │      id     time       conc       amt                  route     dose
     │ Float64  Float64   Float64?  Float64?  InlineStrings.String3  Float64
─────┼───────────────────────────────────────────────────────────────────────
   1 │     1.0      0.0    missing     250.0                     iv    250.0
   2 │     1.0      0.0    8.38202   missing                     iv    250.0
   3 │     1.0      0.5     7.6079   missing                     iv    250.0
   4 │     1.0      1.0    8.73121   missing                     iv    250.0
   5 │     1.0      2.0    5.84823   missing                     iv    250.0
   6 │     1.0      6.0    2.25315   missing                     iv    250.0
   7 │     1.0     10.0    1.17591   missing                     iv    250.0
   8 │     1.0     16.0   0.346308   missing                     iv    250.0
  ⋮  │    ⋮        ⋮         ⋮         ⋮                ⋮               ⋮
 294 │    30.0      1.0    22.1795   missing                     iv    750.0
 295 │    30.0      2.0    18.0047   missing                     iv    750.0
 296 │    30.0      6.0    8.33707   missing                     iv    750.0
 297 │    30.0     10.0    3.50049   missing                     iv    750.0
 298 │    30.0     16.0    1.05457   missing                     iv    750.0
 299 │    30.0     20.0   0.472131   missing                     iv    750.0
 300 │    30.0     24.0   0.242979   missing                     iv    750.0
                                                             285 rows omitted

which can be easily converted to a DataFrame by calling the DataFrame() constructor on it:

df_sas7bdat = DataFrame(tb)
first(df_sas7bdat, 5)

5×6 DataFrame

Row	id	time	conc	amt	route	dose
	Float64	Float64	Float64?	Float64?	String3	Float64
1	1.0	0.0	missing	250.0	iv	250.0
2	1.0	0.0	8.38202	missing	iv	250.0
3	1.0	0.5	7.6079	missing	iv	250.0
4	1.0	1.0	8.73121	missing	iv	250.0
5	1.0	2.0	5.84823	missing	iv	250.0

Ok, that was for .sas7bdat files. We can do the same for .xpt files. But now we are going to do it in a single line of code by combining the last two operations:

df_xpt = DataFrame(readstat("data/iv_bolus_sd.xpt"))
first(df_xpt, 5)

5×6 DataFrame

Row	id	time	conc	amt	route	dose
	Float64	Float64	Float64?	Float64?	String3	Float64
1	1.0	0.0	missing	250.0	iv	250.0
2	1.0	0.0	8.38202	missing	iv	250.0
3	1.0	0.5	7.6079	missing	iv	250.0
4	1.0	1.0	8.73121	missing	iv	250.0
5	1.0	2.0	5.84823	missing	iv	250.0

Caution

Note that the package is called ReadStatTables.jl, so it performs only reading of files, and not writing. If you need to write SAS (or any other statistical software proprietary file format) for some reason, you’ll probably need to export as either CSV or Excel and then convert it with an external tool to the desired file format.

Footnotes

technically, this is actually a constructor.↩︎
CSV.File shares the same arguments as the CSV.read. So the same arguments will work for both of them.↩︎

Reuse

CC BY-SA 4.0

1 📁 CSV files with CSV.jl

1.1 Reading a Single CSV File

1.2 Custom CSV File Specifications

1.3 Specifying Custom Types for Columns while Reading a CSV File

1.4 Selecting and Dropping Columns while Reading a CSV File

1.5 Specifying Missing Values in a CSV File

1.6 Reading Multiple CSV Files

1.7 Writing CSV Files

2 📁 Excel Files with XLSX.jl

2.1 Reading Excel Files

2.2 Inferring Excel Column Types with infer_eltypes

2.3 Writing Excel Files

3 📁 SAS Data Files with ReadStatTables.jl

Footnotes

Reuse

1 📁 CSV files with `CSV.jl`

2 📁 Excel Files with `XLSX.jl`

2.2 Inferring Excel Column Types with `infer_eltypes`

3 📁 SAS Data Files with `ReadStatTables.jl`