using DataFrames
using DataFramesMeta
using PharmaDatasets
Manipulating Tables with DataFramesMeta.jl
You now have a handle on basic operations in Julia, and how to read data into a DataFrame
from CSV files, Excel files, or SAS files. Now it’s time to actually DO something with that data.
1 💾 Basic DataFrames.jl
Functionality
First, let’s read in some data. Using PharmaDatasets.jl
we can import the iv_sd_2
dataset as df
:
= dataset("demographics_1")
df first(df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
Let’s get some basic information about each of the columns. We’ll do this in a number of ways to get a feel for it.
1.1 📦️ ↗️ A DataFrame is a Matrix, columns are vectors
Just like a Matrix
in Julia, the contents of a DataFrame
can be accessed with a pair of indices (see Tutorial 1 - Getting started with Julia) - the first index is for the row, and the second index is for the column. Column indices can either be an integer (for the column number) or the column name.
1, 3] # first row, 3rd column df[
38.212
5, "SCR"] # fifth row, column "SCR" df[
1.5924
Also like matrices, we can select “slices” of multiple rows / columns.
10:20, ["ID", "WEIGHT"]] df[
Row | ID | WEIGHT |
---|---|---|
Int64 | Float64 | |
1 | 10 | 50.878 |
2 | 11 | 76.539 |
3 | 12 | 113.91 |
4 | 13 | 65.829 |
5 | 14 | 63.769 |
6 | 15 | 37.54 |
7 | 16 | 70.069 |
8 | 17 | 49.636 |
9 | 18 | 59.107 |
10 | 19 | 74.732 |
11 | 20 | 56.43 |
typeof(df[10:20, ["ID", "WEIGHT"]])
DataFrame
Notice that, when selecting multiple rows and columns, the return value is itself a DataFrame
.
Unlike the Julia Matrix
type, which has a single parameter for the types of its elements, a DataFrame
is essentially a collection of Vector
s, each of which can have their own type.
When slicing a single column of a DataFrame
, the return value is also a vector:
typeof(df[:, "WEIGHT"])
Vector{Float64} (alias for Array{Float64, 1})
typeof(df[:, "ISMALE"])
Vector{Int64} (alias for Array{Int64, 1})
DataFrame
s also have a convenient syntax for selecting an entire column, similar to the df$col_name
syntax in R. In Julia, we instead use a .
:
= df.WEIGHT
vec first(vec, 5)
5-element Vector{Float64}:
38.212
74.838
37.303
32.969
47.139
Actually, there’s one more way to select a whole column - df[!, :WEIGHT]
. The difference between using df[!, index]
and df[:, index]
is that the latter returns a copy of the column, while the former returns the actual underlying object. This can often be better for performance, and it means you can do things like change values, but it also makes it possible to corrupt the underlying object. This is why it uses a !
, to let you know that it’s potentially dangerous.
For example, you could do pop!(df[!, :ID])
, and then the first column would have fewer rows than the others, violating the expectation that all columns have the same number of rows.
Using the “dot” syntax is identical to using the !
form of indexing, so changes made to df.WEIGHT
will be reflected in the parent DataFrame
, for good or ill.
Any function that operates on a vector can therefore be easily used on DataFrame
columns.
sum(df.WEIGHT)
5188.59
using Statistics
mean(df.eGFR)
58.089360000000006
The describe()
function provides a convenient shorthand to get summary information about the columns in your DataFrame
.
describe(df)
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Float64 | Real | Float64 | Real | Int64 | DataType | |
1 | ID | 50.5 | 1 | 50.5 | 100 | 0 | Int64 |
2 | AGE | 42.9947 | 19.187 | 41.2735 | 79.292 | 0 | Float64 |
3 | WEIGHT | 51.8859 | 20.427 | 49.7975 | 113.91 | 0 | Float64 |
4 | SCR | 1.14298 | 0.76055 | 1.1141 | 1.7202 | 0 | Float64 |
5 | ISMALE | 0.47 | 0 | 0.0 | 1 | 0 | Int64 |
6 | eGFR | 58.0894 | 19.026 | 53.8285 | 129.04 | 0 | Float64 |
The default information is quite handy, but you can also tailor the output to suit your needs. For example:
describe(df, :median, :nunique, cols = ["ID", "AGE"])
Row | variable | median | nunique |
---|---|---|---|
Symbol | Float64 | Nothing | |
1 | ID | 50.5 | |
2 | AGE | 41.2735 |
You can even create your own function, which takes a column as an argument and calculates whatever summary you like.
function my_stat(col)
count(>(50), col) # returns the number of entries that are greater than 50
end
my_stat (generic function with 1 method)
describe(df, :median, :mean, my_stat => "gt50")
Row | variable | median | mean | gt50 |
---|---|---|---|---|
Symbol | Float64 | Float64 | Int64 | |
1 | ID | 50.5 | 50.5 | 50 |
2 | AGE | 41.2735 | 42.9947 | 23 |
3 | WEIGHT | 49.7975 | 51.8859 | 49 |
4 | SCR | 1.1141 | 1.14298 | 0 |
5 | ISMALE | 0.0 | 0.47 | 0 |
6 | eGFR | 53.8285 | 58.0894 | 58 |
In situations like this, you can also use Julia’s “anonymous” function (sometimes called “lambda” function) syntax. Wherever a function can be passed as an argument, you can write arg -> action
, where arg
is the thing being passed as an argument, and action
is whatever the function should do.
For example, instead of writing out a definition for my_stat
ahead of time, we could have written:
describe(df, :median, :mean, (col -> count(>(50), col)) => "gt50")
2 🔮 Coming from dplyr
Alright, now that we’ve got the basics, let’s do something more complicated. You might be familiar with many of these operations for selecting, summarizing, and transforming tables using dplyr
in R. Conceptually, there’s very little difference, though some of the verbs and syntax are a little different.
action | dplyr | DataFrames.jl | DataFramesMeta.jl |
---|---|---|---|
keep rows matching criteria | filter |
subset[!] |
@[r]subset[!] |
generate reduction / summary | summarize |
combine |
@combine |
add new columns based on some function | mutate |
transform[!] |
@[r]transform[!] |
select columns by name | select |
select[!] |
@[r]select[!] |
rename columns | rename |
rename[!] |
@rename[!] |
reorder rows | arrange |
sort[!] |
@orderby |
Each of these functions exist in the base DataFrames.jl
package, using a special “mini language”, but we’re going to use the DataFramesMeta.jl
package, which provides a number of “macros” to make this a bit easier. Macros are ways to tweak the way that Julia parses expressions, and can make many operations a bit easier.
2.1 📜 Macro conventions
In Julia, macros always start with the @
character. The DataFramesMeta.jl
macros help to manipulate DataFrame
s in a similar way to what you may be familiar with from dplyr
, but use the same verbs mentioned above. Namely the macros are of the form @subset
, @combine
, @transform
, and @select
. Only sorting has a different verb: @orderby
.
In keeping with the conventions from DataFrames.jl
, these macros work on entire columns, but they also have a “row-wise” version, which starts with an r
(eg @rsubset
is the row-wise @subset
). Finally, by default, these macros are non-mutating - that is, they will return a copy of the DataFrame
with the operation applied, but will leave the original unchanged. Nevertheless, each macro also has an “in-place” version, that will mutate the original. As is the convention in Julia for mutating functions, these macros end with !
(eg @subset!
and @rsubset!
).
Sometimes, we will refer to these macros using notation such as “@[r]subset[!]
”, which just refers to all of the different flavors of this operation.
OK, enough talking - let’s get to some data manipulations!
3 ℹ️ DataFramesMeta.jl
macro basics
Each macro from DataFramesMeta.jl
takes a DataFrame
(or a GroupedDataFrame
- we’ll get to those later) as the first argument. Other arguments depend on the operation, and we use Symbol
s to refer to columns.
A Symbol
starts with :
, eg :ID
. The details of why we use symbols aren’t super important, but it’s worth knowing that, when making a symbol, you can’t start with a number, and can’t have spaces. If your columns have those features, you can make a symbol from a string by doing eg. Symbol("Col with a Space")
.
3.1 🔃 @orderby
Probably the most straightforward operation is @orderby
, which takes a column or columns that should be used to sort the rows of the dataframe.
= @orderby(df, :eGFR)
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 55 | 58.715 | 20.427 | 1.2121 | 1 | 19.026 |
2 | 77 | 71.791 | 42.516 | 1.61 | 1 | 25.018 |
3 | 98 | 61.124 | 27.529 | 1.0137 | 0 | 25.286 |
4 | 8 | 39.897 | 26.452 | 1.0817 | 0 | 28.899 |
5 | 79 | 52.219 | 26.453 | 1.102 | 1 | 29.265 |
Most of these macros have both the “function-like” syntax used above, where the arguments are all enclosed in parentheses, or a form where the parentheses are omitted and arguments are separated by spaces. The example above for instance, could have been written:
= @orderby df :eGFR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 55 | 58.715 | 20.427 | 1.2121 | 1 | 19.026 |
2 | 77 | 71.791 | 42.516 | 1.61 | 1 | 25.018 |
3 | 98 | 61.124 | 27.529 | 1.0137 | 0 | 25.286 |
4 | 8 | 39.897 | 26.452 | 1.0817 | 0 | 28.899 |
5 | 79 | 52.219 | 26.453 | 1.102 | 1 | 29.265 |
= @orderby(df, :ISMALE, :eGFR) # sort first on ISMALE column, then on eGFR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 98 | 61.124 | 27.529 | 1.0137 | 0 | 25.286 |
2 | 8 | 39.897 | 26.452 | 1.0817 | 0 | 28.899 |
3 | 9 | 54.975 | 26.931 | 0.90926 | 0 | 29.731 |
4 | 37 | 32.226 | 27.841 | 1.1696 | 0 | 30.288 |
5 | 6 | 53.758 | 50.819 | 1.6769 | 0 | 30.855 |
Again, we could instead use the non-function (no parentheses) form, where arguments are separated by spaces rather than commas:
= @orderby df :ISMALE :eGFR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 98 | 61.124 | 27.529 | 1.0137 | 0 | 25.286 |
2 | 8 | 39.897 | 26.452 | 1.0817 | 0 | 28.899 |
3 | 9 | 54.975 | 26.931 | 0.90926 | 0 | 29.731 |
4 | 37 | 32.226 | 27.841 | 1.1696 | 0 | 30.288 |
5 | 6 | 53.758 | 50.819 | 1.6769 | 0 | 30.855 |
As less is more, we’ll use the @
macro calls from DataFramesMeta.jl
without parentheses from now on.
If the number of arguments gets too long, we might want to separate them on additional lines. With the functional form, as with any Julia function, we can put line breaks between arguments, eg
@orderby(df, :ISMALE, :eGFR)
but with the non-function form, we need to use a begin
block, eg:
@orderby df begin
:ISMALE
:eGFR
end
Use a -
before the column name to sort in reverse
= @orderby df :ISMALE -:eGFR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 80 | 45.822 | 98.223 | 0.8463 | 0 | 129.04 |
2 | 12 | 48.539 | 113.91 | 1.0889 | 0 | 112.95 |
3 | 16 | 32.41 | 70.069 | 0.84661 | 0 | 105.12 |
4 | 39 | 44.668 | 65.805 | 0.76055 | 0 | 97.378 |
5 | 47 | 41.047 | 93.403 | 1.1259 | 0 | 96.911 |
You can also pass a custom order (in the form of a vector of integers) to @orderby
. So, the same reverse ordering can be accomplished with:
= @orderby df :ISMALE sortperm(:eGFR, rev = true)
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 70 | 21.981 | 56.093 | 0.97028 | 0 | 80.548 |
2 | 19 | 40.826 | 74.732 | 1.1407 | 0 | 76.702 |
3 | 11 | 38.603 | 76.539 | 0.9541 | 0 | 96.028 |
4 | 24 | 33.338 | 58.429 | 1.2144 | 0 | 60.586 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
Since sortperm()
can be given custom ordering functions, you have almost infinite flexibility. See the documentation for sortperm
in the “live docs” for more information
3.2 ❕ @select
and @transform
The @select
macro can also be straightforward. Here, we just pull out the :ID
and :AGE
columns.
= @select df :ID :AGE
my_df first(my_df, 5)
Row | ID | AGE |
---|---|---|
Int64 | Float64 | |
1 | 1 | 34.823 |
2 | 2 | 32.765 |
3 | 3 | 35.974 |
4 | 4 | 38.206 |
5 | 5 | 33.559 |
We can also use a number of special selectors provided by DataFrames.jl
.
Not()
: Selects all columns except those specified (can be supplied either a singleSymbol
or a vector ofSymbol
s)Between()
: takes two arguments, selects those two columns plus any columns between them.Cols()
: allows combining other selectors- Regular expressions: selects columns that match a regular expression. “Regex” is far too complicated to go into here - the short version is that it’s a way of specifying string matches. See the string tutorial for a bit more info.
All()
: just like you think - selects all columns. Why? Well, you can’t get the same column back twice, so this is most often used like “all remaining” in case you want to use the other selectors to move some columns to the beginning.
For reasons that aren’t worth going into (internal “under the hood” stuff), if we’re using these selectors we need to wrap them in $()
to make them work in the macros supplied by DataFramesMeta.jl
.
Let’s look at some examples.
propertynames(df) # for reference
6-element Vector{Symbol}:
:ID
:AGE
:WEIGHT
:SCR
:ISMALE
:eGFR
@select df $(Not(:ID))) |> propertynames (
5-element Vector{Symbol}:
:AGE
:WEIGHT
:SCR
:ISMALE
:eGFR
@select df $(Between(:SCR, :eGFR))) |> propertynames (
3-element Vector{Symbol}:
:SCR
:ISMALE
:eGFR
We can also pass multiple arguments, each one of which will be evaluated in sequence. For example:
@select df begin
(:ID # get :ID
$(Between(:SCR, :eGFR)) # followed by things between :SCR and :eGFR
end) |> propertynames
4-element Vector{Symbol}:
:ID
:SCR
:ISMALE
:eGFR
The following regular expression matches every column that has a capital E:
@select df $(r"E")) |> propertynames (
3-element Vector{Symbol}:
:AGE
:WEIGHT
:ISMALE
Finally, let’s look at combining these selectors:
@select df begin
($(Cols(:ID, r"E"))
$(Between(:SCR, :eGFR))
end) |> propertynames
6-element Vector{Symbol}:
:ID
:AGE
:WEIGHT
:ISMALE
:SCR
:eGFR
Did you notice that :ISMALE
is not between :SCR
and :eGFR
? In this example, it was already “consumed” by the regular expression, so it is not repeated.
3.2.1 Using @select
to make changes
But what if we want to do a bit of transformation? For example, it looks like the :AGE
column is in years - let’s convert it to :AGE_in_months
. To do this, we have to multiply each value by 12.
In R, one can apply many operations to vectors, and it implicitly applies that operation to each element. Eg.
$> v = c(1,2,3)
r
$> v * 12
r1] 12 24 36
[
$> v^2
r1] 1 4 9 [
In general, Julia tries to avoid DWIM (do what I mean) patterns, when the output could be ambiguous. For multiplying a vector by a scalar value, this is unambiguous, mathematically
julia> v = [1, 2, 3];
julia> v * 12
3-element Vector{Int64}:
12
24
36
But other operations are ambiguous so Julia throws an error instead of guessing what you mean.
julia> v^2
ERROR: MethodError: no method matching ^(::Vector{Int64}, ::Int64)
If you want to apply an operation element-wise, you must “broadcast” it, which in Julia has a convenient dot syntax:
julia> v .^ 2
3-element Vector{Int64}:
1
4
9
Here, we use @select
as before, but we add a new column, assigned to the value of the :AGE
column multiplied by 12:
# broadcasting here (.*) is not strictly necessary, but is a good habit to get into
= @select df :ID :AGE_in_months = :AGE .* 12
my_df first(my_df, 5)
Row | ID | AGE_in_months |
---|---|---|
Int64 | Float64 | |
1 | 1 | 417.876 |
2 | 2 | 393.18 |
3 | 3 | 431.688 |
4 | 4 | 458.472 |
5 | 5 | 402.708 |
What if our transformation is a bit more complicated? For example, let’s get the z-scores of the :eGFR
column - in other words, we want to transform the data to be normally distributed with a mean of zero and a standard deviation of 1.
Currently, the :eGFR
column distribution looks like this:
using CairoMakie
hist(df.eGFR)
A z transform (also called “standard score”) involves taking each value, subtracting the mean of the sample, and dividing by the standard deviation of the sample. So, for a sample \[x_i\],
\[z_i = \frac{x_i - \bar{x}}{\sigma_x}\]
Where \(z_i\) is the z-score for element \(i\), \(x_i\) is the measured value, \(\bar{x}\) is the mean of the sample, and \(\sigma_x\) is the standard deviation of the sample.
Or, in Julia code:
function zscore(v)
= mean(v)
μ = std(v)
σ return [(x - μ) / σ for x in v]
end
zscore (generic function with 1 method)
We can now use this function in the @select
macro.
= @select df :ID :eGFR_z = zscore(:eGFR)
df_z first(df_z, 5)
Row | ID | eGFR_z |
---|---|---|
Int64 | Float64 | |
1 | 1 | -0.645386 |
2 | 2 | 2.836 |
3 | 3 | -0.380372 |
4 | 4 | -0.799943 |
5 | 5 | -0.87244 |
We didn’t need to broadcast (“vectorize”) this call with zscore.(:eGFR)
, since the function was written to operate on the whole column. This is why it’s useful to be able to use both column-wise and row-wise operations, depending on what makes the most sense.
hist(df_z.eGFR_z)
It’s often inconvenient to have to write out a separate function, and then call it in the context of a macro. Because of this, DataFramesMeta.jl
macros allow us to substitute complex expressions using begin
blocks.
So, to accomplish the same result as above, we could instead write:
= @select df begin
my_df :ID
# whitespace in Julia mostly doesn't matter, so we can break this up
:eGFR_z = begin
= mean(:eGFR)
μ = std(:eGFR)
σ return [(x - μ) / σ for x in :eGFR]
end
end
first(my_df, 5)
Row | ID | eGFR_z |
---|---|---|
Int64 | Float64 | |
1 | 1 | -0.645386 |
2 | 2 | 2.836 |
3 | 3 | -0.380372 |
4 | 4 | -0.799943 |
5 | 5 | -0.87244 |
3.2.2 @transform
vs @select
The @transform
macro is essentially identical to @select
, except that @select
only returns the columns that you explicitly ask for, while @transform
returns all columns from the original DataFrame
, as well as any new columns you create.
Here’s exactly the same command we just ran, using @transform
instead of @select
(notice we can also leave out the :ID
):
= @transform df begin
my_df :eGFR_z = begin
= mean(:eGFR)
μ = std(:eGFR)
σ - μ) / σ for x in :eGFR]
[(x end
end
first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | eGFR_z |
---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | -0.645386 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 2.836 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | -0.380372 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | -0.799943 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | -0.87244 |
3.2.3 @byrow
We have already seen that we can use the @r...
versions of macros to use row-wise operations. But sometimes, we want to combine operations that require columns with those that would make more sense as row-rise.
One solution to this is to use broadcasting (eg .>
), but we can also use the @byrow
macro within other macro calls. For example, suppose we want to combine the eGFR_z
operation above with the AGE_in_months
operation we performed earlier:
= @transform df begin
my_df :eGFR_z = begin
= mean(:eGFR)
μ = std(:eGFR)
σ - μ) / σ for x in :eGFR]
[(x end
@byrow :AGE_in_months = :AGE * 12
end
first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | eGFR_z | AGE_in_months |
---|---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | -0.645386 | 417.876 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 2.836 | 393.18 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | -0.380372 | 431.688 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | -0.799943 | 458.472 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | -0.87244 | 402.708 |
3.3 🏷️ @rename
The @rename
macro renames columns in a DataFrame
using the syntax :new = :old
. Here’s an example:
= @rename df :serum_creatinine = :SCR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | serum_creatinine | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
Like other macros, @rename
can be used in both multi-argument and “block” format:
= @rename df begin
my_df :age = :AGE
:serum_creatinine = :SCR
end
first(my_df, 5)
Row | ID | age | WEIGHT | serum_creatinine | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
To rename columns to names with spaces you’ll need to escape it with $"..."
:
= @rename df $"Serum Creatinine" = :SCR
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | Serum Creatinine | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
You can use instead a function to rename all columns with the DataFrame.jl
’ rename
function. It has an alternate method that takes a function as the first positional argument followed by a DataFrame
. It also accepts a custom user-defined function.
Check some examples below.
= rename(lowercase, df)
my_df first(my_df, 5)
Row | id | age | weight | scr | ismale | egfr |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
= rename((s -> replace(s, 'A' => 'X')), df)
my_df first(my_df, 5)
Row | ID | XGE | WEIGHT | SCR | ISMXLE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
3.4 ⬜️ ❎ @subset
The @subset
macro is used for selecting rows that have some property, as determined by a boolean operation - that is, something that returns true
or false
. This operation is often called the “predicate”.
For example, suppose we want to keep rows where the :SCR
is greater than the mean :SCR
score:
= @subset df :SCR .> mean(:SCR)
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
2 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
3 | 6 | 53.758 | 50.819 | 1.6769 | 0 | 30.855 |
4 | 13 | 28.818 | 65.829 | 1.3689 | 0 | 63.118 |
5 | 15 | 57.703 | 37.54 | 1.3098 | 1 | 32.758 |
Recall that we need to use the “broadcasted” >
, so that it’s applied to each element. Here, we need the whole column for the mean
function, but if our operation only needs individual rows, we could have used the @rsubset
macro and skipped the broadcast, eg
= @rsubset(df, :SCR > 0.5)
my_df first(my_df, 5)
We can also provide multiple predicates, which are calculated separately, and only rows where each predicate is true will be returned.
Like @select
, we can either provide predicates as arguments in a function form, or use a begin
block where each line represents one predicate. In other words, the following examples are the same:
= @subset df :SCR .> mean(:SCR) :eGFR .< median(:eGFR);
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
2 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
3 | 6 | 53.758 | 50.819 | 1.6769 | 0 | 30.855 |
4 | 15 | 57.703 | 37.54 | 1.3098 | 1 | 32.758 |
5 | 18 | 48.453 | 59.107 | 1.1961 | 0 | 53.408 |
= @subset df begin
my_df :SCR .> mean(:SCR)
:eGFR .< median(:eGFR)
end;
first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | |
1 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 |
2 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 |
3 | 6 | 53.758 | 50.819 | 1.6769 | 0 | 30.855 |
4 | 15 | 57.703 | 37.54 | 1.3098 | 1 | 32.758 |
5 | 18 | 48.453 | 59.107 | 1.1961 | 0 | 53.408 |
4 🪄 DataFramesMeta.jl
advanced operations with @astable
With the @astable
macro we can perform advanced operations. The limitations of the @select
and @transform
families of macros are that they cannot operate on multiple columns at once.
For example, suppose you want to take the log
of :AGE
and add to the :WEIGHT
converted to pounds. And you want something more readable inside a begin ... end
block. This is not supported by just using @select
or @transform
macro families:
@rtransform df begin
= log(:AGE)
AGE_log = :WEIGHT * 2.2
WEIGHT_lbs :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end
┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :AGE_log = ... instead.
└ @ DataFramesMeta ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/nI3Y1/src/parsing.jl:373
┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :WEIGHT_lbs = ... instead.
└ @ DataFramesMeta ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/nI3Y1/src/parsing.jl:373
UndefVarError: UndefVarError(:AGE_log)
UndefVarError: `AGE_log` not defined
Stacktrace:
[1] (::var"#34#36")()
@ Main.Notebook ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/nI3Y1/src/parsing.jl:291
[2] #561
@ ./boot.jl:0 [inlined]
[3] iterate
@ ./generator.jl:47 [inlined]
[4] collect(itr::Base.Generator{UnitRange{Int64}, DataFrames.var"#561#562"{var"#34#36"}})
@ Base ./array.jl:782
[5] _empty_selector_helper(fun::var"#34#36", len::Int64)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:586
[6] _transformation_helper(df::DataFrame, col_idx::Vector{Int64}, ::Base.RefValue{Any})
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:591
[7] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:805
[8] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1778
[9] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1698
[10] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1299
[11] transform(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1379
[12] transform(::DataFrame, ::Any, ::Any, ::Vararg{Any})
@ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1379
[13] top-level scope
@ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/nI3Y1/src/macros.jl:1600
In order to perform such an operation you’ll need to use the @astable
macro inside the @select
/@transform
macro after the DataFrame
:
= @rtransform df @astable begin
my_df = log(:AGE)
AGE_log = :WEIGHT * 2.2
WEIGHT_lbs :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end
first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | AGE_log_WEIGHT_lbs |
---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | 87.6167 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 168.133 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | 85.6494 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | 76.1748 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | 107.219 |
You can use @astable
inside any DataFramesMeta.jl
macro if you would like to do operations on multiple columns at once.
5 ⊗ Operations on groups
There are many times when we want to apply operations within groups, rather than on the whole DataFrame
.
For example, suppose we want to calculate z-scores for :WEIGHT
, but we know that this variable will be affected by the sex of the individuals. We therefore want our mean and standard deviation calculations to happen separately for men and for women.
We could make a men_df
and women_df
using @subset
, then calculate the values separately, but that would be inconvenient. Instead, we can use groupby
to create a GroupedDataFrame
.
= groupby(df, :ISMALE)
gdf typeof(gdf)
GroupedDataFrame{DataFrame}
Now, we can use the same operations as before. DataFramesMeta
will apply our @select
, @transform
, and @subset
operations within groups, then return a new DataFrame
.
= wz = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
my_df first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | WEIGHT_z |
---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | -0.807368 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1.57251 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | -0.81391 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | -1.08946 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | -0.334861 |
= Figure()
fig = Axis(fig[1, 1], title = "Not transformed", xlabel = "Weight")
not_tr = Axis(fig[1, 2], title = "Z-score", xlabel = "Z-score")
tr = hist!(not_tr, @rsubset(wz, :ISMALE == 0).WEIGHT)
w = hist!(not_tr, @rsubset(wz, :ISMALE == 1).WEIGHT)
m hist!(tr, @rsubset(wz, :ISMALE == 0).WEIGHT_z)
hist!(tr, @rsubset(wz, :ISMALE == 1).WEIGHT_z)
Legend(
2, 1:2],
fig[
[w, m],"women", "men"],
[= :horizontal,
orientation = false,
tellwidth = true,
tellheight
) fig
Well, that was unexpected…
5.1 ⛓️ @chain
Because we so often want to apply this pattern - grouping on one column, then performing operations on groups, DataFramesMeta
provides the @chain
macro from Chain.jl
, in which each line operates on the line before it, similar to stringing together operations with %>%
in R.
In particular, each line implicitly takes the result of the previous line as the first argument. In other words, what we previously accomplished with
df= groupby(df, :ISMALE)
gdf = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
my_df first(my_df, 5)
becomes:
= @chain df begin
my_df groupby(:ISMALE)
@transform :WEIGHT_z = zscore(:WEIGHT)
end
first(my_df, 5)
Row | ID | AGE | WEIGHT | SCR | ISMALE | eGFR | WEIGHT_z |
---|---|---|---|---|---|---|---|
Int64 | Float64 | Float64 | Float64 | Int64 | Float64 | Float64 | |
1 | 1 | 34.823 | 38.212 | 1.1129 | 0 | 42.635 | -0.807368 |
2 | 2 | 32.765 | 74.838 | 0.8846 | 1 | 126.0 | 1.57251 |
3 | 3 | 35.974 | 37.303 | 1.1004 | 1 | 48.981 | -0.81391 |
4 | 4 | 38.206 | 32.969 | 1.1972 | 1 | 38.934 | -1.08946 |
5 | 5 | 33.559 | 47.139 | 1.5924 | 0 | 37.198 | -0.334861 |
5.2 🧑🍳 @combine
Suppose that we want to aggregate or summarize columns after grouping them, for example, getting summary statistics, or counting the number of rows with particular values. This can be accomplished with the @combine
macro.
For the next example, remember that gdf
is our original DataFrame
grouped on the :ISMALE
column.
= @combine gdf begin
my_df :AGE_μ = mean(:AGE)
:WEIGHT_μ = mean(:WEIGHT)
:total = length(:ID)
:high_eGFR = count(>(80), :eGFR)
end
first(my_df, 5)
Row | ISMALE | AGE_μ | WEIGHT_μ | total | high_eGFR |
---|---|---|---|---|---|
Int64 | Float64 | Float64 | Int64 | Int64 | |
1 | 0 | 41.2514 | 53.4655 | 53 | 10 |
2 | 1 | 44.9605 | 50.1047 | 47 | 9 |
DataFramesMeta.jl
has also a macro that fuses together the groupby
+ @combine
operations.
This is the @by
macro and its syntax is:
@by DataFrame :group_col combine_operations
where :group_col
can be either a single column or a vector of columns for which the data would be grouped by.
See an example below:
= @by df :ISMALE :AGE_μ = mean(:AGE) :WEIGHT_μ = mean(:WEIGHT)
my_df first(my_df, 5)
Row | ISMALE | AGE_μ | WEIGHT_μ |
---|---|---|---|
Int64 | Float64 | Float64 | |
1 | 0 | 41.2514 | 53.4655 |
2 | 1 | 44.9605 | 50.1047 |