Manipulating Tables with DataFramesMeta.jl

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

You now have a handle on basic operations in Julia, and how to read data into a DataFrame from CSV files, Excel files, or SAS files. Now it’s time to actually DO something with that data.

using DataFrames
using DataFramesMeta
using PharmaDatasets

1 💾 Basic DataFrames.jl Functionality

First, let’s read in some data. Using PharmaDatasets.jl we can import the iv_sd_2 dataset as df:

df = dataset("demographics_1")
first(df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

Let’s get some basic information about each of the columns. We’ll do this in a number of ways to get a feel for it.

1.1 📦️ ↗️ A DataFrame is a Matrix, columns are vectors

Just like a Matrix in Julia, the contents of a DataFrame can be accessed with a pair of indices (see Tutorial 1 - Getting started with Julia) - the first index is for the row, and the second index is for the column. Column indices can either be an integer (for the column number) or the column name.

df[1, 3] # first row, 3rd column
38.212
df[5, "SCR"] # fifth row, column "SCR"
1.5924

Also like matrices, we can select “slices” of multiple rows / columns.

df[10:20, ["ID", "WEIGHT"]]
11×2 DataFrame
Row ID WEIGHT
Int64 Float64
1 10 50.878
2 11 76.539
3 12 113.91
4 13 65.829
5 14 63.769
6 15 37.54
7 16 70.069
8 17 49.636
9 18 59.107
10 19 74.732
11 20 56.43
typeof(df[10:20, ["ID", "WEIGHT"]])
DataFrame

Notice that, when selecting multiple rows and columns, the return value is itself a DataFrame.

Unlike the Julia Matrix type, which has a single parameter for the types of its elements, a DataFrame is essentially a collection of Vectors, each of which can have their own type.

When slicing a single column of a DataFrame, the return value is also a vector:

typeof(df[:, "WEIGHT"])
Vector{Float64} (alias for Array{Float64, 1})
typeof(df[:, "ISMALE"])
Vector{Int64} (alias for Array{Int64, 1})

DataFrames also have a convenient syntax for selecting an entire column, similar to the df$col_name syntax in R. In Julia, we instead use a .:

vec = df.WEIGHT
first(vec, 5)
5-element Vector{Float64}:
 38.212
 74.838
 37.303
 32.969
 47.139
Tip

Actually, there’s one more way to select a whole column - df[!, :WEIGHT]. The difference between using df[!, index] and df[:, index] is that the latter returns a copy of the column, while the former returns the actual underlying object. This can often be better for performance, and it means you can do things like change values, but it also makes it possible to corrupt the underlying object. This is why it uses a !, to let you know that it’s potentially dangerous.

For example, you could do pop!(df[!, :ID]), and then the first column would have fewer rows than the others, violating the expectation that all columns have the same number of rows.

Using the “dot” syntax is identical to using the ! form of indexing, so changes made to df.WEIGHT will be reflected in the parent DataFrame, for good or ill.

Any function that operates on a vector can therefore be easily used on DataFrame columns.

sum(df.WEIGHT)
5188.59
using Statistics
mean(df.eGFR)
58.089360000000006

The describe() function provides a convenient shorthand to get summary information about the columns in your DataFrame.

describe(df)
6×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Float64 Real Float64 Real Int64 DataType
1 ID 50.5 1 50.5 100 0 Int64
2 AGE 42.9947 19.187 41.2735 79.292 0 Float64
3 WEIGHT 51.8859 20.427 49.7975 113.91 0 Float64
4 SCR 1.14298 0.76055 1.1141 1.7202 0 Float64
5 ISMALE 0.47 0 0.0 1 0 Int64
6 eGFR 58.0894 19.026 53.8285 129.04 0 Float64

The default information is quite handy, but you can also tailor the output to suit your needs. For example:

describe(df, :median, :nunique, cols = ["ID", "AGE"])
2×3 DataFrame
Row variable median nunique
Symbol Float64 Nothing
1 ID 50.5
2 AGE 41.2735

You can even create your own function, which takes a column as an argument and calculates whatever summary you like.

function my_stat(col)
    count(>(50), col) # returns the number of entries that are greater than 50
end
my_stat (generic function with 1 method)
describe(df, :median, :mean, my_stat => "gt50")
6×4 DataFrame
Row variable median mean gt50
Symbol Float64 Float64 Int64
1 ID 50.5 50.5 50
2 AGE 41.2735 42.9947 23
3 WEIGHT 49.7975 51.8859 49
4 SCR 1.1141 1.14298 0
5 ISMALE 0.0 0.47 0
6 eGFR 53.8285 58.0894 58
Tip

In situations like this, you can also use Julia’s “anonymous” function (sometimes called “lambda” function) syntax. Wherever a function can be passed as an argument, you can write arg -> action, where arg is the thing being passed as an argument, and action is whatever the function should do.

For example, instead of writing out a definition for my_stat ahead of time, we could have written:

describe(df, :median, :mean, (col -> count(>(50), col)) => "gt50")

2 🔮 Coming from dplyr

Alright, now that we’ve got the basics, let’s do something more complicated. You might be familiar with many of these operations for selecting, summarizing, and transforming tables using dplyr in R. Conceptually, there’s very little difference, though some of the verbs and syntax are a little different.

action dplyr DataFrames.jl DataFramesMeta.jl
keep rows matching criteria filter subset[!] @[r]subset[!]
generate reduction / summary summarize combine @combine
add new columns based on some function mutate transform[!] @[r]transform[!]
select columns by name select select[!] @[r]select[!]
rename columns rename rename[!] @rename[!]
reorder rows arrange sort[!] @orderby

Each of these functions exist in the base DataFrames.jl package, using a special “mini language”, but we’re going to use the DataFramesMeta.jl package, which provides a number of “macros” to make this a bit easier. Macros are ways to tweak the way that Julia parses expressions, and can make many operations a bit easier.

2.1 📜 Macro conventions

In Julia, macros always start with the @ character. The DataFramesMeta.jl macros help to manipulate DataFrames in a similar way to what you may be familiar with from dplyr, but use the same verbs mentioned above. Namely the macros are of the form @subset, @combine, @transform, and @select. Only sorting has a different verb: @orderby.

In keeping with the conventions from DataFrames.jl, these macros work on entire columns, but they also have a “row-wise” version, which starts with an r (eg @rsubset is the row-wise @subset). Finally, by default, these macros are non-mutating - that is, they will return a copy of the DataFrame with the operation applied, but will leave the original unchanged. Nevertheless, each macro also has an “in-place” version, that will mutate the original. As is the convention in Julia for mutating functions, these macros end with ! (eg @subset! and @rsubset!).

Sometimes, we will refer to these macros using notation such as “@[r]subset[!]”, which just refers to all of the different flavors of this operation.

OK, enough talking - let’s get to some data manipulations!

3 ℹ️ DataFramesMeta.jl macro basics

Each macro from DataFramesMeta.jl takes a DataFrame (or a GroupedDataFrame - we’ll get to those later) as the first argument. Other arguments depend on the operation, and we use Symbols to refer to columns.

Tip

A Symbol starts with :, eg :ID. The details of why we use symbols aren’t super important, but it’s worth knowing that, when making a symbol, you can’t start with a number, and can’t have spaces. If your columns have those features, you can make a symbol from a string by doing eg. Symbol("Col with a Space").

3.1 🔃 @orderby

Probably the most straightforward operation is @orderby, which takes a column or columns that should be used to sort the rows of the dataframe.

my_df = @orderby(df, :eGFR)
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 55 58.715 20.427 1.2121 1 19.026
2 77 71.791 42.516 1.61 1 25.018
3 98 61.124 27.529 1.0137 0 25.286
4 8 39.897 26.452 1.0817 0 28.899
5 79 52.219 26.453 1.102 1 29.265

Most of these macros have both the “function-like” syntax used above, where the arguments are all enclosed in parentheses, or a form where the parentheses are omitted and arguments are separated by spaces. The example above for instance, could have been written:

my_df = @orderby df :eGFR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 55 58.715 20.427 1.2121 1 19.026
2 77 71.791 42.516 1.61 1 25.018
3 98 61.124 27.529 1.0137 0 25.286
4 8 39.897 26.452 1.0817 0 28.899
5 79 52.219 26.453 1.102 1 29.265
my_df = @orderby(df, :ISMALE, :eGFR) # sort first on ISMALE column, then on eGFR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 98 61.124 27.529 1.0137 0 25.286
2 8 39.897 26.452 1.0817 0 28.899
3 9 54.975 26.931 0.90926 0 29.731
4 37 32.226 27.841 1.1696 0 30.288
5 6 53.758 50.819 1.6769 0 30.855

Again, we could instead use the non-function (no parentheses) form, where arguments are separated by spaces rather than commas:

my_df = @orderby df :ISMALE :eGFR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 98 61.124 27.529 1.0137 0 25.286
2 8 39.897 26.452 1.0817 0 28.899
3 9 54.975 26.931 0.90926 0 29.731
4 37 32.226 27.841 1.1696 0 30.288
5 6 53.758 50.819 1.6769 0 30.855
Note

As less is more, we’ll use the @ macro calls from DataFramesMeta.jl without parentheses from now on.

Tip

If the number of arguments gets too long, we might want to separate them on additional lines. With the functional form, as with any Julia function, we can put line breaks between arguments, eg

@orderby(df, :ISMALE, :eGFR)

but with the non-function form, we need to use a begin block, eg:

@orderby df begin
    :ISMALE
    :eGFR
end

Use a - before the column name to sort in reverse

my_df = @orderby df :ISMALE -:eGFR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 80 45.822 98.223 0.8463 0 129.04
2 12 48.539 113.91 1.0889 0 112.95
3 16 32.41 70.069 0.84661 0 105.12
4 39 44.668 65.805 0.76055 0 97.378
5 47 41.047 93.403 1.1259 0 96.911

You can also pass a custom order (in the form of a vector of integers) to @orderby. So, the same reverse ordering can be accomplished with:

my_df = @orderby df :ISMALE sortperm(:eGFR, rev = true)
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 70 21.981 56.093 0.97028 0 80.548
2 19 40.826 74.732 1.1407 0 76.702
3 11 38.603 76.539 0.9541 0 96.028
4 24 33.338 58.429 1.2144 0 60.586
5 5 33.559 47.139 1.5924 0 37.198

Since sortperm() can be given custom ordering functions, you have almost infinite flexibility. See the documentation for sortperm in the “live docs” for more information

3.2@select and @transform

The @select macro can also be straightforward. Here, we just pull out the :ID and :AGE columns.

my_df = @select df :ID :AGE
first(my_df, 5)
5×2 DataFrame
Row ID AGE
Int64 Float64
1 1 34.823
2 2 32.765
3 3 35.974
4 4 38.206
5 5 33.559

We can also use a number of special selectors provided by DataFrames.jl.

  • Not(): Selects all columns except those specified (can be supplied either a single Symbol or a vector of Symbols)
  • Between(): takes two arguments, selects those two columns plus any columns between them.
  • Cols(): allows combining other selectors
  • Regular expressions: selects columns that match a regular expression. “Regex” is far too complicated to go into here - the short version is that it’s a way of specifying string matches. See the string tutorial for a bit more info.
  • All(): just like you think - selects all columns. Why? Well, you can’t get the same column back twice, so this is most often used like “all remaining” in case you want to use the other selectors to move some columns to the beginning.

For reasons that aren’t worth going into (internal “under the hood” stuff), if we’re using these selectors we need to wrap them in $() to make them work in the macros supplied by DataFramesMeta.jl.

Let’s look at some examples.

propertynames(df) # for reference
6-element Vector{Symbol}:
 :ID
 :AGE
 :WEIGHT
 :SCR
 :ISMALE
 :eGFR
(@select df $(Not(:ID))) |> propertynames
5-element Vector{Symbol}:
 :AGE
 :WEIGHT
 :SCR
 :ISMALE
 :eGFR
(@select df $(Between(:SCR, :eGFR))) |> propertynames
3-element Vector{Symbol}:
 :SCR
 :ISMALE
 :eGFR

We can also pass multiple arguments, each one of which will be evaluated in sequence. For example:

(@select df begin
    :ID # get :ID
    $(Between(:SCR, :eGFR)) # followed by things between :SCR and :eGFR
end) |> propertynames
4-element Vector{Symbol}:
 :ID
 :SCR
 :ISMALE
 :eGFR

The following regular expression matches every column that has a capital E:

(@select df $(r"E")) |> propertynames
3-element Vector{Symbol}:
 :AGE
 :WEIGHT
 :ISMALE

Finally, let’s look at combining these selectors:

(@select df begin
    $(Cols(:ID, r"E"))
    $(Between(:SCR, :eGFR))
end) |> propertynames
6-element Vector{Symbol}:
 :ID
 :AGE
 :WEIGHT
 :ISMALE
 :SCR
 :eGFR
Note

Did you notice that :ISMALE is not between :SCR and :eGFR? In this example, it was already “consumed” by the regular expression, so it is not repeated.

3.2.1 Using @select to make changes

But what if we want to do a bit of transformation? For example, it looks like the :AGE column is in years - let’s convert it to :AGE_in_months. To do this, we have to multiply each value by 12.

Caution

In R, one can apply many operations to vectors, and it implicitly applies that operation to each element. Eg.

r$> v = c(1,2,3)

r$> v * 12
[1] 12 24 36

r$> v^2
[1] 1 4 9

In general, Julia tries to avoid DWIM (do what I mean) patterns, when the output could be ambiguous. For multiplying a vector by a scalar value, this is unambiguous, mathematically

julia> v = [1, 2, 3];

julia> v * 12
3-element Vector{Int64}:
 12
 24
 36

But other operations are ambiguous so Julia throws an error instead of guessing what you mean.

julia> v^2
ERROR: MethodError: no method matching ^(::Vector{Int64}, ::Int64)

If you want to apply an operation element-wise, you must “broadcast” it, which in Julia has a convenient dot syntax:

julia> v .^ 2
3-element Vector{Int64}:
 1
 4
 9

Here, we use @select as before, but we add a new column, assigned to the value of the :AGE column multiplied by 12:

# broadcasting here (.*) is not strictly necessary, but is a good habit to get into
my_df = @select df :ID :AGE_in_months = :AGE .* 12
first(my_df, 5)
5×2 DataFrame
Row ID AGE_in_months
Int64 Float64
1 1 417.876
2 2 393.18
3 3 431.688
4 4 458.472
5 5 402.708

What if our transformation is a bit more complicated? For example, let’s get the z-scores of the :eGFR column - in other words, we want to transform the data to be normally distributed with a mean of zero and a standard deviation of 1.

Currently, the :eGFR column distribution looks like this:

using CairoMakie
hist(df.eGFR)

A z transform (also called “standard score”) involves taking each value, subtracting the mean of the sample, and dividing by the standard deviation of the sample. So, for a sample \[x_i\],

\[z_i = \frac{x_i - \bar{x}}{\sigma_x}\]

Where \(z_i\) is the z-score for element \(i\), \(x_i\) is the measured value, \(\bar{x}\) is the mean of the sample, and \(\sigma_x\) is the standard deviation of the sample.

Or, in Julia code:

function zscore(v)
    μ = mean(v)
    σ = std(v)
    return [(x - μ) / σ for x in v]
end
zscore (generic function with 1 method)

We can now use this function in the @select macro.

df_z = @select df :ID :eGFR_z = zscore(:eGFR)
first(df_z, 5)
5×2 DataFrame
Row ID eGFR_z
Int64 Float64
1 1 -0.645386
2 2 2.836
3 3 -0.380372
4 4 -0.799943
5 5 -0.87244
Note

We didn’t need to broadcast (“vectorize”) this call with zscore.(:eGFR), since the function was written to operate on the whole column. This is why it’s useful to be able to use both column-wise and row-wise operations, depending on what makes the most sense.

hist(df_z.eGFR_z)

It’s often inconvenient to have to write out a separate function, and then call it in the context of a macro. Because of this, DataFramesMeta.jl macros allow us to substitute complex expressions using begin blocks.

So, to accomplish the same result as above, we could instead write:

my_df = @select df begin
    :ID
    # whitespace in Julia mostly doesn't matter, so we can break this up
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        return [(x - μ) / σ for x in :eGFR]
    end
end
first(my_df, 5)
5×2 DataFrame
Row ID eGFR_z
Int64 Float64
1 1 -0.645386
2 2 2.836
3 3 -0.380372
4 4 -0.799943
5 5 -0.87244

3.2.2 @transform vs @select

The @transform macro is essentially identical to @select, except that @select only returns the columns that you explicitly ask for, while @transform returns all columns from the original DataFrame, as well as any new columns you create.

Here’s exactly the same command we just ran, using @transform instead of @select (notice we can also leave out the :ID):

my_df = @transform df begin
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        [(x - μ) / σ for x in :eGFR]
    end
end
first(my_df, 5)
5×7 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR eGFR_z
Int64 Float64 Float64 Float64 Int64 Float64 Float64
1 1 34.823 38.212 1.1129 0 42.635 -0.645386
2 2 32.765 74.838 0.8846 1 126.0 2.836
3 3 35.974 37.303 1.1004 1 48.981 -0.380372
4 4 38.206 32.969 1.1972 1 38.934 -0.799943
5 5 33.559 47.139 1.5924 0 37.198 -0.87244

3.2.3 @byrow

We have already seen that we can use the @r... versions of macros to use row-wise operations. But sometimes, we want to combine operations that require columns with those that would make more sense as row-rise.

One solution to this is to use broadcasting (eg .>), but we can also use the @byrow macro within other macro calls. For example, suppose we want to combine the eGFR_z operation above with the AGE_in_months operation we performed earlier:

my_df = @transform df begin
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        [(x - μ) / σ for x in :eGFR]
    end
    @byrow :AGE_in_months = :AGE * 12
end
first(my_df, 5)
5×8 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR eGFR_z AGE_in_months
Int64 Float64 Float64 Float64 Int64 Float64 Float64 Float64
1 1 34.823 38.212 1.1129 0 42.635 -0.645386 417.876
2 2 32.765 74.838 0.8846 1 126.0 2.836 393.18
3 3 35.974 37.303 1.1004 1 48.981 -0.380372 431.688
4 4 38.206 32.969 1.1972 1 38.934 -0.799943 458.472
5 5 33.559 47.139 1.5924 0 37.198 -0.87244 402.708

3.3 🏷️ @rename

The @rename macro renames columns in a DataFrame using the syntax :new = :old. Here’s an example:

my_df = @rename df :serum_creatinine = :SCR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT serum_creatinine ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

Like other macros, @rename can be used in both multi-argument and “block” format:

my_df = @rename df begin
    :age = :AGE
    :serum_creatinine = :SCR
end
first(my_df, 5)
5×6 DataFrame
Row ID age WEIGHT serum_creatinine ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

To rename columns to names with spaces you’ll need to escape it with $"...":

my_df = @rename df $"Serum Creatinine" = :SCR
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT Serum Creatinine ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
Tip

You can use instead a function to rename all columns with the DataFrame.jlrename function. It has an alternate method that takes a function as the first positional argument followed by a DataFrame. It also accepts a custom user-defined function.

Check some examples below.

my_df = rename(lowercase, df)
first(my_df, 5)
5×6 DataFrame
Row id age weight scr ismale egfr
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198
my_df = rename((s -> replace(s, 'A' => 'X')), df)
first(my_df, 5)
5×6 DataFrame
Row ID XGE WEIGHT SCR ISMXLE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 1 34.823 38.212 1.1129 0 42.635
2 2 32.765 74.838 0.8846 1 126.0
3 3 35.974 37.303 1.1004 1 48.981
4 4 38.206 32.969 1.1972 1 38.934
5 5 33.559 47.139 1.5924 0 37.198

3.4 ⬜️ ❎ @subset

The @subset macro is used for selecting rows that have some property, as determined by a boolean operation - that is, something that returns true or false. This operation is often called the “predicate”.

For example, suppose we want to keep rows where the :SCR is greater than the mean :SCR score:

my_df = @subset df :SCR .> mean(:SCR)
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 4 38.206 32.969 1.1972 1 38.934
2 5 33.559 47.139 1.5924 0 37.198
3 6 53.758 50.819 1.6769 0 30.855
4 13 28.818 65.829 1.3689 0 63.118
5 15 57.703 37.54 1.3098 1 32.758
Note

Recall that we need to use the “broadcasted” >, so that it’s applied to each element. Here, we need the whole column for the mean function, but if our operation only needs individual rows, we could have used the @rsubset macro and skipped the broadcast, eg

my_df = @rsubset(df, :SCR > 0.5)
first(my_df, 5)

We can also provide multiple predicates, which are calculated separately, and only rows where each predicate is true will be returned.

Like @select, we can either provide predicates as arguments in a function form, or use a begin block where each line represents one predicate. In other words, the following examples are the same:

my_df = @subset df :SCR .> mean(:SCR) :eGFR .< median(:eGFR);
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 4 38.206 32.969 1.1972 1 38.934
2 5 33.559 47.139 1.5924 0 37.198
3 6 53.758 50.819 1.6769 0 30.855
4 15 57.703 37.54 1.3098 1 32.758
5 18 48.453 59.107 1.1961 0 53.408
my_df = @subset df begin
    :SCR .> mean(:SCR)
    :eGFR .< median(:eGFR)
end;
first(my_df, 5)
5×6 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR
Int64 Float64 Float64 Float64 Int64 Float64
1 4 38.206 32.969 1.1972 1 38.934
2 5 33.559 47.139 1.5924 0 37.198
3 6 53.758 50.819 1.6769 0 30.855
4 15 57.703 37.54 1.3098 1 32.758
5 18 48.453 59.107 1.1961 0 53.408

4 🪄 DataFramesMeta.jl advanced operations with @astable

With the @astable macro we can perform advanced operations. The limitations of the @select and @transform families of macros are that they cannot operate on multiple columns at once.

For example, suppose you want to take the log of :AGE and add to the :WEIGHT converted to pounds. And you want something more readable inside a begin ... end block. This is not supported by just using @select or @transform macro families:

@rtransform df begin
    AGE_log = log(:AGE)
    WEIGHT_lbs = :WEIGHT * 2.2
    :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end
┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :AGE_log = ... instead.
└ @ DataFramesMeta ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/parsing.jl:387
┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :WEIGHT_lbs = ... instead.
└ @ DataFramesMeta ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/parsing.jl:387
UndefVarError: UndefVarError(:AGE_log)
UndefVarError: `AGE_log` not defined
Stacktrace:
  [1] (::var"#34#36")()
    @ Main.Notebook ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/parsing.jl:303
  [2] #560
    @ ./boot.jl:0 [inlined]
  [3] iterate
    @ ./generator.jl:47 [inlined]
  [4] collect(itr::Base.Generator{UnitRange{Int64}, DataFrames.var"#560#561"{var"#34#36"}})
    @ Base ./array.jl:834
  [5] _empty_selector_helper(fun::var"#34#36", len::Int64)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:586
  [6] _transformation_helper(df::DataFrame, col_idx::Vector{Int64}, ::Base.RefValue{Any})
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:591
  [7] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
  [8] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
  [9] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
 [10] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
 [11] transform(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383
 [12] top-level scope
    @ ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/Ga6VV/src/macros.jl:1599

In order to perform such an operation you’ll need to use the @astable macro inside the @select/@transform macro after the DataFrame:

my_df = @rtransform df @astable begin
    AGE_log = log(:AGE)
    WEIGHT_lbs = :WEIGHT * 2.2
    :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end
first(my_df, 5)
5×7 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR AGE_log_WEIGHT_lbs
Int64 Float64 Float64 Float64 Int64 Float64 Float64
1 1 34.823 38.212 1.1129 0 42.635 87.6167
2 2 32.765 74.838 0.8846 1 126.0 168.133
3 3 35.974 37.303 1.1004 1 48.981 85.6494
4 4 38.206 32.969 1.1972 1 38.934 76.1748
5 5 33.559 47.139 1.5924 0 37.198 107.219
Tip

You can use @astable inside any DataFramesMeta.jl macro if you would like to do operations on multiple columns at once.

5 ⊗ Operations on groups

There are many times when we want to apply operations within groups, rather than on the whole DataFrame.

For example, suppose we want to calculate z-scores for :WEIGHT, but we know that this variable will be affected by the sex of the individuals. We therefore want our mean and standard deviation calculations to happen separately for men and for women.

We could make a men_df and women_df using @subset, then calculate the values separately, but that would be inconvenient. Instead, we can use groupby to create a GroupedDataFrame.

gdf = groupby(df, :ISMALE)
typeof(gdf)
GroupedDataFrame{DataFrame}

Now, we can use the same operations as before. DataFramesMeta will apply our @select, @transform, and @subset operations within groups, then return a new DataFrame.

my_df = wz = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
first(my_df, 5)
5×7 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR WEIGHT_z
Int64 Float64 Float64 Float64 Int64 Float64 Float64
1 1 34.823 38.212 1.1129 0 42.635 -0.807368
2 2 32.765 74.838 0.8846 1 126.0 1.57251
3 3 35.974 37.303 1.1004 1 48.981 -0.81391
4 4 38.206 32.969 1.1972 1 38.934 -1.08946
5 5 33.559 47.139 1.5924 0 37.198 -0.334861
fig = Figure()
not_tr = Axis(fig[1, 1], title = "Not transformed", xlabel = "Weight")
tr = Axis(fig[1, 2], title = "Z-score", xlabel = "Z-score")
w = hist!(not_tr, @rsubset(wz, :ISMALE == 0).WEIGHT)
m = hist!(not_tr, @rsubset(wz, :ISMALE == 1).WEIGHT)
hist!(tr, @rsubset(wz, :ISMALE == 0).WEIGHT_z)
hist!(tr, @rsubset(wz, :ISMALE == 1).WEIGHT_z)
Legend(
    fig[2, 1:2],
    [w, m],
    ["women", "men"],
    orientation = :horizontal,
    tellwidth = false,
    tellheight = true,
)
fig

Well, that was unexpected…

5.1 ⛓️ @chain

Because we so often want to apply this pattern - grouping on one column, then performing operations on groups, DataFramesMeta provides the @chain macro from Chain.jl, in which each line operates on the line before it, similar to stringing together operations with %>% in R.

In particular, each line implicitly takes the result of the previous line as the first argument. In other words, what we previously accomplished with

df
gdf = groupby(df, :ISMALE)
my_df = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
first(my_df, 5)

becomes:

my_df = @chain df begin
    groupby(:ISMALE)
    @transform :WEIGHT_z = zscore(:WEIGHT)
end
first(my_df, 5)
5×7 DataFrame
Row ID AGE WEIGHT SCR ISMALE eGFR WEIGHT_z
Int64 Float64 Float64 Float64 Int64 Float64 Float64
1 1 34.823 38.212 1.1129 0 42.635 -0.807368
2 2 32.765 74.838 0.8846 1 126.0 1.57251
3 3 35.974 37.303 1.1004 1 48.981 -0.81391
4 4 38.206 32.969 1.1972 1 38.934 -1.08946
5 5 33.559 47.139 1.5924 0 37.198 -0.334861

5.2 🧑‍🍳 @combine

Suppose that we want to aggregate or summarize columns after grouping them, for example, getting summary statistics, or counting the number of rows with particular values. This can be accomplished with the @combine macro.

For the next example, remember that gdf is our original DataFrame grouped on the :ISMALE column.

my_df = @combine gdf begin
    :AGE_μ = mean(:AGE)
    :WEIGHT_μ = mean(:WEIGHT)
    :total = length(:ID)
    :high_eGFR = count(>(80), :eGFR)
end
first(my_df, 5)
2×5 DataFrame
Row ISMALE AGE_μ WEIGHT_μ total high_eGFR
Int64 Float64 Float64 Int64 Int64
1 0 41.2514 53.4655 53 10
2 1 44.9605 50.1047 47 9
Tip

DataFramesMeta.jl has also a macro that fuses together the groupby + @combine operations.

This is the @by macro and its syntax is:

@by DataFrame :group_col combine_operations

where :group_col can be either a single column or a vector of columns for which the data would be grouped by.

See an example below:

my_df = @by df :ISMALE :AGE_μ = mean(:AGE) :WEIGHT_μ = mean(:WEIGHT)
first(my_df, 5)
2×3 DataFrame
Row ISMALE AGE_μ WEIGHT_μ
Int64 Float64 Float64
1 0 41.2514 53.4655
2 1 44.9605 50.1047