Manipulating Tables with `DataFramesMeta.jl`

Authors

Jose Storopoli

Kevin Bonham

Juan Oneto

You now have a handle on basic operations in Julia, and how to read data into a DataFrame from CSV files, Excel files, or SAS files. Now it’s time to actually DO something with that data.

using DataFrames
using DataFramesMeta
using PharmaDatasets

1 💾 Basic `DataFrames.jl` Functionality

First, let’s read in some data. Using PharmaDatasets.jl we can import the iv_sd_2 dataset as df:

df = dataset("demographics_1")
first(df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Let’s get some basic information about each of the columns. We’ll do this in a number of ways to get a feel for it.

1.1 📦️ ↗️ A DataFrame is a Matrix, columns are vectors

Just like a Matrix in Julia, the contents of a DataFrame can be accessed with a pair of indices (see Tutorial 1 - Getting started with Julia) - the first index is for the row, and the second index is for the column. Column indices can either be an integer (for the column number) or the column name.

df[1, 3] # first row, 3rd column

38.212

df[5, "SCR"] # fifth row, column "SCR"

1.5924

Also like matrices, we can select “slices” of multiple rows / columns.

df[10:20, ["ID", "WEIGHT"]]

11×2 DataFrame

Row	ID	WEIGHT
	Int64	Float64
1	10	50.878
2	11	76.539
3	12	113.91
4	13	65.829
5	14	63.769
6	15	37.54
7	16	70.069
8	17	49.636
9	18	59.107
10	19	74.732
11	20	56.43

typeof(df[10:20, ["ID", "WEIGHT"]])

DataFrame

Notice that, when selecting multiple rows and columns, the return value is itself a DataFrame.

Unlike the Julia Matrix type, which has a single parameter for the types of its elements, a DataFrame is essentially a collection of Vectors, each of which can have their own type.

When slicing a single column of a DataFrame, the return value is also a vector:

typeof(df[:, "WEIGHT"])

Vector{Float64} (alias for Array{Float64, 1})

typeof(df[:, "ISMALE"])

Vector{Int64} (alias for Array{Int64, 1})

DataFrames also have a convenient syntax for selecting an entire column, similar to the df$col_name syntax in R. In Julia, we instead use a .:

vec = df.WEIGHT
first(vec, 5)

5-element Vector{Float64}:
 38.212
 74.838
 37.303
 32.969
 47.139

Tip

Actually, there’s one more way to select a whole column - df[!, :WEIGHT]. The difference between using df[!, index] and df[:, index] is that the latter returns a copy of the column, while the former returns the actual underlying object. This can often be better for performance, and it means you can do things like change values, but it also makes it possible to corrupt the underlying object. This is why it uses a !, to let you know that it’s potentially dangerous.

For example, you could do pop!(df[!, :ID]), and then the first column would have fewer rows than the others, violating the expectation that all columns have the same number of rows.

Using the “dot” syntax is identical to using the ! form of indexing, so changes made to df.WEIGHT will be reflected in the parent DataFrame, for good or ill.

Any function that operates on a vector can therefore be easily used on DataFrame columns.

sum(df.WEIGHT)

5188.59

using Statistics

mean(df.eGFR)

58.089360000000006

The describe() function provides a convenient shorthand to get summary information about the columns in your DataFrame.

describe(df)

6×7 DataFrame

Row	variable	mean	min	median	max	nmissing	eltype
	Symbol	Float64	Real	Float64	Real	Int64	DataType
1	ID	50.5	1	50.5	100	0	Int64
2	AGE	42.9947	19.187	41.2735	79.292	0	Float64
3	WEIGHT	51.8859	20.427	49.7975	113.91	0	Float64
4	SCR	1.14298	0.76055	1.1141	1.7202	0	Float64
5	ISMALE	0.47	0	0.0	1	0	Int64
6	eGFR	58.0894	19.026	53.8285	129.04	0	Float64

The default information is quite handy, but you can also tailor the output to suit your needs. For example:

describe(df, :median, :nunique, cols = ["ID", "AGE"])

2×3 DataFrame

Row	variable	median	nunique
	Symbol	Float64	Nothing
1	ID	50.5
2	AGE	41.2735

You can even create your own function, which takes a column as an argument and calculates whatever summary you like.

function my_stat(col)
    count(>(50), col) # returns the number of entries that are greater than 50
end

my_stat (generic function with 1 method)

describe(df, :median, :mean, my_stat => "gt50")

6×4 DataFrame

Row	variable	median	mean	gt50
	Symbol	Float64	Float64	Int64
1	ID	50.5	50.5	50
2	AGE	41.2735	42.9947	23
3	WEIGHT	49.7975	51.8859	49
4	SCR	1.1141	1.14298	0
5	ISMALE	0.0	0.47	0
6	eGFR	53.8285	58.0894	58

Tip

In situations like this, you can also use Julia’s “anonymous” function (sometimes called “lambda” function) syntax. Wherever a function can be passed as an argument, you can write arg -> action, where arg is the thing being passed as an argument, and action is whatever the function should do.

For example, instead of writing out a definition for my_stat ahead of time, we could have written:

describe(df, :median, :mean, (col -> count(>(50), col)) => "gt50")

2 🔮 Coming from `dplyr`

Alright, now that we’ve got the basics, let’s do something more complicated. You might be familiar with many of these operations for selecting, summarizing, and transforming tables using dplyr in R. Conceptually, there’s very little difference, though some of the verbs and syntax are a little different.

action	dplyr	DataFrames.jl	DataFramesMeta.jl
keep rows matching criteria	`filter`	`subset[!]`	`@[r]subset[!]`
generate reduction / summary	`summarize`	`combine`	`@combine`
add new columns based on some function	`mutate`	`transform[!]`	`@[r]transform[!]`
select columns by name	`select`	`select[!]`	`@[r]select[!]`
rename columns	`rename`	`rename[!]`	`@rename[!]`
reorder rows	`arrange`	`sort[!]`	`@orderby`

Each of these functions exist in the base DataFrames.jl package, using a special “mini language”, but we’re going to use the DataFramesMeta.jl package, which provides a number of “macros” to make this a bit easier. Macros are ways to tweak the way that Julia parses expressions, and can make many operations a bit easier.

2.1 📜 Macro conventions

In Julia, macros always start with the @ character. The DataFramesMeta.jl macros help to manipulate DataFrames in a similar way to what you may be familiar with from dplyr, but use the same verbs mentioned above. Namely the macros are of the form @subset, @combine, @transform, and @select. Only sorting has a different verb: @orderby.

In keeping with the conventions from DataFrames.jl, these macros work on entire columns, but they also have a “row-wise” version, which starts with an r (eg @rsubset is the row-wise @subset). Finally, by default, these macros are non-mutating - that is, they will return a copy of the DataFrame with the operation applied, but will leave the original unchanged. Nevertheless, each macro also has an “in-place” version, that will mutate the original. As is the convention in Julia for mutating functions, these macros end with ! (eg @subset! and @rsubset!).

Sometimes, we will refer to these macros using notation such as “@[r]subset[!]”, which just refers to all of the different flavors of this operation.

OK, enough talking - let’s get to some data manipulations!

3 ℹ️ `DataFramesMeta.jl` macro basics

Each macro from DataFramesMeta.jl takes a DataFrame (or a GroupedDataFrame - we’ll get to those later) as the first argument. Other arguments depend on the operation, and we use Symbols to refer to columns.

Tip

A Symbol starts with :, eg :ID. The details of why we use symbols aren’t super important, but it’s worth knowing that, when making a symbol, you can’t start with a number, and can’t have spaces. If your columns have those features, you can make a symbol from a string by doing eg. Symbol("Col with a Space").

3.1 🔃 `@orderby`

Probably the most straightforward operation is @orderby, which takes a column or columns that should be used to sort the rows of the dataframe.

my_df = @orderby(df, :eGFR)
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	55	58.715	20.427	1.2121	1	19.026
2	77	71.791	42.516	1.61	1	25.018
3	98	61.124	27.529	1.0137	0	25.286
4	8	39.897	26.452	1.0817	0	28.899
5	79	52.219	26.453	1.102	1	29.265

Most of these macros have both the “function-like” syntax used above, where the arguments are all enclosed in parentheses, or a form where the parentheses are omitted and arguments are separated by spaces. The example above for instance, could have been written:

my_df = @orderby df :eGFR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	55	58.715	20.427	1.2121	1	19.026
2	77	71.791	42.516	1.61	1	25.018
3	98	61.124	27.529	1.0137	0	25.286
4	8	39.897	26.452	1.0817	0	28.899
5	79	52.219	26.453	1.102	1	29.265

my_df = @orderby(df, :ISMALE, :eGFR) # sort first on ISMALE column, then on eGFR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	98	61.124	27.529	1.0137	0	25.286
2	8	39.897	26.452	1.0817	0	28.899
3	9	54.975	26.931	0.90926	0	29.731
4	37	32.226	27.841	1.1696	0	30.288
5	6	53.758	50.819	1.6769	0	30.855

Again, we could instead use the non-function (no parentheses) form, where arguments are separated by spaces rather than commas:

my_df = @orderby df :ISMALE :eGFR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	98	61.124	27.529	1.0137	0	25.286
2	8	39.897	26.452	1.0817	0	28.899
3	9	54.975	26.931	0.90926	0	29.731
4	37	32.226	27.841	1.1696	0	30.288
5	6	53.758	50.819	1.6769	0	30.855

Note

As less is more, we’ll use the @ macro calls from DataFramesMeta.jl without parentheses from now on.

Tip

If the number of arguments gets too long, we might want to separate them on additional lines. With the functional form, as with any Julia function, we can put line breaks between arguments, eg

@orderby(df, :ISMALE, :eGFR)

but with the non-function form, we need to use a begin block, eg:

@orderby df begin
    :ISMALE
    :eGFR
end

Use a - before the column name to sort in reverse

my_df = @orderby df :ISMALE -:eGFR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	80	45.822	98.223	0.8463	0	129.04
2	12	48.539	113.91	1.0889	0	112.95
3	16	32.41	70.069	0.84661	0	105.12
4	39	44.668	65.805	0.76055	0	97.378
5	47	41.047	93.403	1.1259	0	96.911

You can also pass a custom order (in the form of a vector of integers) to @orderby. So, the same reverse ordering can be accomplished with:

my_df = @orderby df :ISMALE sortperm(:eGFR, rev = true)
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	70	21.981	56.093	0.97028	0	80.548
2	19	40.826	74.732	1.1407	0	76.702
3	11	38.603	76.539	0.9541	0	96.028
4	24	33.338	58.429	1.2144	0	60.586
5	5	33.559	47.139	1.5924	0	37.198

Since sortperm() can be given custom ordering functions, you have almost infinite flexibility. See the documentation for sortperm in the “live docs” for more information

3.2 ❕ `@select` and `@transform`

The @select macro can also be straightforward. Here, we just pull out the :ID and :AGE columns.

my_df = @select df :ID :AGE
first(my_df, 5)

5×2 DataFrame

Row	ID	AGE
	Int64	Float64
1	1	34.823
2	2	32.765
3	3	35.974
4	4	38.206
5	5	33.559

We can also use a number of special selectors provided by DataFrames.jl.

Not(): Selects all columns except those specified (can be supplied either a single Symbol or a vector of Symbols)
Between(): takes two arguments, selects those two columns plus any columns between them.
Cols(): allows combining other selectors
Regular expressions: selects columns that match a regular expression. “Regex” is far too complicated to go into here - the short version is that it’s a way of specifying string matches. See the string tutorial for a bit more info.
All(): just like you think - selects all columns. Why? Well, you can’t get the same column back twice, so this is most often used like “all remaining” in case you want to use the other selectors to move some columns to the beginning.

For reasons that aren’t worth going into (internal “under the hood” stuff), if we’re using these selectors we need to wrap them in $() to make them work in the macros supplied by DataFramesMeta.jl.

Let’s look at some examples.

propertynames(df) # for reference

6-element Vector{Symbol}:
 :ID
 :AGE
 :WEIGHT
 :SCR
 :ISMALE
 :eGFR

(@select df $(Not(:ID))) |> propertynames

5-element Vector{Symbol}:
 :AGE
 :WEIGHT
 :SCR
 :ISMALE
 :eGFR

(@select df $(Between(:SCR, :eGFR))) |> propertynames

3-element Vector{Symbol}:
 :SCR
 :ISMALE
 :eGFR

We can also pass multiple arguments, each one of which will be evaluated in sequence. For example:

(@select df begin
    :ID # get :ID
    $(Between(:SCR, :eGFR)) # followed by things between :SCR and :eGFR
end) |> propertynames

4-element Vector{Symbol}:
 :ID
 :SCR
 :ISMALE
 :eGFR

The following regular expression matches every column that has a capital E:

(@select df $(r"E")) |> propertynames

3-element Vector{Symbol}:
 :AGE
 :WEIGHT
 :ISMALE

Finally, let’s look at combining these selectors:

(@select df begin
    $(Cols(:ID, r"E"))
    $(Between(:SCR, :eGFR))
end) |> propertynames

6-element Vector{Symbol}:
 :ID
 :AGE
 :WEIGHT
 :ISMALE
 :SCR
 :eGFR

Note

Did you notice that :ISMALE is not between :SCR and :eGFR? In this example, it was already “consumed” by the regular expression, so it is not repeated.

3.2.1 Using `@select` to make changes

But what if we want to do a bit of transformation? For example, it looks like the :AGE column is in years - let’s convert it to :AGE_in_months. To do this, we have to multiply each value by 12.

Caution

In R, one can apply many operations to vectors, and it implicitly applies that operation to each element. Eg.

r$> v = c(1,2,3)

r$> v * 12
[1] 12 24 36

r$> v^2
[1] 1 4 9

In general, Julia tries to avoid DWIM (do what I mean) patterns, when the output could be ambiguous. For multiplying a vector by a scalar value, this is unambiguous, mathematically

julia> v = [1, 2, 3];

julia> v * 12
3-element Vector{Int64}:
 12
 24
 36

But other operations are ambiguous so Julia throws an error instead of guessing what you mean.

julia> v^2
ERROR: MethodError: no method matching ^(::Vector{Int64}, ::Int64)

If you want to apply an operation element-wise, you must “broadcast” it, which in Julia has a convenient dot syntax:

julia> v .^ 2
3-element Vector{Int64}:
 1
 4
 9

Here, we use @select as before, but we add a new column, assigned to the value of the :AGE column multiplied by 12:

# broadcasting here (.*) is not strictly necessary, but is a good habit to get into
my_df = @select df :ID :AGE_in_months = :AGE .* 12
first(my_df, 5)

5×2 DataFrame

Row	ID	AGE_in_months
	Int64	Float64
1	1	417.876
2	2	393.18
3	3	431.688
4	4	458.472
5	5	402.708

What if our transformation is a bit more complicated? For example, let’s get the z-scores of the :eGFR column - in other words, we want to transform the data to be normally distributed with a mean of zero and a standard deviation of 1.

Currently, the :eGFR column distribution looks like this:

using CairoMakie
hist(df.eGFR)

A z transform (also called “standard score”) involves taking each value, subtracting the mean of the sample, and dividing by the standard deviation of the sample. So, for a sample \[x_i\],

\[z_i = \frac{x_i - \bar{x}}{\sigma_x}\]

Where $z_i$ is the z-score for element $i$, $x_i$ is the measured value, $\bar{x}$ is the mean of the sample, and $\sigma_x$ is the standard deviation of the sample.

Or, in Julia code:

function zscore(v)
    μ = mean(v)
    σ = std(v)
    return [(x - μ) / σ for x in v]
end

zscore (generic function with 1 method)

We can now use this function in the @select macro.

df_z = @select df :ID :eGFR_z = zscore(:eGFR)
first(df_z, 5)

5×2 DataFrame

Row	ID	eGFR_z
	Int64	Float64
1	1	-0.645386
2	2	2.836
3	3	-0.380372
4	4	-0.799943
5	5	-0.87244

Note

We didn’t need to broadcast (“vectorize”) this call with zscore.(:eGFR), since the function was written to operate on the whole column. This is why it’s useful to be able to use both column-wise and row-wise operations, depending on what makes the most sense.

hist(df_z.eGFR_z)

It’s often inconvenient to have to write out a separate function, and then call it in the context of a macro. Because of this, DataFramesMeta.jl macros allow us to substitute complex expressions using begin blocks.

So, to accomplish the same result as above, we could instead write:

my_df = @select df begin
    :ID
    # whitespace in Julia mostly doesn't matter, so we can break this up
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        return [(x - μ) / σ for x in :eGFR]
    end
end
first(my_df, 5)

5×2 DataFrame

Row	ID	eGFR_z
	Int64	Float64
1	1	-0.645386
2	2	2.836
3	3	-0.380372
4	4	-0.799943
5	5	-0.87244

3.2.2 `@transform` vs `@select`

The @transform macro is essentially identical to @select, except that @select only returns the columns that you explicitly ask for, while @transform returns all columns from the original DataFrame, as well as any new columns you create.

Here’s exactly the same command we just ran, using @transform instead of @select (notice we can also leave out the :ID):

my_df = @transform df begin
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        [(x - μ) / σ for x in :eGFR]
    end
end
first(my_df, 5)

5×7 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	eGFR_z
	Int64	Float64	Float64	Float64	Int64	Float64	Float64
1	1	34.823	38.212	1.1129	0	42.635	-0.645386
2	2	32.765	74.838	0.8846	1	126.0	2.836
3	3	35.974	37.303	1.1004	1	48.981	-0.380372
4	4	38.206	32.969	1.1972	1	38.934	-0.799943
5	5	33.559	47.139	1.5924	0	37.198	-0.87244

3.2.3 `@byrow`

We have already seen that we can use the @r... versions of macros to use row-wise operations. But sometimes, we want to combine operations that require columns with those that would make more sense as row-rise.

One solution to this is to use broadcasting (eg .>), but we can also use the @byrow macro within other macro calls. For example, suppose we want to combine the eGFR_z operation above with the AGE_in_months operation we performed earlier:

my_df = @transform df begin
    :eGFR_z = begin
        μ = mean(:eGFR)
        σ = std(:eGFR)
        [(x - μ) / σ for x in :eGFR]
    end
    @byrow :AGE_in_months = :AGE * 12
end
first(my_df, 5)

5×8 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	eGFR_z	AGE_in_months
	Int64	Float64	Float64	Float64	Int64	Float64	Float64	Float64
1	1	34.823	38.212	1.1129	0	42.635	-0.645386	417.876
2	2	32.765	74.838	0.8846	1	126.0	2.836	393.18
3	3	35.974	37.303	1.1004	1	48.981	-0.380372	431.688
4	4	38.206	32.969	1.1972	1	38.934	-0.799943	458.472
5	5	33.559	47.139	1.5924	0	37.198	-0.87244	402.708

3.3 🏷️ `@rename`

The @rename macro renames columns in a DataFrame using the syntax :new = :old. Here’s an example:

my_df = @rename df :serum_creatinine = :SCR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	serum_creatinine	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Like other macros, @rename can be used in both multi-argument and “block” format:

my_df = @rename df begin
    :age = :AGE
    :serum_creatinine = :SCR
end
first(my_df, 5)

5×6 DataFrame

Row	ID	age	WEIGHT	serum_creatinine	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

To rename columns to names with spaces you’ll need to escape it with $"...":

my_df = @rename df $"Serum Creatinine" = :SCR
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	Serum Creatinine	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

Tip

You can use instead a function to rename all columns with the DataFrame.jl’ rename function. It has an alternate method that takes a function as the first positional argument followed by a DataFrame. It also accepts a custom user-defined function.

Check some examples below.

my_df = rename(lowercase, df)
first(my_df, 5)

5×6 DataFrame

Row	id	age	weight	scr	ismale	egfr
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

my_df = rename((s -> replace(s, 'A' => 'X')), df)
first(my_df, 5)

5×6 DataFrame

Row	ID	XGE	WEIGHT	SCR	ISMXLE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	1	34.823	38.212	1.1129	0	42.635
2	2	32.765	74.838	0.8846	1	126.0
3	3	35.974	37.303	1.1004	1	48.981
4	4	38.206	32.969	1.1972	1	38.934
5	5	33.559	47.139	1.5924	0	37.198

3.4 ⬜️ ❎ `@subset`

The @subset macro is used for selecting rows that have some property, as determined by a boolean operation - that is, something that returns true or false. This operation is often called the “predicate”.

For example, suppose we want to keep rows where the :SCR is greater than the mean :SCR score:

my_df = @subset df :SCR .> mean(:SCR)
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	4	38.206	32.969	1.1972	1	38.934
2	5	33.559	47.139	1.5924	0	37.198
3	6	53.758	50.819	1.6769	0	30.855
4	13	28.818	65.829	1.3689	0	63.118
5	15	57.703	37.54	1.3098	1	32.758

Note

Recall that we need to use the “broadcasted” >, so that it’s applied to each element. Here, we need the whole column for the mean function, but if our operation only needs individual rows, we could have used the @rsubset macro and skipped the broadcast, eg

my_df = @rsubset(df, :SCR > 0.5)
first(my_df, 5)

We can also provide multiple predicates, which are calculated separately, and only rows where each predicate is true will be returned.

Like @select, we can either provide predicates as arguments in a function form, or use a begin block where each line represents one predicate. In other words, the following examples are the same:

my_df = @subset df :SCR .> mean(:SCR) :eGFR .< median(:eGFR);
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	4	38.206	32.969	1.1972	1	38.934
2	5	33.559	47.139	1.5924	0	37.198
3	6	53.758	50.819	1.6769	0	30.855
4	15	57.703	37.54	1.3098	1	32.758
5	18	48.453	59.107	1.1961	0	53.408

my_df = @subset df begin
    :SCR .> mean(:SCR)
    :eGFR .< median(:eGFR)
end;
first(my_df, 5)

5×6 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR
	Int64	Float64	Float64	Float64	Int64	Float64
1	4	38.206	32.969	1.1972	1	38.934
2	5	33.559	47.139	1.5924	0	37.198
3	6	53.758	50.819	1.6769	0	30.855
4	15	57.703	37.54	1.3098	1	32.758
5	18	48.453	59.107	1.1961	0	53.408

4 🪄 `DataFramesMeta.jl` advanced operations with `@astable`

With the @astable macro we can perform advanced operations. The limitations of the @select and @transform families of macros are that they cannot operate on multiple columns at once.

For example, suppose you want to take the log of :AGE and add to the :WEIGHT converted to pounds. And you want something more readable inside a begin ... end block. This is not supported by just using @select or @transform macro families:

@rtransform df begin
    AGE_log = log(:AGE)
    WEIGHT_lbs = :WEIGHT * 2.2
    :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end

┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :AGE_log = ... instead.
└ @ DataFramesMeta ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/parsing.jl:387
┌ Warning: Using an un-quoted Symbol on the LHS is deprecated. Write :WEIGHT_lbs = ... instead.
└ @ DataFramesMeta ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/parsing.jl:387

UndefVarError: `AGE_log` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
  [1] (::var"#34#36")()
    @ Main.Notebook ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/parsing.jl:303
  [2] #560
    @ ./boot.jl:0 [inlined]
  [3] iterate
    @ ./generator.jl:48 [inlined]
  [4] collect(itr::Base.Generator{UnitRange{Int64}, DataFrames.var"#560#561"{var"#34#36"}})
    @ Base ./array.jl:791
  [5] _empty_selector_helper(fun::var"#34#36", len::Int64)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:586
  [6] _transformation_helper(df::DataFrame, col_idx::Vector{Int64}, ::Base.RefValue{Any})
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:591
  [7] select_transform!(::Base.RefValue{Any}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{Symbol}, copycols::Bool, allow_resizing_newdf::Base.RefValue{Bool}, column_to_copy::BitVector)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:805
  [8] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1783
  [9] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1703
 [10] select(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1303
 [11] transform(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, renamecols::Bool, threads::Bool)
    @ DataFrames ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFrames/kcA9R/src/abstractdataframe/selection.jl:1383
 [12] top-level scope
    @ ~/run/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/DataFramesMeta/1Y7m8/src/macros.jl:1599

In order to perform such an operation you’ll need to use the @astable macro inside the @select/@transform macro after the DataFrame:

my_df = @rtransform df @astable begin
    AGE_log = log(:AGE)
    WEIGHT_lbs = :WEIGHT * 2.2
    :AGE_log_WEIGHT_lbs = AGE_log + WEIGHT_lbs
end
first(my_df, 5)

5×7 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	AGE_log_WEIGHT_lbs
	Int64	Float64	Float64	Float64	Int64	Float64	Float64
1	1	34.823	38.212	1.1129	0	42.635	87.6167
2	2	32.765	74.838	0.8846	1	126.0	168.133
3	3	35.974	37.303	1.1004	1	48.981	85.6494
4	4	38.206	32.969	1.1972	1	38.934	76.1748
5	5	33.559	47.139	1.5924	0	37.198	107.219

Tip

You can use @astable inside any DataFramesMeta.jl macro if you would like to do operations on multiple columns at once.

5 ⊗ Operations on groups

There are many times when we want to apply operations within groups, rather than on the whole DataFrame.

For example, suppose we want to calculate z-scores for :WEIGHT, but we know that this variable will be affected by the sex of the individuals. We therefore want our mean and standard deviation calculations to happen separately for men and for women.

We could make a men_df and women_df using @subset, then calculate the values separately, but that would be inconvenient. Instead, we can use groupby to create a GroupedDataFrame.

gdf = groupby(df, :ISMALE)
typeof(gdf)

GroupedDataFrame{DataFrame}

Now, we can use the same operations as before. DataFramesMeta will apply our @select, @transform, and @subset operations within groups, then return a new DataFrame.

my_df = wz = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
first(my_df, 5)

5×7 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	WEIGHT_z
	Int64	Float64	Float64	Float64	Int64	Float64	Float64
1	1	34.823	38.212	1.1129	0	42.635	-0.807368
2	2	32.765	74.838	0.8846	1	126.0	1.57251
3	3	35.974	37.303	1.1004	1	48.981	-0.81391
4	4	38.206	32.969	1.1972	1	38.934	-1.08946
5	5	33.559	47.139	1.5924	0	37.198	-0.334861

fig = Figure()
not_tr = Axis(fig[1, 1], title = "Not transformed", xlabel = "Weight")
tr = Axis(fig[1, 2], title = "Z-score", xlabel = "Z-score")
w = hist!(not_tr, @rsubset(wz, :ISMALE == 0).WEIGHT)
m = hist!(not_tr, @rsubset(wz, :ISMALE == 1).WEIGHT)
hist!(tr, @rsubset(wz, :ISMALE == 0).WEIGHT_z)
hist!(tr, @rsubset(wz, :ISMALE == 1).WEIGHT_z)
Legend(
    fig[2, 1:2],
    [w, m],
    ["women", "men"],
    orientation = :horizontal,
    tellwidth = false,
    tellheight = true,
)
fig

Well, that was unexpected…

5.1 ⛓️ `@chain`

Because we so often want to apply this pattern - grouping on one column, then performing operations on groups, DataFramesMeta provides the @chain macro from Chain.jl, in which each line operates on the line before it, similar to stringing together operations with %>% in R.

In particular, each line implicitly takes the result of the previous line as the first argument. In other words, what we previously accomplished with

df
gdf = groupby(df, :ISMALE)
my_df = @transform gdf :WEIGHT_z = zscore(:WEIGHT)
first(my_df, 5)

becomes:

my_df = @chain df begin
    groupby(:ISMALE)
    @transform :WEIGHT_z = zscore(:WEIGHT)
end
first(my_df, 5)

5×7 DataFrame

Row	ID	AGE	WEIGHT	SCR	ISMALE	eGFR	WEIGHT_z
	Int64	Float64	Float64	Float64	Int64	Float64	Float64
1	1	34.823	38.212	1.1129	0	42.635	-0.807368
2	2	32.765	74.838	0.8846	1	126.0	1.57251
3	3	35.974	37.303	1.1004	1	48.981	-0.81391
4	4	38.206	32.969	1.1972	1	38.934	-1.08946
5	5	33.559	47.139	1.5924	0	37.198	-0.334861

5.2 🧑‍🍳 `@combine`

Suppose that we want to aggregate or summarize columns after grouping them, for example, getting summary statistics, or counting the number of rows with particular values. This can be accomplished with the @combine macro.

For the next example, remember that gdf is our original DataFrame grouped on the :ISMALE column.

my_df = @combine gdf begin
    :AGE_μ = mean(:AGE)
    :WEIGHT_μ = mean(:WEIGHT)
    :total = length(:ID)
    :high_eGFR = count(>(80), :eGFR)
end
first(my_df, 5)

2×5 DataFrame

Row	ISMALE	AGE_μ	WEIGHT_μ	total	high_eGFR
	Int64	Float64	Float64	Int64	Int64
1	0	41.2514	53.4655	53	10
2	1	44.9605	50.1047	47	9

Tip

DataFramesMeta.jl has also a macro that fuses together the groupby + @combine operations.

This is the @by macro and its syntax is:

@by DataFrame :group_col combine_operations

where :group_col can be either a single column or a vector of columns for which the data would be grouped by.

See an example below:

my_df = @by df :ISMALE :AGE_μ = mean(:AGE) :WEIGHT_μ = mean(:WEIGHT)
first(my_df, 5)

2×3 DataFrame

Row	ISMALE	AGE_μ	WEIGHT_μ
	Int64	Float64	Float64
1	0	41.2514	53.4655
2	1	44.9605	50.1047

Reuse

CC BY-SA 4.0

1 💾 Basic DataFrames.jl Functionality

1.1 📦️ ↗️ A DataFrame is a Matrix, columns are vectors

2 🔮 Coming from dplyr

2.1 📜 Macro conventions

3 ℹ️ DataFramesMeta.jl macro basics

3.1 🔃 @orderby

3.2 ❕ @select and @transform

3.2.1 Using @select to make changes

3.2.2 @transform vs @select

3.2.3 @byrow

3.3 🏷️ @rename

3.4 ⬜️ ❎ @subset

4 🪄 DataFramesMeta.jl advanced operations with @astable

5 ⊗ Operations on groups

5.1 ⛓️ @chain

5.2 🧑‍🍳 @combine

Reuse

1 💾 Basic `DataFrames.jl` Functionality

2 🔮 Coming from `dplyr`

3 ℹ️ `DataFramesMeta.jl` macro basics

3.1 🔃 `@orderby`

3.2 ❕ `@select` and `@transform`

3.2.1 Using `@select` to make changes

3.2.2 `@transform` vs `@select`

3.2.3 `@byrow`

3.3 🏷️ `@rename`

3.4 ⬜️ ❎ `@subset`

4 🪄 `DataFramesMeta.jl` advanced operations with `@astable`

5.1 ⛓️ `@chain`

5.2 🧑‍🍳 `@combine`