Creating tables with SummaryTables.jl

Author

Juan Oneto

Pumas’ SummaryTables.jl is a Julia package that generates publication-ready summary tables for data analysis and reporting. In this tutorial, we will be going over some of the fundamental features of the package and explore how to use it to create informative tables with summary statistics, group data, and much more.

1 Libraries

Of course, we will need to begin by importing SummaryTables.jl:

using SummaryTables

As for our data, we will be using PharmaDatasets.jl as our source and DataFramesMeta.jl for some data wrangling:

using PharmaDatasets
using DataFrames
using DataFramesMeta

Lastly, we need to import StatsBase.jl to have access to the summary statistics functions that we will be using later on:

using StatsBase

2 Load data

We will load our dataset with PharmaDatasets.jl’s dataset() function:

df = dataset("nca/dapa_IV_ORAL")
first(df, 5)
5×11 DataFrame
Row ID TIME TAD COBS AMT OCC AGE WEIGHT GENDER FORMULATION DOSE
Int64 Float64 Float64 Float64 Int64? Int64 Int64 Float64 Int64 String7 Int64?
1 1 0.0 0.0 157.021 5000 1 44 70.5 0 IV 5000
2 1 0.05 0.05 141.892 missing 1 44 70.5 0 IV missing
3 1 0.35 0.35 116.228 missing 1 44 70.5 0 IV missing
4 1 0.5 0.5 109.353 missing 1 44 70.5 0 IV missing
5 1 0.75 0.75 66.4814 missing 1 44 70.5 0 IV missing

You probably noticed that this dataset contains both IV and oral formulations. Here we will only consider the IV formulation.

Also, the GENDER column has been encoded as 0 for males and 1 for females, so we would like to transform this column to use these more meaningful values instead of 0 and 1:

df_iv = @chain df begin
    @rsubset :FORMULATION == "IV"
    @rtransform :GENDER = :GENDER == 0 ? "Male" : "Female"
end;
Tip

Make sure to check our tutorial on Manipulating Tables with DataFramesMeta.jl for a more detailed look into DataFramesMeta.jl’s macros for data wrangling

3 📋 Listing tables

As their name suggests, listing tables are used to list raw values from a dataset, which can be a useful starting point for reporting and further analyses.

In order to create a listing table in SummaryTables.jl, we use the listingtable function, which takes the following positional arguments:

  1. A table: in this case df_iv, which is a DataFrame.
  2. A variable: the variable whose raw values will be listed.

In addition to that, listingtable supports keyword arguments for grouping columns and rows (cols and rows, respectively). Let’s work on an example to see how it works:

Note

Our documentation on SummaryTables.jl contains a comprehensive list of all keyword arguments available for listingtable and all the other functions used in this tutorial.

As an example, let’s try to generate a listing table containing the concentration measurements for each of our subjects, which are in the COBS column of our DataFrame:

listingtable(df_iv, :COBS; cols = :TIME, rows = :ID)
TIME
0 0.05 0.35 0.5 0.75 1 2 3 4 6 8 10 12 16 20 24
ID COBS
1 157 142 116 109 66.5 74.8 39.2 25.4 13 3.81 1.47 1.11 0.911 0.83 0.624 0.654
2 59.8 66.4 55.5 59 55.8 53.7 38.9 31 24.2 15.9 10.7 7.33 5.83 3.3 2.32 1.72
3 166 130 127 97.8 86.6 81.9 35.8 22.3 12.8 6.47 4.98 3.38 3.33 2.69 2.22 2.04
4 134 124 112 122 83.4 78.4 48 31.6 18.7 11 6.01 5.02 3.31 3.32 2.45 2.3
5 94.9 94.4 80.7 60.4 54.4 46.1 17.8 5.78 3.51 1.06 0.556 0.537 0.464 0.359 0.295 0.204
6 57.6 56.7 61.8 56.7 50.5 52.6 34.7 28.9 25.9 16.7 11.4 6.88 4.66 2.98 2.01 1.75
7 81.5 70.8 81 78.7 69.3 59.8 39.5 35.1 25.7 12.6 9.81 6.23 6.54 5.14 3.52 3.06
8 113 96.5 92.6 75.8 67.9 68.9 41.2 23.2 16.8 8.01 4.59 3.56 2.27 1.87 1.41 1.13
9 111 123 102 90.3 79.2 63.1 38.1 21.8 9.74 5.21 3.34 2.23 1.51 1.11 0.701 0.694
10 68.4 59.2 54.7 55.3 56.9 47.3 31 24 21.5 11.3 6.54 4.51 3.45 2.63 1.99 1.38
11 111 106 95.6 62.1 60.9 42.6 17.5 7.57 4.11 1.68 0.929 0.851 0.7 0.51 0.357 0.36
12 106 119 84.9 87.3 69.6 63.1 43.3 28.3 19.7 11.4 7.59 5.52 4.38 3.71 2.66 2.63
13 158 130 97.5 89.1 85.1 79.3 60.3 41.6 26 17.8 11.5 9.59 7.18 5.67 4.58 4.26
14 103 133 116 87.9 82.8 67 38.7 17.3 12 4.17 3.15 2.54 2.14 1.76 1.19 1.11
15 95.1 84.9 79.8 67.1 74.8 75.9 45.9 31.9 23.4 11.5 6.57 4.23 4.72 2.74 2.38 2.13
16 88.4 108 88 80.6 68.8 49.3 27.5 15.3 9.15 4.47 2.7 2.18 1.68 1.48 0.987 0.776
17 194 134 96.7 108 92.7 87.1 44.7 24.4 16.4 8.45 5.65 5.74 4.21 2.97 2.14 1.7
18 114 110 82.7 79.8 80.6 53.9 29.1 17.9 7.71 3.04 1.04 0.706 0.652 0.54 0.443 0.418
19 318 274 189 177 132 93 33.3 22.1 14.5 9.95 8 6.73 5.32 4.68 3.06 2.35
20 57.2 53.4 46.2 55 38 36.2 35.7 24.7 16.8 9.46 6.19 3.59 2.03 0.954 0.794 0.622
21 47.2 46.8 42.4 37.8 40.3 40.5 25.9 26 18.4 11 7.51 5.4 4.38 2.82 1.72 1.11
22 86.8 102 86.9 82.5 68.8 64.5 44.9 37.7 25.1 17.5 10.8 7.87 6.97 4.2 3.56 3.29
23 77.3 66.5 72.1 62.9 57 50.3 40 29.8 18.9 11.7 7.18 5.16 3.26 2.58 2.01 1.6
24 66.6 74.4 54.4 48.8 47.4 39.1 26.3 17.4 11.2 3.86 2.12 1.19 0.833 0.755 0.629 0.546
Tip

As you can see, Quarto will automatically display the table that we just generated. If you are interested in taking your results elsewhere, make sure to check our section on generating output later on in this tutorial to learn about the available formats to export your table results.

Warning

Let’s stop here for a second to understand what we just did.

Since we wanted to list our concentration values, we passed COBS as our second positional argument. However, if we had tried to run listingtable(df_iv, :COBS) we would have gotten an error saying that there are too many rows:

listingtable(df_iv, :COBS)
SummaryTables.TooManyRowsError: TooManyRowsError: Found a group which has more than one value. This is not allowed, only one value of "COBS" per table cell may exist.

384×1 DataFrame
 Row │ COBS
     │ Float64
─────┼────────────
   1 │ 157.021
   2 │ 141.892
   3 │ 116.228
   4 │ 109.353
   5 │  66.4814
   6 │  74.7532
   7 │  39.1933
   8 │  25.4495
  ⋮  │     ⋮
 378 │   3.86121
 379 │   2.11964
 380 │   1.19236
 381 │   0.832999
 382 │   0.755442
 383 │   0.628669
 384 │   0.546453
  369 rows omitted

Filter your dataset or use additional row or column grouping factors.

The following columns in the dataset are not uniform in this group and could potentially be used: ["ID", "TIME", "TAD", "AMT", "AGE", "WEIGHT", "GENDER", "DOSE"].

As suggested by the error message, we need to add grouping factors on our rows or columns to ensure that none of the groups will have more than one value.

In this case, we wanted to use the time values and subject IDs, so that’s how we arrived at our final solution:

listingtable(df_iv, :COBS; cols = :TIME, rows = :ID)
Tip

Notice that by specifying :TIME in cols and :ID in rows we have produced a wide format table from our originally long-formatted table. This is probably a more readable format for presentations and reports, but if you wish to stick to the long format you can add all grouping factors in rows:

listingtable(df_iv, :COBS; rows = [:ID, :TIME])

Note that in this case, we passed a Vector to the rows keyword argument because we wanted to use multiple columns. This also works for cols.

Another thing that we might want to do to this table is customize the table’s headers, since the current COBS, TIME, and SUBJID headers are not very readable.

To do this, we use Pairs indicating the variable and the name that we want to use in the table (e.g :SUBJID => "Subject ID").

Let’s take advantage of this and generate a table with more readable headers and units:

listingtable(
    df_iv,
    :COBS => "Concentration (μg/L)";
    cols = :TIME => "Time (hours)",
    rows = :ID => "Subject ID",
)
Time (hours)
0 0.05 0.35 0.5 0.75 1 2 3 4 6 8 10 12 16 20 24
Subject ID Concentration (μg/L)
1 157 142 116 109 66.5 74.8 39.2 25.4 13 3.81 1.47 1.11 0.911 0.83 0.624 0.654
2 59.8 66.4 55.5 59 55.8 53.7 38.9 31 24.2 15.9 10.7 7.33 5.83 3.3 2.32 1.72
3 166 130 127 97.8 86.6 81.9 35.8 22.3 12.8 6.47 4.98 3.38 3.33 2.69 2.22 2.04
4 134 124 112 122 83.4 78.4 48 31.6 18.7 11 6.01 5.02 3.31 3.32 2.45 2.3
5 94.9 94.4 80.7 60.4 54.4 46.1 17.8 5.78 3.51 1.06 0.556 0.537 0.464 0.359 0.295 0.204
6 57.6 56.7 61.8 56.7 50.5 52.6 34.7 28.9 25.9 16.7 11.4 6.88 4.66 2.98 2.01 1.75
7 81.5 70.8 81 78.7 69.3 59.8 39.5 35.1 25.7 12.6 9.81 6.23 6.54 5.14 3.52 3.06
8 113 96.5 92.6 75.8 67.9 68.9 41.2 23.2 16.8 8.01 4.59 3.56 2.27 1.87 1.41 1.13
9 111 123 102 90.3 79.2 63.1 38.1 21.8 9.74 5.21 3.34 2.23 1.51 1.11 0.701 0.694
10 68.4 59.2 54.7 55.3 56.9 47.3 31 24 21.5 11.3 6.54 4.51 3.45 2.63 1.99 1.38
11 111 106 95.6 62.1 60.9 42.6 17.5 7.57 4.11 1.68 0.929 0.851 0.7 0.51 0.357 0.36
12 106 119 84.9 87.3 69.6 63.1 43.3 28.3 19.7 11.4 7.59 5.52 4.38 3.71 2.66 2.63
13 158 130 97.5 89.1 85.1 79.3 60.3 41.6 26 17.8 11.5 9.59 7.18 5.67 4.58 4.26
14 103 133 116 87.9 82.8 67 38.7 17.3 12 4.17 3.15 2.54 2.14 1.76 1.19 1.11
15 95.1 84.9 79.8 67.1 74.8 75.9 45.9 31.9 23.4 11.5 6.57 4.23 4.72 2.74 2.38 2.13
16 88.4 108 88 80.6 68.8 49.3 27.5 15.3 9.15 4.47 2.7 2.18 1.68 1.48 0.987 0.776
17 194 134 96.7 108 92.7 87.1 44.7 24.4 16.4 8.45 5.65 5.74 4.21 2.97 2.14 1.7
18 114 110 82.7 79.8 80.6 53.9 29.1 17.9 7.71 3.04 1.04 0.706 0.652 0.54 0.443 0.418
19 318 274 189 177 132 93 33.3 22.1 14.5 9.95 8 6.73 5.32 4.68 3.06 2.35
20 57.2 53.4 46.2 55 38 36.2 35.7 24.7 16.8 9.46 6.19 3.59 2.03 0.954 0.794 0.622
21 47.2 46.8 42.4 37.8 40.3 40.5 25.9 26 18.4 11 7.51 5.4 4.38 2.82 1.72 1.11
22 86.8 102 86.9 82.5 68.8 64.5 44.9 37.7 25.1 17.5 10.8 7.87 6.97 4.2 3.56 3.29
23 77.3 66.5 72.1 62.9 57 50.3 40 29.8 18.9 11.7 7.18 5.16 3.26 2.58 2.01 1.6
24 66.6 74.4 54.4 48.8 47.4 39.1 26.3 17.4 11.2 3.86 2.12 1.19 0.833 0.755 0.629 0.546

Finally, we might also be interested in a version of this table that summarizes the concentration values across all subjects for each time point.

Luckily for us, listingtable supports adding summary columns or rows by using the keyword arguments summarize_rows/summarize_cols, which take a Vector containing the summary functions that we want to use.

Let’s compute the geometric mean and the coefficient of variation as an example to see how that works. In this case, we will use summarize_rows:

listingtable(
    df_iv,
    :COBS => "Concentration (μg/L)";
    cols = :TIME => "Time (hours)",
    rows = :ID => "Subject ID",
    summarize_rows = [
        geomean => "Geometric mean (μg/L)",
        (i -> 100 * variation(i)) => "CV (%)",
    ],
)
Time (hours)
0 0.05 0.35 0.5 0.75 1 2 3 4 6 8 10 12 16 20 24
Subject ID Concentration (μg/L)
1 157 142 116 109 66.5 74.8 39.2 25.4 13 3.81 1.47 1.11 0.911 0.83 0.624 0.654
2 59.8 66.4 55.5 59 55.8 53.7 38.9 31 24.2 15.9 10.7 7.33 5.83 3.3 2.32 1.72
3 166 130 127 97.8 86.6 81.9 35.8 22.3 12.8 6.47 4.98 3.38 3.33 2.69 2.22 2.04
4 134 124 112 122 83.4 78.4 48 31.6 18.7 11 6.01 5.02 3.31 3.32 2.45 2.3
5 94.9 94.4 80.7 60.4 54.4 46.1 17.8 5.78 3.51 1.06 0.556 0.537 0.464 0.359 0.295 0.204
6 57.6 56.7 61.8 56.7 50.5 52.6 34.7 28.9 25.9 16.7 11.4 6.88 4.66 2.98 2.01 1.75
7 81.5 70.8 81 78.7 69.3 59.8 39.5 35.1 25.7 12.6 9.81 6.23 6.54 5.14 3.52 3.06
8 113 96.5 92.6 75.8 67.9 68.9 41.2 23.2 16.8 8.01 4.59 3.56 2.27 1.87 1.41 1.13
9 111 123 102 90.3 79.2 63.1 38.1 21.8 9.74 5.21 3.34 2.23 1.51 1.11 0.701 0.694
10 68.4 59.2 54.7 55.3 56.9 47.3 31 24 21.5 11.3 6.54 4.51 3.45 2.63 1.99 1.38
11 111 106 95.6 62.1 60.9 42.6 17.5 7.57 4.11 1.68 0.929 0.851 0.7 0.51 0.357 0.36
12 106 119 84.9 87.3 69.6 63.1 43.3 28.3 19.7 11.4 7.59 5.52 4.38 3.71 2.66 2.63
13 158 130 97.5 89.1 85.1 79.3 60.3 41.6 26 17.8 11.5 9.59 7.18 5.67 4.58 4.26
14 103 133 116 87.9 82.8 67 38.7 17.3 12 4.17 3.15 2.54 2.14 1.76 1.19 1.11
15 95.1 84.9 79.8 67.1 74.8 75.9 45.9 31.9 23.4 11.5 6.57 4.23 4.72 2.74 2.38 2.13
16 88.4 108 88 80.6 68.8 49.3 27.5 15.3 9.15 4.47 2.7 2.18 1.68 1.48 0.987 0.776
17 194 134 96.7 108 92.7 87.1 44.7 24.4 16.4 8.45 5.65 5.74 4.21 2.97 2.14 1.7
18 114 110 82.7 79.8 80.6 53.9 29.1 17.9 7.71 3.04 1.04 0.706 0.652 0.54 0.443 0.418
19 318 274 189 177 132 93 33.3 22.1 14.5 9.95 8 6.73 5.32 4.68 3.06 2.35
20 57.2 53.4 46.2 55 38 36.2 35.7 24.7 16.8 9.46 6.19 3.59 2.03 0.954 0.794 0.622
21 47.2 46.8 42.4 37.8 40.3 40.5 25.9 26 18.4 11 7.51 5.4 4.38 2.82 1.72 1.11
22 86.8 102 86.9 82.5 68.8 64.5 44.9 37.7 25.1 17.5 10.8 7.87 6.97 4.2 3.56 3.29
23 77.3 66.5 72.1 62.9 57 50.3 40 29.8 18.9 11.7 7.18 5.16 3.26 2.58 2.01 1.6
24 66.6 74.4 54.4 48.8 47.4 39.1 26.3 17.4 11.2 3.86 2.12 1.19 0.833 0.755 0.629 0.546
Geometric mean (μg/L) 100 96.4 83.3 76.2 67 59.1 35.2 22.7 14.7 7.39 4.5 3.31 2.6 1.95 1.45 1.23
CV (%) 52.3 44.6 35.6 36.4 28.6 26.4 26.4 34.7 41.5 55 59 59 62 60.9 62.1 65.9

We can see that now we have our summary statistics below the observations. Also, notice that we just provided the functions that we wanted to use as summary statistics, and listingtable did all the heavy lifting for us.

4 📝 Summary tables

Summary tables are used to present summary statistics for a variable of interest in the dataset.

In SummaryTables.jl, we generate a summary table with the summarytable function, which takes the same positional arguments as listingtable:

  1. A table: our DataFrame.
  2. A variable: the variable from which the summary statistics will be computed.

In addition to that, we need to specify the summary keyword argument, which should be a Vector containing the summary functions that we want to use. Let’s try to create an example to see how that works.

For this case, we can turn our attention to the covariates included in our dataset. Because of this, we will need to filter our dataset to include only one observation per subject:

df_cov = unique(df_iv, :ID);

Now let’s create a table for the weight values. We will use the mean and standard deviation for our summary statistics, and we will include the number of subjects used in the calculation:

summarytable(
    df_cov,
    :WEIGHT => "Weight (kg)";
    summary = [mean => "Mean", std => "σ", length => "n"],
)
Weight (kg)
Mean 66
σ 8.17
n 24
Tip

You can also use the pair syntax to specify the names used to refer to the summary functions.

That last table was useful, but it is somewhat simple. Let’s add a grouping variable using cols:

summarytable(
    df_cov,
    :WEIGHT => "Weight (kg)",
    summary = [mean => "Mean", std => "σ", length => "n"],
    cols = :GENDER => "Sex",
)
Sex
Female Male
Weight (kg)
Mean 60.3 69.4
σ 7.27 6.75
n 9 15

Now we get our summary statistics for weight, but separated by sex. We can also add a grouping factor related to age, so let’s compute a categorical variable from it:

age_median = median(df_cov.AGE);

@rtransform! df_cov :AGE_cat = :AGE < age_median ? "Younger" : "Older";

Here we divided subjects into two groups: “Younger” and “Older”. A subject will be considered “Younger” if its age is less than the median. Otherwise, it will fall into the “Older” category.

Now we can add this variable to the table. In this case, we will add it as a row:

summarytable(
    df_cov,
    :WEIGHT => "Weight (kg)",
    summary = [mean => "Mean", std => "σ", length => "n"],
    cols = :GENDER => "Sex",
    rows = :AGE_cat => "Age group",
)
Sex
Female Male
Age group Weight (kg)
Older Mean 58.2 72.2
σ 6.95 6.13
n 7 7
Younger Mean 67.4 67
σ 1.91 6.68
n 2 8

One last interesting thing you might want to know is that you don’t necessarily have to use the functions provided by StatsBase.jl. In fact, summary accepts user-defined functions.

As an example, let’s create a function that counts how many subjects have a weight greater than 70 kg:

more_than_70(x) = count(>(70), x);
summarytable(
    df_cov,
    :WEIGHT => "Weight (kg)",
    summary = [mean => "Mean", std => "σ", length => "n", more_than_70 => " > 70 kg"],
    cols = :GENDER => "Sex",
    rows = :AGE_cat => "Age group",
)
Sex
Female Male
Age group Weight (kg)
Older Mean 58.2 72.2
σ 6.95 6.13
n 7 7
> 70 kg 0 5
Younger Mean 67.4 67
σ 1.91 6.68
n 2 8
> 70 kg 0 2
Tip

We encourage you to check our tutorial on functions to learn more about defining functions in Julia.

5 🥇 Table one

SummaryTables.jl’s table_one function allows you to create tables following the commonly used “Table 1” format to summarize patient baseline characteristics.

table_one also takes two positional arguments:

  1. A table: in this case df_cov, which is a DataFrame.
  2. A Vectorof variables.

Again, we’ll focus on describing our subjects using their covariates, so let’s generate a table with our continuous covariates:

table_one(df_cov, [:WEIGHT => "Weight (kg)", :AGE => "Age (years)"])
Overall
Weight (kg)
Mean (SD) 66 (8.17)
Median [Min, Max] 67.8 [49.1, 84.1]
Age (years)
Mean (SD) 44.6 (3.11)
Median [Min, Max] 44 [41, 51]

As you can see, table_one shows some interesting summary statistics such as the mean and the standard deviation for the variables that we specified. Let’s now try to generate a table for our discrete covariates:

table_one(df_cov, [:GENDER => "Sex", :AGE_cat => "Age group"])
Overall
Sex
Female 9 (37.5%)
Male 15 (62.5%)
Age group
Older 14 (58.3%)
Younger 10 (41.7%)

Notice that when we pass discrete variables we get back the number of occurrences and their corresponding percentages instead of the summary statistics from before.

Lastly, we don’t have to separate continuous and discrete variables. You can include all of them in the same call and table_one will adjust its output accordingly:

table_one(df_cov, [
    :AGE => "Age (years)", # A continuous covariate
    :WEIGHT => "Weight (kg)",
    :GENDER => "Sex", # A discrete covariate
])
Overall
Age (years)
Mean (SD) 44.6 (3.11)
Median [Min, Max] 44 [41, 51]
Weight (kg)
Mean (SD) 66 (8.17)
Median [Min, Max] 67.8 [49.1, 84.1]
Sex
Female 9 (37.5%)
Male 15 (62.5%)

Similarly to what we have been doing before, we can use grouping variables, which in this case are added through the groupby keyword argument:

table_one(
    df_cov,
    [:AGE => "Age (years)", :WEIGHT => "Weight (kg)"];
    groupby = [:AGE_cat => "Age group", :GENDER => "Sex"],
    show_n = true,
)
Age group
Older
(n=14)
Younger
(n=10)
Sex Sex
Overall
(n=24)
Female
(n=7)
Male
(n=7)
Female
(n=2)
Male
(n=8)
Age (years)
Mean (SD) 44.6 (3.11) 46.6 (2.57) 46.9 (2.04) 41.5 (0.707) 41.6 (0.744)
Median [Min, Max] 44 [41, 51] 46 [44, 51] 47 [44, 50] 41.5 [41, 42] 41.5 [41, 43]
Weight (kg)
Mean (SD) 66 (8.17) 58.2 (6.95) 72.2 (6.13) 67.4 (1.91) 67 (6.68)
Median [Min, Max] 67.8 [49.1, 84.1] 55.5 [49.1, 68.7] 70.5 [64.8, 84.1] 67.4 [66, 68.7] 67.8 [54.9, 76.4]
Note

We also set the show_n keyword argument to true in order to get the number of rows associated with the grouping factors.

table_one supports many other useful keyword arguments, including some related to hypothesis testing. We encourage you to check our documentation on table_one to learn more about the use of this function.

6 📤 Generating output

SummaryTables.jl currently supports exporting your results to HTML and \(\LaTeX\) formats.

You can save your table into a file using show in the following way:

table = table_one(...) # Using table_one as an example

# Save in HTML format
open("table.html", "w") do io
    show(io, MIME"text/html"(), table)
end

# Save as LaTeX code
open("table.tex", "w") do io
    show(io, MIME"text/latex"(), table)
end

This will surely come in handy when trying to include the tables you generate with SummaryTables.jl in your publications and reports.

7 Conclusions

We hope this has been helpful in introducing you to the different types of tables that you can generate with SummaryTables.jl. We covered listing tables, which are useful for displaying raw data, as well as summary tables, which are perfect for presenting summarized data. Lastly, we discussed how to create tables in the commonly used “Table 1” format

Make sure to check our documentation on SummaryTables.jl for more information on the available keyword arguments for each of the functions covered here and other examples on the use of this package.