using Pumas # available with Pumas products
using Bioequivalence # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using DataFramesMeta
using StatsBase
using GLM
using SummaryTables

A Course on Bioequivalence: Unit 7 - Understanding marginal means
This is Unit 7 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.
1 Unit overview
In Units 1–5 we studied the introductory statistical details of bioequivlanece. Then in Unit 6 we momentarily departed from the theory to explore PumasCP. In this unit pick up the statistical discussion as we learn about marginal means.
In particular when we carry out a bioequivalence study comparing R
and T
(or potentially more formulations), in addition to estimating the geometric mean ratio (GMR) and obtaining a deciding confidence interval for it, we may also be interested in understanding the geometric means of each quantity R
and T
separately. This can be for the purposes of comparative studies or further understanding of the drug product in future studies. This is the topic of marginal means (also called in classic SAS based literature least squares means).
We unpack these concepts in this unit:
- We understand the difference between marginal means and naive means.
- We see in which cases these means agree.
- We understand the way in which marginal means are computed and learn to treat the types of means separately.
We use the following packages:
We have used each of these packages in previous units.
2 Starting with naive means
Consider this simple example dataset for a parallel design where we assume the data has not been log transformed.
= DataFrame(
data = 1:10,
id = vcat(fill("R", 5), fill("T", 5)),
seq = fill(1, 10),
per = [17.7, missing, 14.7, 20.55, 13.8, 15.45, 13.65, missing, 18.75, 22.95],
AUC
)simple_table(data)
id | seq | per | AUC |
1 | R | 1 | 17.7 |
2 | R | 1 | missing |
3 | R | 1 | 14.7 |
4 | R | 1 | 20.6 |
5 | R | 1 | 13.8 |
6 | T | 1 | 15.4 |
7 | T | 1 | 13.6 |
8 | T | 1 | missing |
9 | T | 1 | 18.8 |
10 | T | 1 | 23 |
If we were interested in the mean of R
and mean of T
, then the natural way to present these means is as geometric means. One way to compute this these means is using the groupby
and combine
functions where we then use the geomean
function together with skipmissing
since there are some missing observations:
= combine(groupby(data, :seq), :AUC => (x -> geomean(skipmissing(x))) => :GM)
gms simple_table(gms, round_digits = 4)
seq | GM |
R | 16.48 |
T | 17.36 |
As you can see, if we apply pumas_be
to this dataset, we obtain the exact same means, 16.48
, and 17.9
. Note that here when we call the pumas_be
function we map the sequence column to :seq
and the period column to :per
:
pumas_be(data, sequence = :seq, period = :per)
Observation Counts | |
Sequence ╲ Period | 1 |
R | 4 |
T | 4 |
Paradigm: Parallel design | ||||
Model: T test (equal variance) | ||||
Criteria: Standard ABE | ||||
Endpoint: AUC | ||||
Formulations: Reference(R), Test(T) | ||||
Results(AUC) | Assessment | Criteria | ||
R | Geometric Naive Mean | 16.48 | ||
T | Geometric Naive Mean | 17.36 | ||
Geometric Mean T/R Ratio (%) | 105.3 | |||
Degrees of Freedom | 6 | |||
90% Confidence Interval (%) | [79.39, 139.7] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 19.41 | 0.1923 | ||
In this case, Pumas writes Naive Mean
to indicate that no model based mean computation is carried out. All that was done is the geometric mean computation of the data (or it can be the exponent of the arithmetic mean of the log transformed data).
Indeed in a parallel design, there are no marginal means, only naive means.
The GMR is the ratio of the geometric means
Importantly since,
\[ \text{GMR} = \frac{\text{GM}_T}{\text{GM}_R}, \]
you may check that (up to rounding),
\[ \texttt{105.3} = \texttt{100} \times \frac{\texttt{17.36}}{\texttt{16.48}}. \]
= round(100 * 17.36 / 16.48, digits = 1) GMR_from_naive_means
105.3
Hence the output of pumas_be
above is consistent in the sense that our estimates for the geometric means and the estimates for the geometric mean ratio agree. Such consistency between the formulation means and the GMR is obviously important when we report the study.
It would have been problematic and a cause for concern if,
\[ \text{GMR} \neq \frac{\text{GM}_T}{\text{GM}_R}. \]
Yet as we see this is not the case.
3 A problem with naive means for non-parallel designs
When we move onto a non-parallel design where the GMR is a result of a statistical model, problems may occur.
To explore this let us consider the example \(2 \times 2\) crossover dataset appearing in Park, Kim, and Bae (2020) which we also used in Unit5:
= DataFrame(
data_2020_PKB = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6],
id = ["TR", "TR", "TR", "TR", "TR", "RT", "RT", "RT", "RT", "RT"],
sequence = [1, 2, 1, 2, 1, 1, 2, 1, 2, 1],
period = [269.3, 410.4, 120.2, 137.3, 105.2, 90.9, 68.9, 228.3, 301.5, 105.3],
cmax
)simple_table(data_2020_PKB, round_mode = nothing)
id | sequence | period | cmax |
1 | TR | 1 | 269.3 |
1 | TR | 2 | 410.4 |
2 | TR | 1 | 120.2 |
2 | TR | 2 | 137.3 |
3 | TR | 1 | 105.2 |
4 | RT | 1 | 90.9 |
4 | RT | 2 | 68.9 |
5 | RT | 1 | 228.3 |
5 | RT | 2 | 301.5 |
6 | RT | 1 | 105.3 |
Now to compute the geometric naive mean, we use can take steps as follows:
= @rtransform data_2020_PKB :trt = string(:sequence[:period])
data2 = combine(groupby(data2, :trt), :cmax => geomean => :GM)
gms sort!(gms, :trt)
simple_table(gms, round_digits = 4)
trt | GM |
R | 165.2 |
T | 147.9 |
The first line above creates a derived dataset, data2
creates another column, :trt
which stands for the treatment or formulation. We then do a similar computation as we did for the parallel design, noting that here we do not need to apply skipmissing
. The sort!
line just sorts so that R
appears before T
.
Now if we just apply pumas_be
to the dataset, you can see that the geometric naive means are 165.2
and 147.9
, exactly as we manually computed above.
pumas_be(data_2020_PKB, endpoint = :cmax)
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 3 | 2 |
TR | 3 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 160.5 | ||
Geometric Naive Mean | 165.2 | |||
T | Geometric Marginal Mean | 139.8 | ||
Geometric Naive Mean | 147.9 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.2088 | |||
Period (p-value) | 0.4684 | |||
The GMR is NOT always the ratio of the geometric naive means
As you can see the output presented the estimated GMR at 87.08
(as a percentage). As we learned in Unit5 this is a model-based estimate of the GMR.
This GMR is consistent with the confidence interval [55.16, 137.5]
which we obtain as a percentage. For example, you may verify (up to rounding errors) that it is at the “center” of the confidence interval :
= round(100exp((log(137.5 / 100) + log(55.16 / 100)) / 2), digits = 2) correct_center_of_ci
87.09
Above we put quotes around “center” since the GMR is at the center on the log scale - not on the natural scale:
= (137.5 + 55.6) / 2 wrong_center_of_ci
96.55
This above 96.55
quantity is meaningless.
Now returning to the GMR, 87.08
, if we try to compute the ratio from the two geometric naive means we get a different value:
= round(100 * 147.9 / 165.2, digits = 2) GMR_from_naive_means
89.53
This discrepancy may be a problem when a problem when we report the results. The model based estimate of GMR differs from what we compute with simple arithmetic. It is not surprising that we got such a discrepancy because GMR resulted from the coefficients of the linear model and not from simple arithmetic computations.
For this the concept of marginal means is useful.
4 Introducing marginal means
Continuing to inspect the pumas_be
output above, observe that we present a Geometric Marginal Mean
for R
at 160.5
and for T
at 139.8
. If we compute the GMR from these values, the we get:
= round(100 * 139.8 / 160.5, digits = 2) GMR_from_marginal_means
87.1
This quantity exactly agrees with the (model based) GMR estimate, presented above (up to rounding).
As such, if wishing to consider the mean values of the endpoints for the two formulations it is generally recommended to use the marginal means and not the naive means. Nevertheless, pumas_be
always presents the naive means as well, since these are basic summary statistics of the data - not necessarily related to the statistical model used.
It is key to understand that these marginal means are model based. In contrast to naive means that can easily be explained as being computed from the data, the marginal means are a slightly more complex beast.
Most pumas_be
outputs present marginal means. The exceptions are parallel designs, and the nonparametric paradigm. For a broad overview, see the Quick Run-through of Pumas Bioequivalence Outputs tutorial.
Cases where marginal and naive means agree
For certain datasets the marginal and naive means agree. This in particular happens when the number of observations is balanced.
As we did in Unit5, let us also create a variant of the dataset that is a complete case analysis:
=
ids_with_both_periods @rsubset @combine(@groupby(data_2020_PKB, :id), :count = length(:id)) :count == 2).id
(= @rsubset data_2020_PKB :id in ids_with_both_periods
data_2020_PKB_complete_case simple_table(data_2020_PKB_complete_case, round_mode = nothing)
id | sequence | period | cmax |
1 | TR | 1 | 269.3 |
1 | TR | 2 | 410.4 |
2 | TR | 1 | 120.2 |
2 | TR | 2 | 137.3 |
4 | RT | 1 | 90.9 |
4 | RT | 2 | 68.9 |
5 | RT | 1 | 228.3 |
5 | RT | 2 | 301.5 |
Now in this case you can see that the naive means and the marginal means agree:
pumas_be(data_2020_PKB_complete_case, endpoint = :cmax)
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 2 | 2 |
TR | 2 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 184.9 | ||
Geometric Naive Mean | 184.9 | |||
T | Geometric Marginal Mean | 161 | ||
Geometric Naive Mean | 161 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.1476 | |||
Period (p-value) | 0.4684 | |||
5 How marginal means are computed
To understand marginal means, let us dive a bit more into the linear model, and also now view it as an ANOVA model.
Getting back to a basic linear model
In Unit 5 we saw that the linear model statistical formula for a crossover study is:
endpoint ~ 1 + sequence + formulation + period + sequence & id_within_sequence
Now for simplicity, let us consider a hypothetical case where we use a simpler model:
endpoint ~ formulation*period
Here formulation*period
means an interaction between formulation
and period
. This expands to the model formula:
endpoint ~ 1 + formulation + period + formulation&period
With this simple example we could understand the basics of marginal means.
As there are two levels for formulation
(R
and T
) and two levels for period (1
and 2
), the associated mathematical formula for such a model is just,
\[ y = \beta_0 + \beta_1 x^{(\texttt{T})} + \beta_2 x^{(\texttt{2})} + \beta_3x^{(\texttt{T})}x^{(\texttt{2})} + \varepsilon. \]
Here when we consider \(x^{(\texttt{TR})}\) and \(x^{(\texttt{2})}\) as boolean (\(0\) or \(1\)) variables similarly to before. Here \(x^{(\texttt{TR})}x^{(\texttt{2})}\) is the interaction term which only evaluates to \(1\) when we are in formulation T
and period 2
. With such a model, the prediction for the four possible cases is done via:
- \(\hat{y}_{\texttt{R}, \texttt{1}} = \hat{\beta}_0\).
- \(\hat{y}_{\texttt{R}, \texttt{2}} = \hat{\beta}_0 + ~~~~ + \hat{\beta}_2\).
- \(\hat{y}_{\texttt{T}, \texttt{1}} = \hat{\beta}_0 + \hat{\beta}_1\).
- \(\hat{y}_{\texttt{T}, \texttt{2}} = \hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_3\).
This type of way of representing two categorical variables in a linear model was introduced as part of Unit 5.
Let us use this model on this transformation of the data_2020_PKB
dataset:
= @rtransform data_2020_PKB :logcmax = round(log(:cmax), digits = 2)
data_simple @rtransform! data_simple :formulation = string(:sequence[:period])
@transform! data_simple :period = string.(:period)
@select! data_simple :formulation :period :logcmax
simple_table(data_simple, round_mode = nothing)
formulation | period | logcmax |
T | 1 | 5.6 |
R | 2 | 6.02 |
T | 1 | 4.79 |
R | 2 | 4.92 |
T | 1 | 4.66 |
R | 1 | 4.51 |
T | 2 | 4.23 |
R | 1 | 5.43 |
T | 2 | 5.71 |
R | 1 | 4.66 |
Here is the fitting of the linear model:
= lm(@formula(logcmax ~ formulation * period), data_simple) model
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}
logcmax ~ 1 + formulation + period + formulation & period
Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
Coef. Std. Error t Pr(>|t|) Lower 95% Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept) 4.86667 0.387752 12.55 <1e-04 3.91787 5.81546
formulation: T 0.15 0.548365 0.27 0.7936 -1.1918 1.4918
period: 2 0.603333 0.61309 0.98 0.3631 -0.896844 2.10351
formulation: T & period: 2 -0.65 0.867041 -0.75 0.4818 -2.77157 1.47157
────────────────────────────────────────────────────────────────────────────────────────
And here is an interpretation of the estimates for the four possible combinations:
= coef(model);
β₀, β₁, β₂, β₃, = round(β₀, digits = 3)
ŷ_R_1 = round(β₀ + β₂, digits = 3)
ŷ_R_2 = round(β₀ + β₁, digits = 3)
ŷ_T_1 = round(β₀ + β₁ + β₂ + β₃, digits = 3)
ŷ_T_2
= [ŷ_R_1 ŷ_R_2; ŷ_T_1 ŷ_T_2]
ests_GLM
function cell_estimates(ests)
println("formulation \\ period:\t\t", 1, "\t\t\t ", 2)
println("\t R: \t\t\t\t ", ests[1, 1], "\t\t\t", ests[1, 2])
println("\t T: \t\t\t\t ", ests[2, 1], "\t\t\t", ests[2, 2])
end
cell_estimates(ests_GLM)
formulation \ period: 1 2
R: 4.867 5.47
T: 5.017 4.97
And here is just a verification of this using the predict
function from the GLM
package:
= DataFrame(formulation = ["R", "T", "R", "T"], period = string.([1, 1, 2, 2]))
pred_df reshape(round.(predict(model, pred_df), digits = 3), 2, 2) |> cell_estimates
formulation \ period: 1 2
R: 4.867 5.47
T: 5.017 4.97
Marginal means from the linear model
Now when we consider marginal means from this fit, in our case of formulation
which is of interest, what we are speaking about is the population average from the above tables, across both values of the period
.
The key here is that population average treats the estimates as true population values, and then averages across the other variables (across period
in this case).
For example, one way to carry this out is to use the mean
function with dims = 2
we take the arithmetic mean of each row:
round.(mean(ests_GLM, dims = 2), digits = 2)
2×1 Matrix{Float64}:
5.17
4.99
Here the first value is the marginal mean for R
and is the average of 4.867
and 5.47
:
= round((4.867 + 5.47) / 2, digits = 2) R_marginal_mean_GLM
5.17
Similarly, the second value is the marginal mean for T
and is the average of 5.017
and 4.97
:
= round((5.017 + 4.97) / 2, digits = 2) T_marginal_mean_GLM
4.99
Such computation of marginal means can be carried out since the estimates are based on the coefficients:
\[ \left[ \begin{matrix} \beta_0 & \beta_0 + \beta_2 \\ \beta_0 + \beta_1 & \beta_0 + \beta_1 + \beta_2 + \beta_3 \end{matrix} \right] \]
So if we compute the mean of the first row we get,
\[ \text{Marginal mean of~}\texttt{R} = \frac{1}{2}\Big(\underbrace{\beta_0}_{\hat{y}_{\texttt{R}, \texttt{1}}} + \underbrace{\beta_0 + \beta_2}_{\hat{y}_{\texttt{R}, \texttt{2}} } \Big) = \beta_0 + \frac{1}{2}\beta_2. \]
Similarly for the second row we get,
\[ \text{Marginal mean of~}\texttt{T} = \frac{1}{2}\Big(\hat{y}_{\texttt{T}, \texttt{1}} + \hat{y}_{\texttt{T}, \texttt{2}} \Big) = \beta_0 + \beta_1 + \frac{1}{2}\big(\beta_2 + \beta_3\big). \]
You can see that implementing these expressions agrees with the above:
round.([β₀ + 0.5β₂, β₀ + β₁ + 0.5(β₂ + β₃)], digits = 2)
2-element Vector{Float64}:
5.17
4.99
An associated two way ANOVA
Let us also consider an alternative way to represent this model. For this let us consider the model as an ANOVA model. With this, we write,
\[ y_{ij} = \mu + s_i + p_j + \gamma_{ij} + \varepsilon, \]
where \(y_{ij}\) is the response (endpoint on log scale) based on the first effect (formulation) being in level \(i\) and the second effect (period) being in level \(j\).
Here \(\mu\) is some fixed overall mean and \(s_i\) and \(p_j\) are the deviations of this mean based on the levels \(i\) and \(j\). In this specific case \(i=1,2\) and \(j=1,2\), but in greater generality we could have had more than two levels for \(s_i\) and/or for \(p_j\). The term \(\gamma_{ij}\) is the interaction term.
When treated as such a two way ANOVA model, we may consider the endpoint data organized as follows:
Period 1 (\(j=1\)) |
Period 2 (\(j=2\)) |
|
---|---|---|
R \((i=1)\) |
\(y^{(11)}_{1}, y^{(11)}_{2}, \ldots, y^{(11)}_{n_{11}}\) | \(y^{(12)}_{1}, y^{(12)}_{2}, \ldots, y^{(12)}_{n_{12}}\) |
T \((i=2)\) |
\(y^{(21)}_{1}, y^{(21)}_{2}, \ldots, y^{(21)}_{n_{21}}\) | \(y^{(22)}_{1}, y^{(22)}_{2}, \ldots, y^{(22)}_{n_{22}}\) |
Here there are \(n_{11}\) observations from R
and period 1
, \(n_{12}\) observations from R
and period 2
, and so fourth.
To make this organization via an ANOVA model let us carry out the following:
= sort(unique(data_simple.formulation))
formulations = sort(unique(data_simple.period))
periods @show formulations
@show periods
= [
anova_data @rsubset data_simple (:formulation .== form .&& :period .== per)).logcmax for
(in formulations, per in periods
form
]
println("\nANOVA form of the data:")
println("formulation \\ period:\t\t", periods[1], "\t\t\t\t\t", periods[2])
println("\t ", formulations[1], "\t\t\t\t", anova_data[1, :])
println("\t ", formulations[2], "\t\t\t\t", anova_data[2, :])
formulations = ["R", "T"]
periods = ["1", "2"]
ANOVA form of the data:
formulation \ period: 1 2
R [[4.51, 5.43, 4.66], [6.02, 4.92]]
T [[5.6, 4.79, 4.66], [4.23, 5.71]]
So with such a data organization, instead of considering the data in the original tabular form, we have data arrays for each combination of formulation
and period
.
Naive means with ANOVA
Such an organization naturally yields itself for the easy naive means. For example:
= vcat(anova_data[1, :]...)' R_data
1×5 adjoint(::Vector{Float64}) with eltype Float64:
4.51 5.43 4.66 6.02 4.92
= vcat(anova_data[2, :]...)' T_data
1×5 adjoint(::Vector{Float64}) with eltype Float64:
5.6 4.79 4.66 4.23 5.71
The naive means are then just arithmetic means over these values:
= round(mean(R_data), digits = 2)
R_naive_mean = round(mean(T_data), digits = 2)
T_naive_mean
@show R_naive_mean
@show T_naive_mean;
R_naive_mean = 5.11
T_naive_mean = 5.0
As you can see, these naive means differ from the marginal means that we computed above. Let us repeat the marginal means for comparison:
@show R_marginal_mean_GLM
@show T_marginal_mean_GLM;
R_marginal_mean_GLM = 5.17
T_marginal_mean_GLM = 4.99
ANOVA estimates
Now one nice property of an ANOVA model is that we can get estimates of the various cells (R
period 1, R
period 2, etc.) by taking the arithmetic mean of the respective cell.
For example, just by applying the mean
function on each cell in anova_data
, we get the ANOVA estimates:
= round.(mean.(anova_data), digits = 3)
ests_ANOVA cell_estimates(ests_ANOVA)
formulation \ period: 1 2
R: 4.867 5.47
T: 5.017 4.97
These ANOVA based estimates are the same as the GLM estimates. Hence we can compute the marginal means from them just as easily:
round.(mean(ests_ANOVA, dims = 2), digits = 2)
2×1 Matrix{Float64}:
5.17
4.99
Why naive means and marginal means agree with balanced data
With the ANOVA based representation, it is wasy to see why the naive means and marginal means differ in the example above. Take for example only the R
formulation as an example.
The naive mean was computed using R_data
:
R_data
1×5 adjoint(::Vector{Float64}) with eltype Float64:
4.51 5.43 4.66 6.02 4.92
Here the first three observations are for period 1
and the other two observations are for period 2
. Schematically this is done via,
\[ 5.11 = \frac{1}{5}\Big(\underbrace{4.51 + 5.43 + 4.66}_{\text{Period 1}} + \underbrace{6.02 + 4.92}_{\text{Period 2}} \Big) \]
In contrast, the marginal mean is the average of the averages. Namely, we first average out Period 1
and Period 2
, and only then we compute the average
\[ 5.17 = \frac{1}{2}\Big(\frac{1}{3}(4.51 + 5.43 + 4.66) + \frac{1}{2}(6.02 + 4.92) \Big) \]
We thus see that our marginal and naive means generally do not agree in such an unbalanced data situation.
Now if our data was balanced, such as for example not having the 4.66
observation, the respective values would agree, namely, you can verify this equality holds:
\[ \frac{1}{4}\Big(\underbrace{4.51 + 5.43}_{\text{Period 1}} + \underbrace{6.02 + 4.92}_{\text{Period 2}} \Big) = \frac{1}{2}\Big(\frac{1}{2}(4.51 + 5.43) + \frac{1}{2}(6.02 + 4.92) \Big). \]
Such a pattern continues in all balanced data situations. This is why the marginal and naive means agree with balanced data.
6 Conclusion
This unit addressed a critical, and often confusing, topic in bioequivalence analysis: the distinction between naive and marginal means. We demonstrated that while naive means—simple geometric averages of the data—are intuitive, they can create a significant inconsistency in crossover studies. Specifically, the ratio of naive means for the test and reference products does not generally match the model-derived geometric mean ratio (GMR), especially when the data is unbalanced.
To provide a complete picture, pumas_be
thoughtfully calculates and presents both naive and marginal means in its output. The naive means serve as a valuable, straightforward summary statistic of the raw data, reflecting the simple average observed for each formulation. They are useful for an initial check and data description.
However, the marginal means (or least-squares means) are the model-based estimates. They are adjusted for other effects in the model (like period and sequence) and are therefore inherently consistent with the overall analysis. The key takeaway is a clear guideline for reporting: for a coherent and defensible study report, the marginal means are the correct values to report when discussing the individual performance of each formulation. Crucially, their ratio will precisely equal the GMR, providing a unified and logically sound set of results.
By presenting both, pumas_be
equips the analyst with the tools for both initial data exploration (via naive means) and formal, consistent reporting (via marginal means).
7 Unit exercises
Confirming the Discrepancy in an Unbalanced Crossover Study
The central problem this unit addresses is that for unbalanced data, the ratio of naive geometric means does not equal the model-based GMR. Let’s prove this for ourselves using the
data_2020_PKB
dataset.- First, create a new column
:trt
that indicates the treatment (R
orT
) for each observation based on its sequence and period. - Using
groupby
andcombine
, calculate the naive geometric mean for thecmax
ofR
andT
across the entire dataset. - Calculate the GMR by taking the ratio of the naive geometric mean of
T
toR
(and multiply by 100). - Now, run
pumas_be(data_2020_PKB, endpoint = :cmax)
. Compare the GMR you calculated in part (c) to theGeometric Mean Ratio
reported in thepumas_be
output. Confirm that they are different.
- First, create a new column
Verifying Consistency in a Balanced Crossover Study
This unit claimed that naive and marginal means agree when the data is balanced. Let’s verify this using the
data_2020_PKB_complete_case
dataset, which represents a balanced subset of the original data.- As in Exercise 1, calculate the naive geometric means for
R
andT
for thecmax
endpoint, but this time using thedata_2020_PKB_complete_case
dataset. - Run
pumas_be(data_2020_PKB_complete_case, endpoint = :cmax)
. - Compare the naive means you calculated in part (a) to the
Geometric Marginal Mean
values reported in thepumas_be
output. Are they the same (up to rounding)? - Now, take the ratio of the marginal means from the
pumas_be
output. Does it match the reportedGeometric Mean Ratio
? This exercise demonstrates how consistency is achieved in both balanced and unbalanced cases when using the correct means.
- As in Exercise 1, calculate the naive geometric means for
Manual Calculation of Marginal Means
To solidify your understanding of how marginal means are derived from a model, let’s manually calculate them using the simplified linear model presented in the tutorial.
- Using the
data_simple
DataFrame created in the tutorial, fit the linear model:model = lm(@formula(logcmax ~ formulation * period), data_simple)
. - Extract the model coefficients using
coef(model)
. Let’s call themβ₀, β₁, β₂, β₃
. - Using the formulas derived in the unit, calculate the log-scale marginal mean for formulation
R
(β₀ + 0.5β₂
) and formulationT
(β₀ + β₁ + 0.5(β₂ + β₃)
). - Exponentiate your results from part (c) to convert the log-scale marginal means back to the original scale (geometric marginal means).
- Compare the values you calculated in (d) to the
Geometric Marginal Mean
values from thepumas_be
output in Exercise 1. They will be very close but not identical. Why do you think there is a small difference? (Hint: Compare the model formula used here to the full model formula for a crossover study mentioned in Unit 5).
- Using the
Conceptual Question: The Impact of Imbalance
This question requires no coding, only reasoning based on the unit’s concepts.
Imagine a \(2 \times 2\) crossover study that starts with 20 subjects in the
TR
sequence and 20 subjects in theRT
sequence.- Scenario A: One subject from the
TR
sequence drops out after Period 1. - Scenario B: Ten subjects from the
TR
sequence drop out after Period 1.
In which scenario (A or B) would you expect the difference between the naive geometric means and the marginal means to be larger? Explain your reasoning in one or two sentences, referencing the concept of data imbalance.
- Scenario A: One subject from the