A Course on Bioequivalence: Unit 7 - Understanding marginal means

1 Unit overview

In Units 1–5 we studied the introductory statistical details of bioequivlanece. Then in Unit 6 we momentarily departed from the theory to explore PumasCP. In this unit pick up the statistical discussion as we learn about marginal means.

In particular when we carry out a bioequivalence study comparing R and T (or potentially more formulations), in addition to estimating the geometric mean ratio (GMR) and obtaining a deciding confidence interval for it, we may also be interested in understanding the geometric means of each quantity R and T separately. This can be for the purposes of comparative studies or further understanding of the drug product in future studies. This is the topic of marginal means (also called in classic SAS based literature least squares means).

We unpack these concepts in this unit:

We understand the difference between marginal means and naive means.
We see in which cases these means agree.
We understand the way in which marginal means are computed and learn to treat the types of means separately.

We use the following packages:

using Pumas # available with Pumas products
using Bioequivalence  # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using DataFramesMeta
using StatsBase
using GLM
using SummaryTables

We have used each of these packages in previous units.

2 Starting with naive means

Consider this simple example dataset for a parallel design where we assume the data has not been log transformed.

data = DataFrame(
    id = 1:10,
    seq = vcat(fill("R", 5), fill("T", 5)),
    per = fill(1, 10),
    AUC = [17.7, missing, 14.7, 20.55, 13.8, 15.45, 13.65, missing, 18.75, 22.95],
)
simple_table(data)


id	seq	per	AUC

1	R	1	17.7
2	R	1	missing
3	R	1	14.7
4	R	1	20.6
5	R	1	13.8
6	T	1	15.4
7	T	1	13.6
8	T	1	missing
9	T	1	18.8
10	T	1	23

If we were interested in the mean of R and mean of T, then the natural way to present these means is as geometric means. One way to compute this these means is using the groupby and combine functions where we then use the geomean function together with skipmissing since there are some missing observations:

gms = combine(groupby(data, :seq), :AUC => (x -> geomean(skipmissing(x))) => :GM)
simple_table(gms, round_digits = 4)


seq	GM

R	16.48
T	17.36

As you can see, if we apply pumas_be to this dataset, we obtain the exact same means, 16.48, and 17.9. Note that here when we call the pumas_be function we map the sequence column to :seq and the period column to :per:

pumas_be(data, sequence = :seq, period = :per)


Observation Counts
Sequence ╲ Period	1

R	4
T	4


Paradigm: Parallel design
Model: T test (equal variance)
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Naive Mean	16.48
T	Geometric Naive Mean	17.36
	Geometric Mean T/R Ratio (%)	105.3
	Degrees of Freedom	6
	90% Confidence Interval (%)	[79.39, 139.7]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	19.41 \| 0.1923

In this case, Pumas writes Naive Mean to indicate that no model based mean computation is carried out. All that was done is the geometric mean computation of the data (or it can be the exponent of the arithmetic mean of the log transformed data).

Indeed in a parallel design, there are no marginal means, only naive means.

The GMR is the ratio of the geometric means

Importantly since,

\[ \text{GMR} = \frac{\text{GM}_T}{\text{GM}_R}, \]

you may check that (up to rounding),

\[ \texttt{105.3} = \texttt{100} \times \frac{\texttt{17.36}}{\texttt{16.48}}. \]

GMR_from_naive_means = round(100 * 17.36 / 16.48, digits = 1)

105.3

Hence the output of pumas_be above is consistent in the sense that our estimates for the geometric means and the estimates for the geometric mean ratio agree. Such consistency between the formulation means and the GMR is obviously important when we report the study.

It would have been problematic and a cause for concern if,

\[ \text{GMR} \neq \frac{\text{GM}_T}{\text{GM}_R}. \]

Yet as we see this is not the case.

3 A problem with naive means for non-parallel designs

When we move onto a non-parallel design where the GMR is a result of a statistical model, problems may occur.

To explore this let us consider the example \(2 \times 2\) crossover dataset appearing in Park, Kim, and Bae (2020) which we also used in Unit5:

data_2020_PKB = DataFrame(
    id = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6],
    sequence = ["TR", "TR", "TR", "TR", "TR", "RT", "RT", "RT", "RT", "RT"],
    period = [1, 2, 1, 2, 1, 1, 2, 1, 2, 1],
    cmax = [269.3, 410.4, 120.2, 137.3, 105.2, 90.9, 68.9, 228.3, 301.5, 105.3],
)
simple_table(data_2020_PKB, round_mode = nothing)


id	sequence	period	cmax

1	TR	1	269.3
1	TR	2	410.4
2	TR	1	120.2
2	TR	2	137.3
3	TR	1	105.2
4	RT	1	90.9
4	RT	2	68.9
5	RT	1	228.3
5	RT	2	301.5
6	RT	1	105.3

Now to compute the geometric naive mean, we use can take steps as follows:

data2 = @rtransform data_2020_PKB :trt = string(:sequence[:period])
gms = combine(groupby(data2, :trt), :cmax => geomean => :GM)
sort!(gms, :trt)
simple_table(gms, round_digits = 4)


trt	GM

R	165.2
T	147.9

The first line above creates a derived dataset, data2 creates another column, :trt which stands for the treatment or formulation. We then do a similar computation as we did for the parallel design, noting that here we do not need to apply skipmissing. The sort! line just sorts so that R appears before T.

Now if we just apply pumas_be to the dataset, you can see that the geometric naive means are 165.2 and 147.9, exactly as we manually computed above.

pumas_be(data_2020_PKB, endpoint = :cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	3	2
TR	3	2


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Marginal Mean	160.5
	Geometric Naive Mean	165.2
T	Geometric Marginal Mean	139.8
	Geometric Naive Mean	147.9
	Geometric Mean T/R Ratio (%)	87.08
	Degrees of Freedom	2
	90% Confidence Interval (%)	[55.16, 137.5]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	22.39 \| 0.2212
ANOVA	Formulation (p-value)	0.4698
	Sequence (p-value)	0.2088
	Period (p-value)	0.4684

The GMR is NOT always the ratio of the geometric naive means

As you can see the output presented the estimated GMR at 87.08 (as a percentage). As we learned in Unit5 this is a model-based estimate of the GMR.

This GMR is consistent with the confidence interval [55.16, 137.5] which we obtain as a percentage. For example, you may verify (up to rounding errors) that it is at the “center” of the confidence interval :

correct_center_of_ci = round(100exp((log(137.5 / 100) + log(55.16 / 100)) / 2), digits = 2)

87.09

Above we put quotes around “center” since the GMR is at the center on the log scale - not on the natural scale:

wrong_center_of_ci = (137.5 + 55.6) / 2

96.55

This above 96.55 quantity is meaningless.

Now returning to the GMR, 87.08, if we try to compute the ratio from the two geometric naive means we get a different value:

GMR_from_naive_means = round(100 * 147.9 / 165.2, digits = 2)

89.53

This discrepancy may be a problem when a problem when we report the results. The model based estimate of GMR differs from what we compute with simple arithmetic. It is not surprising that we got such a discrepancy because GMR resulted from the coefficients of the linear model and not from simple arithmetic computations.

For this the concept of marginal means is useful.

4 Introducing marginal means

Continuing to inspect the pumas_be output above, observe that we present a Geometric Marginal Mean for R at 160.5 and for T at 139.8. If we compute the GMR from these values, the we get:

GMR_from_marginal_means = round(100 * 139.8 / 160.5, digits = 2)

87.1

This quantity exactly agrees with the (model based) GMR estimate, presented above (up to rounding).

As such, if wishing to consider the mean values of the endpoints for the two formulations it is generally recommended to use the marginal means and not the naive means. Nevertheless, pumas_be always presents the naive means as well, since these are basic summary statistics of the data - not necessarily related to the statistical model used.

It is key to understand that these marginal means are model based. In contrast to naive means that can easily be explained as being computed from the data, the marginal means are a slightly more complex beast.

Most pumas_be outputs present marginal means. The exceptions are parallel designs, and the nonparametric paradigm. For a broad overview, see the Quick Run-through of Pumas Bioequivalence Outputs tutorial.

Cases where marginal and naive means agree

For certain datasets the marginal and naive means agree. This in particular happens when the number of observations is balanced.

As we did in Unit5, let us also create a variant of the dataset that is a complete case analysis:

ids_with_both_periods =
    (@rsubset @combine(@groupby(data_2020_PKB, :id), :count = length(:id)) :count == 2).id
data_2020_PKB_complete_case = @rsubset data_2020_PKB :id in ids_with_both_periods
simple_table(data_2020_PKB_complete_case, round_mode = nothing)


id	sequence	period	cmax

1	TR	1	269.3
1	TR	2	410.4
2	TR	1	120.2
2	TR	2	137.3
4	RT	1	90.9
4	RT	2	68.9
5	RT	1	228.3
5	RT	2	301.5

Now in this case you can see that the naive means and the marginal means agree:

pumas_be(data_2020_PKB_complete_case, endpoint = :cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	2	2
TR	2	2


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Marginal Mean	184.9
	Geometric Naive Mean	184.9
T	Geometric Marginal Mean	161
	Geometric Naive Mean	161
	Geometric Mean T/R Ratio (%)	87.08
	Degrees of Freedom	2
	90% Confidence Interval (%)	[55.16, 137.5]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	22.39 \| 0.2212
ANOVA	Formulation (p-value)	0.4698
	Sequence (p-value)	0.1476
	Period (p-value)	0.4684

5 How marginal means are computed

To understand marginal means, let us dive a bit more into the linear model, and also now view it as an ANOVA model.

Getting back to a basic linear model

In Unit 5 we saw that the linear model statistical formula for a crossover study is:

endpoint ~ 1 + sequence + formulation + period + sequence & id_within_sequence

Now for simplicity, let us consider a hypothetical case where we use a simpler model:

endpoint ~ formulation*period

Here formulation*period means an interaction between formulation and period. This expands to the model formula:

endpoint ~ 1 + formulation + period + formulation&period

With this simple example we could understand the basics of marginal means.

As there are two levels for formulation (R and T) and two levels for period (1 and 2), the associated mathematical formula for such a model is just,

\[ y = \beta_0 + \beta_1 x^{(\texttt{T})} + \beta_2 x^{(\texttt{2})} + \beta_3x^{(\texttt{T})}x^{(\texttt{2})} + \varepsilon. \]

Here when we consider \(x^{(\texttt{TR})}\) and \(x^{(\texttt{2})}\) as boolean (\(0\) or \(1\)) variables similarly to before. Here \(x^{(\texttt{TR})}x^{(\texttt{2})}\) is the interaction term which only evaluates to \(1\) when we are in formulation T and period 2. With such a model, the prediction for the four possible cases is done via:

\(\hat{y}_{\texttt{R}, \texttt{1}} = \hat{\beta}_0\).
\(\hat{y}_{\texttt{R}, \texttt{2}} = \hat{\beta}_0 + ~~~~ + \hat{\beta}_2\).
\(\hat{y}_{\texttt{T}, \texttt{1}} = \hat{\beta}_0 + \hat{\beta}_1\).
\(\hat{y}_{\texttt{T}, \texttt{2}} = \hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_2 + \hat{\beta}_3\).

This type of way of representing two categorical variables in a linear model was introduced as part of Unit 5.

Let us use this model on this transformation of the data_2020_PKB dataset:

data_simple = @rtransform data_2020_PKB :logcmax = round(log(:cmax), digits = 2)
@rtransform! data_simple :formulation = string(:sequence[:period])
@transform! data_simple :period = string.(:period)
@select! data_simple :formulation :period :logcmax

simple_table(data_simple, round_mode = nothing)


formulation	period	logcmax

T	1	5.6
R	2	6.02
T	1	4.79
R	2	4.92
T	1	4.66
R	1	4.51
T	2	4.23
R	1	5.43
T	2	5.71
R	1	4.66

Here is the fitting of the linear model:

model = lm(@formula(logcmax ~ formulation * period), data_simple)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

logcmax ~ 1 + formulation + period + formulation & period

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)                  4.86667     0.387752  12.55    <1e-04   3.91787     5.81546
formulation: T               0.15        0.548365   0.27    0.7936  -1.1918      1.4918
period: 2                    0.603333    0.61309    0.98    0.3631  -0.896844    2.10351
formulation: T & period: 2  -0.65        0.867041  -0.75    0.4818  -2.77157     1.47157
────────────────────────────────────────────────────────────────────────────────────────

And here is an interpretation of the estimates for the four possible combinations:

β₀, β₁, β₂, β₃, = coef(model);
ŷ_R_1 = round(β₀, digits = 3)
ŷ_R_2 = round(β₀ + β₂, digits = 3)
ŷ_T_1 = round(β₀ + β₁, digits = 3)
ŷ_T_2 = round(β₀ + β₁ + β₂ + β₃, digits = 3)

ests_GLM = [ŷ_R_1 ŷ_R_2; ŷ_T_1 ŷ_T_2]

function cell_estimates(ests)
    println("formulation \\ period:\t\t", 1, "\t\t\t ", 2)
    println("\t R: \t\t\t\t ", ests[1, 1], "\t\t\t", ests[1, 2])
    println("\t T: \t\t\t\t ", ests[2, 1], "\t\t\t", ests[2, 2])
end

cell_estimates(ests_GLM)

formulation \ period:       1            2
     R:                  4.867          5.47
     T:                  5.017          4.97

And here is just a verification of this using the predict function from the GLM package:

pred_df = DataFrame(formulation = ["R", "T", "R", "T"], period = string.([1, 1, 2, 2]))
reshape(round.(predict(model, pred_df), digits = 3), 2, 2) |> cell_estimates

formulation \ period:       1            2
     R:                  4.867          5.47
     T:                  5.017          4.97

Marginal means from the linear model

Now when we consider marginal means from this fit, in our case of formulation which is of interest, what we are speaking about is the population average from the above tables, across both values of the period.

The key here is that population average treats the estimates as true population values, and then averages across the other variables (across period in this case).

For example, one way to carry this out is to use the mean function with dims = 2 we take the arithmetic mean of each row:

round.(mean(ests_GLM, dims = 2), digits = 2)

2×1 Matrix{Float64}:
 5.17
 4.99

Here the first value is the marginal mean for R and is the average of 4.867 and 5.47:

R_marginal_mean_GLM = round((4.867 + 5.47) / 2, digits = 2)

5.17

Similarly, the second value is the marginal mean for T and is the average of 5.017 and 4.97:

T_marginal_mean_GLM = round((5.017 + 4.97) / 2, digits = 2)

4.99

Such computation of marginal means can be carried out since the estimates are based on the coefficients:

\[ \left[ \begin{matrix} \beta_0 & \beta_0 + \beta_2 \\ \beta_0 + \beta_1 & \beta_0 + \beta_1 + \beta_2 + \beta_3 \end{matrix} \right] \]

So if we compute the mean of the first row we get,

\[ \text{Marginal mean of~}\texttt{R} = \frac{1}{2}\Big(\underbrace{\beta_0}_{\hat{y}_{\texttt{R}, \texttt{1}}} + \underbrace{\beta_0 + \beta_2}_{\hat{y}_{\texttt{R}, \texttt{2}} } \Big) = \beta_0 + \frac{1}{2}\beta_2. \]

Similarly for the second row we get,

\[ \text{Marginal mean of~}\texttt{T} = \frac{1}{2}\Big(\hat{y}_{\texttt{T}, \texttt{1}} + \hat{y}_{\texttt{T}, \texttt{2}} \Big) = \beta_0 + \beta_1 + \frac{1}{2}\big(\beta_2 + \beta_3\big). \]

You can see that implementing these expressions agrees with the above:

round.([β₀ + 0.5β₂, β₀ + β₁ + 0.5(β₂ + β₃)], digits = 2)

2-element Vector{Float64}:
 5.17
 4.99

An associated two way ANOVA

Let us also consider an alternative way to represent this model. For this let us consider the model as an ANOVA model. With this, we write,

\[ y_{ij} = \mu + s_i + p_j + \gamma_{ij} + \varepsilon, \]

where \(y_{ij}\) is the response (endpoint on log scale) based on the first effect (formulation) being in level \(i\) and the second effect (period) being in level \(j\).

Here \(\mu\) is some fixed overall mean and \(s_i\) and \(p_j\) are the deviations of this mean based on the levels \(i\) and \(j\). In this specific case \(i=1,2\) and \(j=1,2\), but in greater generality we could have had more than two levels for \(s_i\) and/or for \(p_j\). The term \(\gamma_{ij}\) is the interaction term.

When treated as such a two way ANOVA model, we may consider the endpoint data organized as follows:

	Period `1` (\(j=1\))	Period `2` (\(j=2\))
`R` \((i=1)\)	\(y^{(11)}_{1}, y^{(11)}_{2}, \ldots, y^{(11)}_{n_{11}}\)	\(y^{(12)}_{1}, y^{(12)}_{2}, \ldots, y^{(12)}_{n_{12}}\)
`T` \((i=2)\)	\(y^{(21)}_{1}, y^{(21)}_{2}, \ldots, y^{(21)}_{n_{21}}\)	\(y^{(22)}_{1}, y^{(22)}_{2}, \ldots, y^{(22)}_{n_{22}}\)

Here there are \(n_{11}\) observations from R and period 1, \(n_{12}\) observations from R and period 2, and so fourth.

To make this organization via an ANOVA model let us carry out the following:

formulations = sort(unique(data_simple.formulation))
periods = sort(unique(data_simple.period))
@show formulations
@show periods
anova_data = [
    (@rsubset data_simple (:formulation .== form .&& :period .== per)).logcmax for
    form in formulations, per in periods
]

println("\nANOVA form of the data:")
println("formulation \\ period:\t\t", periods[1], "\t\t\t\t\t", periods[2])
println("\t ", formulations[1], "\t\t\t\t", anova_data[1, :])
println("\t ", formulations[2], "\t\t\t\t", anova_data[2, :])

formulations = ["R", "T"]
periods = ["1", "2"]

ANOVA form of the data:
formulation \ period:       1                   2
     R              [[4.51, 5.43, 4.66], [6.02, 4.92]]
     T              [[5.6, 4.79, 4.66], [4.23, 5.71]]

So with such a data organization, instead of considering the data in the original tabular form, we have data arrays for each combination of formulation and period.

Naive means with ANOVA

Such an organization naturally yields itself for the easy naive means. For example:

R_data = vcat(anova_data[1, :]...)'

1×5 adjoint(::Vector{Float64}) with eltype Float64:
 4.51  5.43  4.66  6.02  4.92

T_data = vcat(anova_data[2, :]...)'

1×5 adjoint(::Vector{Float64}) with eltype Float64:
 5.6  4.79  4.66  4.23  5.71

The naive means are then just arithmetic means over these values:

R_naive_mean = round(mean(R_data), digits = 2)
T_naive_mean = round(mean(T_data), digits = 2)

@show R_naive_mean
@show T_naive_mean;

R_naive_mean = 5.11
T_naive_mean = 5.0

As you can see, these naive means differ from the marginal means that we computed above. Let us repeat the marginal means for comparison:

@show R_marginal_mean_GLM
@show T_marginal_mean_GLM;

R_marginal_mean_GLM = 5.17
T_marginal_mean_GLM = 4.99

ANOVA estimates

Now one nice property of an ANOVA model is that we can get estimates of the various cells (R period 1, R period 2, etc.) by taking the arithmetic mean of the respective cell.

For example, just by applying the mean function on each cell in anova_data, we get the ANOVA estimates:

ests_ANOVA = round.(mean.(anova_data), digits = 3)
cell_estimates(ests_ANOVA)

formulation \ period:       1            2
     R:                  4.867          5.47
     T:                  5.017          4.97

These ANOVA based estimates are the same as the GLM estimates. Hence we can compute the marginal means from them just as easily:

round.(mean(ests_ANOVA, dims = 2), digits = 2)

2×1 Matrix{Float64}:
 5.17
 4.99

Why naive means and marginal means agree with balanced data

With the ANOVA based representation, it is wasy to see why the naive means and marginal means differ in the example above. Take for example only the R formulation as an example.

The naive mean was computed using R_data:

R_data

1×5 adjoint(::Vector{Float64}) with eltype Float64:
 4.51  5.43  4.66  6.02  4.92

Here the first three observations are for period 1 and the other two observations are for period 2. Schematically this is done via,

\[ 5.11 = \frac{1}{5}\Big(\underbrace{4.51 + 5.43 + 4.66}_{\text{Period 1}} + \underbrace{6.02 + 4.92}_{\text{Period 2}} \Big) \]

In contrast, the marginal mean is the average of the averages. Namely, we first average out Period 1 and Period 2, and only then we compute the average

\[ 5.17 = \frac{1}{2}\Big(\frac{1}{3}(4.51 + 5.43 + 4.66) + \frac{1}{2}(6.02 + 4.92) \Big) \]

We thus see that our marginal and naive means generally do not agree in such an unbalanced data situation.

Now if our data was balanced, such as for example not having the 4.66 observation, the respective values would agree, namely, you can verify this equality holds:

\[ \frac{1}{4}\Big(\underbrace{4.51 + 5.43}_{\text{Period 1}} + \underbrace{6.02 + 4.92}_{\text{Period 2}} \Big) = \frac{1}{2}\Big(\frac{1}{2}(4.51 + 5.43) + \frac{1}{2}(6.02 + 4.92) \Big). \]

Such a pattern continues in all balanced data situations. This is why the marginal and naive means agree with balanced data.

6 Conclusion

This unit addressed a critical, and often confusing, topic in bioequivalence analysis: the distinction between naive and marginal means. We demonstrated that while naive means—simple geometric averages of the data—are intuitive, they can create a significant inconsistency in crossover studies. Specifically, the ratio of naive means for the test and reference products does not generally match the model-derived geometric mean ratio (GMR), especially when the data is unbalanced.

To provide a complete picture, pumas_be thoughtfully calculates and presents both naive and marginal means in its output. The naive means serve as a valuable, straightforward summary statistic of the raw data, reflecting the simple average observed for each formulation. They are useful for an initial check and data description.

However, the marginal means (or least-squares means) are the model-based estimates. They are adjusted for other effects in the model (like period and sequence) and are therefore inherently consistent with the overall analysis. The key takeaway is a clear guideline for reporting: for a coherent and defensible study report, the marginal means are the correct values to report when discussing the individual performance of each formulation. Crucially, their ratio will precisely equal the GMR, providing a unified and logically sound set of results.

By presenting both, pumas_be equips the analyst with the tools for both initial data exploration (via naive means) and formal, consistent reporting (via marginal means).

7 Unit exercises

Confirming the Discrepancy in an Unbalanced Crossover Study

The central problem this unit addresses is that for unbalanced data, the ratio of naive geometric means does not equal the model-based GMR. Let’s prove this for ourselves using the data_2020_PKB dataset.
1. First, create a new column :trt that indicates the treatment (R or T) for each observation based on its sequence and period.
2. Using groupby and combine, calculate the naive geometric mean for the cmax of R and T across the entire dataset.
3. Calculate the GMR by taking the ratio of the naive geometric mean of T to R (and multiply by 100).
4. Now, run pumas_be(data_2020_PKB, endpoint = :cmax). Compare the GMR you calculated in part (c) to the Geometric Mean Ratio reported in the pumas_be output. Confirm that they are different.
Verifying Consistency in a Balanced Crossover Study

This unit claimed that naive and marginal means agree when the data is balanced. Let’s verify this using the data_2020_PKB_complete_case dataset, which represents a balanced subset of the original data.
1. As in Exercise 1, calculate the naive geometric means for R and T for the cmax endpoint, but this time using the data_2020_PKB_complete_case dataset.
2. Run pumas_be(data_2020_PKB_complete_case, endpoint = :cmax).
3. Compare the naive means you calculated in part (a) to the Geometric Marginal Mean values reported in the pumas_be output. Are they the same (up to rounding)?
4. Now, take the ratio of the marginal means from the pumas_be output. Does it match the reported Geometric Mean Ratio? This exercise demonstrates how consistency is achieved in both balanced and unbalanced cases when using the correct means.
Manual Calculation of Marginal Means

To solidify your understanding of how marginal means are derived from a model, let’s manually calculate them using the simplified linear model presented in the tutorial.
1. Using the data_simple DataFrame created in the tutorial, fit the linear model: model = lm(@formula(logcmax ~ formulation * period), data_simple).
2. Extract the model coefficients using coef(model). Let’s call them β₀, β₁, β₂, β₃.
3. Using the formulas derived in the unit, calculate the log-scale marginal mean for formulation R (β₀ + 0.5β₂) and formulation T (β₀ + β₁ + 0.5(β₂ + β₃)).
4. Exponentiate your results from part (c) to convert the log-scale marginal means back to the original scale (geometric marginal means).
5. Compare the values you calculated in (d) to the Geometric Marginal Mean values from the pumas_be output in Exercise 1. They will be very close but not identical. Why do you think there is a small difference? (Hint: Compare the model formula used here to the full model formula for a crossover study mentioned in Unit 5).
Conceptual Question: The Impact of Imbalance

This question requires no coding, only reasoning based on the unit’s concepts.

Imagine a \(2 \times 2\) crossover study that starts with 20 subjects in the TR sequence and 20 subjects in the RT sequence.
- Scenario A: One subject from the TR sequence drops out after Period 1.
- Scenario B: Ten subjects from the TR sequence drop out after Period 1.
In which scenario (A or B) would you expect the difference between the naive geometric means and the marginal means to be larger? Explain your reasoning in one or two sentences, referencing the concept of data imbalance.