using Bioequivalence # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using SummaryTables
using GLM

A Course on Bioequivalence: Unit 14 - On Type III p-values
This is Unit 14 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.
1 Unit overview
In Unit 5, we introduced the linear model as the engine for analyzing the \(2 \times 2\) crossover design. We saw that fitting this model yields coefficients for each factor, and from these, we derive the geometric mean ratio (GMR) and its confidence interval. However, the pumas_be
output also includes an ANOVA
section with p-values for Formulation
, Sequence
, and Period
. These are not the same as the p-values for the individual coefficients we saw in the model’s coefficient table.
In this unit, we will explore these ANOVA p-values. Specifically, we cover:
- The difference between p-values for individual coefficients and p-values for entire factors.
- A brief overview of the different “types” of ANOVA p-values (tests): Type I, II, and III.
- Why Type III tests are the standard for bioequivalence analysis.
- How to interpret the Type III p-values for
Formulation
,Sequence
, andPeriod
in the context of a BE study. In particular we claim that theFormulation
p-value is only useful for comparing across software packages and is not practical for bioequivalence studies. In contrast, theSequence
andPeriod
p-values may give some indication for the presence of such effects.
We use the following packages:
We have used all of these packages in previous tutorials.
2 P-values in a Linear Model
When we fit a linear model, there are two common categories of p-values we might encounter. It is crucial to understand the distinction between them. One category is P-values for Individual Coefficients and the other category is P-values for Factors (ANOVA).
P-values for Individual Coefficients
When we fit a linear model and look at its coefficient table (coeftable
), we get a p-value for each estimated coefficient. Let’s revisit this with a standard \(2 \times 2\) crossover dataset.
As we saw in Unit 5, we can inspect the underlying linear model and its coefficients:
The p-values in the Pr(>|t|)
column test a very specific hypothesis test for each row. It is the hypothesis test where \(H_1\) is: the specific coefficient differs from zero. So such p-values check if an individual coefficient is meaningful or not.
Let us return to this example dataset from Park, Kim, and Bae (2020):
= DataFrame(
data_2020_PKB = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6],
id = ["TR", "TR", "TR", "TR", "TR", "RT", "RT", "RT", "RT", "RT"],
sequence = [1, 2, 1, 2, 1, 1, 2, 1, 2, 1],
period = [269.3, 410.4, 120.2, 137.3, 105.2, 90.9, 68.9, 228.3, 301.5, 105.3],
cmax
)simple_table(data_2020_PKB, round_mode = nothing)
id | sequence | period | cmax |
1 | TR | 1 | 269.3 |
1 | TR | 2 | 410.4 |
2 | TR | 1 | 120.2 |
2 | TR | 2 | 137.3 |
3 | TR | 1 | 105.2 |
4 | RT | 1 | 90.9 |
4 | RT | 2 | 68.9 |
5 | RT | 1 | 228.3 |
5 | RT | 2 | 301.5 |
6 | RT | 1 | 105.3 |
Running pumas_be
on this data (as we did in Unit 5), we obtain:
= pumas_be(data_2020_PKB, endpoint = :cmax) result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 3 | 2 |
TR | 3 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 160.5 | ||
Geometric Naive Mean | 165.2 | |||
T | Geometric Marginal Mean | 139.8 | ||
Geometric Naive Mean | 147.9 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.2088 | |||
Period (p-value) | 0.4684 | |||
Let us also see the coefficient table of the linear model:
simple_table(result.model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 4.87 | 0.128 | 38.1 | 0.000688 | 4.32 | 5.42 |
sequence: TR | 0.286 | 0.156 | 1.83 | 0.209 | -0.387 | 0.959 |
formulation: T | -0.138 | 0.156 | -0.884 | 0.47 | -0.811 | 0.535 |
period: 2 | 0.139 | 0.156 | 0.888 | 0.468 | -0.534 | 0.812 |
sequence: RT & id_within_sequence: 2 | 0.704 | 0.143 | 4.93 | 0.0388 | 0.0894 | 1.32 |
sequence: TR & id_within_sequence: 2 | -0.297 | 0.143 | -2.08 | 0.173 | -0.911 | 0.318 |
sequence: RT & id_within_sequence: 3 | -0.209 | 0.181 | -1.16 | 0.367 | -0.986 | 0.568 |
sequence: TR & id_within_sequence: 3 | -0.358 | 0.181 | -1.98 | 0.186 | -1.13 | 0.419 |
You can see that each of the \(8\) coefficients has a p-value in the Pr(>|t|)
column, most of which happen to be insignificant in this specific example.
The key takeaway is that these p-values are about individual parameters, often comparing one level of a factor to a baseline level. But it sometimes doesn’t answer a broader question such as for example “Overall, is the sequence
factor a significant source of variation in the model?”
For example, let us consider a dataset with 6 sequences and see the first 15 rows of the coefficient table:
= dataset("bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5");
be_6_sequence_data = pumas_be(be_6_sequence_data)
result_6 simple_table(first(result_6.model |> coeftable, 15), halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 8.18 | 0.0563 | 145 | 3.02e-131 | 8.07 | 8.29 |
sequence: RTS | 0.0481 | 0.0585 | 0.823 | 0.412 | -0.0678 | 0.164 |
sequence: SRT | 0.172 | 0.0585 | 2.93 | 0.00407 | 0.0556 | 0.287 |
sequence: STR | -0.1 | 0.0596 | -1.68 | 0.0948 | -0.219 | 0.0177 |
sequence: TRS | 0.187 | 0.061 | 3.06 | 0.00273 | 0.0661 | 0.308 |
sequence: TSR | 0.0529 | 0.0596 | 0.888 | 0.377 | -0.0652 | 0.171 |
formulation: S | 0.339 | 0.0386 | 8.79 | 1.8e-14 | 0.263 | 0.416 |
formulation: T | 0.15 | 0.0386 | 3.9 | 0.000163 | 0.0741 | 0.227 |
period: 2 | 0.053 | 0.0384 | 1.38 | 0.169 | -0.0229 | 0.129 |
period: 3 | -0.0294 | 0.0388 | -0.756 | 0.451 | -0.106 | 0.0476 |
sequence: RST & id_within_sequence: 2 | -0.405 | 0.116 | -3.48 | 0.000702 | -0.635 | -0.175 |
sequence: RTS & id_within_sequence: 2 | 0.0825 | 0.116 | 0.714 | 0.477 | -0.146 | 0.311 |
sequence: SRT & id_within_sequence: 2 | -0.521 | 0.116 | -4.51 | 1.59e-5 | -0.75 | -0.292 |
sequence: STR & id_within_sequence: 2 | -0.125 | 0.115 | -1.09 | 0.279 | -0.353 | 0.103 |
sequence: TRS & id_within_sequence: 2 | 0.378 | 0.117 | 3.24 | 0.00159 | 0.147 | 0.61 |
Now there are 5 coefficients (rows 1 to 5) associated with sequence
, each with its own p-value. So if we are seeking the p-value associated with the question if sequence
is significant or not, which one should we use?
To answer that, we need ANOVA.
P-values for Factors (ANOVA)
Analysis of Variance (ANOVA) tests a different kind of hypothesis. Instead of testing individual coefficients, it tests whether an entire factor (which may be represented by one or more coefficients) has a significant effect on the outcome. For example, the ANOVA p-value for sequence
tests against the null hypothesis:
\(H_0:\) There is no difference in the mean response between any of the levels of the sequence
.
In general with such an hypothesis, if we are given a p-value, then based on if the p-value is “small” we have evidence against this null hypothesis, and thus have evidence that “there is a sequence
effect”. In contrast, if the p-value is non-small, there is no evidence of such a sequence effect.
Such ANOVA style p-values for factors are not part of the model
field of the pumas_be
result, they are rather created by pumas_be
specifically for bioequivalence context.
The F-test
The statistical procedure used to test the significance of an entire factor in the ANOVA framework is the F-test. The logic of the F-test is to compare the variability in the data that is explained by a factor against the variability that is unexplained (i.e., the random error or residuals).
The test computes a value known as the F-statistic, which can loosely be understood as a ratio:
\[ F = \frac{\text{Variability between the factor's groups}}{\text{Variability within the groups}}. \]
Let’s think about this in the context of the Sequence
factor (again assuming the sequences are TR
and RT
):
- The
Variability between groups
measures how different the average response is for theTR
group versus theRT
group. - The
Variability within groups
measures the random variation among subjects who are in the same sequence group. This is considered the baseline or “error” variability.
Interestingly, the F-test is based on the F distribution which also appeared in the previous unit in a completely different application context.
If the Sequence
factor has no real effect (the null hypothesis is true), then the variability between the TR
and RT
groups should be about the same as the random variability we see within each group. In this case, the F-statistic will be close to 1.
However, if the Sequence
factor does have a significant effect, the variability between the groups will be much larger than the random variability within them. This results in an F-statistic significantly greater than 1.
The p-value is then calculated based on this F-statistic. A large F-statistic leads to a small p-value, providing evidence against the null hypothesis and suggesting that the factor is a significant source of variation.
The “variability” mentioned here is formally quantified using sums of squares (SS). How these sums of squares are calculated, especially in a model with multiple factors like our bioequivalence model, is a critical detail that leads to different types of ANOVA tests.
3 Type I, II, and III Sums of Squares
The ANOVA style p-values for factors are calculated from sums of squares (SS), which measure how much variability in the data is explained by each factor. We do not enter the details of sums of squares here. You can see an introductory example of sums of squares for simple one way ANOVA in Nazarathy and Klok (2021).
It turns out that when there are multiple factors then there are multiple ways of calculating such sums of squares, each leading to different types of tests. These tests have been called type I tests, type II tests, and type III tests over the years (alternatively type I, type II, or type III sums of squares). Do not confuse these terms with “type I” and “type II” errors of hypothesis tests, covered heavily in earlier units of this course.
For those wanting to dive into this topic consider the references, Speed, Hocking, and Hackney (1978), Herr (1986), and Langsrud (2003). For using such sums of squares in the mixed models context, see Kuznetsova, Brockhoff, and Christensen (2017).
In a nutshell, the differences in how we compute sums of squares lies in the order and manner in which factors are added to the model.
- Type I (Sequential) SS: The effect of each factor is evaluated sequentially. The SS for the first factor is calculated. Then, the SS for the second factor is calculated after accounting for the first. The SS for the third is calculated after accounting for the first two, and so on. The results depend on the order of the factors in the model formula. This is rarely appropriate for experimental designs where factors are not ordered in a hierarchy of importance.
- Type II (Hierarchical) SS: The SS for a given factor is calculated after accounting for all other factors, except those that contain the given factor (e.g., interactions). For a main effect, its SS is calculated accounting for all other main effects. This method is independent of order for main effects but is primarily used for models without interaction terms.
- Type III (Marginal) SS: The SS for each factor is calculated as if it were the last one to enter the model. It measures the unique contribution of each factor after all other factors have already been included. The results are independent of the order of factors. This is the most appropriate method for BE studies because BE studies can become unbalanced due to subject dropouts. Type III SS is the standard and most reliable method for unbalanced designs.
For these reasons, regulatory agencies and statistical best practices mandate the use of Type III tests for analyzing crossover bioequivalence studies. The ANOVA section in the pumas_be
output exclusively reports Type III p-values.
A Note on the Debate Around Type III Tests
While Type III ANOVA is the regulatory and industry standard for bioequivalence, it’s worth knowing that it is a subject of debate in the wider statistical community. The primary criticism centers on how Type III tests handle models with interaction terms. You may also see an interesting critique in Venables (1998).
The hypothesis tested by a Type III test for a main effect is about the marginal means, averaged across the levels of other factors. Critics argue that if there is a significant interaction, the main effect is not constant and its “average” effect might be meaningless or misleading.
In summary, while it is important to be aware of the statistical debate, the use of Type III tests for the standard additive bioequivalence model is well-justified by convention, regulatory expectation, and its robustness.
4 Interpreting the ANOVA Results
The primary conclusion of a BE study rests on the 90% confidence interval of the GMR, not on the p-values in the ANOVA section. Instead, the ANOVA p-values serve as diagnostic checks on the study’s assumptions and conduct.
Specifically we recommend considering the Sequence
and Period
p-values. Let us consider this example:
= dataset(joinpath("bioequivalence", "RT_TR", "PJ2017_3_1.csv"))
data = pumas_be(data) result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 17 | 17 |
TR | 15 | 15 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: AUC | ||||
Formulations: Reference(R), Test(T) | ||||
Results(AUC) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 1879 | ||
Geometric Naive Mean | 1887 | |||
T | Geometric Marginal Mean | 1848 | ||
Geometric Naive Mean | 1845 | |||
Geometric Mean T/R Ratio (%) | 98.36 | |||
Degrees of Freedom | 30 | |||
90% Confidence Interval (%) | [94.07, 102.8] | Pass | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 10.52 | 0.1049 | ||
ANOVA | Formulation (p-value) | 0.5334 | ||
Sequence (p-value) | 0.1818 | |||
Period (p-value) | 0.00235 | |||
Sequence p-value
In this case the Sequence
p-value is 0.1818
. Hence not significant.
What it tests: The null hypothesis is that there is no difference in the average response between the two sequence groups (
TR
vs.RT
).Interpretation:
- A high p-value (e.g., > 0.05) is the desired outcome. It indicates that the two groups of subjects, randomized to the two sequences, behaved similarly on average across the whole study. This supports the assumption that randomization was successful and that there is no significant carryover effect (where the drug from Period 1 affects the outcome in Period 2).
- A low p-value (e.g., < 0.05) is a red flag. It suggests a systematic difference between the sequence groups. This could be due to a failed randomization or, more commonly, a significant carryover effect. A significant carryover effect can compromise the validity of the crossover design, and the results should be interpreted with extreme caution. Further investigation may be required.
Period p-value
In this case the Period
p-value is 0.00235
. Hence it is significant.
What it tests: The null hypothesis is that there is no difference in the average response between Period 1 and Period 2.
Interpretation:
- A high p-value (e.g., > 0.05) is generally preferred, as it suggests the conditions of the study were stable across time.
- A low p-value (e.g., < 0.05), as we see in this example, indicates a statistically significant period effect. This means that, on average, subject responses were different in Period 2 compared to Period 1, regardless of which formulation they received. This could be due to physiological changes in the subjects, changes in laboratory assay procedures over time, or subjects becoming accustomed to the study procedures.
- Does a significant period effect invalidate the study? No. The crossover design and the linear model are specifically constructed to account for and adjust for period effects. The GMR estimate and its CI are still valid. However, a significant period effect is an important finding that should be noted and, if possible, explained in the study report. It is a key piece of information about the study’s conduct.
5 The Formulation p-value
In the above example the Formulation
p-value is 0.5334
.
Of all the p-values in the ANOVA section, the one for Formulation
is the most likely to be misinterpreted. While it relates to the difference between Test and Reference, it is fundamentally not the test for bioequivalence.
What the F-test for Formulation Actually Tests
The Type III F-test for the Formulation
factor evaluates whether there is a statistically significant difference in the mean response between the Test and Reference formulations. The null hypothesis (\(H_0\)) for this specific test is one of exact equality, or no difference: \[
H_0: \mu_T = \mu_R \quad (\text{or equivalently, on the log-scale, } \log(\mu_T) - \log(\mu_R) = 0)
\] The alternative hypothesis (\(H_1\)) is simply that a difference of any magnitude exists: \[
H_1: \mu_T \neq \mu_R \quad (\text{or on the log-scale, } \log(\mu_T) - \log(\mu_R) \neq 0)
\] In our example, the p-value is 0.5334
. This non-low value means we fail to reject the null hypothesis. The standard statistical interpretation is: “There is no statistically significant evidence from this data to conclude that a difference exists between the average effects of the Test and Reference formulations.”
Why This is Not the Test for Bioequivalence
It is a common and critical mistake to interpret a non-significant formulation p-value (e.g., p > 0.05) as proof of bioequivalence. The F-test and the bioequivalence assessment answer fundamentally different questions.
The F-test has the Wrong Null Hypothesis.
The F-test is a test of equality. Bioequivalence is a test of similarity within a margin.
The F-test aims to disprove equality. You start by assuming the drugs are identical (\(H_0: \mu_T = \mu_R\)) and look for evidence that they are different. A high p-value simply means you failed to find that evidence. This could be because the drugs are indeed very similar, or it could be because your study had high variability or too few subjects (low statistical power) to detect a real, and potentially large, difference.
Bioequivalence aims to disprove a lack of equivalence. For BE, we must reject the “null hypothesis of inequivalence.” This hypothesis states that the drug ratio is outside the acceptance limits. The TOST (Two One-Sided Tests) procedure formalizes this by testing two null hypotheses simultaneously (see Unit 3): \[ H_{0-}: \frac{\mu_T}{\mu_R} \le 0.80 \quad (\text{The ratio is too low}) \] \[ H_{0+}: \frac{\mu_T}{\mu_R} \ge 1.25 \quad (\text{The ratio is too high}) \] We must reject both of these null hypotheses to conclude bioequivalence.
Why present the Formulation
p-value
The Formulation
p-value is presented to help verify that pumas_be
output agrees with legacy output scripts and outputs of other software packages. In that sense, since such p-values have appeared historically, pumas_be
also presents this output for such comparative purposes.
6 Conclusion
This unit clarified the role and meaning of the ANOVA p-values that appear in the pumas_be
output. We learned that these are Type III p-values, which are the industry standard because they provide an order-independent assessment of each factor’s contribution, making them suitable for potentially unbalanced designs.
The key takeaway is the practical interpretation of these values in a bioequivalence context:
- The Formulation p-value is informational but is not the test for BE.
- The Sequence p-value is a critical diagnostic for carryover effects; a low value is a warning sign.
- The Period p-value checks for systematic changes over time; a low value is a notable finding that the model adjusts for, but it does not invalidate the BE conclusion.
The primary evidence for bioequivalence always comes from the 90% confidence interval for the geometric mean ratio. The ANOVA tests provide essential supporting evidence and quality control checks on the assumptions of the crossover design.
7 Unit exercises
What is the main difference between the p-values reported for individual coefficients in a linear model and the ANOVA (Type III) p-values for factors such as Sequence or Period?
Why is the Sequence p-value in the ANOVA section important to check in a bioequivalence study? What might a low Sequence p-value indicate?
Why is the Formulation p-value in the ANOVA section NOT used to establish bioequivalence, and what statistical method should be used instead?