A Course on Bioequivalence: Unit 14 - On Type III p-values

Authors

Yoni Nazarathy

Andreas Noack

Vijay Ivaturi

This is Unit 14 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.

1 Unit overview

In Unit 5, we introduced the linear model as the engine for analyzing the \(2 \times 2\) crossover design. We saw that fitting this model yields coefficients for each factor, and from these, we derive the geometric mean ratio (GMR) and its confidence interval. However, the pumas_be output also includes an ANOVA section with p-values for Formulation, Sequence, and Period. These are not the same as the p-values for the individual coefficients we saw in the model’s coefficient table.

In this unit, we will explore these ANOVA p-values. Specifically, we cover:

The difference between p-values for individual coefficients and p-values for entire factors.
A brief overview of the different “types” of ANOVA p-values (tests): Type I, II, and III.
Why Type III tests are the standard for bioequivalence analysis.
How to interpret the Type III p-values for Formulation, Sequence, and Period in the context of a BE study. In particular we claim that the Formulation p-value is only useful for comparing across software packages and is not practical for bioequivalence studies. In contrast, the Sequence and Period p-values may give some indication for the presence of such effects.

We use the following packages:

using Bioequivalence # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using SummaryTables
using GLM

We have used all of these packages in previous tutorials.

2 P-values in a Linear Model

When we fit a linear model, there are two common categories of p-values we might encounter. It is crucial to understand the distinction between them. One category is P-values for Individual Coefficients and the other category is P-values for Factors (ANOVA).

P-values for Individual Coefficients

When we fit a linear model and look at its coefficient table (coeftable), we get a p-value for each estimated coefficient. Let’s revisit this with a standard \(2 \times 2\) crossover dataset.

As we saw in Unit 5, we can inspect the underlying linear model and its coefficients:

The p-values in the Pr(>|t|) column test a very specific hypothesis test for each row. It is the hypothesis test where \(H_1\) is: the specific coefficient differs from zero. So such p-values check if an individual coefficient is meaningful or not.

Let us return to this example dataset from Park, Kim, and Bae (2020):

data_2020_PKB = DataFrame(
    id = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6],
    sequence = ["TR", "TR", "TR", "TR", "TR", "RT", "RT", "RT", "RT", "RT"],
    period = [1, 2, 1, 2, 1, 1, 2, 1, 2, 1],
    cmax = [269.3, 410.4, 120.2, 137.3, 105.2, 90.9, 68.9, 228.3, 301.5, 105.3],
)
simple_table(data_2020_PKB, round_mode = nothing)


id	sequence	period	cmax

1	TR	1	269.3
1	TR	2	410.4
2	TR	1	120.2
2	TR	2	137.3
3	TR	1	105.2
4	RT	1	90.9
4	RT	2	68.9
5	RT	1	228.3
5	RT	2	301.5
6	RT	1	105.3

Running pumas_be on this data (as we did in Unit 5), we obtain:

result = pumas_be(data_2020_PKB, endpoint = :cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	3	2
TR	3	2


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Marginal Mean	160.5
	Geometric Naive Mean	165.2
T	Geometric Marginal Mean	139.8
	Geometric Naive Mean	147.9
	Geometric Mean T/R Ratio (%)	87.08
	Degrees of Freedom	2
	90% Confidence Interval (%)	[55.16, 137.5]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	22.39 \| 0.2212
ANOVA	Formulation (p-value)	0.4698
	Sequence (p-value)	0.2088
	Period (p-value)	0.4684

Let us also see the coefficient table of the linear model:

simple_table(result.model |> coeftable, halign = :left)


Name	Coef.	Std. Error	t	Pr(>\|t\|)	Lower 95%	Upper 95%

(Intercept)	4.87	0.128	38.1	0.000688	4.32	5.42
sequence: TR	0.286	0.156	1.83	0.209	-0.387	0.959
formulation: T	-0.138	0.156	-0.884	0.47	-0.811	0.535
period: 2	0.139	0.156	0.888	0.468	-0.534	0.812
sequence: RT & id_within_sequence: 2	0.704	0.143	4.93	0.0388	0.0894	1.32
sequence: TR & id_within_sequence: 2	-0.297	0.143	-2.08	0.173	-0.911	0.318
sequence: RT & id_within_sequence: 3	-0.209	0.181	-1.16	0.367	-0.986	0.568
sequence: TR & id_within_sequence: 3	-0.358	0.181	-1.98	0.186	-1.13	0.419

You can see that each of the \(8\) coefficients has a p-value in the Pr(>|t|) column, most of which happen to be insignificant in this specific example.

The key takeaway is that these p-values are about individual parameters, often comparing one level of a factor to a baseline level. But it sometimes doesn’t answer a broader question such as for example “Overall, is the sequence factor a significant source of variation in the model?”

For example, let us consider a dataset with 6 sequences and see the first 15 rows of the coefficient table:

be_6_sequence_data = dataset("bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5");
result_6 = pumas_be(be_6_sequence_data)
simple_table(first(result_6.model |> coeftable, 15), halign = :left)


Name	Coef.	Std. Error	t	Pr(>\|t\|)	Lower 95%	Upper 95%

(Intercept)	8.18	0.0563	145	3.02e-131	8.07	8.29
sequence: RTS	0.0481	0.0585	0.823	0.412	-0.0678	0.164
sequence: SRT	0.172	0.0585	2.93	0.00407	0.0556	0.287
sequence: STR	-0.1	0.0596	-1.68	0.0948	-0.219	0.0177
sequence: TRS	0.187	0.061	3.06	0.00273	0.0661	0.308
sequence: TSR	0.0529	0.0596	0.888	0.377	-0.0652	0.171
formulation: S	0.339	0.0386	8.79	1.8e-14	0.263	0.416
formulation: T	0.15	0.0386	3.9	0.000163	0.0741	0.227
period: 2	0.053	0.0384	1.38	0.169	-0.0229	0.129
period: 3	-0.0294	0.0388	-0.756	0.451	-0.106	0.0476
sequence: RST & id_within_sequence: 2	-0.405	0.116	-3.48	0.000702	-0.635	-0.175
sequence: RTS & id_within_sequence: 2	0.0825	0.116	0.714	0.477	-0.146	0.311
sequence: SRT & id_within_sequence: 2	-0.521	0.116	-4.51	1.59e-5	-0.75	-0.292
sequence: STR & id_within_sequence: 2	-0.125	0.115	-1.09	0.279	-0.353	0.103
sequence: TRS & id_within_sequence: 2	0.378	0.117	3.24	0.00159	0.147	0.61

Now there are 5 coefficients (rows 1 to 5) associated with sequence, each with its own p-value. So if we are seeking the p-value associated with the question if sequence is significant or not, which one should we use?

To answer that, we need ANOVA.

P-values for Factors (ANOVA)

Analysis of Variance (ANOVA) tests a different kind of hypothesis. Instead of testing individual coefficients, it tests whether an entire factor (which may be represented by one or more coefficients) has a significant effect on the outcome. For example, the ANOVA p-value for sequence tests against the null hypothesis:

\(H_0:\) There is no difference in the mean response between any of the levels of the sequence.

In general with such an hypothesis, if we are given a p-value, then based on if the p-value is “small” we have evidence against this null hypothesis, and thus have evidence that “there is a sequence effect”. In contrast, if the p-value is non-small, there is no evidence of such a sequence effect.

Such ANOVA style p-values for factors are not part of the model field of the pumas_be result, they are rather created by pumas_be specifically for bioequivalence context.

The F-test

The statistical procedure used to test the significance of an entire factor in the ANOVA framework is the F-test. The logic of the F-test is to compare the variability in the data that is explained by a factor against the variability that is unexplained (i.e., the random error or residuals).

The test computes a value known as the F-statistic, which can loosely be understood as a ratio:

\[ F = \frac{\text{Variability between the factor's groups}}{\text{Variability within the groups}}. \]

Let’s think about this in the context of the Sequence factor (again assuming the sequences are TR and RT):

The Variability between groups measures how different the average response is for the TR group versus the RT group.
The Variability within groups measures the random variation among subjects who are in the same sequence group. This is considered the baseline or “error” variability.

Interestingly, the F-test is based on the F distribution which also appeared in the previous unit in a completely different application context.

If the Sequence factor has no real effect (the null hypothesis is true), then the variability between the TR and RT groups should be about the same as the random variability we see within each group. In this case, the F-statistic will be close to 1.

However, if the Sequence factor does have a significant effect, the variability between the groups will be much larger than the random variability within them. This results in an F-statistic significantly greater than 1.

The p-value is then calculated based on this F-statistic. A large F-statistic leads to a small p-value, providing evidence against the null hypothesis and suggesting that the factor is a significant source of variation.

The “variability” mentioned here is formally quantified using sums of squares (SS). How these sums of squares are calculated, especially in a model with multiple factors like our bioequivalence model, is a critical detail that leads to different types of ANOVA tests.

3 Type I, II, and III Sums of Squares

The ANOVA style p-values for factors are calculated from sums of squares (SS), which measure how much variability in the data is explained by each factor. We do not enter the details of sums of squares here. You can see an introductory example of sums of squares for simple one way ANOVA in Nazarathy and Klok (2021).

It turns out that when there are multiple factors then there are multiple ways of calculating such sums of squares, each leading to different types of tests. These tests have been called type I tests, type II tests, and type III tests over the years (alternatively type I, type II, or type III sums of squares). Do not confuse these terms with “type I” and “type II” errors of hypothesis tests, covered heavily in earlier units of this course.

For those wanting to dive into this topic consider the references, Speed, Hocking, and Hackney (1978), Herr (1986), and Langsrud (2003). For using such sums of squares in the mixed models context, see Kuznetsova, Brockhoff, and Christensen (2017).

In a nutshell, the differences in how we compute sums of squares lies in the order and manner in which factors are added to the model.

Type I (Sequential) SS: The effect of each factor is evaluated sequentially. The SS for the first factor is calculated. Then, the SS for the second factor is calculated after accounting for the first. The SS for the third is calculated after accounting for the first two, and so on. The results depend on the order of the factors in the model formula. This is rarely appropriate for experimental designs where factors are not ordered in a hierarchy of importance.
Type II (Hierarchical) SS: The SS for a given factor is calculated after accounting for all other factors, except those that contain the given factor (e.g., interactions). For a main effect, its SS is calculated accounting for all other main effects. This method is independent of order for main effects but is primarily used for models without interaction terms.
Type III (Marginal) SS: The SS for each factor is calculated as if it were the last one to enter the model. It measures the unique contribution of each factor after all other factors have already been included. The results are independent of the order of factors. This is the most appropriate method for BE studies because BE studies can become unbalanced due to subject dropouts. Type III SS is the standard and most reliable method for unbalanced designs.

For these reasons, regulatory agencies and statistical best practices mandate the use of Type III tests for analyzing crossover bioequivalence studies. The ANOVA section in the pumas_be output exclusively reports Type III p-values.

A Note on the Debate Around Type III Tests

While Type III ANOVA is the regulatory and industry standard for bioequivalence, it’s worth knowing that it is a subject of debate in the wider statistical community. The primary criticism centers on how Type III tests handle models with interaction terms. You may also see an interesting critique in Venables (1998).

The hypothesis tested by a Type III test for a main effect is about the marginal means, averaged across the levels of other factors. Critics argue that if there is a significant interaction, the main effect is not constant and its “average” effect might be meaningless or misleading.

In summary, while it is important to be aware of the statistical debate, the use of Type III tests for the standard additive bioequivalence model is well-justified by convention, regulatory expectation, and its robustness.

4 Interpreting the ANOVA Results

The primary conclusion of a BE study rests on the 90% confidence interval of the GMR, not on the p-values in the ANOVA section. Instead, the ANOVA p-values serve as diagnostic checks on the study’s assumptions and conduct.

Specifically we recommend considering the Sequence and Period p-values. Let us consider this example:

data = dataset(joinpath("bioequivalence", "RT_TR", "PJ2017_3_1.csv"))
result = pumas_be(data)


Observation Counts
Sequence ╲ Period	1	2

RT	17	17
TR	15	15


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	1879
	Geometric Naive Mean	1887
T	Geometric Marginal Mean	1848
	Geometric Naive Mean	1845
	Geometric Mean T/R Ratio (%)	98.36
	Degrees of Freedom	30
	90% Confidence Interval (%)	[94.07, 102.8]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	10.52 \| 0.1049
ANOVA	Formulation (p-value)	0.5334
	Sequence (p-value)	0.1818
	Period (p-value)	0.00235

Sequence p-value

In this case the Sequence p-value is 0.1818. Hence not significant.

What it tests: The null hypothesis is that there is no difference in the average response between the two sequence groups (TR vs. RT).
Interpretation:
- A high p-value (e.g., > 0.05) is the desired outcome. It indicates that the two groups of subjects, randomized to the two sequences, behaved similarly on average across the whole study. This supports the assumption that randomization was successful and that there is no significant carryover effect (where the drug from Period 1 affects the outcome in Period 2).
- A low p-value (e.g., < 0.05) is a red flag. It suggests a systematic difference between the sequence groups. This could be due to a failed randomization or, more commonly, a significant carryover effect. A significant carryover effect can compromise the validity of the crossover design, and the results should be interpreted with extreme caution. Further investigation may be required.

Period p-value

In this case the Period p-value is 0.00235. Hence it is significant.

What it tests: The null hypothesis is that there is no difference in the average response between Period 1 and Period 2.
Interpretation:
- A high p-value (e.g., > 0.05) is generally preferred, as it suggests the conditions of the study were stable across time.
- A low p-value (e.g., < 0.05), as we see in this example, indicates a statistically significant period effect. This means that, on average, subject responses were different in Period 2 compared to Period 1, regardless of which formulation they received. This could be due to physiological changes in the subjects, changes in laboratory assay procedures over time, or subjects becoming accustomed to the study procedures.
- Does a significant period effect invalidate the study? No. The crossover design and the linear model are specifically constructed to account for and adjust for period effects. The GMR estimate and its CI are still valid. However, a significant period effect is an important finding that should be noted and, if possible, explained in the study report. It is a key piece of information about the study’s conduct.

5 The Formulation p-value

In the above example the Formulation p-value is 0.5334.

Of all the p-values in the ANOVA section, the one for Formulation is the most likely to be misinterpreted. While it relates to the difference between Test and Reference, it is fundamentally not the test for bioequivalence.

What the F-test for Formulation Actually Tests

The Type III F-test for the Formulation factor evaluates whether there is a statistically significant difference in the mean response between the Test and Reference formulations. The null hypothesis (\(H_0\)) for this specific test is one of exact equality, or no difference: \[ H_0: \mu_T = \mu_R \quad (\text{or equivalently, on the log-scale, } \log(\mu_T) - \log(\mu_R) = 0) \] The alternative hypothesis (\(H_1\)) is simply that a difference of any magnitude exists: \[ H_1: \mu_T \neq \mu_R \quad (\text{or on the log-scale, } \log(\mu_T) - \log(\mu_R) \neq 0) \] In our example, the p-value is 0.5334. This non-low value means we fail to reject the null hypothesis. The standard statistical interpretation is: “There is no statistically significant evidence from this data to conclude that a difference exists between the average effects of the Test and Reference formulations.”

Why This is Not the Test for Bioequivalence

It is a common and critical mistake to interpret a non-significant formulation p-value (e.g., p > 0.05) as proof of bioequivalence. The F-test and the bioequivalence assessment answer fundamentally different questions.

The F-test has the Wrong Null Hypothesis.

The F-test is a test of equality. Bioequivalence is a test of similarity within a margin.

The F-test aims to disprove equality. You start by assuming the drugs are identical (\(H_0: \mu_T = \mu_R\)) and look for evidence that they are different. A high p-value simply means you failed to find that evidence. This could be because the drugs are indeed very similar, or it could be because your study had high variability or too few subjects (low statistical power) to detect a real, and potentially large, difference.
Bioequivalence aims to disprove a lack of equivalence. For BE, we must reject the “null hypothesis of inequivalence.” This hypothesis states that the drug ratio is outside the acceptance limits. The TOST (Two One-Sided Tests) procedure formalizes this by testing two null hypotheses simultaneously (see Unit 3): \[ H_{0-}: \frac{\mu_T}{\mu_R} \le 0.80 \quad (\text{The ratio is too low}) \] \[ H_{0+}: \frac{\mu_T}{\mu_R} \ge 1.25 \quad (\text{The ratio is too high}) \] We must reject both of these null hypotheses to conclude bioequivalence.

Why present the `Formulation` p-value

The Formulation p-value is presented to help verify that pumas_be output agrees with legacy output scripts and outputs of other software packages. In that sense, since such p-values have appeared historically, pumas_be also presents this output for such comparative purposes.

6 Conclusion

This unit clarified the role and meaning of the ANOVA p-values that appear in the pumas_be output. We learned that these are Type III p-values, which are the industry standard because they provide an order-independent assessment of each factor’s contribution, making them suitable for potentially unbalanced designs.

The key takeaway is the practical interpretation of these values in a bioequivalence context:

The Formulation p-value is informational but is not the test for BE.
The Sequence p-value is a critical diagnostic for carryover effects; a low value is a warning sign.
The Period p-value checks for systematic changes over time; a low value is a notable finding that the model adjusts for, but it does not invalidate the BE conclusion.

The primary evidence for bioequivalence always comes from the 90% confidence interval for the geometric mean ratio. The ANOVA tests provide essential supporting evidence and quality control checks on the assumptions of the crossover design.

7 Unit exercises

What is the main difference between the p-values reported for individual coefficients in a linear model and the ANOVA (Type III) p-values for factors such as Sequence or Period?
Why is the Sequence p-value in the ANOVA section important to check in a bioequivalence study? What might a low Sequence p-value indicate?
Why is the Formulation p-value in the ANOVA section NOT used to establish bioequivalence, and what statistical method should be used instead?

References

Herr, David G. 1986. “On the History of ANOVA in Unbalanced, Factorial Designs: The First 30 Years.” The American Statistician. https://www.tandfonline.com/doi/abs/10.1080/00031305.1986.10475409.

Kuznetsova, Alexandra, Per B Brockhoff, and Rune HB Christensen. 2017. “lmerTest Package: Tests in Linear Mixed Effects Models.” Journal of Statistical Software. https://www.jstatsoft.org/article/view/v082i13.

Langsrud, Øyvind. 2003. “ANOVA for Unbalanced Data: Use Type II Instead of Type III Sums of Squares.” Statistics and Computing. https://link.springer.com/article/10.1023/a:1023260610025.

Nazarathy, Yoni, and Hayden Klok. 2021. Statistics with Julia: Fundamentals for Data Science, Machine Learning and Artificial Intelligence. Springer Nature. https://statisticswithjulia.org/.

Park, Gowooni, Hyungsub Kim, and Kyun-Seop Bae. 2020. “Bioequivalence Data Analysis.” Translational and Clinical Pharmacology 28 (4): 175. https://pmc.ncbi.nlm.nih.gov/articles/PMC7781810/.

Speed, F Michael, Ronald R Hocking, and OiP Hackney. 1978. “Methods of Analysis of Linear Models with Unbalanced Data.” Journal of the American Statistical Association. https://www.tandfonline.com/doi/abs/10.1080/01621459.1978.10480012.

Venables, WN. 1998. “Exegeses on Linear Models.” In S-Plus User’s Conference, Washington DC. https://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf.

Reuse

CC BY-SA 4.0