using Pumas # available with Pumas products - also reexports Distributions
using NCA # available with Pumas products
using Bioequivalence # available with Pumas products
using BioequivalencePower # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFramesMeta
using SummaryTables
using DataFrames
using DataFramesMeta
using AlgebraOfGraphics
using CairoMakie
using Random

A Course on Bioequivalence: Unit 3 - The Two One Sided T-Test on a Parallel Design
This is Unit 3 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.
1 Unit overview
In the previous unit we considered log transformations, the log-normal distribution, geometric means, the geometric mean ratio, and the CV. In this unit we finally explore the statistical inference concepts of an hypothesis test and confidence interval, used for bioequivalence. In particular we consider:
- The basic statistical hypothesis test.
- The notions of consumer’s risk and producer’s risk in drug development and bioequivalence.
- The two one sided T-test (TOST).
- The relationship between a two one sided T-test (TOST) and a confidence interval based criteria.
- Analysis of these concepts in the setup of a basic parallel design.
In this unit we use the following packages, of which the first 5 are Pumas packages (available only with a Pumas installation).
We have used most of these packages in the previous units. New to this unit is Pumas’ BioequivalencePower
package which provides power analysis for planning of bioequivalence trials.
2 Getting the hypothesis formulation right
Let us understand the basics of hypothesis tests and confidence intervals in the context of bioequivlance.
As a very simplified situation, let us assume we are comparing R
and T
under very simplified (and unrealistic at this point) assumptions:
- Only a single observation of
R
and a single observation ofT
. Denote these observations as \(X_R\) and \(X_T\) respectively. While it is unrealistic to assume a single observation per formulation, we do so for simplicity of this first example. - The observations are already log-transformed so they are normally distributed. This is not an unrealistic assumption. In fact it is the common case in bioequivalence analysis.
- The (log transformed) observations have a known standard deviation \(\sigma = 0.1\). It is an unrealistic assumption to assume that \(\sigma\) (or the CV) is known, we do so for simplicity of this first example.
- The unknown quantities are \(\mu_R\) and \(\mu_T\) which are the means of the (log-transformed) test and reference.
Under these assumptions the difference \(X_T - X_R\) is distributed according normal distribution with mean \(\mu^* = \mu_T - \mu_R\) and standard deviation \(\sigma^* = \sqrt{2} \sigma\). We call the difference, the point estimate for the mean difference and denote it as \(\hat{\mu}^*\).
Note that in a more realistic setting, discussed below, there are multiple observations and \(\sigma\) is not known but is rather estimated.
Classic Hypothesis Formulations
When we carry out hypothesis tests we break up the possibilities in the world into \(H_0\) (the null hypothesis) and \(H_1\) (the alternative hypothesis).
Typically we consider \(H_0\) as the known or default situation. Then from a scientific perspective, we believe and hope that \(H_1\) is actually the case, and our hope is to be able to properly demonstrate that this alternative hypothesis holds
Since a decision must be made, we are faced with the following possibilities, where each row is dictated by the reality in the world and each column is based on what we decide. Note that \(H_0\) is sometimes called “negative” while \(H_1\) is sometimes called “positive”.
Decision: Retain \(H_0\) | Decision: Reject \(H_0\) | |
---|---|---|
Reality: \(H_0\) | ✓ (True Negative) | Type I Error (False Positive) (erroneous rejection of \(H_0\)) |
Reality: \(H_1\) | Type II Error (False Negative) (erroneously failing to reject \(H_0\)) |
✓ (True Positive) |
Now since our decision is based on sampled data, and that data is random, we often try to control and understand the probabilities of the two possible errors, type-I errors (false positives) or type-II errors (false negatives). In practice we are often able to limit the probability of type-I errors to be no more than \(\alpha\), and a typical value is \(\alpha = 0.05\) (or \(5\%\)). As for the type-II errors, we consider their complement which is the probability of a true positive. This probability is called the power of the test. Hence the power is the probability of a true positive.
As a starting point, let us consider classical hypothesis tests (in contrast to two one sided t-tests covered below). This is not the type of hypothesis test we use for bioequivalence, yet it is good to start the discussion with these tests. Let us see why.
The formulation of the hypothesis can be as follows: \[ {\cal T}_{\text{classic}} = \begin{cases} H_{0}:& \mu_T - \mu_R = 0, \\[5pt] H_{1}:& \mu_T - \mu_R \neq 0.\\ \end{cases} \]
Here the null-hypothesis \(H_0\) indicates that \(\mu_T = \mu_R\) (bioequivalence holds) and the alternate hypothesis \(H_1\) indicates that \(\mu_T\) and \(\mu_R\) are different (the formulations are not equivalent). In reality, only one of the two cases holds, either \(H_0\) or \(H_1\). Based on the data we have to make a decision. So we observe the difference \(\hat{\mu}^* = X_T - X_R\) and attempt to make an inferential conclusion about \(\mu^*\).
The key to hypothesis testing is to have test statistic that has a known distribution under \(H_0\). With our very simplified assumptions of a single observation and a know standard deviation, we can take our sampled data and compute a test statistic, \[ Z = \frac{\hat{\mu}^*}{\sigma^*} = \frac{\hat{\mu}^*}{\sqrt{2} \sigma}. \] Under the \(H_0\) assumption the distribution of \(Z\) is standard normal. We are then able to make a decision rule where we reject \(H_0\) if \(Z < z_{0.025}\) or \(Z > z_{0.975}\) where \(z_{0.025}\) and \(z_{0.975}\) are the respective quantiles of a standard normal distribution. This rule ensures we have a type-I error of exactly \(5\%\). That is, under \(H_0\), there is exactly a probability of \(0.05\) to have \(Z\) outside of the interval \([z_{0.025},z_{0.975}]\).
For purpose of understanding here is a simulated repetition of this hypothesis test, a million times, under the null hypothesis. We see that we (wrongfully) reject \(H_0\) about \(5\%\) of the time.
Random.seed!(0)
function single_trial_reject(; μ_R = 0.0, μ_T = 0.0, σ = 0.1)
= rand(Normal(μ_R, σ)) # Sample from the reference
x_R = rand(Normal(μ_T, σ)) # Sample from the test
x_T = (x_T - x_R) / (√2 * σ) # Test statistic
Z return abs(Z) > quantile(Normal(), 0.975)
end
print("Simulation estimation of Type-I error: ")
println(mean([single_trial_reject() for _ = 1:10^6]))
Simulation estimation of Type-I error: 0.04992
As for the type-II error, or its complement which is the power of the test, it depends on the actual difference \(\mu_T - \mu_R\). Thus let us repeat this type of simulation for various values of \(\mu_T\), each time keeping \(\mu_R\) at zero. Whenever \(\mu_T \neq \mu_R\) we are in an \(H_1\) setting. Yet we see that our probability of rejecting (the power) depends on where in \(H_1\) the actual parameter lies:
= setdiff(-0.7:0.1:0.7, 0) # several options in H_1 (assuming μ_R = 0.0)
range_μ_T = length(range_μ_T)
n_μ_T = DataFrame(μ_T = range_μ_T, power = zeros(n_μ_T))
probs
for i = 1:nrow(probs)
= probs.μ_T[i]
μ_T = mean([single_trial_reject(; μ_T) for _ = 1:10^6])
probs.power[i] end
simple_table(probs)
μ_T | power |
-0.7 | 0.999 |
-0.6 | 0.989 |
-0.5 | 0.943 |
-0.4 | 0.807 |
-0.3 | 0.564 |
-0.2 | 0.293 |
-0.1 | 0.109 |
0.1 | 0.109 |
0.2 | 0.293 |
0.3 | 0.564 |
0.4 | 0.807 |
0.5 | 0.942 |
0.6 | 0.989 |
0.7 | 0.999 |
Obviously for cases where \(\mu_T\) and \(\mu_R\) are farther from each other, we have higher power. This is because it is much easier to detect bigger differences in the means.
Consumer’s Risk and Producer’s Risk
Understanding the basics of hypothesis tests is important. However, hypothesis formulations such as \({\cal T}_{\text{classic}}\) do not work well for the bioequivalence setting. To see why, let us define two competing risks:
- consumer’s risk: This is the probability of determining that a test drug product is bioequivelent to the reference when it is actually not.
- producer’s risk: This is the probability of determining that a test drug product is not bioequivelent to the reference when it actually is.
How would we map the consumer’s risk and producer’s risk to type-I and type-II errors under the classic hypothesis \({\cal T}_{\text{classic}}\)?
- Since in this hypothesis formulation, we get that \(H_0\) is equivalence. Hence rejecting \(H_0\) wrongfully (type-I error) is the producer risk.
- Similarly since in this hypothesis formulation we get that \(H_1\) is non-equivalence, not rejecting \(H_1\) wrongfully (type-II error) is the consumer’s risk.
This is a potential problem for consumer safety because with our statistical decision making, we control the type-I error (keeping it up to \(\alpha\)) but do not have directly control the type-II error. This then means that with a classic hypothesis formulation we ensure low producer risk but we do not have direct control over the consumer risk! Thus using \({\cal T}_{\text{classic}}\) does not prioritize consumer safety.
This lack of suitability of classic hypothesis formulations (like \({\cal T}_{\text{classic}}\)) for bioequivalence, gives rise to the two-one-sided based formulation which we describe now.
Two One Sided Formulations
The general idea in bioequivalence is to define a notion of “closeness” of the reference and test. That is, to demonstrate bioequivalence we want \(\mu_R\) and \(\mu_T\) to be “close”.
The level of closeness is not a statistical issue but a pharmacological issue. On the log scale it is the 20% difference yielding the known [80%,125%] confidence bounds.
For our purposes let’s say that there are values \(\delta_- < 0 < \delta_+\) where we say that the drugs are bioequivalent if, \[ \delta_- < \mu_T - \mu_R < \delta_+ \]
For example if we set \(\delta_- = -0.22314\) and \(\delta_+ = 0.22314\) then it corresponds to the \([80\%,125\%]\) bounds:
round.(log.([0.8, 1.25]), digits = 5) |> println
[-0.22314, 0.22314]
With \(\delta_-\) and \(\delta_+\) established based on pharmacological considerations (and considering the log scale), we now formulate the hypothesis: \[ {\cal T_{\text{TOST}}} = \begin{cases} H_{0}:& \mu_T - \mu_R \le \delta_- \qquad \text{or} \qquad \delta_+ \le \mu_T - \mu_R \\[5pt] H_{1}:& \qquad \qquad \delta_- < \mu_T - \mu_R < \delta_+\\ \end{cases} \]
This means that the null hypothesis \(H_0\) is that the formulation are not bioequivalent, while the alternate hypothesis \(H_1\) implies that the formulations are bioequivlanent.
With this, the relationship between the errors (type-I and type-II) and risks (consumer’s and producer’s) is much more suitable.
- The type-I error is the consumer’s risk.
- The type-II error is the producer’s risk.
Now If we are able to carry out a hypothesis test for \({\cal T_{\text{TOST}}}\) where type-I errors are not more than \(\alpha\), we are able to protect consumers. Then, by attempting to keep the power at reasonable levels (typically above 80%) we are able to protect producers as a secondary objective. The most significant factor ensuring high power (or low type-II error) is the number of subjects in the experiment. On top of the ethical considerations associated with the number of subjects, this is a key cost factor for the producer. Hence with the \({\cal T_{\text{TOST}}}\) formulation, the tradeoff between producer’s risk and trial cost is left primarily at the hands of the producer, while consumers are guaranteed to have a cap on the risk.
Using a confidence interval for the decision
The common way to implement a bioequivalence decision for \({\cal T_{\text{TOST}}}\) is based on a confidence interval. A confidence interval for the unknown quantity \(\mu_T-\mu_R\) is comprised of a lower bound \(\hat{L}^*\) and an upper bound \(\hat{U}^*\) such that,
\[ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9. \]
Note that \(\hat{L}^*\) and \(\hat{U}^*\) both depend on the observed data.
It turns out that for a bioequivalence hypothesis test as \({\cal T_{\text{TOST}}}\), where we control the type-I error (consumer’s risk) at \(\alpha\), we need a \(1-2\alpha\) confidence interval. So for example in the typical case where \(\alpha = 0.05\) we need a \(90\%\) confidence interval.
The use of a confidence interval is a matter of convenience, yet the interpretation in bioequivalence differs from the classic inferential interpretation. Details follow.
With our very basic assumptions above, we can use basic probability analysis to develop such confidence bounds. This is because the quantity,
\[ \hat{\mu}^* = X_T - X_R \]
has a normal distribution with mean \(\mu^* = \mu_T - \mu_R\) and standard deviation \(\sigma^* = \sqrt{2} \sigma\). Here \(\hat{\mu}^*\) is the point estimate for the difference in means.
An important supporting quantity is \(z_{0.95} \approx 1.645\) which is the 95th quantile of the standard normal distribution. With this it is easy to derive that the symmetric confidence interval,
\[ \hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}, \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}. \]
The rule for determining bioequivalence is to check if \(\delta_- \le \hat{L}^*\) and \(\hat{U}^* \le \delta_+\). Or in other words, \([\hat{L}^*, \hat{U}^*] \subset [\delta_-, \delta_+]\). In particular:
- If \(\delta_- \le \hat{L}^*\) and \(\hat{U}^* \le \delta_+\) then reject \(H_0\) and conclude that
T
is bioequivalent toR
. - If this doesn’t hold then do not reject \(H_0\), and thus the statistical bioequivalence analysis fails in the sense that we cannot claim that
T
is bioequivalent toR
.
We now see why this conclusion implements the hypothesis formulation \({\cal T_{\text{TOST}}}\).
3 A Two One Sided Test
To implement the hypothesis test \({\cal T_{\text{TOST}}}\) we use two hypothesis tests, each of which is a one-sided hypothesis test.
In particular, we break up the hypothesis test \({\cal T}_{\text{TOST}}\) into two hypothesis tests:
\[ {\cal T}_- = \begin{cases} H_{0-}:& \mu_T - \mu_R \le \delta_-,\\ H_{1-}:& \mu_T - \mu_R > \delta_-,\\ \end{cases} \] and, \[ {\cal T}_+ = \begin{cases} H_{0+}:& \mu_T - \mu_R \ge \delta_+,\\ H_{1+}:& \mu_T - \mu_R < \delta_+.\\ \end{cases} \]
The hypothesis formulation \({\cal T}_{-}\) has \(H_0\) as \(\mu_T \le \mu_R + \delta_{-}\) and since \(\delta_{-} < 0\) it can be read as “the test formulation is not superior to the reference formulation” (implying that the test is less than a reference with a safety margin).
Similarly, the hypothesis formulation \({\cal T}_{+}\) has \(H_0\) as \(\mu_T \ge \mu_R + \delta_{+}\) and since \(\delta_{+} > 0\) it can be read as “the test formulation is not inferior to the reference formulation” (implying that the test is greater than a reference with a safety margin).
In fact, as a side comment, in different contexts, each of these non-superiority and non-inferiority tests are of interest themselves; see FDA (2016).
Importantly, if we reject \(H_{0-}\) in \({\cal T}_-\) and also reject \(H_{0+}\) in \({\cal T}_+\), then the intersection of both of the alternative hypothesis is that,
\[ \delta_- < \mu_T - \mu_R < \delta_+. \]
Now since the type-I error in each of the cases is not more than \(\alpha\), due to the structure of the hypothesis, the combined type-I error is not more than \(\alpha\). This is because for each possible null scenario, only one of the two one-sided nulls is true, and rejecting both at the same time has probability at most \(\alpha\). Since both nulls must be rejected to claim equivalence, the overall chance of Type I error is at most \(\alpha\). For an in-depth advanced analysis describing this idea, outside of the scope of this unit, see Berger and Hsu (1996).
The test statistics
Each of these are more simple (and standard) one sided tests and they are decided via test statistics
\[ Z_- = \frac{\hat{\mu}^*-\delta_-}{\sigma^*}, \qquad \text{and} \qquad Z_+ = \frac{\hat{\mu}^*-\delta_+}{\sigma^*}. \]
The decision rules are:
- Reject \({\cal T}_-\): if \(Z_- > z_{0.95}\).
- Reject \({\cal T}_+\): if \(Z_+ < -z_{0.95} = z_{0.05}\).
These rules are considered based on the extreme values of the null parameter range in each case.
Now recalling the confidence intervals from above, \(\hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}\) and \(\hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}\), we have,
\[ Z_- > z_{0.95} \quad \iff \quad \hat{\mu}^* > \delta_- + \sigma^* z_{0.95} \quad \iff \quad \hat{L}^* > \delta_- \] and, \[ Z_+ < -z_{0.95} \quad \iff \quad \hat{\mu}^* < \delta_+ - \sigma^* z_{0.95} \quad \iff \quad \hat{U}^* < \delta_+ \]
Thus we see that rejection of both tests is exactly identical to the confidence interval \([\hat{L}^*, \hat{U}^*]\) being contained in \([\delta_-, \delta_+]\).
A simulation example
Here is some example code, simulating such a trial. This is not code you would use for bioequivalence analysis, it is rather code that helps understand the key points we discuss in this unit.
First a random trial:
function random_trial(; μ_R = 0.0, μ_T = 0.07, σ = 0.06)
= rand(Normal(μ_R, σ))
x_R = rand(Normal(μ_T, σ))
x_T = x_T - x_R
μ̂ return μ̂
end
Now the confidence interval for the trial and a check if the trial passes based on the confidence interval:
function ci_of_trial(μ̂; σ = 0.06)
= quantile(Normal(), 0.95)
z_95 return (μ̂ - √(2) * σ * z_95, μ̂ + √(2) * σ * z_95)
end
function ci_passes_trial(ci; δ₊ = 0.22314, δ₋ = -δ₊)
return δ₋ ≤ ci[1] && ci[2] ≤ δ₊ #true means reject H0 and conclude BE
end
Now, the two one sided hypothesis test formulations associated with trial:
function lower_trial_reject(μ̂; σ = 0.06, δ₋ = -0.22314)
= quantile(Normal(), 0.95)
z_95 = (μ̂ - δ₋) / (√(2) * σ)
Z₋ return Z₋ > z_95
end
function upper_trial_reject(μ̂; σ = 0.06, δ₊ = 0.22314)
= quantile(Normal(), 0.05)
z_05 = (μ̂ - δ₊) / (√(2) * σ)
Z₊ return Z₊ < z_05
end
Now this function makes the decision of the trial using the confidence interval approach, and also verifies (with (assert?)) that the hypothesis test approach is equivalent:
function make_decision(μ̂; σ = 0.06, δ₊ = 0.22314, δ₋ = -δ₊)
= ci_of_trial(μ̂; σ)
ci = ci_passes_trial(ci; δ₊, δ₋)
decision if decision #true
@assert lower_trial_reject(μ̂; σ, δ₋) && upper_trial_reject(μ̂; σ, δ₊)
else #false
@assert lower_trial_reject(μ̂; σ, δ₋) == false ||
upper_trial_reject(μ̂; σ, δ₊) == false
end
return decision
end
Here it is on one arbitrary trial where the decision is to reject \(H_0\):
Random.seed!(0)
= random_trial()
μ̂ @show μ̂
= ci_of_trial(μ̂)
ci @show ci
= ci_passes_trial(ci)
pass_ci @show pass_ci
= lower_trial_reject(μ̂)
lower_reject @show lower_reject
= upper_trial_reject(μ̂)
upper_reject @show upper_reject
= make_decision(μ̂)
decision @show decision;
μ̂ = 0.021457133452423928
ci = (-0.11811332498877689, 0.16102759189362473)
pass_ci = true
lower_reject = true
upper_reject = true
decision = true
And here is another arbitrary one where the decision is to not to reject \(H_0\):
Random.seed!(2)
= random_trial()
μ̂ @show μ̂
= ci_of_trial(μ̂)
ci @show ci
= ci_passes_trial(ci)
pass_ci @show pass_ci
= lower_trial_reject(μ̂)
lower_reject @show lower_reject
= upper_trial_reject(μ̂)
upper_reject @show upper_reject
= make_decision(μ̂)
decision @show decision;
μ̂ = 0.17446680750826965
ci = (0.034896349067068844, 0.31403726594947046)
pass_ci = false
lower_reject = true
upper_reject = false
decision = false
Let’s see 10 separate trials simulated trials:
Random.seed!(0)
= [make_decision(random_trial()) for _ = 1:10]
sim_trials @show sim_trials
@show mean(sim_trials);
sim_trials = Bool[1, 1, 0, 1, 1, 1, 1, 0, 0, 1]
mean(sim_trials) = 0.7
As you can see in some we reject \(H_0\) (1
) and in others not (0
). The mean of this count is the probability of rejecting \(H_0\).
Now this code estimates the power as a function of \(\mu_T\). This is for a given \(\sigma = 0.06\). It is more important to focus on the output table than on the code.
function hypothesis_test_probabilities(σ)
= sort(union(-0.5:0.05:0.5, [-0.22314, 0.22314]))
range_μ_T = length(range_μ_T)
n_μ_T = DataFrame(
probs = range_μ_T,
μ_T = zeros(n_μ_T),
power = zeros(n_μ_T),
lower_reject_prob = zeros(n_μ_T),
upper_reject_prob
)@rtransform!(
probs,:be_case = ((-0.22314 < :μ_T < 0.22314) ? "H₁ (BE)" : "H₀ (Not BE)")
)@select!(probs, :be_case, Not(:be_case))
for i = 1:nrow(probs)
= probs.μ_T[i]
μ_T = [random_trial(; σ, μ_T) for _ = 1:10^6]
data = mean(make_decision.(data; σ))
probs.power[i] = mean(lower_trial_reject.(data; σ))
probs.lower_reject_prob[i] = mean(upper_trial_reject.(data; σ))
probs.upper_reject_prob[i] end
return probs
end
simple_table(hypothesis_test_probabilities(0.06))
be_case | μ_T | power | lower_reject_prob | upper_reject_prob |
H₀ (Not BE) | -0.5 | 3.0e-6 | 3.0e-6 | 1 |
H₀ (Not BE) | -0.45 | 6.0e-6 | 6.0e-6 | 1 |
H₀ (Not BE) | -0.4 | 0.0001 | 0.0001 | 1 |
H₀ (Not BE) | -0.35 | 0.0008 | 0.0008 | 1 |
H₀ (Not BE) | -0.3 | 0.00534 | 0.00534 | 1 |
H₀ (Not BE) | -0.25 | 0.0249 | 0.0249 | 1 |
H₀ (Not BE) | -0.223 | 0.0502 | 0.0504 | 1 |
H₁ (BE) | -0.2 | 0.0845 | 0.0849 | 1 |
H₁ (BE) | -0.15 | 0.214 | 0.217 | 0.997 |
H₁ (BE) | -0.1 | 0.408 | 0.423 | 0.985 |
H₁ (BE) | -0.05 | 0.596 | 0.654 | 0.942 |
H₁ (BE) | 0 | 0.675 | 0.837 | 0.838 |
H₁ (BE) | 0.05 | 0.595 | 0.942 | 0.653 |
H₁ (BE) | 0.1 | 0.408 | 0.985 | 0.423 |
H₁ (BE) | 0.15 | 0.214 | 0.997 | 0.217 |
H₁ (BE) | 0.2 | 0.0845 | 1 | 0.0849 |
H₀ (Not BE) | 0.223 | 0.0499 | 1 | 0.05 |
H₀ (Not BE) | 0.25 | 0.0249 | 1 | 0.0249 |
H₀ (Not BE) | 0.3 | 0.00535 | 1 | 0.00536 |
H₀ (Not BE) | 0.35 | 0.000865 | 1 | 0.000865 |
H₀ (Not BE) | 0.4 | 9.8e-5 | 1 | 9.8e-5 |
H₀ (Not BE) | 0.45 | 4.0e-6 | 1 | 4.0e-6 |
H₀ (Not BE) | 0.5 | 1.0e-6 | 1 | 1.0e-6 |
Here are some observations from above:
- Observe that at the limits, \(\delta_-\) and \(\delta_+\) the power estimate is approximately at \(\alpha = 5\%\).
- Observe that the maximal power is when \(\mu_T = \mu_R (=0)\) in which case it is at about \(67.5\%\).
- Observe that for highly positive \(\mu_T\) the lower test, \({\cal T}_-\) does not have an effect (we reject with probability \(1\)). Similarly, for highly negative \(\mu_T\) the upper test \({\cal T}_+\) rejects with probability \(1\).
Now if somehow the variability of the drug products would have been lower (\(\sigma = 0.04\) instead of \(\sigma = 0.06\)) then the power is generally increased:
simple_table(hypothesis_test_probabilities(0.04))
be_case | μ_T | power | lower_reject_prob | upper_reject_prob |
H₀ (Not BE) | -0.5 | 0 | 0 | 1 |
H₀ (Not BE) | -0.45 | 0 | 0 | 1 |
H₀ (Not BE) | -0.4 | 1.0e-6 | 1.0e-6 | 1 |
H₀ (Not BE) | -0.35 | 3.5e-5 | 3.5e-5 | 1 |
H₀ (Not BE) | -0.3 | 0.00135 | 0.00135 | 1 |
H₀ (Not BE) | -0.25 | 0.017 | 0.017 | 1 |
H₀ (Not BE) | -0.223 | 0.0506 | 0.0506 | 1 |
H₁ (BE) | -0.2 | 0.109 | 0.109 | 1 |
H₁ (BE) | -0.15 | 0.363 | 0.363 | 1 |
H₁ (BE) | -0.1 | 0.703 | 0.703 | 1 |
H₁ (BE) | -0.05 | 0.921 | 0.922 | 0.999 |
H₁ (BE) | 0 | 0.978 | 0.989 | 0.989 |
H₁ (BE) | 0.05 | 0.921 | 0.999 | 0.921 |
H₁ (BE) | 0.1 | 0.702 | 1 | 0.702 |
H₁ (BE) | 0.15 | 0.362 | 1 | 0.362 |
H₁ (BE) | 0.2 | 0.108 | 1 | 0.108 |
H₀ (Not BE) | 0.223 | 0.0505 | 1 | 0.0505 |
H₀ (Not BE) | 0.25 | 0.017 | 1 | 0.017 |
H₀ (Not BE) | 0.3 | 0.00136 | 1 | 0.00136 |
H₀ (Not BE) | 0.35 | 3.4e-5 | 1 | 3.4e-5 |
H₀ (Not BE) | 0.4 | 0 | 1 | 0 |
H₀ (Not BE) | 0.45 | 0 | 1 | 0 |
H₀ (Not BE) | 0.5 | 0 | 1 | 0 |
In practice, we obviously do not get to influence \(\sigma\). But instead we sample more than one subject for each formulation. In fact, we also don’t know \(\sigma\) but rather need to estimate it. This puts us in the realm of the T-test. The good news, is that all of the principles from the hypothetical case we considered up to now, carry over to the much more realistic setting.
4 A much more realistic case - parallel design
As a data example let us create the same NCA based dataset that we analyzed in the previous two units with a parallel study, this time focusing only on Cmax.
= dataset("nca/crossover_study.csv")
df @subset!(df, :period .== 1)
@rtransform!(df, :sequence = :sequence == "RT" ? "R" : "T")
= read_nca(df; observations = :concentration, group = [:sequence, :period])
pop = run_nca(pop; parameters = [:cmax])
nca_result = nca_result.reportdf; be_analysis_data
Now, let us use the preprocess_be
function as it also automatically log-transforms the data (we’ll also sort by the sequence).
= sort(preprocess_be(be_analysis_data, endpoint = :cmax), :sequence)
transformed_data simple_table(transformed_data)
id | id_within_sequence | sequence | period | formulation | endpoint |
1 | 1 | R | 1 | 'R' | 9.03 |
10 | 2 | R | 1 | 'R' | 8.92 |
2 | 3 | R | 1 | 'R' | 9.08 |
3 | 4 | R | 1 | 'R' | 8.78 |
4 | 5 | R | 1 | 'R' | 8.57 |
5 | 6 | R | 1 | 'R' | 9.08 |
6 | 7 | R | 1 | 'R' | 8.81 |
7 | 8 | R | 1 | 'R' | 9.14 |
8 | 9 | R | 1 | 'R' | 8.69 |
9 | 10 | R | 1 | 'R' | 9.43 |
11 | 1 | T | 1 | 'T' | 8.87 |
12 | 2 | T | 1 | 'T' | 8.68 |
13 | 3 | T | 1 | 'T' | 9.15 |
14 | 4 | T | 1 | 'T' | 9.01 |
15 | 5 | T | 1 | 'T' | 8.65 |
16 | 6 | T | 1 | 'T' | 9.42 |
17 | 7 | T | 1 | 'T' | 8.99 |
18 | 8 | T | 1 | 'T' | 8.76 |
19 | 9 | T | 1 | 'T' | 8.89 |
20 | 10 | T | 1 | 'T' | 8.96 |
Now similar to the \(Z\), \(Z_-\), and \(Z_+\) statistics that we presented above in the simplified assumptions of single observation and known variance, we now have a t-statistics that can be used for confidence intervals and hypothesis tests. They can then be used for the two one sided t-test just as above.
In particular we consider the sample mean and sample standard deviations,
= mean(transformed_data.endpoint[1:10]), std(transformed_data.endpoint[1:10])
x̄_R, s_R = mean(transformed_data.endpoint[11:20]), std(transformed_data.endpoint[11:20])
x̄_T, s_T @show x̄_R, s_R
@show x̄_T, s_T;
(x̄_R, s_R) = (8.953467566963166, 0.2510234409844379)
(x̄_T, s_T) = (8.93727699385348, 0.2290728211926693)
Based on these four summary statistics we can get a lot done! But we need to make some assumptions:
- Normality and independence of the (log-transformed) data.
- The same variability for both
R
andT
. - The number of observations per formulation is \(n_R\) and \(n_T\), and we denote the observations \(X^R_{i}\) and \(X^T_{i}\) respectively.
A point estimate for the difference in means is,
\[ \hat{\mu}^* = \frac{1}{n_T}\sum_{i=1}^n X^T_{i} - \frac{1}{n_R}\sum_{i=1}^n X^R_{i} \]
= x̄_T - x̄_R
μ̂ @show μ̂;
μ̂ = -0.016190573109685857
Based on it, we define the \(T\) statistic:
\[ T = \frac{\hat{\mu}^*}{S_p\sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}. \]
If \(\mu_T = \mu_R\) then it follows a T-distribution with \(d = n_R + n_T - 2\) degrees of freedom. When \(d\) is not too small (e.g. \(d > 40\)) the distribution is very similar to a standard normal distribution. Yet, for small \(d\) it has heavier tails.
The denominator of the \(T\) statistic has the pooled standard deviation,
\[ S_p = \sqrt{ \frac{(n_R-1)S_R^2 + (n_T-1)S_T^2}{n_R + n_T - 2} }, \]
where each of \(S_R\) and \(S_T\) are standard deviations
\[ S_R = \sqrt{ \frac{1}{n_R-1} \sum_{i=1}^{n_R} (x^R_{i} - \overline{x}_R)^2 } \qquad \text{and} \qquad S_T = \sqrt{ \frac{1}{n_T-1} \sum_{i=1}^{n_T} (x^T_{i} - \overline{x}_T)^2 }. \]
Let’s compute the pooled standard deviation and the T statistic for our data.
= sqrt((9 * s_R^2 + 9 * s_T^2) / 18)
s_p = (x̄_T - x̄_R) / (s_p * √(1 / 10 + 1 / 10))
t_statistic @show s_p
@show t_statistic;
s_p = 0.24029890275741608
t_statistic = -0.15065912350205451
Now a \(90\% = 100 \times (1-\alpha)\%\) confidence interval based on the T statistic,
\[ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9, \]
is based on,
\[ \hat{L}^* = \hat{\mu}^* - t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}} , \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}. \]
Here \(t_{d, 0.95}\) is the \(95\%\) quantile of the T-distribution with \(d\) degrees of freedom.
= quantile(TDist(18), 0.95)
t_95 @show t_95;
t_95 = 1.7340636066175381
Here is the confidence interval computed explicitly:
= (μ̂ - t_95 * s_p * √(1 / 10 + 1 / 10), μ̂ + t_95 * s_p * √(1 / 10 + 1 / 10)) ci
(-0.2025416081295061, 0.1701604619101344)
And here we can see it back on the original scale, as a percentage.
100round.(collect(exp.(ci)), digits = 4) |> println
[81.67, 118.55]
Here is the same confidence interval computed via pumas_be
(slight differences are due to rounding implementations).
pumas_be(be_analysis_data, endpoint = :cmax)
Observation Counts | |
Sequence ╲ Period | 1 |
R | 10 |
T | 10 |
Paradigm: Parallel design | ||||
Model: T test (equal variance) | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Naive Mean | 7735 | ||
T | Geometric Naive Mean | 7610 | ||
Geometric Mean T/R Ratio (%) | 98.39 | |||
Degrees of Freedom | 18 | |||
90% Confidence Interval (%) | [81.66, 118.6] | Pass | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 23.73 | 0.234 | ||
Considering other output values, the Vriability
is estimated in this simple model - simply via the standard deviation. You can see it is the same as in the pumas_be
output. In other complex models, variability estimation is more complex.
= round(std(transformed_data.endpoint), digits = 3)
σ̂ = round(100 * sigma2geocv(σ̂), digits = 2)
CV
@show CV, σ̂;
(CV, σ̂) = (23.72, 0.234)
Finally, observe that the Geometric Mean T/R Ratio is simply the transformed (exponentiated) point estimate.
round(100exp(μ̂), digits = 2)
98.39
A variant that does not assume equal variance of both treatments
There is a variant of the T test which does not require us to assume that the variability of R
and the variability of T
is the same.
This is sometimes called the Welch–Satterthwaite T test.
The choice if to use this variant or the equal-variance case (presented above), can be based on our willingness to make assumptions about the bioequivalence trial before conducting it. In particular, if we have reason to believe that variability of the R
and variability of T
are significantly different, then not assuming equal variance would make sense. In any case, this needs to be represented in the study protocol.
From a user point of view, the non-equal variance case estimates an approximate degrees of freedom based on the standard deviations of both R
and T
. This is done via a formula s follows:
= 10 # the number of reference observations
n_R = 10 # the number of test observations
n_T
# formula for the approximate degrees of freedom
=
v ^2 / n_R + s_T^2 / n_T)^2 /
(s_R^2 / n_R)^2 / (n_R - 1) + (s_T^2 / n_T)^2 / (n_T - 1)) ((s_R
17.851353654711797
Then a T-distribution with this value for the degrees of freedom is used. Note that the value will almost always be non-integer.
The confidence interval is then computed as such:
= quantile(TDist(v), 0.95)
t_095 =
ci - t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10), μ̂ + t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10)) (μ̂
(-0.20262570982810862, 0.1702445636087369)
In this case, to convert back to the non-log scale we carry out:
100round.(collect(exp.(ci)), digits = 4) |> println
[81.66, 118.56]
Observe that in this specific example the confidence interval is essentially the same as the equal variance case above, yet in general it need not be. In any case, with Pumas, carrying out such a test is simply done by specifying the keyword argument, homogeneity = false
. Note that by default it is set to true
for parallel designs, and hence we ignored it in the equal variance case example above. This is the output
pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false)
Observation Counts | |
Sequence ╲ Period | 1 |
R | 10 |
T | 10 |
Paradigm: Parallel design | ||||
Model: T test (unequal variance) | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Naive Mean | 7735 | ||
T | Geometric Naive Mean | 7610 | ||
Geometric Mean T/R Ratio (%) | 98.39 | |||
Degrees of Freedom | 17.85 | |||
90% Confidence Interval (%) | [81.65, 118.6] | Pass | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 23.73 | 0.234 | ||
You can see that the degrees of freedom estimate is 17.85
, which up to rounding is exactly as we manually computed above. Note that the actual variability estimate is unaltered.
5 Exploring power for the parallel design
The above bioequivalence hypothesis test passed. Formally what we have done is consider \({\cal T_{\text{TOST}}}\) and reject \(H_0\) using the confidence interval based criteria. From a statistical perspective, this would put us in a position for applying for approval for the test drug. In this case, as producers of the test formulation we have reached our goal, and in parallel consumers of the drug are safe since we have kept a cap on the consumer risk (Type-I error is less than \(5\%\)).
Of course there is also “luck” involved. With a different reality we could have had different observations where \(H_0\) is not rejected and we cannot conclude bioequivalence. Similarly, there is the possibility that the two formulations are actually not bioequivalent (the reality is in \(H_0\)), and in parallel some “bad luck” for the consumers implied that the endpoint observations were close enough. But such “bad luck” is bounded by \(5\%\).
While it is not common to modify \(\alpha\), for pedagogical purposes lets see what happens when we half \(\alpha\) from \(5\%\) to \(2.5\%\). Hypothetically this can happen if the regulator decides to be more stringent (cautious) in terms of the consumer risk (Type-I error). To adjust this with pumas_be
, we can use the level
keyword argument which specifies the width of the confidence interval. Note that for type-I error α
we need a level of 1-2α
. (For ease of comparison to the previous example, we keep homogeneity = false
):
= 0.025
α pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false, level = 1 - 2α)
Observation Counts | |
Sequence ╲ Period | 1 |
R | 10 |
T | 10 |
Paradigm: Parallel design | ||||
Model: T test (unequal variance) | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Naive Mean | 7735 | ||
T | Geometric Naive Mean | 7610 | ||
Geometric Mean T/R Ratio (%) | 98.39 | |||
Degrees of Freedom | 17.85 | |||
95% Confidence Interval (%) | [78.49, 123.4] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 23.73 | 0.234 | ||
Compare the above 95%
confidence interval [78.49, 123.4]
with the \(90\%\) confidence interval [81.65, 118.6]
we had before. You can see that we “paid a price” for requesting more confidence. That price is the wider interval. As a consequence of the lower bound, we do not reject \(H_0\) and thus see Fail
.
So say that after such a failure, and carrying out some additional manufacturing and engineering checks, the drug manufacturer still believes that the two formulations are bioequivalent. In this case it is probably time for a new trial, and we are here to help the manufacturer in planning that trial.
It is very possible that we failed because our previous trial did not control the producers’s risk (type-II error). Or in other words, the power was not high enough. For example, let’s be more conservative than the results of the previous trial and assume that the Geometric Mean Ratio is at \(96\%\) (in our previous trial the point estimate was \(98.39\%\)). Also, lets assume that the variability (CV) is at \(25\%\). In the previous trial we estimated it as \(23.73\%\). What kind of type-II error do we expect for such a trial? Alternatively what is the power (the complement of the type-II error)?
To answer such questions we can use the BioequivalencePower
package. In particular, this let’s use the power
function from that package. More specific details on power and sample size studies are in the sequel of this course. At this point let us just carry out basic usage without going into the details. Here is we ask Pumas to compute the power:
power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96, α = 0.025)
Computed achieved power: 8.465%
HypothesisTestMetaParams Settings: n = 20 α = 0.025
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.96 [θ₁, θ₂] = [0.8, 1.25]
Oh no! The result indicates that the power of the test is only \(0.08465\)! This means that the type-II error is over \(90%\) and hence we have a very high chance of not being able to show bioequivalence. The producer is certainly at risk!
If we try to adjust our assumptions, for example assuming the geometric mean ratio is at \(1.0\) and the CV is better, we get more power:
power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0, α = 0.025)
Computed achieved power: 34.247%
HypothesisTestMetaParams Settings: n = 20 α = 0.025
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.2 θ₀ = 1.0 [θ₁, θ₂] = [0.8, 1.25]
Still, with about \(35\%\) power we are not in a good situation. In fact, it is customary to Aim for a producer’s risk of about \(20\%\) or a power of \(80\%\).
Now obviously, the elephant in the room is the non-standard \(\alpha = 0.025\). If remove that parameter, the power
function uses the default \(5\%\) type-I error.
power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0)
Computed achieved power: 56.384%
HypothesisTestMetaParams Settings: n = 20 α = 0.05
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.2 θ₀ = 1.0 [θ₁, θ₂] = [0.8, 1.25]
With this our power is starting to be (a bit more) sensible, but if we go back to the more conservative CV
and geometric mean ratio,
power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96)
Computed achieved power: 24.08%
HypothesisTestMetaParams Settings: n = 20 α = 0.05
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.96 [θ₁, θ₂] = [0.8, 1.25]
we are back at a low power value. Now in practice, we are not able to influence the CV or the actual geometric mean ratio, we can only use these values to try and gauge what kind of power we expect. But what we can do is increase the sample size (at a cost to the producer and a potential ethical risk if the sample size is too big). So what happens if we have \(30\) subjects instead of \(20\)?
power(RatioTOST, ED2Parallel, n = 30, CV = 0.25, peθ₀ = 0.96)
Computed achieved power: 51.929%
HypothesisTestMetaParams Settings: n = 30 α = 0.05
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.96 [θ₁, θ₂] = [0.8, 1.25]
The power is again getting to be sensible, now at around \(50\%\). Let us in fact see how big of a sample size is needed? For this, instead of searching using power
, let’s use the samplesize
function from the BioequivalencePower
package:
samplesize(RatioTOST, ED2Parallel, CV = 0.25, peθ₀ = 0.96)
Determined sample size: n = 50 achieved power = 80.71%
HypothesisTestMetaParams Settings: α = 0.05 target_power = 0.8
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.96 [θ₁, θ₂] = [0.8, 1.25]
Well, it turns out that we need \(n=50\) participants, a randomized \(25\) of which will be receive the R
treatment and the others will receive T
. In fact, to be safe, after we consider this \(n=50\) case, as we are planning a trial we can also consider what will happen if say \(3\) will dropout from R
, and \(2\) will dropout of T
.
power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.96)
Computed achieved power: 75.86%
HypothesisTestMetaParams Settings: n = [22, 23] α = 0.05
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.96 [θ₁, θ₂] = [0.8, 1.25]
We can see this won’t give us a huge loss of power. Similarly, we can continue the analysis and see what if in this case we are wrong about the GMR value and it is at \(0.9\) instead of \(0.96\):
power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.9)
Computed achieved power: 47.075%
HypothesisTestMetaParams Settings: n = [22, 23] α = 0.05
Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel} CV = 0.25 θ₀ = 0.9 [θ₁, θ₂] = [0.8, 1.25]
The power drops, but does not vanish.
In summary this type of bioequivalence planning analysis can take place, all with a view of ensuring the producer’s risk is controlled. Luckily, the consumer’s risk is already controlled via \(\alpha\).
In future units we’ll focus on other aspects of planning and the BioequivalencePower
package and discuss methodologies for making a decision on the design type and the number of subjects in the trial.
6 Conclusion
This unit explores the statistical foundations of bioequivalence testing, focusing on the Two One-Sided T-Test (TOST) approach used in parallel study designs. We begin by establishing the limitations of classical hypothesis testing for bioequivalence, where the null hypothesis assumes equivalence - a problematic formulation that doesn’t adequately control consumer risk (the probability of incorrectly concluding bioequivalence when it doesn’t exist). Instead, bioequivalence studies use TOST, which reformulates the null hypothesis to assume non-equivalence, requiring researchers to demonstrate that the difference between test and reference formulations falls within predefined acceptance limits (typically \(80-125\%\) on the original scale, or \(\pm 0.223\) on the log scale). This approach involves conducting two one-sided hypothesis tests and is equivalent to demonstrating that a \(90\%\) confidence interval for the ratio of geometric means falls entirely within the bioequivalence limits. The unit demonstrates these concepts using both simplified theoretical examples and realistic data analysis, showing how sample size, variability (CV), and the true geometric mean ratio affect statistical power, and concludes with practical guidance on study planning to ensure adequate power while controlling both consumer and producer risks.
7 Unit exercises
Hypothesis Formulation
Using the Two One-Sided T-Test (TOST) approach, formulate the null and alternative hypotheses for a bioequivalence study comparing drug formulations
R
andT
. Assume the bioequivalence limits are defined as \(\delta_- = -0.22314\) and \(\delta_+ = 0.22314\).- Clearly state the null hypothesis \(H_0\).
- Clearly state the alternative hypothesis \(H_1\).
Type-I and Type-II Error Analysis
Define Type-I and Type-II errors in the context of bioequivalence testing. Additionally, relate each type of error to its corresponding consumer’s risk and producer’s risk.
- Explain Type-I error in your own words.
- Explain Type-II error in your own words.
Statistical Power Calculation
A bioequivalence trial estimates a Geometric Mean Ratio (GMR) of 0.95 with a Coefficient of Variation (CV) of 0.2. If the sample size is determined to be \(n = 20\) and the Type-I error rate \(\alpha\) is set at \(0.05\), compute the power of the test using the
power
function from theBioequivalencePower
package.- Write the code you used for the computation.
- What is the estimated power?
Confidence Interval Interpretation
After performing a bioequivalence study, you calculate a 90% confidence interval for the difference in means, yielding \([0.15, 0.30]\). Based on these results, determine whether the formulations
R
andT
meet the bioequivalence criteria defined by \(\delta_= -0.22314\) and \(\delta_+ = 0.22314\).- State your conclusion regarding bioequivalence.
- Justify your conclusion based on the confidence interval.