A Course on Bioequivalence: Unit 3 - The Two One Sided T-Test on a Parallel Design

This is Unit 3 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.

1 Unit overview

In the previous unit we considered log transformations, the log-normal distribution, geometric means, the geometric mean ratio, and the CV. In this unit we finally explore the statistical inference concepts of an hypothesis test and confidence interval, used for bioequivalence. In particular we consider:

The basic statistical hypothesis test.
The notions of consumer’s risk and producer’s risk in drug development and bioequivalence.
The two one sided T-test (TOST).
The relationship between a two one sided T-test (TOST) and a confidence interval based criteria.
Analysis of these concepts in the setup of a basic parallel design.

In this unit we use the following packages, of which the first 5 are Pumas packages (available only with a Pumas installation).

using Pumas # available with Pumas products - also reexports Distributions
using NCA # available with Pumas products
using Bioequivalence  # available with Pumas products
using BioequivalencePower  # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFramesMeta
using SummaryTables
using DataFrames
using DataFramesMeta
using AlgebraOfGraphics
using CairoMakie
using Random

We have used most of these packages in the previous units. New to this unit is Pumas’ BioequivalencePower package which provides power analysis for planning of bioequivalence trials.

2 Getting the hypothesis formulation right

Let us understand the basics of hypothesis tests and confidence intervals in the context of bioequivlance.

As a very simplified situation, let us assume we are comparing R and T under very simplified (and unrealistic at this point) assumptions:

Only a single observation of R and a single observation of T. Denote these observations as $X_R$ and $X_T$ respectively. While it is unrealistic to assume a single observation per formulation, we do so for simplicity of this first example.
The observations are already log-transformed so they are normally distributed. This is not an unrealistic assumption. In fact it is the common case in bioequivalence analysis.
The (log transformed) observations have a known standard deviation $\sigma = 0.1$. It is an unrealistic assumption to assume that $\sigma$ (or the CV) is known, we do so for simplicity of this first example.
The unknown quantities are $\mu_R$ and $\mu_T$ which are the means of the (log-transformed) test and reference.

Under these assumptions the difference $X_T - X_R$ is distributed according normal distribution with mean $\mu^* = \mu_T - \mu_R$ and standard deviation $\sigma^* = \sqrt{2} \sigma$. We call the difference, the point estimate for the mean difference and denote it as $\hat{\mu}^*$.

Note that in a more realistic setting, discussed below, there are multiple observations and $\sigma$ is not known but is rather estimated.

Classic Hypothesis Formulations

When we carry out hypothesis tests we break up the possibilities in the world into $H_0$ (the null hypothesis) and $H_1$ (the alternative hypothesis).

Typically we consider $H_0$ as the known or default situation. Then from a scientific perspective, we believe and hope that $H_1$ is actually the case, and our hope is to be able to properly demonstrate that this alternative hypothesis holds

Since a decision must be made, we are faced with the following possibilities, where each row is dictated by the reality in the world and each column is based on what we decide. Note that $H_0$ is sometimes called “negative” while $H_1$ is sometimes called “positive”.

	Decision: Retain $H_0$	Decision: Reject $H_0$
Reality: $H_0$	✓ (True Negative)	Type I Error (False Positive) (erroneous rejection of $H_0$)
Reality: $H_1$	Type II Error (False Negative) (erroneously failing to reject $H_0$)	✓ (True Positive)

Now since our decision is based on sampled data, and that data is random, we often try to control and understand the probabilities of the two possible errors, type-I errors (false positives) or type-II errors (false negatives). In practice we are often able to limit the probability of type-I errors to be no more than $\alpha$, and a typical value is $\alpha = 0.05$ (or $5\%$). As for the type-II errors, we consider their complement which is the probability of a true positive. This probability is called the power of the test. Hence the power is the probability of a true positive.

As a starting point, let us consider classical hypothesis tests (in contrast to two one sided t-tests covered below). This is not the type of hypothesis test we use for bioequivalence, yet it is good to start the discussion with these tests. Let us see why.

The formulation of the hypothesis can be as follows: \[ {\cal T}_{\text{classic}} = \begin{cases} H_{0}:& \mu_T - \mu_R = 0, \\[5pt] H_{1}:& \mu_T - \mu_R \neq 0.\\ \end{cases} \]

Here the null-hypothesis $H_0$ indicates that $\mu_T = \mu_R$ (bioequivalence holds) and the alternate hypothesis $H_1$ indicates that $\mu_T$ and $\mu_R$ are different (the formulations are not equivalent). In reality, only one of the two cases holds, either $H_0$ or $H_1$. Based on the data we have to make a decision. So we observe the difference $\hat{\mu}^* = X_T - X_R$ and attempt to make an inferential conclusion about $\mu^*$.

The key to hypothesis testing is to have test statistic that has a known distribution under $H_0$. With our very simplified assumptions of a single observation and a know standard deviation, we can take our sampled data and compute a test statistic, \[ Z = \frac{\hat{\mu}^*}{\sigma^*} = \frac{\hat{\mu}^*}{\sqrt{2} \sigma}. \] Under the $H_0$ assumption the distribution of $Z$ is standard normal. We are then able to make a decision rule where we reject $H_0$ if $Z < z_{0.025}$ or $Z > z_{0.975}$ where $z_{0.025}$ and $z_{0.975}$ are the respective quantiles of a standard normal distribution. This rule ensures we have a type-I error of exactly $5\%$. That is, under $H_0$, there is exactly a probability of $0.05$ to have $Z$ outside of the interval $[z_{0.025},z_{0.975}]$.

For purpose of understanding here is a simulated repetition of this hypothesis test, a million times, under the null hypothesis. We see that we (wrongfully) reject $H_0$ about $5\%$ of the time.

Random.seed!(0)

function single_trial_reject(; μ_R = 0.0, μ_T = 0.0, σ = 0.1)
    x_R = rand(Normal(μ_R, σ))  # Sample from the reference
    x_T = rand(Normal(μ_T, σ))  # Sample from the test
    Z = (x_T - x_R) / (√2 * σ)  # Test statistic
    return abs(Z) > quantile(Normal(), 0.975)
end

print("Simulation estimation of Type-I error: ")
println(mean([single_trial_reject() for _ = 1:10^6]))

Simulation estimation of Type-I error: 0.04992

As for the type-II error, or its complement which is the power of the test, it depends on the actual difference $\mu_T - \mu_R$. Thus let us repeat this type of simulation for various values of $\mu_T$, each time keeping $\mu_R$ at zero. Whenever $\mu_T \neq \mu_R$ we are in an $H_1$ setting. Yet we see that our probability of rejecting (the power) depends on where in $H_1$ the actual parameter lies:

range_μ_T = setdiff(-0.7:0.1:0.7, 0) # several options in H_1 (assuming μ_R = 0.0)
n_μ_T = length(range_μ_T)
probs = DataFrame(μ_T = range_μ_T, power = zeros(n_μ_T))

for i = 1:nrow(probs)
    μ_T = probs.μ_T[i]
    probs.power[i] = mean([single_trial_reject(; μ_T) for _ = 1:10^6])
end
simple_table(probs)


μ_T	power

-0.7	0.999
-0.6	0.989
-0.5	0.943
-0.4	0.807
-0.3	0.564
-0.2	0.293
-0.1	0.109
0.1	0.109
0.2	0.293
0.3	0.564
0.4	0.807
0.5	0.942
0.6	0.989
0.7	0.999

Obviously for cases where $\mu_T$ and $\mu_R$ are farther from each other, we have higher power. This is because it is much easier to detect bigger differences in the means.

Consumer’s Risk and Producer’s Risk

Understanding the basics of hypothesis tests is important. However, hypothesis formulations such as ${\cal T}_{\text{classic}}$ do not work well for the bioequivalence setting. To see why, let us define two competing risks:

consumer’s risk: This is the probability of determining that a test drug product is bioequivelent to the reference when it is actually not.
producer’s risk: This is the probability of determining that a test drug product is not bioequivelent to the reference when it actually is.

How would we map the consumer’s risk and producer’s risk to type-I and type-II errors under the classic hypothesis ${\cal T}_{\text{classic}}$?

Since in this hypothesis formulation, we get that $H_0$ is equivalence. Hence rejecting $H_0$ wrongfully (type-I error) is the producer risk.
Similarly since in this hypothesis formulation we get that $H_1$ is non-equivalence, not rejecting $H_1$ wrongfully (type-II error) is the consumer’s risk.

This is a potential problem for consumer safety because with our statistical decision making, we control the type-I error (keeping it up to $\alpha$) but do not have directly control the type-II error. This then means that with a classic hypothesis formulation we ensure low producer risk but we do not have direct control over the consumer risk! Thus using ${\cal T}_{\text{classic}}$ does not prioritize consumer safety.

This lack of suitability of classic hypothesis formulations (like ${\cal T}_{\text{classic}}$) for bioequivalence, gives rise to the two-one-sided based formulation which we describe now.

Two One Sided Formulations

The general idea in bioequivalence is to define a notion of “closeness” of the reference and test. That is, to demonstrate bioequivalence we want $\mu_R$ and $\mu_T$ to be “close”.

The level of closeness is not a statistical issue but a pharmacological issue. On the log scale it is the 20% difference yielding the known [80%,125%] confidence bounds.

For our purposes let’s say that there are values $\delta_- < 0 < \delta_+$ where we say that the drugs are bioequivalent if, \[ \delta_- < \mu_T - \mu_R < \delta_+ \]

For example if we set $\delta_- = -0.22314$ and $\delta_+ = 0.22314$ then it corresponds to the $[80\%,125\%]$ bounds:

round.(log.([0.8, 1.25]), digits = 5) |> println

[-0.22314, 0.22314]

With $\delta_-$ and $\delta_+$ established based on pharmacological considerations (and considering the log scale), we now formulate the hypothesis: \[ {\cal T_{\text{TOST}}} = \begin{cases} H_{0}:& \mu_T - \mu_R \le \delta_- \qquad \text{or} \qquad \delta_+ \le \mu_T - \mu_R \\[5pt] H_{1}:& \qquad \qquad \delta_- < \mu_T - \mu_R < \delta_+\\ \end{cases} \]

This means that the null hypothesis $H_0$ is that the formulation are not bioequivalent, while the alternate hypothesis $H_1$ implies that the formulations are bioequivlanent.

With this, the relationship between the errors (type-I and type-II) and risks (consumer’s and producer’s) is much more suitable.

The type-I error is the consumer’s risk.
The type-II error is the producer’s risk.

Now If we are able to carry out a hypothesis test for ${\cal T_{\text{TOST}}}$ where type-I errors are not more than $\alpha$, we are able to protect consumers. Then, by attempting to keep the power at reasonable levels (typically above 80%) we are able to protect producers as a secondary objective. The most significant factor ensuring high power (or low type-II error) is the number of subjects in the experiment. On top of the ethical considerations associated with the number of subjects, this is a key cost factor for the producer. Hence with the ${\cal T_{\text{TOST}}}$ formulation, the tradeoff between producer’s risk and trial cost is left primarily at the hands of the producer, while consumers are guaranteed to have a cap on the risk.

Using a confidence interval for the decision

The common way to implement a bioequivalence decision for ${\cal T_{\text{TOST}}}$ is based on a confidence interval. A confidence interval for the unknown quantity $\mu_T-\mu_R$ is comprised of a lower bound $\hat{L}^*$ and an upper bound $\hat{U}^*$ such that,

\[ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9. \]

Note that $\hat{L}^*$ and $\hat{U}^*$ both depend on the observed data.

It turns out that for a bioequivalence hypothesis test as ${\cal T_{\text{TOST}}}$, where we control the type-I error (consumer’s risk) at $\alpha$, we need a $1-2\alpha$ confidence interval. So for example in the typical case where $\alpha = 0.05$ we need a $90\%$ confidence interval.

Warning

The use of a confidence interval is a matter of convenience, yet the interpretation in bioequivalence differs from the classic inferential interpretation. Details follow.

With our very basic assumptions above, we can use basic probability analysis to develop such confidence bounds. This is because the quantity,

\[ \hat{\mu}^* = X_T - X_R \]

has a normal distribution with mean $\mu^* = \mu_T - \mu_R$ and standard deviation $\sigma^* = \sqrt{2} \sigma$. Here $\hat{\mu}^*$ is the point estimate for the difference in means.

An important supporting quantity is $z_{0.95} \approx 1.645$ which is the 95th quantile of the standard normal distribution. With this it is easy to derive that the symmetric confidence interval,

\[ \hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}, \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}. \]

The rule for determining bioequivalence is to check if $\delta_- \le \hat{L}^*$ and $\hat{U}^* \le \delta_+$. Or in other words, $[\hat{L}^*, \hat{U}^*] \subset [\delta_-, \delta_+]$. In particular:

If $\delta_- \le \hat{L}^*$ and $\hat{U}^* \le \delta_+$ then reject $H_0$ and conclude that T is bioequivalent to R.
If this doesn’t hold then do not reject $H_0$, and thus the statistical bioequivalence analysis fails in the sense that we cannot claim that T is bioequivalent to R.

We now see why this conclusion implements the hypothesis formulation ${\cal T_{\text{TOST}}}$.

3 A Two One Sided Test

To implement the hypothesis test ${\cal T_{\text{TOST}}}$ we use two hypothesis tests, each of which is a one-sided hypothesis test.

In particular, we break up the hypothesis test ${\cal T}_{\text{TOST}}$ into two hypothesis tests:

\[ {\cal T}_- = \begin{cases} H_{0-}:& \mu_T - \mu_R \le \delta_-,\\ H_{1-}:& \mu_T - \mu_R > \delta_-,\\ \end{cases} \] and, \[ {\cal T}_+ = \begin{cases} H_{0+}:& \mu_T - \mu_R \ge \delta_+,\\ H_{1+}:& \mu_T - \mu_R < \delta_+.\\ \end{cases} \]

The hypothesis formulation ${\cal T}_{-}$ has $H_0$ as $\mu_T \le \mu_R + \delta_{-}$ and since $\delta_{-} < 0$ it can be read as “the test formulation is not superior to the reference formulation” (implying that the test is less than a reference with a safety margin).

Similarly, the hypothesis formulation ${\cal T}_{+}$ has $H_0$ as $\mu_T \ge \mu_R + \delta_{+}$ and since $\delta_{+} > 0$ it can be read as “the test formulation is not inferior to the reference formulation” (implying that the test is greater than a reference with a safety margin).

In fact, as a side comment, in different contexts, each of these non-superiority and non-inferiority tests are of interest themselves; see FDA (2016).

Importantly, if we reject $H_{0-}$ in ${\cal T}_-$ and also reject $H_{0+}$ in ${\cal T}_+$, then the intersection of both of the alternative hypothesis is that,

\[ \delta_- < \mu_T - \mu_R < \delta_+. \]

Now since the type-I error in each of the cases is not more than $\alpha$, due to the structure of the hypothesis, the combined type-I error is not more than $\alpha$. This is because for each possible null scenario, only one of the two one-sided nulls is true, and rejecting both at the same time has probability at most $\alpha$. Since both nulls must be rejected to claim equivalence, the overall chance of Type I error is at most $\alpha$. For an in-depth advanced analysis describing this idea, outside of the scope of this unit, see Berger and Hsu (1996).

The test statistics

Each of these are more simple (and standard) one sided tests and they are decided via test statistics

\[ Z_- = \frac{\hat{\mu}^*-\delta_-}{\sigma^*}, \qquad \text{and} \qquad Z_+ = \frac{\hat{\mu}^*-\delta_+}{\sigma^*}. \]

The decision rules are:

Reject ${\cal T}_-$: if $Z_- > z_{0.95}$.
Reject ${\cal T}_+$: if $Z_+ < -z_{0.95} = z_{0.05}$.

These rules are considered based on the extreme values of the null parameter range in each case.

Now recalling the confidence intervals from above, $\hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}$ and $\hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}$, we have,

\[ Z_- > z_{0.95} \quad \iff \quad \hat{\mu}^* > \delta_- + \sigma^* z_{0.95} \quad \iff \quad \hat{L}^* > \delta_- \] and, \[ Z_+ < -z_{0.95} \quad \iff \quad \hat{\mu}^* < \delta_+ - \sigma^* z_{0.95} \quad \iff \quad \hat{U}^* < \delta_+ \]

Thus we see that rejection of both tests is exactly identical to the confidence interval $[\hat{L}^*, \hat{U}^*]$ being contained in $[\delta_-, \delta_+]$.

A simulation example

Here is some example code, simulating such a trial. This is not code you would use for bioequivalence analysis, it is rather code that helps understand the key points we discuss in this unit.

First a random trial:

function random_trial(; μ_R = 0.0, μ_T = 0.07, σ = 0.06)
    x_R = rand(Normal(μ_R, σ))
    x_T = rand(Normal(μ_T, σ))
    μ̂ = x_T - x_R
    return μ̂
end

Now the confidence interval for the trial and a check if the trial passes based on the confidence interval:

function ci_of_trial(μ̂; σ = 0.06)
    z_95 = quantile(Normal(), 0.95)
    return (μ̂ - √(2) * σ * z_95, μ̂ + √(2) * σ * z_95)
end

function ci_passes_trial(ci; δ₊ = 0.22314, δ₋ = -δ₊)
    return δ₋ ≤ ci[1] && ci[2] ≤ δ₊ #true means reject H0 and conclude BE
end

Now, the two one sided hypothesis test formulations associated with trial:

function lower_trial_reject(μ̂; σ = 0.06, δ₋ = -0.22314)
    z_95 = quantile(Normal(), 0.95)
    Z₋ = (μ̂ - δ₋) / (√(2) * σ)
    return Z₋ > z_95
end

function upper_trial_reject(μ̂; σ = 0.06, δ₊ = 0.22314)
    z_05 = quantile(Normal(), 0.05)
    Z₊ = (μ̂ - δ₊) / (√(2) * σ)
    return Z₊ < z_05
end

Now this function makes the decision of the trial using the confidence interval approach, and also verifies (with (assert?)) that the hypothesis test approach is equivalent:

function make_decision(μ̂; σ = 0.06, δ₊ = 0.22314, δ₋ = -δ₊)
    ci = ci_of_trial(μ̂; σ)
    decision = ci_passes_trial(ci; δ₊, δ₋)
    if decision #true
        @assert lower_trial_reject(μ̂; σ, δ₋) && upper_trial_reject(μ̂; σ, δ₊)
    else #false
        @assert lower_trial_reject(μ̂; σ, δ₋) == false ||
                upper_trial_reject(μ̂; σ, δ₊) == false
    end
    return decision
end

Here it is on one arbitrary trial where the decision is to reject $H_0$:

Random.seed!(0)
μ̂ = random_trial()
@show μ̂

ci = ci_of_trial(μ̂)
@show ci

pass_ci = ci_passes_trial(ci)
@show pass_ci

lower_reject = lower_trial_reject(μ̂)
@show lower_reject

upper_reject = upper_trial_reject(μ̂)
@show upper_reject

decision = make_decision(μ̂)
@show decision;

μ̂ = 0.021457133452423928
ci = (-0.11811332498877689, 0.16102759189362473)
pass_ci = true
lower_reject = true
upper_reject = true
decision = true

And here is another arbitrary one where the decision is to not to reject $H_0$:

Random.seed!(2)
μ̂ = random_trial()
@show μ̂

ci = ci_of_trial(μ̂)
@show ci

pass_ci = ci_passes_trial(ci)
@show pass_ci

lower_reject = lower_trial_reject(μ̂)
@show lower_reject

upper_reject = upper_trial_reject(μ̂)
@show upper_reject

decision = make_decision(μ̂)
@show decision;

μ̂ = 0.17446680750826965
ci = (0.034896349067068844, 0.31403726594947046)
pass_ci = false
lower_reject = true
upper_reject = false
decision = false

Let’s see 10 separate trials simulated trials:

Random.seed!(0)
sim_trials = [make_decision(random_trial()) for _ = 1:10]
@show sim_trials
@show mean(sim_trials);

sim_trials = Bool[1, 1, 0, 1, 1, 1, 1, 0, 0, 1]
mean(sim_trials) = 0.7

As you can see in some we reject $H_0$ (1) and in others not (0). The mean of this count is the probability of rejecting $H_0$.

Now this code estimates the power as a function of $\mu_T$. This is for a given $\sigma = 0.06$. It is more important to focus on the output table than on the code.

function hypothesis_test_probabilities(σ)
    range_μ_T = sort(union(-0.5:0.05:0.5, [-0.22314, 0.22314]))
    n_μ_T = length(range_μ_T)
    probs = DataFrame(
        μ_T = range_μ_T,
        power = zeros(n_μ_T),
        lower_reject_prob = zeros(n_μ_T),
        upper_reject_prob = zeros(n_μ_T),
    )
    @rtransform!(
        probs,
        :be_case = ((-0.22314 < :μ_T < 0.22314) ? "H₁ (BE)" : "H₀ (Not BE)")
    )
    @select!(probs, :be_case, Not(:be_case))

    for i = 1:nrow(probs)
        μ_T = probs.μ_T[i]
        data = [random_trial(; σ, μ_T) for _ = 1:10^6]
        probs.power[i] = mean(make_decision.(data; σ))
        probs.lower_reject_prob[i] = mean(lower_trial_reject.(data; σ))
        probs.upper_reject_prob[i] = mean(upper_trial_reject.(data; σ))
    end
    return probs
end
simple_table(hypothesis_test_probabilities(0.06))


be_case	μ_T	power	lower_reject_prob	upper_reject_prob

H₀ (Not BE)	-0.5	3.0e-6	3.0e-6	1
H₀ (Not BE)	-0.45	6.0e-6	6.0e-6	1
H₀ (Not BE)	-0.4	0.0001	0.0001	1
H₀ (Not BE)	-0.35	0.0008	0.0008	1
H₀ (Not BE)	-0.3	0.00534	0.00534	1
H₀ (Not BE)	-0.25	0.0249	0.0249	1
H₀ (Not BE)	-0.223	0.0502	0.0504	1
H₁ (BE)	-0.2	0.0845	0.0849	1
H₁ (BE)	-0.15	0.214	0.217	0.997
H₁ (BE)	-0.1	0.408	0.423	0.985
H₁ (BE)	-0.05	0.596	0.654	0.942
H₁ (BE)	0	0.675	0.837	0.838
H₁ (BE)	0.05	0.595	0.942	0.653
H₁ (BE)	0.1	0.408	0.985	0.423
H₁ (BE)	0.15	0.214	0.997	0.217
H₁ (BE)	0.2	0.0845	1	0.0849
H₀ (Not BE)	0.223	0.0499	1	0.05
H₀ (Not BE)	0.25	0.0249	1	0.0249
H₀ (Not BE)	0.3	0.00535	1	0.00536
H₀ (Not BE)	0.35	0.000865	1	0.000865
H₀ (Not BE)	0.4	9.8e-5	1	9.8e-5
H₀ (Not BE)	0.45	4.0e-6	1	4.0e-6
H₀ (Not BE)	0.5	1.0e-6	1	1.0e-6

Here are some observations from above:

Observe that at the limits, $\delta_-$ and $\delta_+$ the power estimate is approximately at $\alpha = 5\%$.
Observe that the maximal power is when $\mu_T = \mu_R (=0)$ in which case it is at about $67.5\%$.
Observe that for highly positive $\mu_T$ the lower test, ${\cal T}_-$ does not have an effect (we reject with probability $1$). Similarly, for highly negative $\mu_T$ the upper test ${\cal T}_+$ rejects with probability $1$.

Now if somehow the variability of the drug products would have been lower ($\sigma = 0.04$ instead of $\sigma = 0.06$) then the power is generally increased:

simple_table(hypothesis_test_probabilities(0.04))


be_case	μ_T	power	lower_reject_prob	upper_reject_prob

H₀ (Not BE)	-0.5	0	0	1
H₀ (Not BE)	-0.45	0	0	1
H₀ (Not BE)	-0.4	1.0e-6	1.0e-6	1
H₀ (Not BE)	-0.35	3.5e-5	3.5e-5	1
H₀ (Not BE)	-0.3	0.00135	0.00135	1
H₀ (Not BE)	-0.25	0.017	0.017	1
H₀ (Not BE)	-0.223	0.0506	0.0506	1
H₁ (BE)	-0.2	0.109	0.109	1
H₁ (BE)	-0.15	0.363	0.363	1
H₁ (BE)	-0.1	0.703	0.703	1
H₁ (BE)	-0.05	0.921	0.922	0.999
H₁ (BE)	0	0.978	0.989	0.989
H₁ (BE)	0.05	0.921	0.999	0.921
H₁ (BE)	0.1	0.702	1	0.702
H₁ (BE)	0.15	0.362	1	0.362
H₁ (BE)	0.2	0.108	1	0.108
H₀ (Not BE)	0.223	0.0505	1	0.0505
H₀ (Not BE)	0.25	0.017	1	0.017
H₀ (Not BE)	0.3	0.00136	1	0.00136
H₀ (Not BE)	0.35	3.4e-5	1	3.4e-5
H₀ (Not BE)	0.4	0	1	0
H₀ (Not BE)	0.45	0	1	0
H₀ (Not BE)	0.5	0	1	0

In practice, we obviously do not get to influence $\sigma$. But instead we sample more than one subject for each formulation. In fact, we also don’t know $\sigma$ but rather need to estimate it. This puts us in the realm of the T-test. The good news, is that all of the principles from the hypothetical case we considered up to now, carry over to the much more realistic setting.

4 A much more realistic case - parallel design

As a data example let us create the same NCA based dataset that we analyzed in the previous two units with a parallel study, this time focusing only on Cmax.

df = dataset("nca/crossover_study.csv")
@subset!(df, :period .== 1)
@rtransform!(df, :sequence = :sequence == "RT" ? "R" : "T")
pop = read_nca(df; observations = :concentration, group = [:sequence, :period])
nca_result = run_nca(pop; parameters = [:cmax])
be_analysis_data = nca_result.reportdf;

Now, let us use the preprocess_be function as it also automatically log-transforms the data (we’ll also sort by the sequence).

transformed_data = sort(preprocess_be(be_analysis_data, endpoint = :cmax), :sequence)
simple_table(transformed_data)


id	id_within_sequence	sequence	period	formulation	endpoint

1	1	R	1	'R'	9.03
10	2	R	1	'R'	8.92
2	3	R	1	'R'	9.08
3	4	R	1	'R'	8.78
4	5	R	1	'R'	8.57
5	6	R	1	'R'	9.08
6	7	R	1	'R'	8.81
7	8	R	1	'R'	9.14
8	9	R	1	'R'	8.69
9	10	R	1	'R'	9.43
11	1	T	1	'T'	8.87
12	2	T	1	'T'	8.68
13	3	T	1	'T'	9.15
14	4	T	1	'T'	9.01
15	5	T	1	'T'	8.65
16	6	T	1	'T'	9.42
17	7	T	1	'T'	8.99
18	8	T	1	'T'	8.76
19	9	T	1	'T'	8.89
20	10	T	1	'T'	8.96

Now similar to the $Z$, $Z_-$, and $Z_+$ statistics that we presented above in the simplified assumptions of single observation and known variance, we now have a t-statistics that can be used for confidence intervals and hypothesis tests. They can then be used for the two one sided t-test just as above.

In particular we consider the sample mean and sample standard deviations,

x̄_R, s_R = mean(transformed_data.endpoint[1:10]), std(transformed_data.endpoint[1:10])
x̄_T, s_T = mean(transformed_data.endpoint[11:20]), std(transformed_data.endpoint[11:20])
@show x̄_R, s_R
@show x̄_T, s_T;

(x̄_R, s_R) = (8.953467566963166, 0.2510234409844379)
(x̄_T, s_T) = (8.93727699385348, 0.2290728211926693)

Based on these four summary statistics we can get a lot done! But we need to make some assumptions:

Normality and independence of the (log-transformed) data.
The same variability for both R and T.
The number of observations per formulation is $n_R$ and $n_T$, and we denote the observations $X^R_{i}$ and $X^T_{i}$ respectively.

A point estimate for the difference in means is,

\[ \hat{\mu}^* = \frac{1}{n_T}\sum_{i=1}^n X^T_{i} - \frac{1}{n_R}\sum_{i=1}^n X^R_{i} \]

μ̂ = x̄_T - x̄_R
@show μ̂;

μ̂ = -0.016190573109685857

Based on it, we define the $T$ statistic:

\[ T = \frac{\hat{\mu}^*}{S_p\sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}. \]

If $\mu_T = \mu_R$ then it follows a T-distribution with $d = n_R + n_T - 2$ degrees of freedom. When $d$ is not too small (e.g. $d > 40$) the distribution is very similar to a standard normal distribution. Yet, for small $d$ it has heavier tails.

The denominator of the $T$ statistic has the pooled standard deviation,

\[ S_p = \sqrt{ \frac{(n_R-1)S_R^2 + (n_T-1)S_T^2}{n_R + n_T - 2} }, \]

where each of $S_R$ and $S_T$ are standard deviations

\[ S_R = \sqrt{ \frac{1}{n_R-1} \sum_{i=1}^{n_R} (x^R_{i} - \overline{x}_R)^2 } \qquad \text{and} \qquad S_T = \sqrt{ \frac{1}{n_T-1} \sum_{i=1}^{n_T} (x^T_{i} - \overline{x}_T)^2 }. \]

Let’s compute the pooled standard deviation and the T statistic for our data.

s_p = sqrt((9 * s_R^2 + 9 * s_T^2) / 18)
t_statistic = (x̄_T - x̄_R) / (s_p * √(1 / 10 + 1 / 10))
@show s_p
@show t_statistic;

s_p = 0.24029890275741608
t_statistic = -0.15065912350205451

Now a $90\% = 100 \times (1-\alpha)\%$ confidence interval based on the T statistic,

\[ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9, \]

is based on,

\[ \hat{L}^* = \hat{\mu}^* - t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}} , \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}. \]

Here $t_{d, 0.95}$ is the $95\%$ quantile of the T-distribution with $d$ degrees of freedom.

t_95 = quantile(TDist(18), 0.95)
@show t_95;

t_95 = 1.7340636066175381

Here is the confidence interval computed explicitly:

ci = (μ̂ - t_95 * s_p * √(1 / 10 + 1 / 10), μ̂ + t_95 * s_p * √(1 / 10 + 1 / 10))

(-0.2025416081295061, 0.1701604619101344)

And here we can see it back on the original scale, as a percentage.

100round.(collect(exp.(ci)), digits = 4) |> println

[81.67, 118.55]

Here is the same confidence interval computed via pumas_be (slight differences are due to rounding implementations).

pumas_be(be_analysis_data, endpoint = :cmax)


Observation Counts
Sequence ╲ Period	1

R	10
T	10


Paradigm: Parallel design
Model: T test (equal variance)
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Naive Mean	7735
T	Geometric Naive Mean	7610
	Geometric Mean T/R Ratio (%)	98.39
	Degrees of Freedom	18
	90% Confidence Interval (%)	[81.66, 118.6]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	23.73 \| 0.234

Considering other output values, the Vriability is estimated in this simple model - simply via the standard deviation. You can see it is the same as in the pumas_be output. In other complex models, variability estimation is more complex.

σ̂ = round(std(transformed_data.endpoint), digits = 3)
CV = round(100 * sigma2geocv(σ̂), digits = 2)

@show CV, σ̂;

(CV, σ̂) = (23.72, 0.234)

Finally, observe that the Geometric Mean T/R Ratio is simply the transformed (exponentiated) point estimate.

round(100exp(μ̂), digits = 2)

98.39

A variant that does not assume equal variance of both treatments

There is a variant of the T test which does not require us to assume that the variability of R and the variability of T is the same.

This is sometimes called the Welch–Satterthwaite T test.

The choice if to use this variant or the equal-variance case (presented above), can be based on our willingness to make assumptions about the bioequivalence trial before conducting it. In particular, if we have reason to believe that variability of the R and variability of T are significantly different, then not assuming equal variance would make sense. In any case, this needs to be represented in the study protocol.

From a user point of view, the non-equal variance case estimates an approximate degrees of freedom based on the standard deviations of both R and T. This is done via a formula s follows:

n_R = 10 # the number of reference observations
n_T = 10 # the number of test observations

# formula for the approximate degrees of freedom
v =
    (s_R^2 / n_R + s_T^2 / n_T)^2 /
    ((s_R^2 / n_R)^2 / (n_R - 1) + (s_T^2 / n_T)^2 / (n_T - 1))

17.851353654711797

Then a T-distribution with this value for the degrees of freedom is used. Note that the value will almost always be non-integer.

The confidence interval is then computed as such:

t_095 = quantile(TDist(v), 0.95)
ci =
    (μ̂ - t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10), μ̂ + t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10))

(-0.20262570982810862, 0.1702445636087369)

In this case, to convert back to the non-log scale we carry out:

100round.(collect(exp.(ci)), digits = 4) |> println

[81.66, 118.56]

Observe that in this specific example the confidence interval is essentially the same as the equal variance case above, yet in general it need not be. In any case, with Pumas, carrying out such a test is simply done by specifying the keyword argument, homogeneity = false. Note that by default it is set to true for parallel designs, and hence we ignored it in the equal variance case example above. This is the output

pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false)


Observation Counts
Sequence ╲ Period	1

R	10
T	10


Paradigm: Parallel design
Model: T test (unequal variance)
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Naive Mean	7735
T	Geometric Naive Mean	7610
	Geometric Mean T/R Ratio (%)	98.39
	Degrees of Freedom	17.85
	90% Confidence Interval (%)	[81.65, 118.6]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	23.73 \| 0.234

You can see that the degrees of freedom estimate is 17.85, which up to rounding is exactly as we manually computed above. Note that the actual variability estimate is unaltered.

5 Exploring power for the parallel design

The above bioequivalence hypothesis test passed. Formally what we have done is consider ${\cal T_{\text{TOST}}}$ and reject $H_0$ using the confidence interval based criteria. From a statistical perspective, this would put us in a position for applying for approval for the test drug. In this case, as producers of the test formulation we have reached our goal, and in parallel consumers of the drug are safe since we have kept a cap on the consumer risk (Type-I error is less than $5\%$).

Of course there is also “luck” involved. With a different reality we could have had different observations where $H_0$ is not rejected and we cannot conclude bioequivalence. Similarly, there is the possibility that the two formulations are actually not bioequivalent (the reality is in $H_0$), and in parallel some “bad luck” for the consumers implied that the endpoint observations were close enough. But such “bad luck” is bounded by $5\%$.

While it is not common to modify $\alpha$, for pedagogical purposes lets see what happens when we half $\alpha$ from $5\%$ to $2.5\%$. Hypothetically this can happen if the regulator decides to be more stringent (cautious) in terms of the consumer risk (Type-I error). To adjust this with pumas_be, we can use the level keyword argument which specifies the width of the confidence interval. Note that for type-I error α we need a level of 1-2α. (For ease of comparison to the previous example, we keep homogeneity = false):

α = 0.025
pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false, level = 1 - 2α)


Observation Counts
Sequence ╲ Period	1

R	10
T	10


Paradigm: Parallel design
Model: T test (unequal variance)
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)

		Results(cmax)	Assessment	Criteria
R	Geometric Naive Mean	7735
T	Geometric Naive Mean	7610
	Geometric Mean T/R Ratio (%)	98.39
	Degrees of Freedom	17.85
	95% Confidence Interval (%)	[78.49, 123.4]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	23.73 \| 0.234

Compare the above 95% confidence interval [78.49, 123.4] with the $90\%$ confidence interval [81.65, 118.6] we had before. You can see that we “paid a price” for requesting more confidence. That price is the wider interval. As a consequence of the lower bound, we do not reject $H_0$ and thus see Fail.

So say that after such a failure, and carrying out some additional manufacturing and engineering checks, the drug manufacturer still believes that the two formulations are bioequivalent. In this case it is probably time for a new trial, and we are here to help the manufacturer in planning that trial.

It is very possible that we failed because our previous trial did not control the producers’s risk (type-II error). Or in other words, the power was not high enough. For example, let’s be more conservative than the results of the previous trial and assume that the Geometric Mean Ratio is at $96\%$ (in our previous trial the point estimate was $98.39\%$). Also, lets assume that the variability (CV) is at $25\%$. In the previous trial we estimated it as $23.73\%$. What kind of type-II error do we expect for such a trial? Alternatively what is the power (the complement of the type-II error)?

To answer such questions we can use the BioequivalencePower package. In particular, this let’s use the power function from that package. More specific details on power and sample size studies are in the sequel of this course. At this point let us just carry out basic usage without going into the details. Here is we ask Pumas to compute the power:

power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96, α = 0.025)

Computed achieved power: 8.465%
    HypothesisTestMetaParams Settings: n = 20  α = 0.025
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.96  [θ₁, θ₂] = [0.8, 1.25]

Oh no! The result indicates that the power of the test is only $0.08465$! This means that the type-II error is over $90%$ and hence we have a very high chance of not being able to show bioequivalence. The producer is certainly at risk!

If we try to adjust our assumptions, for example assuming the geometric mean ratio is at $1.0$ and the CV is better, we get more power:

power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0, α = 0.025)

Computed achieved power: 34.247%
    HypothesisTestMetaParams Settings: n = 20  α = 0.025
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.2  θ₀ = 1.0  [θ₁, θ₂] = [0.8, 1.25]

Still, with about $35\%$ power we are not in a good situation. In fact, it is customary to Aim for a producer’s risk of about $20\%$ or a power of $80\%$.

Now obviously, the elephant in the room is the non-standard $\alpha = 0.025$. If remove that parameter, the power function uses the default $5\%$ type-I error.

power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0)

Computed achieved power: 56.384%
    HypothesisTestMetaParams Settings: n = 20  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.2  θ₀ = 1.0  [θ₁, θ₂] = [0.8, 1.25]

With this our power is starting to be (a bit more) sensible, but if we go back to the more conservative CV and geometric mean ratio,

power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96)

Computed achieved power: 24.08%
    HypothesisTestMetaParams Settings: n = 20  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.96  [θ₁, θ₂] = [0.8, 1.25]

we are back at a low power value. Now in practice, we are not able to influence the CV or the actual geometric mean ratio, we can only use these values to try and gauge what kind of power we expect. But what we can do is increase the sample size (at a cost to the producer and a potential ethical risk if the sample size is too big). So what happens if we have $30$ subjects instead of $20$?

power(RatioTOST, ED2Parallel, n = 30, CV = 0.25, peθ₀ = 0.96)

Computed achieved power: 51.929%
    HypothesisTestMetaParams Settings: n = 30  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.96  [θ₁, θ₂] = [0.8, 1.25]

The power is again getting to be sensible, now at around $50\%$. Let us in fact see how big of a sample size is needed? For this, instead of searching using power, let’s use the samplesize function from the BioequivalencePower package:

samplesize(RatioTOST, ED2Parallel, CV = 0.25, peθ₀ = 0.96)

Determined sample size: n = 50  achieved power = 80.71% 
    HypothesisTestMetaParams Settings:   α = 0.05  target_power = 0.8
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.96  [θ₁, θ₂] = [0.8, 1.25]

Well, it turns out that we need $n=50$ participants, a randomized $25$ of which will be receive the R treatment and the others will receive T. In fact, to be safe, after we consider this $n=50$ case, as we are planning a trial we can also consider what will happen if say $3$ will dropout from R, and $2$ will dropout of T.

power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.96)

Computed achieved power: 75.86%
    HypothesisTestMetaParams Settings: n = [22, 23]  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.96  [θ₁, θ₂] = [0.8, 1.25]

We can see this won’t give us a huge loss of power. Similarly, we can continue the analysis and see what if in this case we are wrong about the GMR value and it is at $0.9$ instead of $0.96$:

power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.9)

Computed achieved power: 47.075%
    HypothesisTestMetaParams Settings: n = [22, 23]  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.25  θ₀ = 0.9  [θ₁, θ₂] = [0.8, 1.25]

The power drops, but does not vanish.

In summary this type of bioequivalence planning analysis can take place, all with a view of ensuring the producer’s risk is controlled. Luckily, the consumer’s risk is already controlled via $\alpha$.

In future units we’ll focus on other aspects of planning and the BioequivalencePower package and discuss methodologies for making a decision on the design type and the number of subjects in the trial.

6 Conclusion

This unit explores the statistical foundations of bioequivalence testing, focusing on the Two One-Sided T-Test (TOST) approach used in parallel study designs. We begin by establishing the limitations of classical hypothesis testing for bioequivalence, where the null hypothesis assumes equivalence - a problematic formulation that doesn’t adequately control consumer risk (the probability of incorrectly concluding bioequivalence when it doesn’t exist). Instead, bioequivalence studies use TOST, which reformulates the null hypothesis to assume non-equivalence, requiring researchers to demonstrate that the difference between test and reference formulations falls within predefined acceptance limits (typically $80-125\%$ on the original scale, or $\pm 0.223$ on the log scale). This approach involves conducting two one-sided hypothesis tests and is equivalent to demonstrating that a $90\%$ confidence interval for the ratio of geometric means falls entirely within the bioequivalence limits. The unit demonstrates these concepts using both simplified theoretical examples and realistic data analysis, showing how sample size, variability (CV), and the true geometric mean ratio affect statistical power, and concludes with practical guidance on study planning to ensure adequate power while controlling both consumer and producer risks.

7 Unit exercises

Hypothesis Formulation

Using the Two One-Sided T-Test (TOST) approach, formulate the null and alternative hypotheses for a bioequivalence study comparing drug formulations R and T. Assume the bioequivalence limits are defined as $\delta_- = -0.22314$ and $\delta_+ = 0.22314$.
1. Clearly state the null hypothesis $H_0$.
2. Clearly state the alternative hypothesis $H_1$.
Type-I and Type-II Error Analysis

Define Type-I and Type-II errors in the context of bioequivalence testing. Additionally, relate each type of error to its corresponding consumer’s risk and producer’s risk.
1. Explain Type-I error in your own words.
2. Explain Type-II error in your own words.
Statistical Power Calculation

A bioequivalence trial estimates a Geometric Mean Ratio (GMR) of 0.95 with a Coefficient of Variation (CV) of 0.2. If the sample size is determined to be $n = 20$ and the Type-I error rate $\alpha$ is set at $0.05$, compute the power of the test using the power function from the BioequivalencePower package.
1. Write the code you used for the computation.
2. What is the estimated power?
Confidence Interval Interpretation

After performing a bioequivalence study, you calculate a 90% confidence interval for the difference in means, yielding $[0.15, 0.30]$. Based on these results, determine whether the formulations R and T meet the bioequivalence criteria defined by $\delta_= -0.22314$ and $\delta_+ = 0.22314$.
1. State your conclusion regarding bioequivalence.
2. Justify your conclusion based on the confidence interval.

References

Berger, Roger L, and Jason C Hsu. 1996. “Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets.” Statistical Science 11 (4): 283–319. https://www.jstor.org/stable/2246130.

FDA. 2016. “Non-Inferiority Clinical Trials.” 2016. https://www.fda.gov/media/78504/download.

Reuse

CC BY-SA 4.0

--- title: "A Course on Bioequivalence: Unit 3 - The Two One Sided T-Test on a Parallel Design" execute: error: false echo: true enabled: true daemon: 300 engine: julia julia: exeflags: ["--project=@.", "--threads=1"] author: - Yoni Nazarathy - Andreas Noack - Vijay Ivaturi format: html: include-in-header: text: | <style> .header-container { width: 100%; } .logo { display: block; width: 150px; margin-left: -2.5%; } </style> <style> .no-arrow::before { content: ''; } .no-arrow > summary { list-style: none; } .no-arrow > summary::-webkit-details-marker { display: none; } </style> <link rel = "shortcut icon" href = "https://pumas.ai/favicon.ico" /> include-before-body: text: | <div class="header-container"> <img src="https://pumas-assets.s3.amazonaws.com/CompanyLogos/PumasAI/RGB/PNG/Pumas+AI+Primary%404x.png" alt="Pumas Logo" class="logo"> </div> include-after-body: text: | css: style.css self-contained-math: true anchor-sections: true theme: default toc: true toc-depth: 4 toc-expand: 2 toc-location: left toc-title: Contents number-sections: true number-depth: 2 code-summary: Show/Hide Code code-tools: caption: Download tutorial source: true bibliography: ../references.bib fig-format: svg fig-width: 8 fig-height: 6 license: CC BY-SA 4.0 profile-type: no-intermediate-text: false #set to true to hide the intermediate text. --- This is **Unit 3** of a [complete bioequivalence analysis course using Pumas](https://tutorials.pumas.ai/#module-16-a-course-on-bioequivalence). There are **15 units** in the course in total. ## Unit overview In the [previous unit](unit-2-logs-means-and-cvs.html) we considered log transformations, the log-normal distribution, geometric means, the geometric mean ratio, and the CV. In this unit we finally explore the statistical inference concepts of an hypothesis test and confidence interval, used for bioequivalence. In particular we consider: - The basic statistical hypothesis test. - The notions of consumer's risk and producer's risk in drug development and bioequivalence. - The two one sided T-test (TOST). - The relationship between a two one sided T-test (TOST) and a confidence interval based criteria. - Analysis of these concepts in the setup of a basic parallel design. In this unit we use the following packages, of which the first 5 are Pumas packages (available only with a Pumas installation). ```{julia} using Pumas # available with Pumas products - also reexports Distributions using NCA # available with Pumas products using Bioequivalence # available with Pumas products using BioequivalencePower # available with Pumas products using PharmaDatasets # available with Pumas products using DataFramesMeta using SummaryTables using DataFrames using DataFramesMeta using AlgebraOfGraphics using CairoMakie using Random ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} We have used most of these packages in the previous units. New to this unit is Pumas' [`BioequivalencePower` package](https://docs.pumas.ai/stable/bioequivalence/be_power_introduction/) which provides power analysis for **planning** of bioequivalence trials. ::: ## Getting the hypothesis formulation right ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Let us understand the basics of hypothesis tests and confidence intervals in the context of bioequivlance. ::: As a very simplified situation, let us assume we are comparing `R` and `T` under very simplified (and unrealistic at this point) assumptions: - Only a single observation of `R` and a single observation of `T`. Denote these observations as $X_R$ and $X_T$ respectively. _While it is unrealistic to assume a single observation per formulation, we do so for simplicity of this first example_. - The observations are already log-transformed so they are normally distributed. _This is **not** an unrealistic assumption. In fact it is the common case in bioequivalence analysis_. - The (log transformed) observations have a **known standard deviation** $\sigma = 0.1$. _It is an unrealistic assumption to assume that $\sigma$ (or the CV) is known, we do so for simplicity of this first example_. - The unknown quantities are $\mu_R$ and $\mu_T$ which are the means of the (log-transformed) test and reference. Under these assumptions the difference $X_T - X_R$ is distributed according normal distribution with mean $\mu^* = \mu_T - \mu_R$ and standard deviation $\sigma^* = \sqrt{2} \sigma$. We call the difference, the **point estimate for the mean difference** and denote it as $\hat{\mu}^*$. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Note that in a more realistic setting, discussed below, there are multiple observations and $\sigma$ is not known but is rather estimated. ::: ### Classic Hypothesis Formulations When we carry out hypothesis tests we break up the possibilities in the world into $H_0$ (the null hypothesis) and $H_1$ (the alternative hypothesis). ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Typically we consider $H_0$ as the known or default situation. Then from a scientific perspective, we believe and hope that $H_1$ is actually the case, and our hope is to be able to properly demonstrate that this alternative hypothesis holds ::: ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Since a decision must be made, we are faced with the following possibilities, where each row is dictated by the reality in the world and each column is based on what we decide. Note that $H_0$ is sometimes called "negative" while $H_1$ is sometimes called "positive". ::: | | **Decision: Retain $H_0$** | **Decision: Reject $H_0$** | |:------------------ |:-------------------------------------------------------------------------------------------------------- |:------------------------------------------------------------------------------------------------ | | **Reality: $H_0$** | <span style="color: green;">✓ (True Negative)</span> | <span style="color: red;">Type I Error (False Positive)<br>(erroneous rejection of $H_0$)</span> | | **Reality: $H_1$** | <span style="color: red;">Type II Error (False Negative)<br>(erroneously failing to reject $H_0$)</span> | <span style="color: green;">✓ (True Positive)</span> | ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now since our decision is based on sampled data, and that data is random, we often try to control and understand the probabilities of the two possible errors, type-I errors (false positives) or type-II errors (false negatives). In practice we are often able to limit the probability of type-I errors to be no more than $\alpha$, and a typical value is $\alpha = 0.05$ (or $5\%$). As for the type-II errors, we consider their complement which is the probability of a true positive. This probability is called the **power** of the test. Hence the power is the probability of a true positive. ::: As a starting point, let us consider classical hypothesis tests (in contrast to two one sided t-tests covered below). This is **not** the type of hypothesis test we use for bioequivalence, yet it is good to start the discussion with these tests. Let us see why. The formulation of the hypothesis can be as follows: $$ {\cal T}_{\text{classic}} = \begin{cases} H_{0}:& \mu_T - \mu_R = 0, \\[5pt] H_{1}:& \mu_T - \mu_R \neq 0.\\ \end{cases} $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here the **null-hypothesis** $H_0$ indicates that $\mu_T = \mu_R$ (bioequivalence holds) and the **alternate hypothesis** $H_1$ indicates that $\mu_T$ and $\mu_R$ are different (the formulations are not equivalent). In reality, only one of the two cases holds, either $H_0$ or $H_1$. Based on the data we have to make a decision. So we observe the difference $\hat{\mu}^* = X_T - X_R$ and attempt to make an inferential conclusion about $\mu^*$. ::: The key to hypothesis testing is to have **test statistic** that has a known distribution under $H_0$. With our very simplified assumptions of a single observation and a know standard deviation, we can take our sampled data and compute a test statistic, $$ Z = \frac{\hat{\mu}^*}{\sigma^*} = \frac{\hat{\mu}^*}{\sqrt{2} \sigma}. $$ Under the $H_0$ assumption the distribution of $Z$ is standard normal. We are then able to make a **decision rule** where we reject $H_0$ if $Z < z_{0.025}$ or $Z > z_{0.975}$ where $z_{0.025}$ and $z_{0.975}$ are the respective quantiles of a standard normal distribution. This rule ensures we have a type-I error of exactly $5\%$. That is, under $H_0$, there is exactly a probability of $0.05$ to have $Z$ outside of the interval $[z_{0.025},z_{0.975}]$. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} For purpose of understanding here is a simulated repetition of this hypothesis test, a million times, under the null hypothesis. We see that we (wrongfully) reject $H_0$ about $5\%$ of the time. ::: ```{julia} Random.seed!(0) function single_trial_reject(; μ_R = 0.0, μ_T = 0.0, σ = 0.1) x_R = rand(Normal(μ_R, σ)) # Sample from the reference x_T = rand(Normal(μ_T, σ)) # Sample from the test Z = (x_T - x_R) / (√2 * σ) # Test statistic return abs(Z) > quantile(Normal(), 0.975) end print("Simulation estimation of Type-I error: ") println(mean([single_trial_reject() for _ = 1:10^6])) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} As for the type-II error, or its complement which is the power of the test, it depends on the actual difference $\mu_T - \mu_R$. Thus let us repeat this type of simulation for various values of $\mu_T$, each time keeping $\mu_R$ at zero. Whenever $\mu_T \neq \mu_R$ we are in an $H_1$ setting. Yet we see that our probability of rejecting (the **power**) depends on where in $H_1$ the actual parameter lies: ::: ```{julia} range_μ_T = setdiff(-0.7:0.1:0.7, 0) # several options in H_1 (assuming μ_R = 0.0) n_μ_T = length(range_μ_T) probs = DataFrame(μ_T = range_μ_T, power = zeros(n_μ_T)) for i = 1:nrow(probs) μ_T = probs.μ_T[i] probs.power[i] = mean([single_trial_reject(; μ_T) for _ = 1:10^6]) end simple_table(probs) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Obviously for cases where $\mu_T$ and $\mu_R$ are farther from each other, we have higher power. This is because it is much easier to detect bigger differences in the means. ::: ### Consumer's Risk and Producer's Risk ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Understanding the basics of hypothesis tests is important. However, hypothesis formulations such as ${\cal T}_{\text{classic}}$ **do not** work well for the bioequivalence setting. To see why, let us define two competing risks: ::: - **consumer's risk**: This is the probability of determining that a test drug product **is** bioequivelent to the reference when it is **actually not**. - **producer's risk**: This is the probability of determining that a test drug product **is not** bioequivelent to the reference when it **actually is**. How would we map the consumer's risk and producer's risk to type-I and type-II errors under the classic hypothesis ${\cal T}_{\text{classic}}$? - Since in this hypothesis formulation, we get that $H_0$ is equivalence. Hence rejecting $H_0$ wrongfully (type-I error) is the producer risk. - Similarly since in this hypothesis formulation we get that $H_1$ is non-equivalence, not rejecting $H_1$ wrongfully (type-II error) is the consumer's risk. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} This is a potential **problem** for consumer safety because with our statistical decision making, we control the type-I error (keeping it up to $\alpha$) but do not have directly control the type-II error. This then means that with a classic hypothesis formulation we ensure low producer risk but we do not have direct control over the consumer risk! Thus using ${\cal T}_{\text{classic}}$ does not prioritize consumer safety. This lack of suitability of classic hypothesis formulations (like ${\cal T}_{\text{classic}}$) for bioequivalence, gives rise to the two-one-sided based formulation which we describe now. ::: ### Two One Sided Formulations The general idea in bioequivalence is to define a notion of "closeness" of the reference and test. That is, to demonstrate bioequivalence we want $\mu_R$ and $\mu_T$ to be "close". The level of closeness is not a statistical issue but a pharmacological issue. On the log scale it is the 20% difference yielding the known [80%,125%] confidence bounds. For our purposes let's say that there are values $\delta_- < 0 < \delta_+$ where we say that the drugs are bioequivalent if, $$ \delta_- < \mu_T - \mu_R < \delta_+ $$ For example if we set $\delta_- = -0.22314$ and $\delta_+ = 0.22314$ then it corresponds to the $[80\%,125\%]$ bounds: ```{julia} round.(log.([0.8, 1.25]), digits = 5) |> println ``` With $\delta_-$ and $\delta_+$ established based on pharmacological considerations (and considering the log scale), we now formulate the hypothesis: $$ {\cal T_{\text{TOST}}} = \begin{cases} H_{0}:& \mu_T - \mu_R \le \delta_- \qquad \text{or} \qquad \delta_+ \le \mu_T - \mu_R \\[5pt] H_{1}:& \qquad \qquad \delta_- < \mu_T - \mu_R < \delta_+\\ \end{cases} $$ This means that the **null hypothesis** $H_0$ is that the formulation are **not bioequivalent**, while the **alternate hypothesis** $H_1$ implies that the formulations are **bioequivlanent**. With this, the relationship between the errors (type-I and type-II) and risks (consumer's and producer's) is much more suitable. - **The type-I error is the consumer's risk.** - **The type-II error is the producer's risk.** ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now If we are able to carry out a hypothesis test for ${\cal T_{\text{TOST}}}$ where type-I errors are not more than $\alpha$, we are able to protect consumers. Then, by attempting to keep the power at reasonable levels (typically above 80%) we are able to protect producers as a secondary objective. The most significant factor ensuring high power (or low type-II error) is the number of subjects in the experiment. On top of the ethical considerations associated with the number of subjects, this is a key cost factor for the producer. Hence with the ${\cal T_{\text{TOST}}}$ formulation, the tradeoff between producer's risk and trial cost is left primarily at the hands of the producer, while consumers are guaranteed to have a cap on the risk. ::: ### Using a confidence interval for the decision The common way to implement a bioequivalence decision for ${\cal T_{\text{TOST}}}$ is based on a **confidence interval**. A confidence interval for the unknown quantity $\mu_T-\mu_R$ is comprised of a **lower bound** $\hat{L}^*$ and an **upper bound** $\hat{U}^*$ such that, $$ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9. $$ Note that $\hat{L}^*$ and $\hat{U}^*$ both depend on the observed data. It turns out that for a bioequivalence hypothesis test as ${\cal T_{\text{TOST}}}$, where we control the type-I error (consumer's risk) at $\alpha$, we need a $1-2\alpha$ confidence interval. So for example in the typical case where $\alpha = 0.05$ we need a $90\%$ confidence interval. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} :::{.callout-warning} The use of a confidence interval is a matter of convenience, yet the interpretation in bioequivalence differs from the classic inferential interpretation. Details follow. ::: ::: ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} With our very basic assumptions above, we can use basic probability analysis to develop such confidence bounds. This is because the quantity, ::: $$ \hat{\mu}^* = X_T - X_R $$ has a normal distribution with mean $\mu^* = \mu_T - \mu_R$ and standard deviation $\sigma^* = \sqrt{2} \sigma$. Here $\hat{\mu}^*$ is the point estimate for the difference in means. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} An important supporting quantity is $z_{0.95} \approx 1.645$ which is the 95th quantile of the standard normal distribution. With this it is easy to derive that the symmetric confidence interval, ::: $$ \hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}, \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}. $$ The rule for determining bioequivalence is to check if $\delta_- \le \hat{L}^*$ and $\hat{U}^* \le \delta_+$. Or in other words, $[\hat{L}^*, \hat{U}^*] \subset [\delta_-, \delta_+]$. In particular: - If $\delta_- \le \hat{L}^*$ and $\hat{U}^* \le \delta_+$ then reject $H_0$ and conclude that `T` is bioequivalent to `R`. - If this doesn't hold then do not reject $H_0$, and thus the statistical bioequivalence analysis fails in the sense that we cannot claim that `T` is bioequivalent to `R`. We now see why this conclusion implements the hypothesis formulation ${\cal T_{\text{TOST}}}$.  ## A Two One Sided Test To implement the hypothesis test ${\cal T_{\text{TOST}}}$ we use **two hypothesis** tests, each of which is a one-sided hypothesis test. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} In particular, we break up the hypothesis test ${\cal T}_{\text{TOST}}$ into two hypothesis tests: ::: $$ {\cal T}_- = \begin{cases} H_{0-}:& \mu_T - \mu_R \le \delta_-,\\ H_{1-}:& \mu_T - \mu_R > \delta_-,\\ \end{cases} $$ and, $$ {\cal T}_+ = \begin{cases} H_{0+}:& \mu_T - \mu_R \ge \delta_+,\\ H_{1+}:& \mu_T - \mu_R < \delta_+.\\ \end{cases} $$ The hypothesis formulation ${\cal T}_{-}$ has $H_0$ as $\mu_T \le \mu_R + \delta_{-}$ and since $\delta_{-} < 0$ it can be read as "_the test formulation is not superior to the reference formulation_" (implying that the test is less than a reference with a safety margin). Similarly, the hypothesis formulation ${\cal T}_{+}$ has $H_0$ as $\mu_T \ge \mu_R + \delta_{+}$ and since $\delta_{+} > 0$ it can be read as "_the test formulation is not inferior to the reference formulation_" (implying that the test is greater than a reference with a safety margin). In fact, as a side comment, in different contexts, each of these non-superiority and non-inferiority tests are of interest themselves; see @fda-non-inf-2016. Importantly, if we reject $H_{0-}$ in ${\cal T}_-$ and also reject $H_{0+}$ in ${\cal T}_+$, then the intersection of both of the alternative hypothesis is that, $$ \delta_- < \mu_T - \mu_R < \delta_+. $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now since the type-I error in each of the cases is not more than $\alpha$, due to the structure of the hypothesis, the combined type-I error is not more than $\alpha$. This is because for each possible null scenario, only one of the two one-sided nulls is true, and rejecting both at the same time has probability at most $\alpha$. Since both nulls must be rejected to claim equivalence, the overall chance of Type I error is at most $\alpha$. For an in-depth advanced analysis describing this idea, outside of the scope of this unit, see @berger1996bioequivalence. ::: ### The test statistics Each of these are more simple (and standard) **one sided tests** and they are decided via **test statistics** $$ Z_- = \frac{\hat{\mu}^*-\delta_-}{\sigma^*}, \qquad \text{and} \qquad Z_+ = \frac{\hat{\mu}^*-\delta_+}{\sigma^*}. $$ The decision rules are: - Reject ${\cal T}_-$: if $Z_- > z_{0.95}$. - Reject ${\cal T}_+$: if $Z_+ < -z_{0.95} = z_{0.05}$. ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} These rules are considered based on the extreme values of the null parameter range in each case. ::: Now recalling the confidence intervals from above, $\hat{L}^* = \hat{\mu}^* - \sigma^* z_{0.95}$ and $\hat{U}^* = \hat{\mu}^* + \sigma^* z_{0.95}$, we have, $$ Z_- > z_{0.95} \quad \iff \quad \hat{\mu}^* > \delta_- + \sigma^* z_{0.95} \quad \iff \quad \hat{L}^* > \delta_- $$ and, $$ Z_+ < -z_{0.95} \quad \iff \quad \hat{\mu}^* < \delta_+ - \sigma^* z_{0.95} \quad \iff \quad \hat{U}^* < \delta_+ $$ Thus we see that rejection of both tests is exactly identical to the confidence interval $[\hat{L}^*, \hat{U}^*]$ being contained in $[\delta_-, \delta_+]$. ### A simulation example ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here is some example code, simulating such a trial. This is not code you would use for bioequivalence analysis, it is rather code that helps understand the key points we discuss in this unit. ::: First a random trial: ```{julia} #| output: false function random_trial(; μ_R = 0.0, μ_T = 0.07, σ = 0.06) x_R = rand(Normal(μ_R, σ)) x_T = rand(Normal(μ_T, σ)) μ̂ = x_T - x_R return μ̂ end ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now the confidence interval for the trial and a check if the trial passes based on the confidence interval: ::: ```{julia} #| output: false function ci_of_trial(μ̂; σ = 0.06) z_95 = quantile(Normal(), 0.95) return (μ̂ - √(2) * σ * z_95, μ̂ + √(2) * σ * z_95) end function ci_passes_trial(ci; δ₊ = 0.22314, δ₋ = -δ₊) return δ₋ ≤ ci[1] && ci[2] ≤ δ₊ #true means reject H0 and conclude BE end ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now, the two one sided hypothesis test formulations associated with trial: ::: ```{julia} #| output: false function lower_trial_reject(μ̂; σ = 0.06, δ₋ = -0.22314) z_95 = quantile(Normal(), 0.95) Z₋ = (μ̂ - δ₋) / (√(2) * σ) return Z₋ > z_95 end function upper_trial_reject(μ̂; σ = 0.06, δ₊ = 0.22314) z_05 = quantile(Normal(), 0.05) Z₊ = (μ̂ - δ₊) / (√(2) * σ) return Z₊ < z_05 end ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now this function makes the decision of the trial using the confidence interval approach, and also verifies (with @assert) that the hypothesis test approach is equivalent: ::: ```{julia} #| output: false function make_decision(μ̂; σ = 0.06, δ₊ = 0.22314, δ₋ = -δ₊) ci = ci_of_trial(μ̂; σ) decision = ci_passes_trial(ci; δ₊, δ₋) if decision #true @assert lower_trial_reject(μ̂; σ, δ₋) && upper_trial_reject(μ̂; σ, δ₊) else #false @assert lower_trial_reject(μ̂; σ, δ₋) == false || upper_trial_reject(μ̂; σ, δ₊) == false end return decision end ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here it is on one arbitrary trial where the decision is to reject $H_0$: ::: ```{julia} Random.seed!(0) μ̂ = random_trial() @show μ̂ ci = ci_of_trial(μ̂) @show ci pass_ci = ci_passes_trial(ci) @show pass_ci lower_reject = lower_trial_reject(μ̂) @show lower_reject upper_reject = upper_trial_reject(μ̂) @show upper_reject decision = make_decision(μ̂) @show decision; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} And here is another arbitrary one where the decision is to not to reject $H_0$: ::: ```{julia} Random.seed!(2) μ̂ = random_trial() @show μ̂ ci = ci_of_trial(μ̂) @show ci pass_ci = ci_passes_trial(ci) @show pass_ci lower_reject = lower_trial_reject(μ̂) @show lower_reject upper_reject = upper_trial_reject(μ̂) @show upper_reject decision = make_decision(μ̂) @show decision; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Let's see 10 separate trials simulated trials: ::: ```{julia} Random.seed!(0) sim_trials = [make_decision(random_trial()) for _ = 1:10] @show sim_trials @show mean(sim_trials); ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} As you can see in some we reject $H_0$ (`1`) and in others not (`0`). The mean of this count is the probability of rejecting $H_0$. Now this code estimates the power as a function of $\mu_T$. This is for a given $\sigma = 0.06$. It is more important to focus on the output table than on the code. ::: ```{julia} function hypothesis_test_probabilities(σ) range_μ_T = sort(union(-0.5:0.05:0.5, [-0.22314, 0.22314])) n_μ_T = length(range_μ_T) probs = DataFrame( μ_T = range_μ_T, power = zeros(n_μ_T), lower_reject_prob = zeros(n_μ_T), upper_reject_prob = zeros(n_μ_T), ) @rtransform!( probs, :be_case = ((-0.22314 < :μ_T < 0.22314) ? "H₁ (BE)" : "H₀ (Not BE)") ) @select!(probs, :be_case, Not(:be_case)) for i = 1:nrow(probs) μ_T = probs.μ_T[i] data = [random_trial(; σ, μ_T) for _ = 1:10^6] probs.power[i] = mean(make_decision.(data; σ)) probs.lower_reject_prob[i] = mean(lower_trial_reject.(data; σ)) probs.upper_reject_prob[i] = mean(upper_trial_reject.(data; σ)) end return probs end simple_table(hypothesis_test_probabilities(0.06)) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here are some observations from above: - Observe that at the limits, $\delta_-$ and $\delta_+$ the power estimate is approximately at $\alpha = 5\%$. - Observe that the maximal power is when $\mu_T = \mu_R (=0)$ in which case it is at about $67.5\%$. - Observe that for highly positive $\mu_T$ the lower test, ${\cal T}_-$ does not have an effect (we reject with probability $1$). Similarly, for highly negative $\mu_T$ the upper test ${\cal T}_+$ rejects with probability $1$. Now if somehow the variability of the drug products would have been lower ($\sigma = 0.04$ instead of $\sigma = 0.06$) then the power is generally increased: ::: ```{julia} simple_table(hypothesis_test_probabilities(0.04)) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} In practice, we obviously do not get to influence $\sigma$. But instead we sample more than one subject for each formulation. In fact, we also don't know $\sigma$ but rather need to estimate it. This puts us in the realm of the **T-test**. The good news, is that all of the principles from the hypothetical case we considered up to now, carry over to the much more realistic setting. ::: ## A much more realistic case - parallel design ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} As a data example let us create the same NCA based dataset that we analyzed in the previous two units with a parallel study, this time focusing only on Cmax. ::: ```{julia} #| output: false df = dataset("nca/crossover_study.csv") @subset!(df, :period .== 1) @rtransform!(df, :sequence = :sequence == "RT" ? "R" : "T") pop = read_nca(df; observations = :concentration, group = [:sequence, :period]) nca_result = run_nca(pop; parameters = [:cmax]) be_analysis_data = nca_result.reportdf; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now, let us use the `preprocess_be` function as it also automatically log-transforms the data (we'll also sort by the sequence). ::: ```{julia} transformed_data = sort(preprocess_be(be_analysis_data, endpoint = :cmax), :sequence) simple_table(transformed_data) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now similar to the $Z$, $Z_-$, and $Z_+$ statistics that we presented above in the simplified assumptions of single observation and known variance, we now have a t-statistics that can be used for confidence intervals and hypothesis tests. They can then be used for the two one sided t-test just as above. In particular we consider the sample mean and sample standard deviations, ::: ```{julia} x̄_R, s_R = mean(transformed_data.endpoint[1:10]), std(transformed_data.endpoint[1:10]) x̄_T, s_T = mean(transformed_data.endpoint[11:20]), std(transformed_data.endpoint[11:20]) @show x̄_R, s_R @show x̄_T, s_T; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Based on these four summary statistics we can get a lot done! But we need to make some assumptions: - Normality and independence of the (log-transformed) data. - The same variability for both `R` and `T`. - The number of observations per formulation is $n_R$ and $n_T$, and we denote the observations $X^R_{i}$ and $X^T_{i}$ respectively. A point estimate for the difference in means is, ::: $$ \hat{\mu}^* = \frac{1}{n_T}\sum_{i=1}^n X^T_{i} - \frac{1}{n_R}\sum_{i=1}^n X^R_{i} $$ ```{julia} μ̂ = x̄_T - x̄_R @show μ̂; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Based on it, we define the $T$ statistic: ::: $$ T = \frac{\hat{\mu}^*}{S_p\sqrt{\frac{1}{n_T} + \frac{1}{n_R}}}. $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} If $\mu_T = \mu_R$ then it follows a [T-distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) with $d = n_R + n_T - 2$ degrees of freedom. When $d$ is not too small (e.g. $d > 40$) the distribution is very similar to a standard normal distribution. Yet, for small $d$ it has heavier tails. The denominator of the $T$ statistic has the **pooled standard deviation**, ::: $$ S_p = \sqrt{ \frac{(n_R-1)S_R^2 + (n_T-1)S_T^2}{n_R + n_T - 2} }, $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} where each of $S_R$ and $S_T$ are standard deviations ::: $$ S_R = \sqrt{ \frac{1}{n_R-1} \sum_{i=1}^{n_R} (x^R_{i} - \overline{x}_R)^2 } \qquad \text{and} \qquad S_T = \sqrt{ \frac{1}{n_T-1} \sum_{i=1}^{n_T} (x^T_{i} - \overline{x}_T)^2 }. $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Let's compute the pooled standard deviation and the T statistic for our data. ::: ```{julia} s_p = sqrt((9 * s_R^2 + 9 * s_T^2) / 18) t_statistic = (x̄_T - x̄_R) / (s_p * √(1 / 10 + 1 / 10)) @show s_p @show t_statistic; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Now a $90\% = 100 \times (1-\alpha)\%$ confidence interval based on the T statistic, ::: $$ {\mathbb P}(\hat{L}^* \le \mu_T - \mu_R \le \hat{U}^*) = 0.9, $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} is based on, ::: $$ \hat{L}^* = \hat{\mu}^* - t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}} , \qquad \text{and} \qquad \hat{U}^* = \hat{\mu}^* + t_{d, 0.95} S_p \sqrt{\frac{1}{n_T} + \frac{1}{n_R}}. $$ ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here $t_{d, 0.95}$ is the $95\%$ quantile of the T-distribution with $d$ degrees of freedom. ::: ```{julia} t_95 = quantile(TDist(18), 0.95) @show t_95; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here is the confidence interval computed explicitly: ::: ```{julia} ci = (μ̂ - t_95 * s_p * √(1 / 10 + 1 / 10), μ̂ + t_95 * s_p * √(1 / 10 + 1 / 10)) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} And here we can see it back on the original scale, as a percentage. ::: ```{julia} 100round.(collect(exp.(ci)), digits = 4) |> println ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Here is the same confidence interval computed via `pumas_be` (slight differences are due to rounding implementations). ::: ```{julia} pumas_be(be_analysis_data, endpoint = :cmax) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Considering other output values, the `Vriability` is estimated in this simple model - simply via the standard deviation. You can see it is the same as in the `pumas_be` output. In other complex models, variability estimation is more complex. ::: ```{julia} σ̂ = round(std(transformed_data.endpoint), digits = 3) CV = round(100 * sigma2geocv(σ̂), digits = 2) @show CV, σ̂; ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Finally, observe that the Geometric Mean T/R Ratio is simply the transformed (exponentiated) point estimate. ::: ```{julia} round(100exp(μ̂), digits = 2) ``` ### A variant that does not assume equal variance of both treatments ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} There is a variant of the T test which does not require us to assume that the variability of `R` and the variability of `T` is the same. ::: This is sometimes called the [Welch–Satterthwaite T test](https://en.wikipedia.org/wiki/Welch%27s_t-test). ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} The choice if to use this variant or the equal-variance case (presented above), can be based on our willingness to make assumptions about the bioequivalence trial before conducting it. In particular, if we have reason to believe that variability of the `R` and variability of `T` are significantly different, then not assuming equal variance would make sense. In any case, this needs to be represented in the study protocol. From a user point of view, the non-equal variance case estimates an **approximate degrees of freedom** based on the standard deviations of both `R` and `T`. This is done via a formula s follows: ::: ```{julia} n_R = 10 # the number of reference observations n_T = 10 # the number of test observations # formula for the approximate degrees of freedom v = (s_R^2 / n_R + s_T^2 / n_T)^2 / ((s_R^2 / n_R)^2 / (n_R - 1) + (s_T^2 / n_T)^2 / (n_T - 1)) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Then a T-distribution with this value for the degrees of freedom is used. Note that the value will almost always be non-integer. The confidence interval is then computed as such: ::: ```{julia} t_095 = quantile(TDist(v), 0.95) ci = (μ̂ - t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10), μ̂ + t_095 * sqrt(s_R^2 / 10 + s_T^2 / 10)) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} In this case, to convert back to the non-log scale we carry out: ::: ```{julia} 100round.(collect(exp.(ci)), digits = 4) |> println ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Observe that in this specific example the confidence interval is essentially the same as the equal variance case above, yet in general it need not be. In any case, with Pumas, carrying out such a test is simply done by specifying the keyword argument, `homogeneity = false`. Note that by default it is set to `true` for parallel designs, and hence we ignored it in the equal variance case example above. This is the output ::: ```{julia} pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} You can see that the degrees of freedom estimate is `17.85`, which up to rounding is exactly as we manually computed above. Note that the actual variability estimate is unaltered. ::: ## Exploring power for the parallel design ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} The above bioequivalence hypothesis test passed. Formally what we have done is consider ${\cal T_{\text{TOST}}}$ and reject $H_0$ using the confidence interval based criteria. From a statistical perspective, this would put us in a position for applying for approval for the test drug. In this case, as producers of the test formulation we have reached our goal, and in parallel consumers of the drug are safe since we have kept a cap on the consumer risk (Type-I error is less than $5\%$). Of course there is also "luck" involved. With a different reality we could have had different observations where $H_0$ is not rejected and we cannot conclude bioequivalence. Similarly, there is the possibility that the two formulations are actually not bioequivalent (the reality is in $H_0$), and in parallel some "bad luck" for the consumers implied that the endpoint observations were close enough. But such "bad luck" is bounded by $5\%$. While it is not common to modify $\alpha$, for pedagogical purposes lets see what happens when we half $\alpha$ from $5\%$ to $2.5\%$. Hypothetically this can happen if the regulator decides to be more stringent (cautious) in terms of the consumer risk (Type-I error). To adjust this with `pumas_be`, we can use the `level` keyword argument which specifies the width of the confidence interval. Note that for type-I error `α` we need a level of `1-2α`. (For ease of comparison to the previous example, we keep `homogeneity = false`): ::: ```{julia} α = 0.025 pumas_be(be_analysis_data, endpoint = :cmax, homogeneity = false, level = 1 - 2α) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Compare the above `95%` confidence interval `[78.49, 123.4]` with the $90\%$ confidence interval `[81.65, 118.6]` we had before. You can see that we "paid a price" for requesting more confidence. That price is the wider interval. As a consequence of the lower bound, we do not reject $H_0$ and thus see `Fail`. So say that after such a failure, and carrying out some additional manufacturing and engineering checks, the drug manufacturer still believes that the two formulations are bioequivalent. In this case it is probably time for a new trial, and we are here to help the manufacturer in **planning** that trial. It is very possible that we failed because our previous trial did not control the producers's risk (type-II error). Or in other words, the power was not high enough. For example, let's be more conservative than the results of the previous trial and assume that the Geometric Mean Ratio is at $96\%$ (in our previous trial the point estimate was $98.39\%$). Also, lets assume that the variability (CV) is at $25\%$. In the previous trial we estimated it as $23.73\%$. What kind of type-II error do we **expect** for such a trial? Alternatively what is the power (the complement of the type-II error)? To answer such questions we can use the `BioequivalencePower` package. In particular, this let's use the `power` function from that package. More specific details on power and sample size studies are in the sequel of this course. At this point let us just carry out basic usage without going into the details. Here is we ask Pumas to compute the power: ::: ```{julia} power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96, α = 0.025) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Oh no! The result indicates that the power of the test is only $0.08465$! This means that the type-II error is over $90%$ and hence we have a very high chance of **not** being able to show bioequivalence. The producer is certainly at risk! If we try to adjust our assumptions, for example assuming the geometric mean ratio is at $1.0$ and the CV is better, we get more power: ::: ```{julia} power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0, α = 0.025) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Still, with about $35\%$ power we are not in a good situation. In fact, it is customary to **Aim for a producer's risk of about $20\%$ or a power of $80\%$.** Now obviously, the elephant in the room is the non-standard $\alpha = 0.025$. If remove that parameter, the `power` function uses the default $5\%$ type-I error. ::: ```{julia} power(RatioTOST, ED2Parallel, n = 20, CV = 0.2, peθ₀ = 1.0) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} With this our power is starting to be (a bit more) sensible, but if we go back to the more conservative `CV` and geometric mean ratio, ::: ```{julia} power(RatioTOST, ED2Parallel, n = 20, CV = 0.25, peθ₀ = 0.96) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} we are back at a low power value. Now in practice, we are not able to influence the CV or the actual geometric mean ratio, we can only use these values to try and gauge what kind of power we expect. But what we can do is increase the sample size (at a cost to the producer and a potential ethical risk if the sample size is too big). So what happens if we have $30$ subjects instead of $20$? ::: ```{julia} power(RatioTOST, ED2Parallel, n = 30, CV = 0.25, peθ₀ = 0.96) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} The power is again getting to be sensible, now at around $50\%$. Let us in fact see how big of a sample size is needed? For this, instead of searching using `power`, let's use the `samplesize` function from the `BioequivalencePower` package: ::: ```{julia} samplesize(RatioTOST, ED2Parallel, CV = 0.25, peθ₀ = 0.96) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} Well, it turns out that we need $n=50$ participants, a randomized $25$ of which will be receive the `R` treatment and the others will receive `T`. In fact, to be safe, after we consider this $n=50$ case, as we are planning a trial we can also consider what will happen if say $3$ will dropout from `R`, and $2$ will dropout of `T`. ::: ```{julia} power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.96) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} We can see this won't give us a huge loss of power. Similarly, we can continue the analysis and see what if in this case we are wrong about the GMR value and it is at $0.9$ instead of $0.96$: ::: ```{julia} power(RatioTOST, ED2Parallel, n = [22, 23], CV = 0.25, peθ₀ = 0.9) ``` ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} The power drops, but does not vanish. In summary this type of **bioequivalence planning** analysis can take place, all with a view of ensuring the producer's risk is controlled. Luckily, the consumer's risk is already controlled via $\alpha$. In future units we'll focus on other aspects of planning and the `BioequivalencePower` package and discuss methodologies for making a decision on the design type and the number of subjects in the trial. ::: ## Conclusion ::: {.content-hidden when-meta="profile-type.no-intermediate-text"} This unit explores the statistical foundations of bioequivalence testing, focusing on the Two One-Sided T-Test (TOST) approach used in parallel study designs. We begin by establishing the limitations of classical hypothesis testing for bioequivalence, where the null hypothesis assumes equivalence - a problematic formulation that doesn't adequately control consumer risk (the probability of incorrectly concluding bioequivalence when it doesn't exist). Instead, bioequivalence studies use TOST, which reformulates the null hypothesis to assume non-equivalence, requiring researchers to demonstrate that the difference between test and reference formulations falls within predefined acceptance limits (typically $80-125\%$ on the original scale, or $\pm 0.223$ on the log scale). This approach involves conducting two one-sided hypothesis tests and is equivalent to demonstrating that a $90\%$ confidence interval for the ratio of geometric means falls entirely within the bioequivalence limits. The unit demonstrates these concepts using both simplified theoretical examples and realistic data analysis, showing how sample size, variability (CV), and the true geometric mean ratio affect statistical power, and concludes with practical guidance on study planning to ensure adequate power while controlling both consumer and producer risks. ::: ## Unit exercises 1. **Hypothesis Formulation** Using the Two One-Sided T-Test (TOST) approach, formulate the null and alternative hypotheses for a bioequivalence study comparing drug formulations `R` and `T`. Assume the bioequivalence limits are defined as $\delta_- = -0.22314$ and $\delta_+ = 0.22314$. (a) Clearly state the null hypothesis $H_0$. (b) Clearly state the alternative hypothesis $H_1$. 2. **Type-I and Type-II Error Analysis** Define Type-I and Type-II errors in the context of bioequivalence testing. Additionally, relate each type of error to its corresponding consumer's risk and producer's risk. (a) Explain Type-I error in your own words. (b) Explain Type-II error in your own words. 3. **Statistical Power Calculation** A bioequivalence trial estimates a Geometric Mean Ratio (GMR) of 0.95 with a Coefficient of Variation (CV) of 0.2. If the sample size is determined to be $n = 20$ and the Type-I error rate $\alpha$ is set at $0.05$, compute the power of the test using the `power` function from the `BioequivalencePower` package. (a) Write the code you used for the computation. (b) What is the estimated power? 4. **Confidence Interval Interpretation** After performing a bioequivalence study, you calculate a 90% confidence interval for the difference in means, yielding $[0.15, 0.30]$. Based on these results, determine whether the formulations `R` and `T` meet the bioequivalence criteria defined by $\delta_= -0.22314$ and $\delta_+ = 0.22314$. (a) State your conclusion regarding bioequivalence. (b) Justify your conclusion based on the confidence interval. ## References {.unnumbered}

	Decision: Retain \(H_0\)	Decision: Reject \(H_0\)
Reality: \(H_0\)	✓ (True Negative)	Type I Error (False Positive) (erroneous rejection of \(H_0\))
Reality: \(H_1\)	Type II Error (False Negative) (erroneously failing to reject \(H_0\))	✓ (True Positive)