A Course on Bioequivalence: Unit 8 - Basics of Power Analysis

Authors

Yoni Nazarathy

Andreas Noack

Vijay Ivaturi

This is Unit 8 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.

1 Unit overview

In this unit we understand the basics of power analysis. This extends an introduction to power analysis from Unit3.

Planning of bioequivalence trials is intimately related to statistical power. Loosely, the power is the probability of succeeding with the trial. More formally, it is the probability of rejecting \(H_0\) when the actual reality is within \(H_1\).

The most important levers that we have when planning a trial are:

The number of subjects in the trial.
The design.

With power analysis we see how to vary these choices so as to achieve a desired level of power, all under certain assumptions about the underlying attributes of the reference and test product.

We use the following packages in this unit.

using Pumas # available with Pumas products
using Bioequivalence  # available with Pumas products
using BioequivalencePower  # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using DataFramesMeta
using StatsBase
using SummaryTables
using CairoMakie

We have used all of these packages in previous units. The key package in the current unit is BioequivalencePower. You may also consult the documentation for this package for further insights.

2 Revisiting the basics

Recall from Unit 3 that when we carry out bioequivalence analysis, the type of hypothesis test that we are implementing is the following:

\[ {\cal T_{\text{TOST}}} = \begin{cases} H_{0}:& \mu_T - \mu_R \le \delta_- \qquad \text{or} \qquad \delta_+ \le \mu_T - \mu_R \\[5pt] H_{1}:& \qquad \qquad \delta_- < \mu_T - \mu_R < \delta_+\\ \end{cases} \]

Here \(H_0\) implies not being bioequivalent and \(H_1\) implies being bioequivalent.

Typical values for the (non-statistical) limits are \(\delta_- = \log(0.8) \approx -0.22314\) and \(\delta_- = \log(1.25) \approx 0.22314\).

The hypothesis test is about the population parameters \(\mu_R\) and \(\mu_T\), their difference, or equivalently about the geometric mean ratio

\[ \text{GMR} = e^{\mu_T - \mu_R}. \]

In Unit 2 we saw the GMR as,

\[ \text{GMR} = \frac{\text{geomean}(x_1^T,\ldots,x_{n_T}^T)}{\text{geomean}(x_1^R,\ldots,x^R_{n_R})} = e^{\text{mean}(\log(x_1^T),\ldots,\log(x_{n_T}^T)) ~-~ \text{mean}(\log(x_1^R),\ldots,\log(x^R_{n_R}))}. \]

Yet this form presents an estimator of the GMR based on the data. Whereas, when we carry power analysis, we typically do not use data, but rather consider the theoretical population quantity \(e^{\mu_T - \mu_R}\) or the difference \(\mu_T - \mu_R\).

We should also realize that while the hypothesis test \({\cal T_{\text{TOST}}}\) is not about the CV, the variability (sometimes on its several forms), is another related population quantity that has an effect.

Alpha, Beta, and Power

Recall from Unit 3 that the decision framework has Type-I errors and Type-II errors as follows:

	Decision: Retain \(H_0\)	Decision: Reject \(H_0\)
Reality: \(H_0\)	✓ (True Negative)	Type I Error (False Positive) (erroneous rejection of \(H_0\))
Reality: \(H_1\)	Type II Error (False Negative) (erroneously failing to reject \(H_0\))	✓ (True Positive)

Type-I errors capture the consumers’ risk, while the Type-II error is the producer’s risk.

In particular \(\alpha\) is defined as the maximal probability of Type-I error.

Let us also define \(\beta\) as the probability of Type-II error and its complement is the power.

Hence in summary:

\[ \begin{align*} \alpha &\ge {\mathbb P}(\text{Type I error}) \\ \beta &= {\mathbb P}(\text{Type II error}) \\ \text{power} &= 1 - \beta \end{align*} \]

In general our goal is to plan a study and a statistical study protocol such:

For sure \(\alpha = 0.05\) (consumer’s are guaranteed to be protected), even if our assumptions about certain parameters such as the GMR and CV were off.
The power is at the order of \(0.8\) (or \(\beta = 0.2\)), based on assumptions about the GMR and CV.

Consumers are the priority by default. We then need to make sure producers have a good chance of success.

A simple hypothesis test example

To understand the interplay of \(\alpha\) and \(\beta\), let us momentarily digress from the complex \({\cal T}_{\text{TOST}}\) hypothesis and consider a simpler case where we only need to decide about the mean of a normal distribution.

In this very simple case, assume the standard deviation is known at \(\sigma = 2\).

In particular, take this is our simple hypothesis formulation:

\[ {\cal T_{\text{simple}}} = \begin{cases} H_{0}:& \mu = 15 \\[5pt] H_{1}:& \mu = 20\\ \end{cases} \]

While we only present this formulation for pedagogic purposes, we can also link it to a hypothetical example. For example, ssume we know that molar dose concentrations from two batch productions vary and are either with a mean of \(15\) for batch \(0\) and a mean of \(20\) for a batch \(1\). Assume further that we take a single physical pill and using some measurement we need to determine if we are in \(H_0\) or \(H_1\).

We use a threshold \(\tau\) and our rule is that if the measured quantity is greater than \(\tau\) we determine \(H_1\), and otherwise we determine \(H_0\).

In this example assume \(\tau = 18.29\) (specifically set so that \(\alpha = 0.05\)).

For such a threshold we can compute \(\alpha\) and \(\beta\) using the cdf and ccdf functions from the Distributions package. The cdf function standing for “cumulative distribution function” just tells us the probability that a quantity is less than or equal to a value. The ccdf with the extra c meaning “complementary”, is one minus that quantity and thus implies the probability of being greater than a value.

μ₀, μ₁, σ = 15, 20, 2
dist₀, dist₁ = Normal(μ₀, σ), Normal(μ₁, σ)
τ = round(quantile(dist₀, 0.95), digits = 3)
α = round(ccdf(dist₀, τ), digits = 3)
β = round(cdf(dist₁, τ), digits = 3)
println("τ: ", τ)
println("Probability of Type I error: ", α)
println("Probability of Type II error: ", β)
println("Power: ", 1 - β)

τ: 18.29
Probability of Type I error: 0.05
Probability of Type II error: 0.196
Power: 0.804

As you can see, these values are aligned with the typical target values of \(\alpha = 0.05\) exactly, and the power being at about \(80\%\).

The nice thing about such a simplistic hypothesis is that it is easy to visualize. Here it is:

grid = 5:0.1:25
fig, ax = lines(grid, pdf.(dist₀, grid), color = :blue)
h₀grid, h₁grid = τ:0.1:25, 5:0.1:τ+0.1
fillx = collect(h₀grid)
filly = pdf.(dist₀, fillx)
poly!(
    ax,
    [fillx; reverse(fillx)],
    [zeros(length(filly)); reverse(filly)],
    color = (:blue, 0.25),
    strokewidth = 0,
)

lines!(ax, grid, pdf.(dist₁, grid), color = :green)
fillx2 = collect(h₁grid)
filly2 = pdf.(dist₁, fillx2)
poly!(
    ax,
    [fillx2; reverse(fillx2)],
    [zeros(length(filly2)); reverse(filly2)],
    color = (:green, 0.25),
    strokewidth = 0,
)

lines!(ax, [τ, τ], [0, 0.25], color = :red, linewidth = 3)
xlims!(ax, 5, 25)
ylims!(ax, 0, 0.25)

text!(ax, L"\beta", position = (17, 0.015), align = (:center, :center), fontsize = 20)
text!(ax, L"\alpha", position = (19, 0.015), align = (:center, :center), fontsize = 20)
text!(ax, L"\tau", position = (τ + 0.25, 0.24), align = (:center, :center), fontsize = 20)
text!(ax, L"H_0", position = (μ₀, 0.21), align = (:center, :center), fontsize = 20)
text!(ax, L"H_1", position = (μ₁, 0.21), align = (:center, :center), fontsize = 20)

ax.xlabel = "x"
ax.ylabel = "Probability density"
fig

As you can see the area to the right of the green zone, under \(H_1\) is the power.

With this consider what would happen if the value of \(\mu\) under \(H_1\) would change, for example to \(\mu = 21\), or to \(\mu = 19\). What would this do to the power? How would it change from \(80.4\%\)? Let’s calculate it:

μ₁ = 19
dist₁ = Normal(μ₁, σ)
β = round(cdf(dist₁, τ), digits = 3)
println("Probability of Type II error: ", β)
println("Power: ", 1 - β)

Probability of Type II error: 0.361
Power: 0.639

μ₁ = 21
dist₁ = Normal(μ₁, σ)
β = round(cdf(dist₁, τ), digits = 3)
println("Probability of Type II error: ", β)
println("Power: ", 1 - β)

Probability of Type II error: 0.088
Power: 0.912

Understanding a power curve

Such modifications of the actual \(H_1\) scenario (\(\mu\) above) help create power curves. In general, in the bioequivalence context, we can modify other parameters and settings such as:

CV
GMR
\(\alpha\)
Sample size
Design

A power curve is a visual representation of the power as a function of such parameters, assumptions, and settings.

Still, out of the bioequvilance context, let’s see this for the simple hypothesis case above where we vary \(\mu_1\) and also consider various alternative values of \(\sigma\).

\[ {\cal T_{\text{simple}}} = \begin{cases} H_{0}:& \mu = 15 \\[5pt] H_{1}:& \mu = \mu_1\\ \end{cases} \]

Here we calculate and plot the curve:

function simple_power(μ₁, σ; μ₀ = 15, α = 0.05)
    dist₀, dist₁ = Normal(μ₀, σ), Normal(μ₁, σ)
    τ = round(quantile(dist₀, 1 - α), digits = 3)
    α = round(ccdf(dist₀, τ), digits = 3)
    β = round(cdf(dist₁, τ), digits = 3)
    return 1 - β
end

μ₁_range = 15:0.1:25

power_curve = [simple_power(μ₁, 2) for μ₁ in μ₁_range]

fig = Figure()
ax = Axis(
    fig[1, 1],
    xlabel = "μ₁",
    ylabel = "Power",
    title = "Power Curve (σ = 2, μ₀ = 15, α = 0.05)",
    xticks = 15:25,
    yticks = 0:0.1:1,
)
lines!(ax, μ₁_range, power_curve, color = :blue, label = "Power")
xlims!(ax, 15, 25)
ylims!(ax, 0, 1)
fig

Observe that here the power will depend on \(\mu_1\). For example, in an extreme case where \(\mu_1=15\) (as in \(H_0\)), the power will be \(\alpha\). Then as \(\mu_1\) is farther to the right from \(15\), the power increases. In the quite extreme case of \(\mu_1 = 25\) there is hardly any overlap between the distribution of \(H_0\) and \(H_1\) and thus the power is almost \(100\%\).

Now say we have a few finite scenarios for \(\sigma\), e.g. \(0.75\), \(2\) (as before), or \(3.5\). Then we can compute the power for each of these cases, where in this case we still present it as a function of \(\mu_1\) on the plot.

σ_levels = [0.75, 2.0, 3.5]
power_curve_low_var_ = [simple_power(μ₁, σ_levels[1]) for μ₁ in μ₁_range]
power_curve_mid_var_ = [simple_power(μ₁, σ_levels[2]) for μ₁ in μ₁_range]
power_curve_high_var_ = [simple_power(μ₁, σ_levels[3]) for μ₁ in μ₁_range]

fig = Figure()
ax = Axis(
    fig[1, 1],
    xlabel = "μ₁",
    ylabel = "Power",
    title = "Power Curve (various σ values, μ₀ = 15, α = 0.05)",
    xticks = 15:25,
    yticks = 0:0.1:1,
)
lines!(ax, μ₁_range, power_curve_low_var_, color = :green, label = "σ = $(σ_levels[1])")
lines!(ax, μ₁_range, power_curve_mid_var_, color = :blue, label = "σ = $(σ_levels[2])")
lines!(ax, μ₁_range, power_curve_high_var_, color = :red, label = "σ = $(σ_levels[3])")
axislegend(ax, position = :rb)
xlims!(ax, 15, 25)
ylims!(ax, 0, 1)
fig

It is obvious that lower \(\sigma\) (less variability) increases the power in this case.

Such analysis can go on and on. When applied to more realistic hypothesis tests, such as for example the T-test, a critical quantity that affects power is the sample size. In classical statistical books you will find power curves for such tests, and without computers such curves were often used to help choose sample sizes. We now seen how this approach works for bioequivalence.

You may consider further aspects of power analysis and general hypothesis tests in the context of the Julia language in Chapter 7 of Nazarathy and Klok (2021).

3 Back to the bioequivalence case

In the Bioequivalence analysis case, similar analysis holds, the difference is that the \({\cal T_{\text{TOST}}}\) hypothesis formulation is more complicated and hence the underlying power curves are more complicated. For this we rely on more complex computations under the hood, sometimes using analytical computations, sometimes using numerical computations, and sometimes using Monte Carlo simulations.

All these cases are captured with the BioequivalencePower package.

A pilot study

As a simple example assume we are constrained (e.g. by product specific guidances) to carry a \(2 \times 2\) crossover study for a new generic product T, which we want to compare to R.

Assume that before embarking on the complete study, we carry out a small pilot study which yields this AUC data:

pilot_study = DataFrame(
    id = [1, 1, 2, 2, 3, 3, 4, 4, 10, 10, 11, 11, 12, 12, 13, 13, 14],
    sequence = vcat(fill("TR", 8), fill("RT", 9)),
    period = [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1],
    AUC = vcat(
        [19.2, missing, 19.1, 15.9, 24.8, 14.9, 15.5, 15.3],
        [15.0, 15.2, 14.9, missing, 15.3, 15.2, 15.1, 15.3, 15.2],
    ),
)
simple_table(pilot_study)


id	sequence	period	AUC

1	TR	1	19.2
1	TR	2	missing
2	TR	1	19.1
2	TR	2	15.9
3	TR	1	24.8
3	TR	2	14.9
4	TR	1	15.5
4	TR	2	15.3
10	RT	1	15
10	RT	2	15.2
11	RT	1	14.9
11	RT	2	missing
12	RT	1	15.3
12	RT	2	15.2
13	RT	1	15.1
13	RT	2	15.3
14	RT	1	15.2

In this case assume we had \(8\) subjects in each sequence (RT and TR) but due to complications, there were multiple dropouts so only 3 subjects completed both periods on each of the sequences.

When we run pumas_be on the data we get the following:

pumas_be(pilot_study)


Observation Counts
Sequence ╲ Period	1	2

RT	5	3
TR	4	3


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	15.21
	Geometric Naive Mean	15.2
T	Geometric Marginal Mean	17.16
	Geometric Naive Mean	17.48
	Geometric Mean T/R Ratio (%)	112.9
	Degrees of Freedom	4
	90% Confidence Interval (%)	[96.61, 131.8]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	12.68 \| 0.1263
ANOVA	Formulation (p-value)	0.1724
	Sequence (p-value)	0.1528
	Period (p-value)	0.1919

The fact that this pilot study fails BE (the GMR confidence interval as a percentage is [96.61, 131.8]) is not the key point.

The more important point of value from the pilot study is to learn that the GMR point estimate is at \(112.9\%\) and that the CV estimate is at \(12.68\%\). While these are clearly estimates and do not reflect the exact population values, we get an indication for what the kind of GMR and the kind of CV to expect.

Such preliminary estimates can also be obtained from other historical studies of the approved drug, and or similar approvals in other jurisdictions, etc.

Planning the pivotal study

Importantly we now want to plan a \(2\times 2\) trial that will have a low enough producers’s riks (no more than \(20%\)) - and as always keep the consumer’s risk at no more than \(5\%\).

One approach for this is to consider slightly more conservative estimates of the measured GMR and CV from the pilot study. For example we can take the GMR at \(115\%\) and the CV at \(15\%\).

Now with these assumptions let us rely on one of the power_analysis function from the BioequivalencePower. In specifying RatioTOST, ED2x2Crossover as the first two arguments to power_analysis, we are indicating that the study is a crossover design and that the input parameters are with respect to GMR’s and CV’s as opposed to differences of means and standard deviations.

Importantly, the specification CV = 0.15 and peθ₀ = 1.15, indicate our assumptions about the CV and the GMR. In particular, notice that this 1.15 value which log transforms to 0.1398, lies within \(H_1\).

Note that this format of power_analysis is motivated by the popular PowerTOST R package package. In fact, Pumas’s BioequivalencePower provides feature parity with PowerTOST as well as a few more options.

power_analysis(RatioTOST, ED2x2Crossover, CV = 0.15, peθ₀ = 1.15)

Note that we see θ₀ = 114% instead of θ₀ = 115% due to rounding issues.

Let us outline the essence of this power_analysis display.

We see that the analysis was for a target power of \(80\%\) with Target = 80%.
A key result of the analysis is choosing n=42 subjects (this means 21 in each sequence). This choice of subject will yield a power of \(81\%\) as you can see via Estimated = 81%. This target power can be updated as an input setting.
The minimum acceptable power is set (by default) at \(70\%\). This quantity is used by sensitivity analysis with respect to the GMR and the CV to check how far those quantities can deviate.
On the top left display we see part of a power curve as a function of the CV. The curve starts at the input CV of 15% where the power is 81%. Yet we see that if the actual CV is greater the power decreases. This is plotted up until the power hits the minimum accepted power of 70%, at a CV of 17.4%. We are also presented with the fact that this increase of the CV as a percentage is a +16.3% increase. So it means we have a “margin” of error on our assumed CV of about 16%.
On the top right display we see a similar analysis for the GMR. Here we see that we can have a GMR of up to 116.3% if we wish to stay above the minimum acceptable power.
Finally the bottom left display has a discrete plot as it deals with the sample size (as steps of 2 because there are two sequences in this design). We see that we will stay above the minimum acceptable power if we have up 10 dropouts, reducing the number of subjects from 42 to 32.

This type of analysis then allows us to get a bit more certainty about our trial planning. It combines power analysis and some robustness analysis on our parameter assumptions.

Such analysis is not the only approach. A different approach is a Bayesian approach where we assume some distribution on GMR and CV. See the BioequivalencePower docs as a starting point. You may also consider Lv et al. (2023).

Still sticking with this robustness analysis, as another example say we now have a target power of \(90\%\) and we certainly don’t want to go below \(80\%\). With this we can use the power_analysis function using a slightly more complex interface:

power_analysis(
    RatioTOST{ED2x2Crossover}(CV = 0.15, peθ₀ = 1.15),
    HypothesisTestMetaParams(target_power = 0.9);
    pac = PowerAnalysisConstants(min_power = 0.8),
)

You can see that now \(58\) subjects are recommended and we can sustain dropouts down to \(42\) subjects, as consistent with the computation above.

4 More features of the `BioequivalencePower` package

The two main functions of the package that we use are power and samplesize.

power - takes in parameters and details of a study including the number of subjects and the assumptions on the point in \(H_1\) where we lie, and determines the power we shall obtain.
samplesize - helps determine the sample size we need to achieve a target power.

Here are some examples that illustrate what can be done. The implication of each case should be self-evident.

Basic samplesize and power

Here is a sample size calculation for a parallel trial with CV at \(30\%\) and the GMR at \(93\%\).

samplesize(RatioTOST, ED2Parallel, CV = 0.3, peθ₀ = 0.93)

Determined sample size: n = 96  achieved power = 80.171% 
    HypothesisTestMetaParams Settings:   α = 0.05  target_power = 0.8
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.3  θ₀ = 0.93  [θ₁, θ₂] = [0.8, 1.25]

Since the suggestion is n = 96 subjects, let us verify that the power is indeed at that level with this number of subjects:

power(RatioTOST, ED2Parallel, n = 96, CV = 0.3, peθ₀ = 0.93)

Computed achieved power: 80.171%
    HypothesisTestMetaParams Settings: n = 96  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.3  θ₀ = 0.93  [θ₁, θ₂] = [0.8, 1.25]

And if we were to have one less in each treatment, hence n = 94, observe the power drops below \(80\%\):

power(RatioTOST, ED2Parallel, n = 94, CV = 0.3, peθ₀ = 0.93)

Computed achieved power: 79.414%
    HypothesisTestMetaParams Settings: n = 94  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.3  θ₀ = 0.93  [θ₁, θ₂] = [0.8, 1.25]

Power for unbalanced design

While sample size calculations always give us numbers for balanced designs (equal number of subjects per sequence), we can query the power in unbalanced cases (as we would sometimes get with dropouts). For example:

power(RatioTOST, ED2Parallel, n = [52, 44], CV = 0.3, peθ₀ = 0.93)

Computed achieved power: 79.925%
    HypothesisTestMetaParams Settings: n = [52, 44]  α = 0.05
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2Parallel}  CV = 0.3  θ₀ = 0.93  [θ₁, θ₂] = [0.8, 1.25]

Seeing the samplesize search

By using verbose = true, we can see the steps that the samplesize function takes to find a suggested sample size. In this case, let us consider a more complicated design (here the first \(2\) stands for the number of formulations, the next \(4\) for the number of sequences, and the next \(4\) for the number of periods):

samplesize(
    RatioTOST,
    ED2x4x4ReplicateCrossover,
    peθ₀ = 0.85,
    CV = 0.35,
    target_power = 0.75,
    verbose = true,
)

[ Info: n = 88: 0.5094811021600534
[ Info: n = 92: 0.524505955896785
[ Info: n = 96: 0.5391723757631878
[ Info: n = 104: 0.5674425265970613
[ Info: n = 116: 0.6072452349329205
[ Info: n = 140: 0.6779087111438776
[ Info: n = 180: 0.7718052680276569
[ Info: n = 160: 0.7283330721661994
[ Info: Note: There is a non-uniform split: [43, 43, 42, 42]
[ Info: n = 170: 0.7508422488260919
[ Info: Note: There is a non-uniform split: [42, 41, 41, 41]
[ Info: n = 165: 0.7397844130016891
[ Info: Note: There is a non-uniform split: [42, 42, 42, 41]
[ Info: n = 167: 0.7442620616723767
[ Info: n = 168: 0.7465159421565855
[ Info: Note: There is a non-uniform split: [43, 42, 42, 42]
[ Info: n = 169: 0.7486745735975423

Determined sample size: n = 172  achieved power = 75.52% 
    HypothesisTestMetaParams Settings:   α = 0.05  target_power = 0.75
    Study: BioequivalencePower.RatioTOST{BioequivalencePower.ED2x4x4ReplicateCrossover}  CV = 0.35  θ₀ = 0.85  [θ₁, θ₂] = [0.8, 1.25]

You can see that at some points during the search, the number of subjects n is not divisible by 4 and hence there is a message of non-uniform split. Then Pumas adjusts the search until finding n = 172 which is the lowest multiple of 4 with power greater than \(75\%\).

More on the package

Finally this function provides an overview of all most of the features you can find in the BioequivalencePower package. Consult the docs for further information.

supported_designs_and_studies()

(designs = 15×8 DataFrame
 Row │ DesignType                 DesignString                       NumTreatm ⋯
     │ DataType                   String                             Union…    ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ ED2Parallel                parallel                           2         ⋯
   2 │ ED2x2Crossover             2x2                                2
   3 │ ED3x3Crossover             3x3                                3
   4 │ ED3x6x3Crossover           3x6x3                              3
   5 │ ED4x4Crossover             4x4                                4         ⋯
   6 │ ED2x2x3ReplicateCrossover  2x2x3                              2
   7 │ ED2x2x4ReplicateCrossover  2x2x4                              2
   8 │ ED2x4x4ReplicateCrossover  2x4x4                              2
   9 │ ED2x3x3PartialReplicate    2x3x3                              2         ⋯
  10 │ ED2x4x2Balaam              2x4x2                              2
  11 │ ED2x2x2RepeatedCrossover   2x2x2r                             2
  12 │ EDPaired                   paired                             1
  13 │ EDGeneralCrossover         A general design used for dose p…            ⋯
  14 │ EDGeneralParallel          A general design used for dose p… 
  15 │ EDGeneral                  A general design used for dose p… 
                                                               6 columns omitted,
 studies = 15×16 DataFrame
 Row │ StudyType                          ED2Parallel  ED2x2Crossover  ED3x3Cr ⋯
     │ Any                                String       String          String  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ DifferenceTOST                     num|mc       num|mc          num|mc  ⋯
   2 │ RatioTOST                          num|mc       num|mc|slmc     num|mc
   3 │ TwoDifferenceTOSTs                 mc           mc              mc
   4 │ TwoRatioTOSTs                      mc           mc              mc
   5 │ FiducialRatioTOST                  mc           mc              -       ⋯
   6 │ NonInferiorityDifferenceOST        num          num             num
   7 │ NonInferiorityRatioOST             num          num             num
   8 │ ExpandingLimitsRatioTOST           -            -               -
   9 │ ReferenceScaledRatioTOST           -            -               -       ⋯
  10 │ NarrowTherapeuticIndexDrugRatioT…  -            -               -
  11 │ DoseProportionalityStudy           -            -               -
  12 │ (BayesianWrapper, DifferenceTOST)  nume         nume            nume
  13 │ (BayesianWrapper, RatioTOST)       nume         nume            nume    ⋯
  14 │ (BayesianWrapper, NonInferiority…  nume         nume            nume
  15 │ (BayesianWrapper, NonInferiority…  nume         nume            nume
                                                              13 columns omitted,
 legend = 5×2 DataFrame
 Row │ Code    Meaning
     │ String  String
─────┼───────────────────────────────────
   1 │ num     Numerical
   2 │ anum    Approximate Numerical
   3 │ nume    Numerical Expectation
   4 │ mc      Monte Carlo
   5 │ slmc    Subject Level Monte Carlo,
 methods = 15×8 DataFrame
 Row │ StudyType                          power   samplesize  pvalue  confint  ⋯
     │ Any                                String  String      String  String   ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ DifferenceTOST                     yes     yes         yes     yes      ⋯
   2 │ RatioTOST                          yes     yes         yes     yes
   3 │ TwoDifferenceTOSTs                 yes     yes
   4 │ TwoRatioTOSTs                      yes     yes
   5 │ FiducialRatioTOST                  yes     yes                 yes      ⋯
   6 │ NonInferiorityDifferenceOST        yes     yes
   7 │ NonInferiorityRatioOST             yes     yes
   8 │ ExpandingLimitsRatioTOST           yes     yes
   9 │ ReferenceScaledRatioTOST           yes     yes                          ⋯
  10 │ NarrowTherapeuticIndexDrugRatioT…  yes     yes
  11 │ DoseProportionalityStudy           yes     yes
  12 │ (BayesianWrapper, DifferenceTOST)  yes     yes
  13 │ (BayesianWrapper, RatioTOST)       yes     yes                          ⋯
  14 │ (BayesianWrapper, NonInferiority…  yes     yes
  15 │ (BayesianWrapper, NonInferiority…  yes     yes
                                                               3 columns omitted,)

As you can see, there are many features and variations with the package. Consult the docs for further information.

5 Conclusion

In this unit, we delved into the fundamentals of power analysis, a critical step in planning any bioequivalence trial. We established that power is the probability of correctly concluding bioequivalence when the test and reference products are indeed equivalent (i.e., rejecting H₀ when H₁ is true). The primary goal of power analysis is to determine the necessary sample size to achieve a desired level of power, typically 80% or 90%, given a set of assumptions.

The key levers at our disposal are the study design and the number of subjects. We saw how these, combined with our assumptions about the true geometric mean ratio (GMR) and coefficient of variation (CV), dictate the expected power of a study. These assumptions are often derived from pilot studies or existing literature.

The BioequivalencePower package provides a powerful and user-friendly toolkit for these tasks. We explored how the samplesize function directly computes the required number of subjects, the power function calculates the power for a given study setup, and the comprehensive power_analysis function not only determines the sample size but also provides a valuable sensitivity analysis. This sensitivity analysis shows how robust our study plan is to deviations from our initial GMR and CV assumptions, giving us a crucial margin of safety.

Ultimately, a well-executed power analysis ensures that a study is designed efficiently—large enough to have a high probability of success, but not wastefully large—while always maintaining the strict 5% limit on consumer risk (Type I error).

We also mention that many blog posts on BEBAC (2025) can be a highly valuable resource for understanding power analysis for bioequivalence trials.

6 Unit exercises

Sample Size for Different Designs

A pharmaceutical company is planning a study for a new drug formulation. Based on a pilot study, they assume a GMR of 95% (peθ₀ = 0.95) and an intra-subject CV of 20% (CV = 0.20). They want to achieve at least 80% power.
1. If they plan to use a standard 2x2 crossover design (ED2x2Crossover), how many subjects will they need in total? Use the samplesize function.
2. Due to concerns about potential carryover effects, they are also considering a parallel design (ED2Parallel). How many subjects would be needed for this design with the same assumptions?
3. Briefly explain the difference in the required sample sizes between the crossover and parallel designs.
The Sensitivity of Power

Imagine a 2x2 crossover study has been planned with 36 subjects. The planners assumed a GMR of 1.05 and a CV of 25%.
1. Using the power function, calculate the expected power for this study under the planned assumptions.
2. After the study is complete, the analysis reveals that the true CV was actually closer to 30%. What would the power have been with this higher variability?
3. In another scenario, what if the CV was indeed 25%, but the true GMR was 1.10 (further from 1.00 than assumed)? Would the power be higher or lower than the original calculation in (a)? Calculate it to confirm your intuition.
4. What if the GMR was 1.02 (closer to 1.00 than assumed)? Calculate the power in this case. This exercise demonstrates how power is a function of the true (but unknown) parameters.
Interpreting a Full Power Analysis Report

Run the following command to generate a power analysis report:
```
power_analysis(RatioTOST, ED2x2Crossover, CV = 0.18, peθ₀ = 1.08, target_power = 0.90)
```
Based on the output, answer the following questions:
1. What is the recommended total sample size to achieve the target power of 90%?
2. The sensitivity analysis for the CV shows a “Crit. val”. What does this value represent in the context of the study’s power?
3. If the true GMR turned out to be 1.10, would the study still have power greater than the minimum acceptable level of 70%? (Hint: Look at the GMR sensitivity plot).
4. According to the “Sensitivity to n” plot, what is the smallest number of subjects the study could have (due to dropouts) and still maintain the minimum acceptable power of 70%?
The “Cost” of Power

A sponsor is planning a parallel design study and, based on previous data, is confident that the GMR is 0.97 and the CV is 0.30. They are debating how much power to aim for.
1. Use the samplesize function to determine the required sample size for 80% power.
2. Now, calculate the required sample size for 90% power.
3. Finally, calculate the required sample size for 95% power.
4. Compare the increase in the number of subjects required to go from 80% to 90% power, versus the increase required to go from 90% to 95% power. What does this tell you about the “cost” of achieving very high levels of power in terms of sample size?

References

BEBAC. 2025. “Series of Articles Was Inspired by Recurring Discussions in the Bioequivalence and Bioavailability Forum.” 2025. https://bebac.at/articles/.

Lv, Duo, Michael J Grayling, Xinyue Zhang, Qingwei Zhao, and Haiyan Zheng. 2023. “A Bayesian Approach to Pilot-Pivotal Trials for Bioequivalence Assessment.” BMC Medical Research Methodology 23 (1): 301. https://link.springer.com/article/10.1186/s12874-023-02120-2.

Nazarathy, Yoni, and Hayden Klok. 2021. Statistics with Julia: Fundamentals for Data Science, Machine Learning and Artificial Intelligence. Springer Nature. https://statisticswithjulia.org/.

Reuse

CC BY-SA 4.0