A Course on Bioequivalence: Unit 4 - Different experimental designs

1 Unit overview

In the previous unit we considered statistical inference concepts and focused on the hypothesis test formulation for bioequivalence. In this unit we explore various experimental designs used in a bioequivalence trials. In particular we consider:

The concept of a a multi-period trial and an experimental design.
Randomization, blindness, and open-label.
The very popular \(2\times 2\) (sometimes called the \(2\times 2\times 2\)) crossover design.
An exploration of product specific guidances.
Other popular experimental designs including partial replicate, fully replicate, and multiple treatment designs.
Considerations for multiple reference (comparator) products and multiple test products.
Multiple comparisons and adjustment of \(\alpha\).
General statistical considerations for choosing an experimental design.

We use the following packages.

using Pumas # available with Pumas products - also reexports Distributions
using Bioequivalence  # available with Pumas products
using PharmaDatasets # available with Pumas products
using CairoMakie
using AlgebraOfGraphics
using DataFrames
using DataFramesMeta
using SummaryTables
using StatsBase
using Random
using MixedModels

We have used these packages in the previous units. One new package we use in this unit is MixedModels used for random effects models.

2 The concept of an experimental design

Our focus up to now was on the case where we break the subjects into cohorts of R and T, and then administer the respective dosages to each of the cohorts. This is the very basic parallel design.

However, there are other experimental designs that are more efficient, powerful, and enable detecting and estimating additional quantities.

In particular, the parallel design has only a single period in the sense that at the onset of the trial participants are administered a single formulation and once that treatment is complete, the trial is over. What if we broke up the participants’ trial journey into multiple periods, and in each successive period we would administer a different formulation to the subject?

As a first basic example of more than one period, assume we have this AUC endpoint dataset arising from an hypothetical bioequivalence trial:

be_data_paired = DataFrame(
    id = vcat(fill.(1:5, 2)...),
    sequence = fill("TR", 10),
    period = repeat(1:2, 5),
    AUC = [21.5, 24.6, 22.5, 28.4, 23.8, 22.1, 19.6, 21.4, 19.9, 21.5],
)
simple_table(be_data_paired)


id	sequence	period	AUC

1	TR	1	21.5
1	TR	2	24.6
2	TR	1	22.5
2	TR	2	28.4
3	TR	1	23.8
3	TR	2	22.1
4	TR	1	19.6
4	TR	2	21.4
5	TR	1	19.9
5	TR	2	21.5

As you can see there are 5 subjects in this trial with ids 1,…,5. In contrast to a parallel trial, each subject appears in two rows, once for period 1 and once for period 2. This is sometimes called a paired design because each observation of T on a subject and the corresponding observation of R on the subject form a pair.

There is also a sequence column. In this case there is only a single sequence and it is TR for all subjects. This implies that each subject is first administered the test treatment T in the first period and is then administered the reference treatment R in the second period.

Let us compare and contrast the design we see with be_data_paired with the simpler parallel design that we considered up to now. For clarity let us create be_data_parallel which has the same number of AUC endpoint measurements.

be_data_parallel = DataFrame(
    id = 1:10,
    group = vcat(fill("T", 5), fill("R", 5)),
    period = 1,
    AUC = [21.1, 21.6, 22.8, 19.6, 23.2, 18.8, 19.9, 21.7, 22.3, 23.2],
)
simple_table(be_data_parallel)


id	group	period	AUC

1	T	1	21.1
2	T	1	21.6
3	T	1	22.8
4	T	1	19.6
5	T	1	23.2
6	R	1	18.8
7	R	1	19.9
8	R	1	21.7
9	R	1	22.3
10	R	1	23.2

Here are some point contrasting the two designs, focusing on the datasets that represent these designs:

The number of subjects in be_data_paired is half of that of be_data_parallel. This is obvious since in the two-period design yielding be_data_paired we “reuse” a subject and administer R after we administer T.
The duration of the trial for be_data_paired is typically at least twice as long as the trial for be_data_parallel. This is because each subject has to first receive formulation T. Then after the first period we need a washout period giving the body time to clear of the active ingredient. Then in the second period the subject receives formulation R. Note that the duration of the washout period can sometimes be extensive and is determined by the half life property of the product in question.
In case administration of the formulation affects a subject in the long term (or beyond a washout period), be_data_paired will have the second period measurements affected by the first period.
With be_data_parallel, variability between T and R cannot be isolated from variability between subjects. In a parallel design some of the variation we encounter is due to natural differences between subjects. In contrast with be_data_paired since each subject is administered both T and R, by considering the difference between formulations on each subject, the measurements are potentially less influenced by the variability between subjects. Such pairing is a simple form of blocking which is a key concept in experimental design. It is an approach to remove undesirable variability from the measurements.

Point (1) is often an obvious difference and is a plus of multi-period designs. Having fewer subjects is more ethical and reduces costs. In contrast, point (2) is a plus for the one-period parallel design. It is often a key consideration if the needed washout period is long. Point (3) is a case that occurs if the washout period is “essentially infinite”. This would be a key consideration of trials that consider outcomes, including curing of a condition. It is less of a consideration in bioequivalence trials. Finally, the most important issue is point (4). From a statistical point of view, isolating subject effects is often critical. This point implies that the power of multi-period designs is often superior to the equivalent power of single period (parallel) designs.

In summary there are several reasons to use more than one period (points (1) and (4) above), as well as some potential complications (points (2) and (3) above). We now consider the most popular kind of multi-period design, the \(2 \times 2\) cross over design.

3 Overarching principles: randomization and blindness/open-label

A crucial principle that underlies all sound experimental designs is randomization.

Randomization is the process of assigning subjects to treatment groups (e.g., T or R in a parallel design). It requires using a formal, probabilistic mechanism, such as a coin flip or, more commonly, a computer-generated random sequence. Without randomization, we risk introducing bias. For instance, an investigator might subconsciously assign healthier-looking subjects to one group, or the first subjects to enroll in a trial might be systematically different from those who enroll later. Randomization breaks any systematic connection between the characteristics of the subjects and the treatment they receive, ensuring that, on average, the groups are comparable in all respects, both known and unknown.

To see an example of randomization, say that we have \(20\) subjects that we want to partition into \(4\) groups. For example both the reference and test formulations with fasting and without. We can use the sample function from StatsBase with the replace=false option. Here is example code:

num_subjects = 20
group_names = ["R fasting", "T fasting", "R fed", "T fed"]
num_groups = length(group_names)
subjects_per_group = num_subjects ÷ num_groups # ÷ \div + [TAB], integer division
subjects = 1:20

Random.seed!(0)
shuffled_subjects = sample(subjects, length(subjects); replace = false)
groups = Iterators.partition(shuffled_subjects, subjects_per_group)

for (i, group) in enumerate(groups)
    println(group_names[i], ": ", sort(group))
end

R fasting: [3, 5, 9, 15, 18]
T fasting: [7, 8, 10, 12, 17]
R fed: [4, 6, 11, 16, 20]
T fed: [1, 2, 13, 14, 19]

Note that the Random.seed!(0) statement ensures the same randomization occurs every time we execute this code. The “seed” 0 can be replaced with any other number and that would imply a different randomization.

A related, but distinct, concept is blinding.

Blinding refers to the practice of concealing the treatment assignment from one or more parties involved in the trial. In a single-blind trial, the subjects are unaware of which treatment they are receiving. In a double-blind trial, considered the gold standard in many clinical trials, neither the subjects nor the investigators and their staff know the treatment assignments. This prevents knowledge of the treatment from influencing subjects’ reporting of effects or investigators’ assessment of outcomes. When all parties are aware of the treatment assignments, the trial is called open-label.

While double-blinding is critical for trials with subjective endpoints (like pain relief or mood), bioequivalence studies are typically conducted in an open-label fashion.

This is because the primary endpoints, such as AUC and Cmax, are objective pharmacokinetic measurements derived from laboratory analysis. These values are not susceptible to patient or investigator subjectivity.

However, even in an open-label setting, proper randomization of subjects to their assigned treatments or sequences is non-negotiable, as it remains the fundamental safeguard against bias and ensures the statistical validity of the trial’s conclusions.

4 The \(2 \times 2\) crossover design

The design that yields the be_data_paired data above has the disadvantage that all subjects were first administered T and only later they were administered R. This can cause a bias in the results in case the T treatment affects the R treatment that follows, or in case conditions during the first period are significantly different in comparison to the second period.

To alleviate such potential biases, the \(2 \times 2\) crossover design (sometimes also called \(2 \times 2 \times 2\)) makes use of the notion of a sequence. There are \(2\) sequences, RT and TR. The first sequence implies that subjects receive R in the first period and T in the second period. Similarly, the second sequence is the opposite as subjects receive T in the first period and R in the second. Note that the phrase \(2 \times 2\) stands for “2 sequences” and “2 periods”. Further, when called \(2 \times 2 \times 2\), the additional “2” stands for the number of formulations.

Here is a tabular representation of this design:

Sequence	Period 1	Period 2
RT	`R`	`T`
TR	`T`	`R`

As with any trial, the study needs to be randomized. Recruited subjects are allocated either to the RT sequence or the TR sequence. Now, the same benefit as in the paired design creating be_data_paired are achieved. Further, it is possible to test and rule out period effects since in each period we have subjects from both treatments.

This design is undoubtedly the most common and most recommended design to use in a bioequivalence study. Parallel designs are typically only recommended in case when there is a long washout period. Other more complex designs described below are typically for the specific cases that we describe.

Here is example endpoint data from a \(2 \times 2\) crossover study. We display only the first \(8\) rows.

be_data = dataset(joinpath("bioequivalence", "RT_TR", "PJ2017_3_1.csv"))
simple_table(first(be_data, 8))


id	sequence	period	AUC	Cmax

1	RT	1	2849	499
1	RT	2	2230	436
2	TR	1	2025	438
2	TR	2	2000	361
3	TR	1	2090	535
3	TR	2	1826	558
4	RT	1	2790	733
4	RT	2	2864	416

We can use the detect_design function from the Bioequivalence package. Pumas uses also this function internally to automatically determine the type of design and study paradigm from the data.

detect_design(be_data)

(design = "RT|TR",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'R',
 test_formulations = ['T'],
 replicated = non_replicated,
 supports_rsabe = false,
 crossover = true,)

You can see that the design string is RT|TR capturing the two sequences. You can further see that num_sequences = 2, num_periods = 2, and num_formulations = 2, all in agreement with the \(2 \times 2 \times 2\) naming convention. Other information returned by detect_design also describes this design. For example you can see that Pumas identifies that the reference formulation is R, and so fourth.

Now carrying out bioequivalence analysis via pumas_be, and considering AUC by default, we obtain:

pumas_be(be_data)


Observation Counts
Sequence ╲ Period	1	2

RT	17	17
TR	15	15


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	1879
	Geometric Naive Mean	1887
T	Geometric Marginal Mean	1848
	Geometric Naive Mean	1845
	Geometric Mean T/R Ratio (%)	98.36
	Degrees of Freedom	30
	90% Confidence Interval (%)	[94.07, 102.8]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	10.52 \| 0.1049
ANOVA	Formulation (p-value)	0.5334
	Sequence (p-value)	0.1818
	Period (p-value)	0.00235

We’ll analyze elements of this output in further detail in the sequence but let us highlight a few points now.

Observe the Observation Counts table at the start of the output. As is evident there were 17 subjects in the RT sequence, and 15 subject it the TR sequence. This imbalance is often not planned but can rather occur due to dropouts.
Observe that the Paradigm is Non replicated crossover design. We discuss replication (non replicated, partially replicated, and fully replicated designs) below.
Pumas chooses a Linear Model for such a design. We discuss this in the next unit.
Observe that we obtain estimates of the Marginal Mean both for R and for T. These are model based estimates, and they may differ from the data (Naive) estimates in case of unbalanced data (as we have here).
The Geometric Mean T/R Ratio is 98.36 as a percentage. This is obtained from the marginal means.

round(100 * 1848 / 1879, digits = 2) # value slightly differs from output due to rounding

98.35

A model based confidence interval of [94.07, 102.8] is obtained and since it is well contained within the 80–125 margin, we Pass.
Note that the model uses 30 degrees of freedom. This quantity is related to the variability used to generate the confidence interval. It is a property of the linear model used (see next unit) and the size of the data.
The variability estimate is at a CV of 10.52 as a percentage. More on this estimation is in the next unit. The associated standard deviation estimate is 0.1049.
Since a statistical model is used, we are able to investigate if there are any unwanted effects in the model. In particular, Type-III p-values are computed for Formulation, Sequence, and Period. When a p-value is non-small we consider the effect as insignificant and when it is small we consider the effect as significant. In this case, since we see a Period p-value of 0.00235, there may be reason to believe that an (unwanted) period effect is taking place.

Note: p-values are a deceptive quantity. As a rule of thumb, if a p-value is between \(0\%\) and an order of \(5\%\) (or sometimes \(10\%\)) then it is quite a clear indication that the underlying factor has an effect. In particular this holds if p-values are very small. For example above, we see that the period almost certainly has an effect because the p-value of 0.00235 is very small. However, if p-values are greater than \(10\%\) (or sometimes \(5\%\)) then their value is not really important. So a p-value of 0.1818 for Sequence above should be treated exactly the same as the p-value of 0.5334 for Formulation above. In particular, if there is no effect, we can expect the p-value to be a uniform quantity anywhere between \(0\%\) and \(100\%\). Thus we should never speak about “high p-values”. We can rather speak about “very low p-values” (as in the case of 0.00235) or “marginally low p-values” (as would have been the case with a p-value of around \(5\%\) to \(10\%\)).

5 Exploring example product specific guidances

Now that we know a bit about parallel and cross over designs, let us carry out a short exploration of FDA’s website for Product-Specific Guidances (PSG) for Generic Drug Development: FDA (2025). As of July 2025 there are 2,288 PSGs listed. To briefly explore a small subset of these PSGs, we look at a few arbitrary products.

For this, since the FDA lists the PSG’s based on the active ingredient alphabetically, we consider the first PSG entry for the first 8 letters A–H. You can click the URL to see the PDF of each PSG to see various guidelines. Below we summarize key points:

Abacavir sulfate: 2 by 2 cross over design on healthy males and non-pregnant, non-lactating females.
Bacitracin:This is an ointment. A comparative study is recommended.
Cabotegravir: parallel design on male and non-pregnant and non-lactating females.
Dabigatran etexilate mesylate: Both a fasting and a fed study is recommended with a 2-treatment, 2-sequence, 4-period, fully replicated crossover design on normal healthy males and females.
Econazole nitrate: This is a cream. The design is a clinical endpoint randomized, double blind, parallel, placebo-controlled.
Famciclovir: 2 by 2 cross over design on healthy males and non-pregnant, non-lactating females.
Gabapentin: 2 by 2 cross over design on healthy males and non-pregnant, non-lactating females.
Halcinonide: This is a cream. The FDA recommends two studies, a pilot study and pivotal study.

This gives a taste of the fact that \(2 \times 2\) cross over studies are common. It also motivates looking at fully replicated crossover studies as described below. As is typical from creams and ointments, comparative studies are recommended. In contrast to the bioequivalence studies we cover here, based on endpoints, comparative studies typically consider the effects of the product. They are mostly out of the scope of this course.

Note that we present these here only for pedagogic purposes, if conducting a trial for such products you should consult directly with the FDA or the appropriate regulatory agency.

6 Other Designs

So far we have seen the parallel design (single period) and the \(2 \times 2\) crossover design. These are the two most common alternatives. Now that we are aware of the way designs are described based on sequences, periods, and formulations we may consider other designs as well. In particular we may consider designs with more than two sequences, more than two periods, and even more than two formulations.

Let us take an approach of considering some of the common designs by considering the example datasets supplied via the PharmaDatasets package. In general, with that package, the datasets function returns a list of all of the datasets. Those datasets associated with bioequivalence endpoints are set with the prefix "bioequivalence", then we can do some string processing to see the design type which follows "bioequivalence". This is the list:

unique([split(ds, "/")[2] for ds in filter(contains("bioequivalence"), datasets())])

15-element Vector{SubString{String}}:
 "R_T"
 "RR_RT_TR_TT"
 "RTRT_TRTR"
 "RRT_RTR_TRR"
 "RT_TR"
 "RTTR_TRRT"
 "RTT_TRR"
 "RTR_TRT"
 "RTR_TRR"
 "RR_TT"
 "RST_RTS_SRT_STR_TRS_TSR"
 "RRTT_TTRR"
 "RTRT_RTTR_TRRT_TRTR"
 "RRTT_RTTR_TRRT_TTRR"
 "ADBC_BACD_CBDA_DCAB"

In this case we use strings such as "R_T", "RR_RT_TR_TT", etc.. to specify a design. Let us understand how to read this encoding. For example, the datasets with "RT_TR" are for the \(2 \times 2\) crossover designs the we covered above. In all this specification we typically use R for the reference and T for the test as usual.

As you can see, certain designs have 3 or 4 formulations. This is evident with "RST_RTS_SRT_STR_TRS_TSR", and "ADBC_BACD_CBDA_DCAB". When it comes to such designs with multiple formulations, one approach is to consider the design as a specification of comparing a single reference to multiple tests. With this approach, if we consider "RST_RTS_SRT_STR_TRS_TSR", the first case R is still the reference and S and T are two alternative test formulations.

In the second case, "ADBC_BACD_CBDA_DCAB", formulation A is considered as the reference and B, C, and D are considered as test cases. In any case (specific to the bioequivlance package), we consider the lowest letter in the alphabet as the reference formulation. We discuss more issues associated with multiple formulations below.

As an illustration let us take an arbitrary dataset from each group of design available in PharmaDataset. We then print the design information for the dataset and specify the design as

\[ \text{Num Sequences} \times \text{Num Periods} \times \text{Num Formulations} \]

design_types =
    unique([split(ds, "/")[2] for ds in filter(contains("bioequivalence"), datasets())])
for design_type in design_types
    dataset_name = first(filter(contains("bioequivalence/$design_type"), datasets()))
    be_data = dataset(dataset_name)
    dd = detect_design(be_data)

    println("Dataset name: $dataset_name")
    println(
        "\tDesign: $(dd.design),  $(dd.num_sequences)×$(dd.num_periods)×$(dd.num_formulations)",
    )
end

Dataset name: bioequivalence/R_T/FSL2015_7
    Design: R|T,  2×1×2
Dataset name: bioequivalence/RR_RT_TR_TT/CL2009_9_2_1
    Design: RR|RT|TR|TT,  4×2×2
Dataset name: bioequivalence/RTRT_TRTR/PJ2017_6
    Design: RTRT|TRTR,  2×4×2
Dataset name: bioequivalence/RRT_RTR_TRR/SLTGSF2020_DS07
    Design: RRT|RTR|TRR,  3×3×2
Dataset name: bioequivalence/RT_TR/PJ2017_3_12
    Design: RT|TR,  2×2×2
Dataset name: bioequivalence/RTTR_TRRT/CL2009_9_4_1
    Design: RTTR|TRRT,  2×4×2
Dataset name: bioequivalence/RTT_TRR/SLTGSF2020_DS10
    Design: RTT|TRR,  2×3×2
Dataset name: bioequivalence/RTR_TRT/SLTGSF2020_DS03
    Design: RTR|TRT,  2×3×2
Dataset name: bioequivalence/RTR_TRR/SLTGSF2020_DS22
    Design: RTR|TRR,  2×3×2
Dataset name: bioequivalence/RR_TT/From_PJ2017_4_2
    Design: RR|TT,  2×2×2
Dataset name: bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5
    Design: RST|RTS|SRT|STR|TRS|TSR,  6×3×3
Dataset name: bioequivalence/RRTT_TTRR/SLTGSF2020_DS28
    Design: RRTT|TTRR,  2×4×2
Dataset name: bioequivalence/RTRT_RTTR_TRRT_TRTR/SLTGSF2020_DS23
    Design: RTRT|RTTR|TRRT|TRTR,  4×4×2
Dataset name: bioequivalence/RRTT_RTTR_TRRT_TTRR/SLTGSF2020_DS24
    Design: RRTT|RTTR|TRRT|TTRR,  4×4×2
Dataset name: bioequivalence/ADBC_BACD_CBDA_DCAB/PJ2017_4_6
    Design: ADBC|BACD|CBDA|DCAB,  4×4×4

Let us now consider a few of these designs and their uses.

7 Within subject variability and between subject variability

When we measure a drug’s effect, the results are never identical. This variation, or variability, comes from two main sources: differences between subjects, and differences within the subject. Understanding these sources is crucial for designing an efficient study. The variability of products, typically quantified via the CV, is a key property that requires estimation, affects the nature of the product, and requires consideration when planning a study.

Between-subject variability (BSV), or inter-subject variability, describes how different people respond to the same drug. It arises from stable differences between individuals, such as genetics, body weight, metabolism, and organ function. This is why one person might consistently have higher AUC values than another, regardless of the formulation they receive.
Within-subject variability (WSV), or intra-subject variability, describes the random, unpredictable fluctuations in a single person’s response. If the same person takes the exact same drug on two different occasions, their AUC values will still differ slightly. This is due to day-to-day physiological changes, minor differences in diet or activity, and measurement error. This variability is an inherent property of the drug and its interaction with a person’s body over time.

A toy example

To get a feel for these two forms of variability let us a create a toy model of concentration over time. This is not a well established pharmacokinetic model but is rather an illustrative conceptual example.

Assume we have 4 subjects, which always start with a (normalized) concentration of 1.0 at onset. Then assume that at every time instance the concentration decreases when multiplied by a subject specific subject_average_concentration_loss_per_time parameter. These are the parameters:

subject_average_concentration_loss_per_time = [0.78, 0.87, 0.88, 0.91]
num_subjects = length(subject_average_concentration_loss_per_time);

As you can see this constant varies across subjects and is as low as 0.78 for the first subject and as low as 0.91 for the fourth subject. This can be viewed as some inherent between subject variation.

Let us also assume that at every time, the concentration for a subject is updated by this inherent factor of the subject multiplied by a slight variation which is a uniform random quantity between 0.85 and 1.15. These random variations then work together to inherently create the within subject variability.

Hence the simulate_subject! function which we now creates, takes a DataFrame (df) and adds to it the trajectory of such a subject. This is done until the subject leaves the trial when the concentration drops below 5%.

function simulate_subject!(df, id, subj_per_counter)
    concentration = 1.0
    time = 0
    while concentration > 0.05
        push!(df, (id, subj_per_counter, time, concentration))
        concentration *=
            subject_average_concentration_loss_per_time[id] * rand(Uniform(0.85, 1.15))
        time += 1
    end
end;

We can now simulate this (toy) model for all four subjects, where we also assume that each subject is administered the formulation over \(3\) periods:

df = DataFrame(id = Int[], subj_per = Int[], time = Int[], concentration = Float64[])
subj_per_counter = 1
for id = 1:num_subjects
    for _ = 1:3
        simulate_subject!(df, id, subj_per_counter)
        subj_per_counter += 1
    end
end

Note that in this toy example there is only a single formulation, say R. Thus in a sense with the example the design is:

Sequence	Period 1	Period 2	Period 3
RRR	`R`	`R`	`R`

We now plot the results:

df.id = categorical(df.id)
df.subj_per = categorical(df.subj_per)
plt =
    data(df) *
    mapping(:time, :concentration, group = :subj_per, color = :id) *
    visual(Lines)
draw(
    plt,
    axis = (
        xlabel = "Time",
        ylabel = "Concentration",
        title = "Concentration-Time Profiles",
    ),
)

By summing up each of the profiles along the time axis, we get a rough form of an AUC endpoint (here each time step was \(1\), e.g. one hour). We also carry out a log transformation. Note the composition log ∘ sum where \circ followed by the TAB key yields ∘. Here are these endpoints organized by the subject id where each row is another period.

auc_data = select(
    combine(groupby(df, [:id, :subj_per]), :concentration => log ∘ sum => :auc),
    [:id, :auc],
)
wide_df = DataFrame(
    [g.auc for g in groupby(auc_data, :id)],
    "id: " .* string.(unique(auc_data.id)),
)
simple_table(wide_df)


id: 1	id: 2	id: 3	id: 4

1.33	1.84	1.83	2.26
1.33	1.84	2.19	2.3
1.4	1.82	1.73	2.32

Now just by considering the values above you see that within each subject group (column) there is some variability which arises form the different replications of the subject. This is the within subject variability. Further, across the various columns there is also variability which is ts the between subject variability.

Estimating the WSV and BSV from the toy model data

Without getting into the details at this point, let us use a linear mixed model from the MixedModels package to estimate the WSV and BSV.

The approach we take is sometimes called a one-way random effects ANOVA model. In our case it can be represented as:

\[ \text{AUC}_{ij} = \mu + b_j + w_{ij}, \qquad i = 1,\ldots,3, \qquad j = 1,\ldots,4, \]

where \(i\) is the period index and \(j\) is the subject index. Here \(\mu\) is some overall mean, and it is assumed that \(b_j\) and \(w_{ij}\) are independent (normally distributed) random quantities with a mean of zero. The variability of \(b_j\) is the BSV and the variability of \(w_{ij}\) is the WSV.

A mixed model allows us to estimate these variabilities. We do not discuss the full details at this point:

VarCorr(lmm(@formula(auc ~ 1 + (1 | id)), auc_data, REML = true))

	Column	Variance	Std.Dev
id	(Intercept)	0.145379	0.381286
Residual		0.015066	0.122742

The right most column of above table presents estimates of the standard deviations of \(b_j\) and \(w_{ij}\) respectively. Converting these values to CVs we obtain:

BSV = round(sigma2geocv(0.381286), digits = 4)
WSV = round(sigma2geocv(0.122742), digits = 4)
@show BSV
@show WSV;

BSV = 0.3956
WSV = 0.1232

While this toy model and analysis is not indicative of the type of analysis we may do for actual bioequivalence trials, we still see here that we can estimate the variabilities. In particular, in certain settings it is important to understand the WSV as it indicates the nature of the product in question. For example when WSV (as a CV) is greater that \(30\%\) drugs are typically called highly variable.

Design considerations for estimation of variability

A key advantage of crossover designs over parallel designs is that they can account for the large, systematic differences between subjects. By comparing the test and reference formulations within the same person, we effectively remove much of the between-subject variability from the analysis.

However, a standard 2×2 crossover design only allows us to estimate a single, combined measure of the within-subject variability. It cannot distinguish whether the test formulation is inherently more or less variable than the reference formulation.

To estimate the within-subject variability for each formulation separately, we need to administer the same formulation to a subject more than once. This is the primary motivation for replicate designs. Such designs are also useful for drugs that are classified as highly variable drugs (HVD) where there is a high within subject variability. They are also useful for narrow therapeutic index drugs (NTID). Both of these subjects are covered in detail in further units.

8 Replicate designs

A replicate design is a design where a formulation is administered more than once for a subject. There are two categories here:

A fully replicate design has both (or all) formulations replicated.
A partial replicate design has only one of the formulations replicated. Note that in this case, it is typically the reference formulation which is replicated.

Fully replicate design

This is a common fully replicate design:

Sequence	Period 1	Period 2	Period 3	Period 4
TRTR	`T`	`R`	`T`	`R`
RTRT	`R`	`T`	`R`	`T`

Observe that that in each of the sequences TRTR and RTRT, we administer both formulations R and T. Since there are two measurements of each formulation per subject, with this design we can estimate the within subject variability both for R and for T.

In fact, this design is recommended for NTID products as in FDA (2012), as well as for HVD products as in FDA (2011).

With four periods there is obviously more of a chance of encountering dropouts, since subjects are required to participate in the study for longer durations. Nevertheless, such designs generally require a smaller study size than two period designs.

A different variation of this \(2 \times 4 \times 2\) design assigns formulations in a slightly different order than above:

Sequence	Period 1	Period 2	Period 3	Period 4
RTTR	`R`	`T`	`T`	`R`
TRRT	`T`	`R`	`R`	`T`

Observe that no matter what, in both of these variations, subjects from each sequence receive both formulations twice, and each period exhibits both formulations.

For illustration here is pumas_be applied to both of these variants of the \(2 \times 4 \times 2\) design (click the tabs to switch between the designs):

be_data = dataset("bioequivalence/RTRT_TRTR/SLTGSF2020_DS12")
pumas_be(be_data, endpoint = :PK)


Observation Counts
Sequence ╲ Period	1	2	3	4

RTRT	38	38	36	37
TRTR	39	38	34	38


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: Standard ABE
Endpoint: PK
Formulations: Reference(R), Test(T)

		Results(PK)	Assessment	Criteria
R	Geometric Marginal Mean	0.7645
	Geometric Naive Mean	0.7588
T	Geometric Marginal Mean	0.9105
	Geometric Naive Mean	0.883
	Geometric Mean T/R Ratio (%)	119.1
	Degrees of Freedom	72.68
	90% Confidence Interval (%)	[89.13, 159.2]	Fail	CI ⊆ [80, 125]
Variability	CVᵣ (%) \| σ̂ᵣ	221.5 \| 1.3327
	CVₜ (%) \| σ̂ₜ	288.9 \| 1.495
	Variability Ratio (%)	112.2
ANOVA	Formulation (p-value)	0.3184
	Sequence (p-value)	0.2292
	Period (p-value)	0

be_data = dataset("bioequivalence/RTTR_TRRT/CL2009_9_4_1")
pumas_be(be_data, endpoint = :AUC)


Observation Counts
Sequence ╲ Period	1	2	3	4

RTTR	5	5	5	5
TRRT	5	5	4	4


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	65.41
	Geometric Naive Mean	65.1
T	Geometric Marginal Mean	76.33
	Geometric Naive Mean	76.73
	Geometric Mean T/R Ratio (%)	116.7
	Degrees of Freedom	19.57
	90% Confidence Interval (%)	[98.19, 138.8]	Fail	CI ⊆ [80, 125]
Variability	CVᵣ (%) \| σ̂ᵣ	39.62 \| 0.3818
	CVₜ (%) \| σ̂ₜ	32.0 \| 0.3122
	Variability Ratio (%)	81.77
ANOVA	Formulation (p-value)	0.1386
	Sequence (p-value)	0.2507
	Period (p-value)	0.02269

As you can see, in both cases the output now presents estimates of CVᵣ and CVₜ for the within subject variability of the reference and test product respectively. Such estimates are not attainable without replication. Also note that the model is a Mixed model with (unequal variance). We discuss these issues in future units.

The `FDA_HighlyVariable` and `FDA_NarrowTherapeuticIndex` specifications

Future units are focused on the topics of highly variable drugs, narrow therapeutic index drugs, and reference scaling approaches. At this point let us just appreciate that we often (but not always) use such designs with reference scaling which is an approach for bioequivalence used for highly variable drugs and narrow therapeutic index drugs.

At this point appreciate that to carry out such analysis with pumas_be, we specify the FDA_HighlyVariable or FDA_NarrowTherapeuticIndex as the second argument. Here is what the output looks like (click the tabs to switch between FDA_HighlyVariable and FDA_NarrowTherapeuticIndex).

be_data = dataset("bioequivalence/RTRT_TRTR/SLTGSF2020_DS12")
pumas_be(be_data, endpoint = :PK, FDA_HighlyVariable)


Observation Counts
Sequence ╲ Period	1	2	3	4

RTRT	38	38	36	37
TRTR	39	38	34	38


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: FDA RSABE for HV
Endpoint: PK
Formulations: Reference(R), Test(T)

		Results(PK)	Assessment	Criteria
R	Geometric Marginal Mean	0.7645
	Geometric Naive Mean	0.7588
T	Geometric Marginal Mean	0.9105
	Geometric Naive Mean	0.883
	Geometric Mean T/R Ratio (%)	119.1	Pass	GMR ∈ [80, 125]
	Degrees of Freedom	72.68
	90% Confidence Interval (%)	[89.13, 159.2]
Variability	CVᵣ (%) \| σ̂ᵣ	221.5 \| 1.3327	OK for RS	CVᵣ ≥ Minimal CVᵣ
	CVₜ (%) \| σ̂ₜ	288.9 \| 1.495
	Variability Ratio (%)	112.2
ANOVA	Formulation (p-value)	0.3184
	Sequence (p-value)	0.2292
	Period (p-value)	0
Reference Scaling Params	Reference Scaling Constant	0.7967
	Minimal CVᵣ for Reference Scaling (%)	30.0 \| 0.294
Reference Scaling Analysis	Geometric Mean T/R Ratio (%)	131.4
	Standard Error (Log Scale)	0.1777
	90% Confidence Interval (%)	[97.69, 176.7]
	Degrees of Freedom	71
	Howe's Approx RSABE Stat (95%)	-0.9468	Pass	≤ 0

be_data = dataset("bioequivalence/RTTR_TRRT/CL2009_9_4_1")
pumas_be(be_data, endpoint = :AUC, FDA_NarrowTherapeuticIndex)


Observation Counts
Sequence ╲ Period	1	2	3	4

RTTR	5	5	5	5
TRRT	5	5	4	4


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: FDA RSABE for NTI
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	65.41
	Geometric Naive Mean	65.1
T	Geometric Marginal Mean	76.33
	Geometric Naive Mean	76.73
	Geometric Mean T/R Ratio (%)	116.7
	Degrees of Freedom	19.57
	90% Confidence Interval (%)	[98.19, 138.8]	Fail	CI ⊆ [80, 125]
Variability	CVᵣ (%) \| σ̂ᵣ	39.62 \| 0.3818
	CVₜ (%) \| σ̂ₜ	32.0 \| 0.3122
	Variability Ratio (%)	81.77
ANOVA	Formulation (p-value)	0.1386
	Sequence (p-value)	0.2507
	Period (p-value)	0.02269
Reference Scaling Params	Reference Scaling Constant	1.11
Reference Scaling Analysis	Geometric Mean T/R Ratio (%)	118.4
	Standard Error (Log Scale)	0.0649
	90% Confidence Interval (%)	[104.8, 134]
	Degrees of Freedom	7
	Howe's Approx RSABE Stat (95%)	-0.03577	Fail	≤ 0
	Variability Ratio Quantile (95%)	1.592	Pass	≤ 2.5

We’ll analyze these approaches and the complete output arising from such specifications in future units.

Partial replicate design

Sometimes, our key motivation for a replicate design is estimating the within subject variability of the reference, and the within subject variability of the test product is not as critical.

For this we can use a partial replicate design. This is for example a partial replicate design:

Sequence	Period 1	Period 2	Period 3
RRT	`R`	`R`	`T`
RTR	`R`	`T`	`R`
TRR	`T`	`R`	`R`

Here is pumas_be applied to data from such a design.

be_data = dataset("bioequivalence/RRT_RTR_TRR/SLTGSF2020_DS04")
pumas_be(be_data, endpoint = :PK)


Observation Counts
Sequence ╲ Period	1	2	3

RRT	17	17	17
RTR	17	17	17
TRR	17	17	17


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: Standard ABE
Endpoint: PK
Formulations: Reference(R), Test(T)

		Results(PK)	Assessment	Criteria
R	Geometric Marginal Mean	41.58
	Geometric Naive Mean	41.58
T	Geometric Marginal Mean	57.06
	Geometric Naive Mean	57.06
	Geometric Mean T/R Ratio (%)	137.2
	Degrees of Freedom	48.95
	90% Confidence Interval (%)	[118.7, 158.6]	Fail	CI ⊆ [80, 125]
Variability	CVᵣ (%) \| σ̂ᵣ	61.96 \| 0.57
ANOVA	Formulation (p-value)	0.0006
	Sequence (p-value)	0.1543
	Period (p-value)	0.8618

As you can see the estimated variability is CVᵣ which stands for the within subject variability of the R.

Again, we can use FDA_HighlyVariable for reference scaling, details are in future units.

pumas_be(be_data, FDA_HighlyVariable, endpoint = :PK)


Observation Counts
Sequence ╲ Period	1	2	3

RRT	17	17	17
RTR	17	17	17
TRR	17	17	17


Paradigm: Replicated crossover that supports reference scaling
Model: Mixed model (unequal variance)
Criteria: FDA RSABE for HV
Endpoint: PK
Formulations: Reference(R), Test(T)

		Results(PK)	Assessment	Criteria
R	Geometric Marginal Mean	41.58
	Geometric Naive Mean	41.58
T	Geometric Marginal Mean	57.06
	Geometric Naive Mean	57.06
	Geometric Mean T/R Ratio (%)	137.2	Fail	GMR ∈ [80, 125]
	Degrees of Freedom	48.95
	90% Confidence Interval (%)	[118.7, 158.6]
Variability	CVᵣ (%) \| σ̂ᵣ	61.96 \| 0.57	OK for RS	CVᵣ ≥ Minimal CVᵣ
ANOVA	Formulation (p-value)	0.0006
	Sequence (p-value)	0.1543
	Period (p-value)	0.8618
Reference Scaling Params	Reference Scaling Constant	0.7967
	Minimal CVᵣ for Reference Scaling (%)	30.0 \| 0.294
Reference Scaling Analysis	Geometric Mean T/R Ratio (%)	137.2
	Standard Error (Log Scale)	0.0866
	90% Confidence Interval (%)	[118.7, 158.7]
	Degrees of Freedom	48
	Howe's Approx RSABE Stat (95%)	-0.02774	Pass	≤ 0

Note that we cannot use FDA_NarrowTherapeuticIndex as this requires a fully replicate design. Also we can only use FDA_HighlyVariable with certain partial replicate designs. For example look at supports_rsabe = for these two designs:

be_data = dataset("bioequivalence/RRT_RTR_TRR/SLTGSF2020_DS04")
detect_design(be_data)

(design = "RRT|RTR|TRR",
 num_sequences = 3,
 num_periods = 3,
 num_formulations = 2,
 reference_formulation = 'R',
 test_formulations = ['T'],
 replicated = partially_replicated,
 supports_rsabe = true,
 crossover = true,)

be_data = dataset("bioequivalence/RTR_TRR/SLTGSF2020_DS22")
detect_design(be_data)

(design = "RTR|TRR",
 num_sequences = 2,
 num_periods = 3,
 num_formulations = 2,
 reference_formulation = 'R',
 test_formulations = ['T'],
 replicated = partially_replicated,
 supports_rsabe = false,
 crossover = true,)

Repeated (replicated) parallel design

Another type of replicated design is a parallel design of the form:

Sequence	Period 1	Period 2
RR	`R`	`R`
TT	`T`	`T`

be_data = dataset(joinpath("bioequivalence", "RR_TT", "From_PJ2017_4_2"));
design = detect_design(be_data)

(design = "RR|TT",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'R',
 test_formulations = ['T'],
 replicated = fully_replicated,
 supports_rsabe = false,
 crossover = false,)

While this design does not have some of the benefits of cross over design it does allow us to estimate the within subject variability both for R and for T:

pumas_be(be_data, endpoint = :PK)


Observation Counts
Sequence ╲ Period	1	2

RR	34	35
TT	38	38


Paradigm: Parallel design
Model: T test (equal variance)
Criteria: Standard ABE
Endpoint: PK
Formulations: Reference(R), Test(T)

		Results(PK)	Assessment	Criteria
R	Geometric Naive Mean	3.72
T	Geometric Naive Mean	3.891
	Geometric Mean T/R Ratio (%)	104.6
	Degrees of Freedom	143
	90% Confidence Interval (%)	[95.25, 114.9]	Pass	CI ⊆ [80, 125]
Variability	CVᵣ (%) \| σ̂ᵣ	13.29 \| 0.1323
	CVₜ (%) \| σ̂ₜ	8.14 \| 0.0813

As you can see both CVᵣ and CVₜ are estimated. A negative is obviously that subject effects cannot be factored out since a subject is either administered R or T.

9 Multiple formulations (more than 2)

Some studies have more than two formulations. We encode these situations as 1RnT, nR1T, and nRnT (these are not industry-standard encodings but are useful for the discussion here). Here are the types of cases:

Multiple test formulations (1RnT): In this case there is a single reference and multiple tests (e.g. n=2 or n=3). As an example, this can be a situation where we consider multiple dosage forms of the test formulation and want to show that they are all bioequivalent to the reference product.
Multiple reference formulations (nR1T): In this case we want to compare a single test product to two or more references. For example in one trial we want to compare the test product to both a European reference and a US reference (n=2).
Combinations of test and reference (nRnT): In this case we want to have more than one test product and more than one reference product. For example there are two dosage levels (low and high) and a test and reference is tested on each dosage (n=2, and thus there are 4 formulations in total).

Often, such scenarios are carried out in a single bioequivalence trial where the design accounts for the multiple formulations. For example, in the 1RnT case the reference is administered as one formulation, and each of the test formulations are other formulations. In the nR1T case the reference is administered as one formulation, and each of the test formulations are other formulations. Finally in the nRnT case, typically with n=2, there would be four formulations, two of which are reference formulations and the other two are matching reference formulations.

Three formulations design

In the case of three formulations, R, S, and T one popular design is:

Sequence	Period 1	Period 2	Period 3
RST	`R`	`S`	`T`
RTS	`R`	`T`	`S`
SRT	`S`	`R`	`T`
STR	`S`	`T`	`R`
TRS	`T`	`R`	`S`
TSR	`T`	`S`	`R`

Observe that there are \(6\) possible permutations to the three formulations, R, S, and T, and this design considers each possible formulation.

If we consider this design for the 1RnT case then it is sensible to encode R as the reference formulation while the test formulations are S and T.
If we consider this design for the nR1T case then it is sensible to encode T is the test formulation while the reference formulations are R and S.

Four formulations design

In the case of four formulation, denoted A, B, C, and D, with A the reference, a common design is the following:

Sequence	Period 1	Period 2	Period 3	Period 4
ABCD	`A`	`B`	`C`	`D`
BCDA	`B`	`C`	`D`	`A`
CDAB	`C`	`D`	`A`	`B`
DABC	`D`	`A`	`B`	`C`

This is a Williams design, also known as a balanced Latin square crossover design.

This design is ensures that:

Each formulation appears exactly once in each period and each sequence. This balances period and sequence effects.
Each formulation both precedes and follows every other formulation the same number of times across all sequences. This distributes potential first-order carryover effects evenly.

There can be multiple interpretations:

If we consider this design for the 1RnT case then it is sensible to encode A as the reference formulation while the test formulations are B, C, and D.
If we consider this design for the nR1T case then it is sensible to encode A is the test formulation while the reference formulations are B, C, and D.
If we consider this design for the nRnT case, then we can consider A as a reference compared to the test B, and C as a reference compared to the test D.

Multiple comparisons

In each case where there are multiple formulations there are also multiple comparisons. This means that we carry out multiple hypothesis tests. Each such hypothesis test considers a different pair of formulations and has its own null hypothesis \(H_0\) (non-bioequivalence) and alternative hypothesis \(H_1\) (bioequivalence holds).

As an example for n=2:

1RnT: We compare the single reference formulation (R) to each of the test formulations (S and T). So we have:

\[ {\cal T}_{\texttt{comparison 1}} = \begin{cases} H_0: & \texttt{R vs T}~\text{not bioequivalent},\\ H_1: & \texttt{R vs T}~\text{bioequivalent}. \end{cases} \]

\[ {\cal T}_{\texttt{comparison 2}} = \begin{cases} H_0: & \texttt{R vs S}~\text{not bioequivalent},\\ H_1: & \texttt{R vs S}~\text{bioequivalent}. \end{cases} \]

nR1T: We compare the single test formulation (T) to multiple reference formulations (R and S). So we have:

\[ {\cal T}_{\texttt{comparison 1}} = \begin{cases} H_0: & \texttt{R vs T}~\text{not bioequivalent},\\ H_1: & \texttt{R vs T}~\text{bioequivalent}. \end{cases} \]

\[ {\cal T}_{\texttt{comparison 2}} = \begin{cases} H_0: & \texttt{S vs T}~\text{not bioequivalent},\\ H_1: & \texttt{S vs T}~\text{bioequivalent}. \end{cases} \]

nRnT: We compare reference A to test B and reference C to test D. Hence for each reference formulation there is an \(H_0\) and an \(H_1\). So we have:

\[ {\cal T}_{\texttt{comparison 1}} = \begin{cases} H_0: & \texttt{A vs B}~\text{not bioequivalent},\\ H_1: & \texttt{A vs B}~\text{bioequivalent}. \end{cases} \]

\[ {\cal T}_{\texttt{comparison 2}} = \begin{cases} H_0: & \texttt{C vs D}~\text{not bioequivalent},\\ H_1: & \texttt{C vs D}~\text{bioequivalent}. \end{cases} \]

Alpha (\(\alpha\)) adjustment

Recalling that hypothesis tests control the Type-I error via \(\alpha\), if we carry out multiple comparisons, each with its own \(\alpha\), there is a possibility of increasing the probability of Type-I error.

As a hypothetical example assume we are in the 1RnT case with \(n=10\).

This is hypothetical because we will rarely try more than three test formulations in a single trial. In this case, say bioequivalence does not hold for any of the test formulations. That is, \(H_0\) holds for all 10 trials. Still, let us see the probability of wrongfully rejecting at least one of the hypothesis and saying one of the formulations is bioequivalent:

Under \(H_0\) there is a probability of \(\alpha\) to reject \(H_0\) wrongfully for a specific test formulation.
Thus there is a probability of \(1-\alpha\) of not making a Type-I error of that test formulation.
If we assume the tests are independent, then there is a probability of \((1-\alpha)^{10}\) of not making any Type-I error.
Hence the probability of making at least one Type-I error is \(1-(1-\alpha)^{10}\):

Let us see this for the typical \(\alpha = 5\%\):

α = 0.05
consumer_risk = 1 - (1 - α)^10

0.4012630607616213

Hence the consumer’s risk from that trial is inflated from \(5\%\) to about \(40\%\)!

Now in practice we won’t carry out n=10 comparisons but rather n=2 or n=3. Still in that case:

consumer_risk_2_comparisons = round(1 - (1 - α)^2, digits = 4)
consumer_risk_3_comparisons = round(1 - (1 - α)^3, digits = 4);
@show consumer_risk_2_comparisons
@show consumer_risk_3_comparisons;

consumer_risk_2_comparisons = 0.0975
consumer_risk_3_comparisons = 0.1426

Hence we are still at riks of increasing the probability of some Type-I error to either about \(9.8\%\) or about \(14.2\%\).

A simple approach for mitigating this increase in the consumer’s risk is the Bonferroni correction where we adjust \(\alpha\) by dividing it by the number of trials.

See how this approach works for n=2, n=3, and (the hypothetical case of) n=10:

consumer_risk_2_comparisons_adjusted = round(1 - (1 - α / 2)^2, digits = 6)
consumer_risk_3_comparisons_adjusted = round(1 - (1 - α / 3)^3, digits = 6)
consumer_risk_10_comparisons_adjusted = round(1 - (1 - α / 10)^10, digits = 6)
@show consumer_risk_2_comparisons_adjusted
@show consumer_risk_3_comparisons_adjusted
@show consumer_risk_10_comparisons_adjusted;

consumer_risk_2_comparisons_adjusted = 0.049375
consumer_risk_3_comparisons_adjusted = 0.049171
consumer_risk_10_comparisons_adjusted = 0.04889

So, you can see that by adjusting alpha we are able to keep the consumer risk roughly at \(5\%\).

Note however that such adjustments to \(\alpha\) reduce the power of each of the tests. Hence while we are able to keep the consumer riks roughly fixed, we end up increasing the producer’s riks. More of this are in future units where we discuss power analysis.

The ICH guidance specifically suggests when to adjust \(\alpha\) and when not to. For this consider Section 2.2.5 in International council for harmonisation of technical requirements for pharmaceuticals for human use (2024):

For the case of nR1T (this is called “Multiple Comparator Products” in that guidance) it is recommended not to carry out an adjustment:

In studies with multiple comparator products, multiplicity correction, i.e., alpha adjustment, is not needed because comparator products are considered independent and region-specific. Decisions will be made independently about a test product relative to a single comparator product within a single jurisdiction.

However in the case 1RnT (“multiple test products”) it is suggested:

The need to apply multiplicity correction in pivotal trials depends on the underlying objectives of the trial:

If the objective is to achieve BE for all test formulations vs. the comparator product, no alpha adjustment is needed.

If the objective is to show BE for any of the test formulations, multiplicity (alpha) adjustment may be needed.

In any case, as with any design choice and statistical methodology choice:

The objective of the trial and method for multiplicity correction should be pre-specified in the study protocol.

Models and data use

In each of the cases 1RnT, nR1T, and nRnT we have a choice if to use a single statistical model based on all the collected data of the design, or to break up the data into separate models. As an example if we consider 1RnT with n=2. We can compare the reference formulation to each of the test formulations based on a single model, or we can separate the data and consider two separate models.

That is, we can have one model that generates all the comparisons (in linear models these are sometimes called contrasts), or we can create a different model for each comparison. The one model approach has the benefit of using all data together and thus potentially enables better estimates and higher statistical power. However this is also a drawback because comparisons between two formulations are also influenced by a third (or even fourth) formulation that is not relevant for that comparison.

While there are several approaches, our general suggested approach here is:

1RnT: Use a single model comprised of all data. Pumas’ pumas_be function supports this.
nR1T: Use separate models, with each model separating out the data and using only data, converting it to a \(2 \times 2\) cross over design. Then carry out the analysis for each filtered dataset separately.
nRnT: Similarly, separate out the models, with each model converting the data to a \(2 \times 2\) design and then carry out the analysis for each filtered dataset separately.

Let us now see some examples.

A `1RnT` example with `n=3`

We’ll use the bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5 dataset from the PharmaDatasets.

This data is from Example 4.5 of Patterson and Jones (2017). Note that in that book it is treated as a nR1T case in contrast our example below. Let’s first load the dataset:

be_data = dataset("bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5");

Now let’s apply pumas_be which considers the study as 1RnT by default. Here following the ICH guidance above, let’s assume we are in case (b) where the objective is “to show BE for any of the test formulations”. In this case let’s carry out \(\alpha\) adjustment. In pumas_be this is done via the level keyword argument (do not confuse with the alpha and level_y keyword arguments that have other uses).

With a Bonforoni correction, since we carry out two comparisons, we want to half \(\alpha\) from \(5\%\) to \(2.5\%\). So the hypothesis tests are to be carried out with \(\alpha = 0.025\).

However, the level keyword is for the width of the confidence intervals. As you can recall from Unit 3, the idea of carrying out a TOST is based on a \(100\times(1-2\alpha)\%\) confidence interval. Hence the confidence interval width we want is \(95\%\) (in contrast to the default \(90\%\)). Thus we set level = 0.95.

Let us see it both for AUC and Cmax (click on the tabs to switch):

pumas_be(be_data, level = 0.95)


Observation Counts
Sequence ╲ Period	1	2	3

RST	8	8	8
RTS	11	11	11
SRT	11	11	11
STR	10	10	10
TRS	10	10	10
TSR	10	10	10


Paradigm: 3 formulations
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(S and T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	3808
	Geometric Naive Mean	3824
S	Geometric Marginal Mean	5346
	Geometric Naive Mean	5412
	Geometric Mean S/R Ratio (%)	140.4
	Degrees of Freedom	114
	95% Confidence Interval (%)	[130.1, 151.5]	Fail	CI ⊆ [80, 125]
T	Geometric Marginal Mean	4426
	Geometric Naive Mean	4426
	Geometric Mean T/R Ratio (%)	116.2
	Degrees of Freedom	114
	95% Confidence Interval (%)	[107.7, 125.5]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	21.22 \| 0.2099

pumas_be(be_data, level = 0.95, endpoint = :Cmax)


Observation Counts
Sequence ╲ Period	1	2	3

RST	9	9	8
RTS	11	11	11
SRT	11	11	11
STR	10	10	10
TRS	11	11	10
TSR	10	10	10


Paradigm: 3 formulations
Model: Linear model
Criteria: Standard ABE
Endpoint: Cmax
Formulations: Reference(R), Test(S and T)

		Results(Cmax)	Assessment	Criteria
R	Geometric Marginal Mean	832.4
	Geometric Naive Mean	837.5
S	Geometric Marginal Mean	1327
	Geometric Naive Mean	1340
	Geometric Mean S/R Ratio (%)	159.4
	Degrees of Freedom	118
	95% Confidence Interval (%)	[143.6, 176.9]	Fail	CI ⊆ [80, 125]
T	Geometric Marginal Mean	1082
	Geometric Naive Mean	1078
	Geometric Mean T/R Ratio (%)	129.9
	Degrees of Freedom	118
	95% Confidence Interval (%)	[117.1, 144.2]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	29.69 \| 0.2907

You can see that Pumas detects the Paradigm is 3 formulations and then assigns the tests and reference accordingly. For example for AUC, the estimated GMR is \(140.4\%\) and the confidence interval (notice it is 95%) is [130.1, 151.5] which falls outside of the desired [80, 125] range. We will revisit more details of the output and the model in future units.

An `nR1T` example with `n=2`

Again using the bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5 dataset, let us now consider it as nR1T (multiple references). Following the approach outlined above, let us transform the dataset into separate \(2 \times 2\) datasets, data_ST_2by2 and data_RT_2by2:

be_data = dataset("bioequivalence/RST_RTS_SRT_STR_TRS_TSR/PJ2017_4_5")
@rtransform! be_data :formulation = :sequence[:period];
data_ST = @rsubset be_data :formulation in ['S', 'T'];
data_ST_2by2 = @rtransform data_ST :sequence = replace(:sequence, "R" => "");
@rtransform! data_ST_2by2 :period = findfirst(==(:formulation), :sequence);

data_RT = @rsubset be_data :formulation in ['R', 'T'];
data_RT_2by2 = @rtransform data_RT :sequence = replace(:sequence, "S" => "");
@rtransform! data_RT_2by2 :period = findfirst(==(:formulation), :sequence);

We can inspect the design of the datasets we formed with this transformation:

detect_design(data_ST_2by2)

(design = "RT|TR",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'S',
 test_formulations = ['T'],
 replicated = non_replicated,
 supports_rsabe = false,
 crossover = true,)

detect_design(data_RT_2by2)

(design = "RT|TR",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'R',
 test_formulations = ['T'],
 replicated = non_replicated,
 supports_rsabe = false,
 crossover = true,)

Note that in both cases the design string is RT|TR (and does not use S). This is just due to the letters R and T being a standard way to specify the design inside Pumas.

Now we can carry out the comparisons (click on the tabs to switch between comparisons - note that in each case we present AUC followed by Cmax):

Note that at the moment, pumas_be’s output labels S as R in both cases, but it should be treated as S:

pumas_be(data_ST_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	29	29
TR	31	31


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	5339
	Geometric Naive Mean	5412
T	Geometric Marginal Mean	4459
	Geometric Naive Mean	4426
	Geometric Mean T/R Ratio (%)	83.52
	Degrees of Freedom	56
	90% Confidence Interval (%)	[78.84, 88.48]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	18.73 \| 0.1857
ANOVA	Formulation (p-value)	0
	Sequence (p-value)	0.4313
	Period (p-value)	0.174

pumas_be(data_ST_2by2, endpoint = :Cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	30	29
TR	32	31


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: Cmax
Formulations: Reference(R), Test(T)

		Results(Cmax)	Assessment	Criteria
R	Geometric Marginal Mean	1323
	Geometric Naive Mean	1340
T	Geometric Marginal Mean	1091
	Geometric Naive Mean	1078
	Geometric Mean T/R Ratio (%)	82.52
	Degrees of Freedom	58
	90% Confidence Interval (%)	[76.17, 89.41]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	26.7 \| 0.2624
ANOVA	Formulation (p-value)	0.00018
	Sequence (p-value)	0.8896
	Period (p-value)	0.5804

pumas_be(data_RT_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	30	30
TR	30	30


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	3815
	Geometric Naive Mean	3824
T	Geometric Marginal Mean	4430
	Geometric Naive Mean	4426
	Geometric Mean T/R Ratio (%)	116.1
	Degrees of Freedom	57
	90% Confidence Interval (%)	[108.7, 124]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	21.63 \| 0.2139
ANOVA	Formulation (p-value)	0.00036
	Sequence (p-value)	0.4096
	Period (p-value)	0.5645

pumas_be(data_RT_2by2, endpoint = :Cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	31	30
TR	31	31


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: Cmax
Formulations: Reference(R), Test(T)

		Results(Cmax)	Assessment	Criteria
R	Geometric Marginal Mean	837.5
	Geometric Naive Mean	837.5
T	Geometric Marginal Mean	1083
	Geometric Naive Mean	1078
	Geometric Mean T/R Ratio (%)	129.4
	Degrees of Freedom	59
	90% Confidence Interval (%)	[118.8, 140.8]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	28.56 \| 0.28
ANOVA	Formulation (p-value)	0
	Sequence (p-value)	0.2598
	Period (p-value)	0.07524

A `1RnT` example with `n=3`

We’ll use the bioequivalence/ADBC_BACD_CBDA_DCAB/PJ2017_4_6 dataset from the PharmaDatasets. This data is from Example 4.6 of Patterson and Jones (2017). Note that in that book it is treated as a nRnT case in contrast to our analysis now which considers it as having multiple tests.

be_data = dataset("bioequivalence/ADBC_BACD_CBDA_DCAB/PJ2017_4_6")
pumas_be(be_data)


Observation Counts
Sequence ╲ Period	1	2	3	4

ADBC	7	7	7	7
BACD	7	7	7	7
CBDA	7	7	7	7
DCAB	7	7	7	7


Paradigm: 4 formulations
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(A), Test(B, C and D)

		Results(AUC)	Assessment	Criteria
A	Geometric Marginal Mean	367.9
	Geometric Naive Mean	367.9
B	Geometric Marginal Mean	369.7
	Geometric Naive Mean	369.7
	Geometric Mean B/A Ratio (%)	100.5
	Degrees of Freedom	78
	90% Confidence Interval (%)	[94.7, 106.6]	Pass	CI ⊆ [80, 125]
C	Geometric Marginal Mean	2985
	Geometric Naive Mean	2985
	Geometric Mean C/A Ratio (%)	811.4
	Degrees of Freedom	78
	90% Confidence Interval (%)	[764.8, 860.8]	Fail	CI ⊆ [80, 125]
D	Geometric Marginal Mean	2879
	Geometric Naive Mean	2879
	Geometric Mean D/A Ratio (%)	782.6
	Degrees of Freedom	78
	90% Confidence Interval (%)	[737.6, 830.2]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	13.35 \| 0.1329

Now you may observe that in this example, formulations C and D fail with estimates and confidence intervals for the GMR way out of the desired range. In this dataset this is because they are with different dosage amounts which are in fact 8 times as high. For this we transform the data by dividing the AUC by 8 every time the formulation is C or D. This is done with the following code using @rtransform from the DataFramesMeta package.

transformed_data =
    @rtransform(be_data, :AUC = :sequence[:period] in ['C', 'D'] ? :AUC / 8 : :AUC);
be_result = pumas_be(transformed_data)


Observation Counts
Sequence ╲ Period	1	2	3	4

ADBC	7	7	7	7
BACD	7	7	7	7
CBDA	7	7	7	7
DCAB	7	7	7	7


Paradigm: 4 formulations
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(A), Test(B, C and D)

		Results(AUC)	Assessment	Criteria
A	Geometric Marginal Mean	367.9
	Geometric Naive Mean	367.9
B	Geometric Marginal Mean	369.7
	Geometric Naive Mean	369.7
	Geometric Mean B/A Ratio (%)	100.5
	Degrees of Freedom	78
	90% Confidence Interval (%)	[94.7, 106.6]	Pass	CI ⊆ [80, 125]
C	Geometric Marginal Mean	373.2
	Geometric Naive Mean	373.2
	Geometric Mean C/A Ratio (%)	101.4
	Degrees of Freedom	78
	90% Confidence Interval (%)	[95.6, 107.6]	Pass	CI ⊆ [80, 125]
D	Geometric Marginal Mean	359.9
	Geometric Naive Mean	359.9
	Geometric Mean D/A Ratio (%)	97.82
	Degrees of Freedom	78
	90% Confidence Interval (%)	[92.2, 103.8]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	13.35 \| 0.1329

As you can see, now in this case, all formulations pass. Such a transformation obviously only needs to be done when the experimental setup merits it.

A `nR1T` example with `n=3`

Let us now consider the same data as nR1T. For this let us create three different datasets that each compare the test D to the references, A, B, and C respectively.

We’ll continue working on transformed_data from above. Let us create a data processing helper function make_2by2by2_from_4by4by4 that converts the \(4 \times 4 \times 4\) design into a \(2 \times 2 \times 2\) design with the specification formulations:

function make_2by2by2_from_4by4by4(original_df, formulation_1, formulation_2)
    df = deepcopy(original_df)
    all_formulations = unique(vcat(unique.(df.sequence)...))
    formulations_to_drop = setdiff(all_formulations, [formulation_1, formulation_2])
    @rtransform! df :formulation = :sequence[:period]
    @rsubset! df :formulation in [formulation_1, formulation_2]
    @rtransform! df :sequence =
        replace(:sequence, formulations_to_drop[1] => "", formulations_to_drop[2] => "")
    @rtransform! df :period = findfirst(==(:formulation), :sequence)
    return df
end;

We now use this function for the three comparisons we consider. You can again switch between tabs in each case below. Here are the results only for AUC:

data_AD_2by2 = make_2by2by2_from_4by4by4(transformed_data, 'A', 'D')
pumas_be(data_AD_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	367.9
	Geometric Naive Mean	367.9
T	Geometric Marginal Mean	359.9
	Geometric Naive Mean	359.9
	Geometric Mean T/R Ratio (%)	97.82
	Degrees of Freedom	26
	90% Confidence Interval (%)	[92.41, 103.5]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	12.52 \| 0.1247
ANOVA	Formulation (p-value)	0.514
	Sequence (p-value)	2.0e-5
	Period (p-value)	0.00891

data_BD_2by2 = make_2by2by2_from_4by4by4(transformed_data, 'B', 'D')
pumas_be(data_BD_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	369.7
	Geometric Naive Mean	369.7
T	Geometric Marginal Mean	359.9
	Geometric Naive Mean	359.9
	Geometric Mean T/R Ratio (%)	97.36
	Degrees of Freedom	26
	90% Confidence Interval (%)	[90.17, 105.1]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	16.96 \| 0.1684
ANOVA	Formulation (p-value)	0.5576
	Sequence (p-value)	0.05751
	Period (p-value)	0.01014

data_CD_2by2 = make_2by2by2_from_4by4by4(transformed_data, 'C', 'D')
pumas_be(data_CD_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	373.2
	Geometric Naive Mean	373.2
T	Geometric Marginal Mean	359.9
	Geometric Naive Mean	359.9
	Geometric Mean T/R Ratio (%)	96.45
	Degrees of Freedom	26
	90% Confidence Interval (%)	[92.43, 100.6]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	9.37 \| 0.0935
ANOVA	Formulation (p-value)	0.1599
	Sequence (p-value)	1.0e-5
	Period (p-value)	0.0053

A `nRnT` example with `n=3`

Similarly if we wanted to treat the dataset bioequivalence/ADBC_BACD_CBDA_DCAB/PJ2017_4_6 dataset as in Example 4.6 of Patterson and Jones (2017), we are in the nRnT case where we compare A and B for one dosage level and then compare C and D for a different dosage level.

Now let us not use the transformed data but stay with the original data values. We again take the approach of converting the design to a \(2 \times 2 \times 2\) design using our make_2by2by2_from_4by4by4 function from above:

be_data = dataset("bioequivalence/ADBC_BACD_CBDA_DCAB/PJ2017_4_6");

data_AB_2by2 = make_2by2by2_from_4by4by4(transformed_data, 'A', 'B')
detect_design(data_AB_2by2)

(design = "RT|TR",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'A',
 test_formulations = ['B'],
 replicated = non_replicated,
 supports_rsabe = false,
 crossover = true,)

data_CD_2by2 = make_2by2by2_from_4by4by4(transformed_data, 'C', 'D')
detect_design(data_CD_2by2)

(design = "RT|TR",
 num_sequences = 2,
 num_periods = 2,
 num_formulations = 2,
 reference_formulation = 'C',
 test_formulations = ['D'],
 replicated = non_replicated,
 supports_rsabe = false,
 crossover = true,)

Here are the comparisons both of AUC and Cmax, in each case:

pumas_be(data_AB_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	367.9
	Geometric Naive Mean	367.9
T	Geometric Marginal Mean	369.7
	Geometric Naive Mean	369.7
	Geometric Mean T/R Ratio (%)	100.5
	Degrees of Freedom	26
	90% Confidence Interval (%)	[92.84, 108.7]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	17.45 \| 0.1732
ANOVA	Formulation (p-value)	0.9201
	Sequence (p-value)	0.3733
	Period (p-value)	0.1069

pumas_be(data_AB_2by2, endpoint = :Cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: Cmax
Formulations: Reference(R), Test(T)

		Results(Cmax)	Assessment	Criteria
R	Geometric Marginal Mean	73.17
	Geometric Naive Mean	73.17
T	Geometric Marginal Mean	70.62
	Geometric Naive Mean	70.62
	Geometric Mean T/R Ratio (%)	96.51
	Degrees of Freedom	26
	90% Confidence Interval (%)	[87.15, 106.9]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	22.68 \| 0.224
ANOVA	Formulation (p-value)	0.5586
	Sequence (p-value)	0.9181
	Period (p-value)	0.1429

pumas_be(data_CD_2by2)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: AUC
Formulations: Reference(R), Test(T)

		Results(AUC)	Assessment	Criteria
R	Geometric Marginal Mean	373.2
	Geometric Naive Mean	373.2
T	Geometric Marginal Mean	359.9
	Geometric Naive Mean	359.9
	Geometric Mean T/R Ratio (%)	96.45
	Degrees of Freedom	26
	90% Confidence Interval (%)	[92.43, 100.6]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	9.37 \| 0.0935
ANOVA	Formulation (p-value)	0.1599
	Sequence (p-value)	1.0e-5
	Period (p-value)	0.0053

pumas_be(data_CD_2by2, endpoint = :Cmax)


Observation Counts
Sequence ╲ Period	1	2

RT	14	14
TR	14	14


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: Cmax
Formulations: Reference(R), Test(T)

		Results(Cmax)	Assessment	Criteria
R	Geometric Marginal Mean	636
	Geometric Naive Mean	636
T	Geometric Marginal Mean	617.1
	Geometric Naive Mean	617.1
	Geometric Mean T/R Ratio (%)	97.04
	Degrees of Freedom	26
	90% Confidence Interval (%)	[91.11, 103.4]	Pass	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	13.9 \| 0.1383
ANOVA	Formulation (p-value)	0.4232
	Sequence (p-value)	0.00134
	Period (p-value)	0.669

10 Conclusion

This unit provided an overview of the most common experimental designs in bioequivalence testing, moving beyond the simple parallel design to more statistically powerful multi-period approaches.

The main focus was the \(2 \times 2\) crossover design, which is the standard for most bioequivalence studies. Its strength lies in controlling for variability between subjects, which generally provides more precise results with a smaller number of participants.

We also covered replicate designs, which are necessary for highly variable drugs (HVDs) and narrow therapeutic index drugs (NTIDs). By administering a drug more than once to the same subject, these designs allow us to measure the within-subject variability, a key requirement for certain regulatory pathways.

Finally, we looked at how to handle studies with multiple test or reference formulations. This raised the important statistical consideration of multiple comparisons and when it might be necessary to adjust the significance level (\(\alpha\)) to maintain the integrity of the results.

Understanding these different designs is crucial because the design choice determines the correct statistical model for the analysis. In the next unit, we will explore the linear and mixed-effects models that Pumas uses to analyze data from these various experimental setups.

11 Unit exercises

Parallel vs. Crossover Design

Consider a bioequivalence trial comparing formulations \(R\) (reference) and \(T\) (test).
1. Describe a simple parallel design and a \(2 \times 2\) crossover design in terms of subject allocation and treatment sequence.
2. Discuss one statistical advantage and one logistical disadvantage of using a crossover design instead of a parallel design for bioequivalence testing.
Exploring Randomization: In the section on randomization, we assigned 20 subjects to 4 groups.
1. Modify the code to randomize 30 subjects into 3 groups: “R”, “T low dose”, and “T high dose”.
2. Run the code twice, once with Random.seed!(0) and once with Random.seed!(1). What do you observe about the group assignments? Why is this the case?
Understanding Variability with the Toy Model: The toy model in the “Within subject variability and between subject variability” section helps illustrate BSV and WSV.
1. Manipulating BSV: The array subject_average_concentration_loss_per_time controls the differences between subjects. Make the values in this array more similar to each other (e.g., [0.84, 0.85, 0.86, 0.87]). Rerun the simulation and the mixed model analysis (VarCorr(lmm(...))). How does the estimated standard deviation for id (which corresponds to BSV) change?
2. Manipulating WSV: The random component rand(Uniform(0.85, 1.15)) controls the variability within a single subject’s repeated measurements. Change this to be narrower, for example, rand(Uniform(0.95, 1.05)). Rerun the simulation and the mixed model analysis. How does the estimated standard deviation for the Residual (which corresponds to WSV) change?
3. Based on your experiments, if you were designing a study for a drug known to have very high BSV but low WSV, would you prefer a parallel or a crossover design? Why?
The Impact of Alpha Correction: In the 1RnT example with three formulations, we applied a Bonferroni correction by setting level = 0.95.
1. Rerun the analysis for AUC from that section without the correction, using the default level = 0.90.
2. Compare the 90% confidence intervals for the T/R and S/R comparisons with the 95% confidence intervals from the original example. Are they wider or narrower?
3. Could failing to apply an alpha correction in a multiple-comparison scenario lead to a “Pass” conclusion when a corrected analysis would “Fail”? Explain your reasoning based on the confidence intervals.
Analyzing Different Comparisons: In the section on the four-formulation Williams design (ADBC_BACD_CBDA_DCAB), we compared test D to references A, B, and C.
1. Using the make_2by2by2_from_4by4by4 helper function and the transformed_data, write the code to perform a bioequivalence analysis comparing formulation A (as reference) to formulation C (as test).
2. Run pumas_be on the resulting dataset. What is the GMR and 90% confidence interval? Do formulations A and C pass the bioequivalence criteria?
Investigating a Parallel Design: The list of available designs included "R_T", which corresponds to a parallel design.
1. Find a dataset with this design using the datasets() function from PharmaDatasets.
2. Load the data and run detect_design. Confirm that it is a parallel design with 1 period and 2 sequences.
3. Run pumas_be on this dataset. Look at the “Variability” section of the output. What CV is reported? Is it the overall CV, CVᵣ, or CVₜ? Why can’t a parallel design estimate within-subject variability?

A Course on Bioequivalence: Unit 4 - Different experimental designs

1 Unit overview

2 The concept of an experimental design

3 Overarching principles: randomization and blindness/open-label

4 The \(2 \times 2\) crossover design

5 Exploring example product specific guidances

6 Other Designs

7 Within subject variability and between subject variability

A toy example

Estimating the WSV and BSV from the toy model data

Design considerations for estimation of variability

8 Replicate designs

Fully replicate design

The `FDA_HighlyVariable` and `FDA_NarrowTherapeuticIndex` specifications

Partial replicate design

Repeated (replicated) parallel design

9 Multiple formulations (more than 2)

Three formulations design

Four formulations design

Multiple comparisons

Alpha (\(\alpha\)) adjustment

Models and data use

A `1RnT` example with `n=3`

An `nR1T` example with `n=2`

A `1RnT` example with `n=3`

A `nR1T` example with `n=3`

A `nRnT` example with `n=3`

10 Conclusion

11 Unit exercises

References

Reuse

1 Unit overview

2 The concept of an experimental design

3 Overarching principles: randomization and blindness/open-label

4 The \(2 \times 2\) crossover design

5 Exploring example product specific guidances

6 Other Designs

7 Within subject variability and between subject variability

A toy example

Estimating the WSV and BSV from the toy model data

Design considerations for estimation of variability

8 Replicate designs

Fully replicate design

The FDA_HighlyVariable and FDA_NarrowTherapeuticIndex specifications

Partial replicate design

Repeated (replicated) parallel design

9 Multiple formulations (more than 2)

Three formulations design

Four formulations design

Multiple comparisons

Alpha (\(\alpha\)) adjustment

Models and data use

A 1RnT example with n=3

An nR1T example with n=2

A 1RnT example with n=3

A nR1T example with n=3

A nRnT example with n=3

10 Conclusion

11 Unit exercises

References

Reuse

The `FDA_HighlyVariable` and `FDA_NarrowTherapeuticIndex` specifications

A `1RnT` example with `n=3`

An `nR1T` example with `n=2`

A `1RnT` example with `n=3`

A `nR1T` example with `n=3`

A `nRnT` example with `n=3`