using Pumas # available with Pumas products - also reexports Distributions
using NCA # available with Pumas products
using Bioequivalence # available with Pumas products
using BioequivalencePower # available with Pumas products
using PharmaDatasets # available with Pumas products
using GLM
using DataFramesMeta
using SummaryTables
using DataFrames
using DataFramesMeta
using AlgebraOfGraphics
using CairoMakie
using Random
using LinearAlgebra
using MixedModels

A Course on Bioequivalence: Unit 5 - The Linear Model
This is Unit 5 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.
1 Unit overview
In the previous unit we considered various designs. In this unit we consider analysis of the \(2 \times 2\) cross over design using a linear model. We also consider other designs that use the linear model. In particular we consider:
- Basics of linear models with categorical variables.
- The linear model applied to the \(2 \times 2\) cross over design.
- Interpretation of the variability.
- Extensions for multiple formulations.
- Regulatory specifications and examples, also considering older SAS code specifications.
- A brief exploration of Mixed Models for the \(2 \times 2\) cross over design.
We use the following packages.
We have used most of these packages in previous units. One new package to this unit is the GLM
package, standing for “Generalized Linear Models”. As a Pumas Bioequivalence
user, most of the time you do not need to explicitly use GLM
, yet for our understanding we will carry out model fits directly with GLM
as well. Also, while we have already used the MixedModels
package previously, we make more extensive use of this package in this unit. We also briefly use the in-built LinearAlgebra
package not used in previous units.
Note that the unit also presents a few snippets of SAS code for the purpose of understanding how it appears in some regulatory guidelines.
2 Linear Models Overview
Linear models are a fundamental tool in statistics, used to quantify the relationship between a response variable and one or more predictor variables.
In the context of bioequivalence, the response variable is the endpoint on the log scale and the predictor variables are various elements of the design. Before we see how the linear model works for crossover designs, let us consider some basics of linear models.
Basic Linear Regression
The simplest linear model is simple linear regression, where we predict a response variable \(y\) based on one predictor \(x\).
For example (outside the scope of a standard bioequivalence trial), we may consider the relationship between an endpoint value and body weight.
So in this example, \(y\) can be the AUC (for example, in nanogram hours per milliliter) and \(x\) can be the body weight (in kilograms). For this, the model is \[ y = \beta_0 + \beta_1 x + \varepsilon, \] where the parameters of the model are \(\beta_0\), the intercept, and \(\beta_1\), the slope.
Here, \(\varepsilon\) signifies random error around the regression line. We model this error as having a normal distribution with \(0\) mean and a variance \(\sigma^2\). Hence, this \(\sigma\) (or \(\sigma^2\)) is also an unknown parameter.
Now, when we have data of weights \(x_i\) and respective AUC values \(y_i\) for \(i=1,\ldots,n\), we can use these datapoints to estimate \(\beta_0\), \(\beta_1\), and \(\sigma^2\).
The standard way to estimate parameters is via maximum likelihood estimation.
We briefly discussed such an estimation approach in Unit 2 in the context of the log-normal distribution. Now, for basic linear regression, it is carried out by considering the likelihood function.
Here is the expression of the likelihood function (we present it here only as an illustration): \[ L(\beta_0, \beta_1, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right). \] It turns out that maximizing this expression is equivalent to:
- Solving a least squares minimization problem to estimate \(\beta_0\) and \(\beta_1\), where the estimators are denoted \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
- Estimating \(\sigma^2\) by calculating the average squared residual (or, more precisely, the maximum likelihood estimator of the variance):
\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2. \]
Note: In standard regression output, an unbiased estimator divides by \(n-2\) instead of \(n\). (Here \(2\) is the number of regression coefficients estimated).
We do not cover the basic ideas of least squares here, and you may proceed with the material even without an understanding of least squares. Yet if you are interested to learn more about least squares, one possible source is Section 2.3 in Liquet, Moka, and Nazarathy (2024). Another possible source is Section 8.1 in Nazarathy and Klok (2021).
Importantly, given some arbitrary weight value \(x\), we can predict the AUC value we expect for that \(x\) via,
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x. \]
Let us consider a small example with \(n=5\) datapoints. The GLM
package provides the lm
function (standing for linear model). A statistical model is specified via @formula(AUC ~ 1 + weight)
where AUC
and weight
are columns in a dataframe. The 1 +
in the formula stands for the intercept (note that even without this we fit an intercept unless we write 0 +
).
= DataFrame(
df = [60, 65, 70, 75, 80], # x values
weight = [125.5, 119.1, 137.0, 128.5, 132.2],
AUC # y values
)
# Fit simple linear regression model: AUC ~ weight
= lm(@formula(AUC ~ 1 + weight), df)
model
# present the coefficient table neatly
simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 96.5 | 29.4 | 3.28 | 0.0464 | 2.85 | 190 |
weight | 0.456 | 0.418 | 1.09 | 0.356 | -0.876 | 1.79 |
The result above is a fit model
. In the output above you see the estimated coefficients in the left most column. There is also additional statistical information including p-values for considering if the coefficients are meaningful in the model.
Specifically when a p-value is close to 0 (e.g. less than \(5\%\)) it implies that the estimated coefficient is most likely meaningful within the model. Similarly there are confidence intervals for the estimated parameters. More on this follows below.
For more clarity, here are the estimated regression line parameters, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) as well as the estimated standard deviation of \(\varepsilon\) which we denote via \(\hat{\sigma}\).
= round.(coef(model), digits = 2)
β̂₀, β̂₁ println("Estimates of regression line parameters: β̂₀ = $(β̂₀), β̂₁ = $(β̂₁)")
# The estimation of σ̂²
= deviance(model) # Residual sum of squares
rss = dof_residual(model) # n- 2
df_resid = round(sqrt(rss / df_resid), digits = 2)
σ̂ println("σ̂: ", σ̂)
Estimates of regression line parameters: β̂₀ = 96.54, β̂₁ = 0.46
σ̂: 6.62
The estimated value of \(\beta_1\) at about \(0.46\) implies that an increase of the weight by a single kilogram implies an expected increase of the AUC by \(0.46\) ng·h/mL.
The estimated value of \(\hat{\sigma}\) is an indication of the spread of the AUC values around the regression line. To see this further, let us also observe the data and the plotted line visually.
In this case let us use the predict
function applied to model
:
= range(50, stop = 100, length = 100)
x_points = predict(model, DataFrame(weight = x_points))
pred
=
plt data(df) * mapping(:weight, :AUC) * visual(Scatter) +
data(DataFrame(weight = x_points, auc_pred = pred)) *
mapping(:weight, :auc_pred) *
visual(Lines)
draw(
plt;= (
axis = "X: Weight (kg)",
xlabel = "Y: AUC (ng·h/mL)",
ylabel = "Linear Regression: AUC (X) vs. Weight (Y)",
title = ((50, 100), (100, 150)),
limits
), )
Multiple Linear Regression
The simple linear regression case is quite easily extended to a cases where in place of the single response variable there are multiple predictor variables.
For example (still outside the scope of a standard bioequivalance trial), assume that AUC is not only predicted by the subject’s weight, but is also predicted by the height, and age. In this case the model can be denoted as,
\[ y = \beta_0 + \beta_1 x^{(1)} + \beta_2 x^{(2)} + \beta_3 x^{(3)} + \varepsilon, \]
where now \(x^{(1)}\) is the weight (Kg), \(x^{(2)}\) is the height (cm), and \(x^{(3)}\) is the age (years). Similarly there could have been more variables.
In this case, for each subject \(i=1,\ldots,n\), the data contains the response \(y_i\) and the predictors \(x^{(1)}_i, x^{(2)}_i, x^{(3)}_i\).
With this, the same ideas of maximum likelihood estimation through least squares can yield estimates, \(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3\), as well as \(\hat{\sigma}^2\).
Now again for some given some specific \(x^{(1)}, x^{(2)}, x^{(3)}\) values we can predict the AUC as,
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x^{(1)} + \hat{\beta}_2 x^{(2)} + \hat{\beta}_3 x^{(3)}. \]
Let us see this via the GLM
package where we provide data points for all three predictors (weight, height, and age). As you can see in the code below we use the @formula(AUC ~ 1 + weight + height + age)
inside the lm
function:
= DataFrame(
df = [60, 65, 70, 75, 80], # kg
weight = [165, 170, 172, 168, 175], # cm
height = [25, 25, 27, 32, 31], # years
age = [126, 130, 134, 131, 136],
AUC
)
= lm(@formula(AUC ~ 1 + weight + height + age), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | -682.0 | 390 | -1.75 | 0.331 | -5642.0 | 4279 |
weight | -4.67 | 2.77 | -1.69 | 0.341 | -39.8 | 30.5 |
height | 5.33 | 2.65 | 2.01 | 0.294 | -28.3 | 39 |
age | 8.33 | 4.77 | 1.75 | 0.331 | -52.3 | 68.9 |
The output of the fit model
above now has four rows, with one row for every fit coefficient. In this particular case, the p-values for all coefficients are not small (e.g. none are less than \(5\%\)). This indicates that for each of those predictors there is no clear evidence of an effect of the predictor on the output (AUC). This does not mean that there is no such effect, but only indicates that we do not have evidence of such an effect.
While in practice we will not use this model due to the non-low p-values, for demonstration, let us still consider the estimated parameters:
= round.(coef(model), digits = 2)
β̂₀, β̂₁, β̂₂, β̂₃
println("Estimates of regression line parameters:")
println("β̂₀ (intercept) = $(β̂₀)")
println("β̂₁ (weight) = $(β̂₁)")
println("β̂₂ (height) = $(β̂₂)")
println("β̂₃ (age) = $(β̂₃)")
= deviance(model)
rss = dof_residual(model)
df_resid = round(sqrt(rss / df_resid), digits = 2)
σ̂ println("σ̂: ", σ̂)
Estimates of regression line parameters:
β̂₀ (intercept) = -681.93
β̂₁ (weight) = -4.67
β̂₂ (height) = 5.33
β̂₃ (age) = 8.33
σ̂: 0.73
Categorical Variables in Linear Models
Often, and certainly in bioequivalence studies, we need to model a categorical predictor such as formulation (Test vs. Reference), period, sequence, and others.
This is done by means of multiple linear regression where each categorical variable is represented by multiple boolean (or indicator) variables.
As an example assume we want to predict AUC as a function of the predictor formulation
, and there are three formulations, R
, S
, and T
. The “trick” for handling such a case, is to create a linear model,
\[ y = \beta_0 + \beta_1 x^{(\texttt{S})} + \beta_2 x^{(\texttt{T})} + \varepsilon. \]
Now the variables \(x^{(\texttt{S})}\) and \(x^{(\texttt{T})}\) are associated with the formulation predictor as follows:
- If the formulation is
R
then both variables are \(0\). - If the formulation is
S
then \(x^{(\texttt{S})} = 1\) and \(x^{(\texttt{T})} = 0\). - If the formulation is
T
then \(x^{(\texttt{S})} = 0\) and \(x^{(\texttt{T})} = 1\).
Hence with categorical predictors, the \(x\) variables are set as boolean (also sometimes called indicator variables).
Such an approach allows us to incorporate categorical variables in the model. With this:
- In the case of
R
formulation, \(\hat{y} = \hat{\beta}_0\). - In the case of
S
formulation, \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\). - In the case of
T
formulation, \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_2\).
In the same way, if there are more categorical predictors, we can create more categorical variables for this purpose.
Observe that a categorical predictor with \(k\) levels (e.g. \(k=3\) in the example above), requires \(k-1\) levels (e.g. \(2\) in this example).
Here is again an example using the GLM
package:
= DataFrame(
df = ["R", "S", "T", "R", "S", "T"],
formulation = [120.0, 125.5, 130.7, 119.2, 125.1, 132.4],
AUC
)
= lm(@formula(AUC ~ 1 + formulation), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 120 | 0.555 | 216 | 2.2e-7 | 118 | 121 |
formulation: S | 5.7 | 0.784 | 7.27 | 0.00537 | 3.2 | 8.2 |
formulation: T | 11.9 | 0.784 | 15.2 | 0.000614 | 9.45 | 14.4 |
Let us again use the coefficients to interpret the effects:
# Show fitted coefficients
= round.(coef(model), digits = 2)
β̂₀, β̂₁, β̂₂ println("Estimates of regression line parameters:")
println("β̂₀ (intercept, R) = $(β̂₀)")
println("β̂₁ (formulation: S) = $(β̂₁), formulation S effect (β̂₀ + β̂₁) = $(β̂₀ + β̂₁)")
println("β̂₂ (formulation: T) = $(β̂₂), formulation T effect (β̂₀ + β̂₂) = $(β̂₀ + β̂₂)")
# Estimate of σ̂² (residual variance)
= deviance(model)
rss = dof_residual(model)
df_resid = round(sqrt(rss / df_resid), digits = 2)
σ̂ println("σ̂: ", σ̂)
Estimates of regression line parameters:
β̂₀ (intercept, R) = 119.6
β̂₁ (formulation: S) = 5.7, formulation S effect (β̂₀ + β̂₁) = 125.3
β̂₂ (formulation: T) = 11.95, formulation T effect (β̂₀ + β̂₂) = 131.54999999999998
σ̂: 0.78
We can also see these values directly with the predict
function:
= DataFrame(formulation = ["R", "S", "T"])
newdata = predict(model, newdata)
auc_pred println("Predicted AUC for formulations R, S, T: ", round.(auc_pred, digits = 2))
Predicted AUC for formulations R, S, T: [119.6, 125.3, 131.55]
More than one categorical variable
Here is another example of categorical variables with more than one categorical variable.
Assume there is also a location
variable which indicates the physical location where the trial took place.
= DataFrame(
df = ["R", "S", "T", "R", "S", "T"],
formulation = ["loc_A", "loc_A", "loc_A", "loc_B", "loc_B", "loc_B"],
location = [120.0, 125.5, 130.7, 119.2, 125.1, 132.4],
AUC
)
= lm(@formula(AUC ~ 1 + formulation + location), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 120 | 0.775 | 154 | 4.21e-5 | 116 | 123 |
formulation: S | 5.7 | 0.95 | 6 | 0.0266 | 1.61 | 9.79 |
formulation: T | 11.9 | 0.95 | 12.6 | 0.00625 | 7.86 | 16 |
location: loc_B | 0.167 | 0.775 | 0.215 | 0.85 | -3.17 | 3.5 |
In this case, an associated linear model is:
\[ y = \beta_0 + \beta_1 x^{(\texttt{S})} + \beta_2 x^{(\texttt{T})} + \beta_3 x^{(\texttt{loc\_B})} + \varepsilon. \]
Where in such a model, we can interpret the estimate for the endpoint as follows:
R
,locA
: \(\hat{y} = \hat{\beta}_0\).S
,locA
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\).T
,locA
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_2\).R
,locB
: \(\hat{y} = \hat{\beta}_0 \quad \quad ~ +\,\hat{\beta}_3\).S
,locB
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 + \,\hat{\beta}_3\).T
,locB
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_2 + \,\hat{\beta}_3\).
To see this, lets again use the predict
function and compare with a manual computation where we use the coef
function to extract the coefficients. As you can see we obtain the same values:
= coef(model)
β̂₀, β̂₁, β̂₂, β̂₃ simple_table(
DataFrame(
= df.formulation,
formulation = df.location,
location = round.(
manual + β̂₁, β̂₀ + β̂₂, β̂₀ + β̂₃, β̂₀ + β̂₁ + β̂₃, β̂₀ + β̂₂ + β̂₃],
[β̂₀, β̂₀ = 2,
digits
),= round.(predict(model, df), digits = 2),
prediction
),= nothing,
round_mode )
formulation | location | manual | prediction |
R | loc_A | 119.52 | 119.52 |
S | loc_A | 125.22 | 125.22 |
T | loc_A | 131.47 | 131.47 |
R | loc_B | 119.68 | 119.68 |
S | loc_B | 125.38 | 125.38 |
T | loc_B | 131.63 | 131.63 |
Relationship to ANOVA
When a linear model includes only categorical predictors and has a single predictor, as in the example with the formulation
variable, the model is mathematically equivalent to a classical one-way analysis of variance (ANOVA). If there are two categorical predictors it is equivalent to a classical two-way analysis of variance (ANOVA). See Section 7.3 in Nazarathy and Klok (2021) for an elementary introduction to ANOVA (with the Julia language).
ANOVA is a statistical procedure traditionally used to test whether the means of several groups are equal. In the linear model formulation, the group assignments are represented by indicator variables, and fitting the model amounts to estimating the mean of each group and testing for differences among them.
We also specify very specific ANOVA style (Type III) p-values for various quantities in the model. More on this in the sequel.
Nesting or Multilevel (Hierarchical) Models
Another element of linear models that we may consider is nesting of variables.
This is a case where one variable can be considered differently based on another variable.
Do not confuse this with a different topic (not discussed here) called “nested models”.
To see our basic type of use of nesting consider this data where the formulation is either R
or T
and within each formulation we have an identifier of the subject, id_within_formulation
, which is either 1
, 2
, or 3
.
= DataFrame(
df = ["R", "R", "R", "R", "R", "R", "T", "T", "T", "T"],
formulation = ["1", "1", "2", "2", "3", "3", "1", "1", "2", "3"],
id_within_formulation = [120.0, 125.5, 130.7, 119.2, 125.1, 132.4, 129.3, 130.5, 132.1, 128.8],
AUC
)simple_table(df, round_mode = nothing)
formulation | id_within_formulation | AUC |
R | 1 | 120.0 |
R | 1 | 125.5 |
R | 2 | 130.7 |
R | 2 | 119.2 |
R | 3 | 125.1 |
R | 3 | 132.4 |
T | 1 | 129.3 |
T | 1 | 130.5 |
T | 2 | 132.1 |
T | 3 | 128.8 |
Observe that our meaning with this is that a subject with id 1
within formulation R
is a different subject then the subject with id 1
within formulation T
. Similarly for the other ids.
Now if we were to naively specify a linear model formula such as AUC ~ formulation + id_within_formulation
, then we get:
= lm(@formula(AUC ~ formulation + id_within_formulation), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 124 | 2.81 | 44 | 9.16e-9 | 117 | 131 |
formulation: T | 5.12 | 3.08 | 1.66 | 0.147 | -2.41 | 12.7 |
id_within_formulation: 2 | 1.86 | 3.63 | 0.513 | 0.626 | -7.01 | 10.7 |
id_within_formulation: 3 | 3.3 | 3.63 | 0.908 | 0.399 | -5.58 | 12.2 |
The problem with this approach is that it did not separate id_within_formulation
across formulations.
Instead what we want to do is have id_within_formulation
nested within formulation
. For this we use &
in the statistical formula, or more specifically we introduce a term such as:
formulation & id_within_formulation
This term means that we want to consider id_within_formulation
separably for each of the formulations. Note that in other languages such as SAS this would appear as, id_within_formulation(formulation)
.
With such a nesting specification, this is what the estimated model looks like now:
= lm(@formula(AUC ~ formulation + formulation & id_within_formulation), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 123 | 3.68 | 33.3 | 4.84e-6 | 113 | 133 |
formulation: T | 7.15 | 5.21 | 1.37 | 0.242 | -7.32 | 21.6 |
formulation: R & id_within_formulation: 2 | 2.2 | 5.21 | 0.422 | 0.695 | -12.3 | 16.7 |
formulation: T & id_within_formulation: 2 | 2.2 | 6.38 | 0.345 | 0.748 | -15.5 | 19.9 |
formulation: R & id_within_formulation: 3 | 6 | 5.21 | 1.15 | 0.314 | -8.47 | 20.5 |
formulation: T & id_within_formulation: 3 | -1.1 | 6.38 | -0.172 | 0.872 | -18.8 | 16.6 |
Let us understand the meaning of the coefficients fit for this linear model. For this consider this representation of the linear model with a total of 6 estimated coefficients:
\[ y = \beta_0 + \beta_1 x^{(\texttt{T})} + \beta_2 x^{(\texttt{R\&2})} + \beta_3 x^{(\texttt{T\&2})} + \beta_4 x^{(\texttt{R\&3})} + \beta_5 x^{(\texttt{T\&3})} + \varepsilon. \]
As before, when working with categorical variables each of these \(x\) variables are set to either \(0\) or \(1\) depending on the case of the data. For example \(x^{(\texttt{R\&3})} = 1\) only for data points where formulation
is R
and id_within_formulation
is 3
.
With these settings the meaning that the coefficients take are as follows:
R
,1
: \(\hat{y} = \hat{\beta}_0\).R
,2
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_2\).R
,3
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_4\).T
,1
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1\).T
,2
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 + \,\hat{\beta}_3\).T
,3
: \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 + \,\hat{\beta}_5\).
We can verify it using the predict
function compared to a manual computation. Note that unique
simply removes extra repeated lines from the dataframe.
= coef(model)
β̂₀, β̂₁, β̂₂, β̂₃, β̂₄, β̂₅ = unique(
df_predict DataFrame(
= df.formulation,
formulation = df.id_within_formulation,
id_within_formulation = round.(predict(model, df), digits = 2),
prediction
),
)=
df_predict.manual round.(
+ β̂₂, β̂₀ + β̂₄, β̂₀ + β̂₁, β̂₀ + β̂₁ + β̂₃, β̂₀ + β̂₁ + β̂₅],
[β̂₀, β̂₀ = 2,
digits
)simple_table(df_predict, round_mode = nothing)
formulation | id_within_formulation | prediction | manual |
R | 1 | 122.75 | 122.75 |
R | 2 | 124.95 | 124.95 |
R | 3 | 128.75 | 128.75 |
T | 1 | 129.9 | 129.9 |
T | 2 | 132.1 | 132.1 |
T | 3 | 128.8 | 128.8 |
Note that in principle we could have had identifier numbers for all subjects, not repeating them within each formulation. However, when it comes to considering type III p-values, a discussion of a future unit, our implementation requires to separate the ids into id_within_formulation
as above.
3 Inference with the linear model
The statistical assumptions of a linear model enable us to carry out statistical inference, including confidence intervals, hypothesis tests, and estimation of certain quantities which we call contrasts.
In general a statistical model makes use of specific statistical assumptions. As such, it differs from a machine learning model which is typically focused on finding an input-output relationship between \(x\) and \(y\) with minimal statistical assumptions. See for example, Chapter 2 of Liquet, Moka, and Nazarathy (2024) where linear models are presented as machine learning models.
A theoretical example
As a statistical model, the benefit of making statistical assumptions is that we know the distributional properties of the estimators \(\hat{\beta}\).
To get a feel for this, let us adapt Listing 8.5 from Nazarathy and Klok (2021). Note that this example and code does not have practical bioequivalence value, but is rather here for pedagogic purposes.
Here assume the simple linear regression case where: the actual population \(\beta_0\) is \(2.0\), the actual population \(\beta_1\) is \(1.5\), and the standard deviation of \(\varepsilon\), is at \(\sigma = 2.5\). Assume further that we consider a sample of n=10
points over \(x_1 =1, x_2=2\), and up to \(x_{10}=10\).
Now to get a feel for the distribution of the estimators, let us simulate this sampling many times, e.g. N=10_000
times. This snippet of code then simulates ten thousand samples and for each it records a pair of estimators \((\hat{\beta}_0, \hat{\beta}_1)\) in the array ests
:
= 2.0, 1.5, 2.5
β₀, β₁, σ = 10, 10_000
n, N
function coef_est()
= 1:n
x_vals = β₀ .+ β₁ * x_vals + rand(Normal(0, σ), n)
y_vals = DataFrame(X = x_vals, Y = y_vals)
data = lm(@formula(Y ~ X), data)
model return coef(model)
end
= [coef_est() for _ = 1:N]; ests
Here is a scatter plot of the ten thousand simulation repetitions which can give a hint at the distribution of the estimators:
= Figure()
fig = Axis(
ax 1, 1];
fig[= L"\hat{\beta}_0",
xlabel = L"\hat{\beta}_1",
ylabel = 20,
xlabelsize = 20,
ylabelsize
)scatter!(ax, first.(ests), last.(ests); color = :blue, markersize = 3)
fig
Part of the beauty of linear models is that we know the exact properties of the distributions of these estimators. For example, while the details are not critical, based on \(\sigma\) and the (design) values of \(x\), we know the covariance matrix of the estimator, \(\Sigma\).
See for example Section 8.2 of Nazarathy and Klok (2021) for more details:
= mean(1:n)
x_bar = sum((x - x_bar)^2 for x = 1:n), sum(x^2 for x = 1:n)
s_xx, s_x2 = σ^2 * s_x2 / (n * s_xx), σ^2 / s_xx
var_0, var_1 = -σ^2 * x_bar / s_xx
cov_01 = [
Σ
var_0 cov_01
cov_01 var_1 ]
2×2 Matrix{Float64}:
2.91667 -0.416667
-0.416667 0.0757576
These covariance values (including the variances) together with the fact that the estimators follow a (multivariate/bivariate) normal distribution, can even be translated into probalistic information.
For example this code which uses a bit of linear algebra and probability distributions (the details can be omitted) to find an ellipse in the \((\hat{\beta}_0, \hat{\beta}_1)\) space which covers \(100\times(1-\alpha)\) percent of the predictors:
= 0.05
α = cholesky(Σ).L
A = quantile(Rayleigh(), 1 - α)
r is_in_ellipse(β̂) = norm(inv(A) * (β̂ - [β₀, β₁])) ≤ r;
We can then apply the is_in_ellipse
function on every estimate see if it is in the ellipse or not. You can see the proportion is close to \(1-\alpha\).
= is_in_ellipse.(ests)
est_in println("Proportion of points inside ellipse: ", mean(est_in))
Proportion of points inside ellipse: 0.9493
Here is a plot presenting this ellipse with the points inside the ellipse and those not. Also since the estimators are unbiased the center of the ellipse is at the actual \((\beta_0, \beta_1)\) point:
= Figure()
fig = Axis(
ax 1, 1];
fig[= L"\hat{\beta}_0",
xlabel = L"\hat{\beta}_1",
ylabel = 20,
xlabelsize = 20,
ylabelsize
)
= is_in_ellipse.(ests)
in_ellipse = [r * A * [cos(t), sin(t)] .+ [β₀, β₁] for t in range(0, 2π; length = 500)]
ellipse = first.(ests[in_ellipse]), last.(ests[in_ellipse])
xs_in, ys_in = first.(ests[.!in_ellipse]), last.(ests[.!in_ellipse])
xs_out, ys_out
scatter!(ax, xs_in, ys_in; color = :green, markersize = 3)
scatter!(ax, xs_out, ys_out; color = :blue, markersize = 3)
lines!(ax, first.(ellipse), last.(ellipse); color = :red, linewidth = 2)
scatter!(ax, [β₀], [β₁]; color = :red, markersize = 8)
fig
The point of all this is that we have a-lot of distributional information in the linear model. We know the properties of the estimators, and based on that, statistical inference theory is developed.
While this distributional information relies on knowing \(\sigma\), in the realistic case where we do not know \(\sigma\) and estimate it based on the data, the resuliting marginal distribution of each \(\hat{\beta}_i\) is closely related to a T-distribution. This allows us to carry out inference.
Basic parameter inference
Let us return to the simple linear regression example from the top of this unit, and now consider confidence intervals and hypothesis tests. These ideas also extend to more complex linear models that involve multiple estimated coefficients. Here is the data example and model fitting once more:
= DataFrame(weight = [60, 65, 70, 75, 80], AUC = [125.5, 119.1, 137.0, 128.5, 132.2])
df = lm(@formula(AUC ~ 1 + weight), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 96.5 | 29.4 | 3.28 | 0.0464 | 2.85 | 190 |
weight | 0.456 | 0.418 | 1.09 | 0.356 | -0.876 | 1.79 |
In addition to the estimated coefficients in the first column, we are also presented additional information, with each row relating to each individual coefficient in isolation.
The second numerical column, Std. Error
, presents model based estimates of the standard error associated with each coefficient.
These estimates are related to \(\hat{\sigma}\) (not appearing here) and from quantities associated with the \(x\) values. The specific formulas used to compute these standard errors are not as important.
The GLM
function stderror
can retrieve these estimates:
round.(stderror(model), digits = 3)
2-element Vector{Float64}:
29.439
0.418
To understand the interpretation of these standard errors, consider both the estimate \(\hat{\beta}_i\) and standard error \(S_{\hat{\beta}_i}\) as a ratio,
\[ T_i = \frac{\hat{\beta}_i}{S_{\hat{\beta}_i}}. \]
The key is now to consider a case where the actual unknown coefficient is zero, namely, \(\beta_i = 0\). If this is the case, \(T_i\) follows a T-distribution with \(n-p\) degrees of freedom, where \(p\) is the number of estimated coefficients (e.g. \(p=2\) in simple linear regression).
For our specific data, here are those T-values, also appearing via the third numerical column, t
, in the table:
round.(coef(model) ./ stderror(model), digits = 2)
2-element Vector{Float64}:
3.28
1.09
Each such value can be viewed as a statistic of a specific hypothesis test about a parameter within the model. For example for the second row, weight
, there is the hypothesis test:
\[ {\cal T}_{\beta_1} = \begin{cases} H_0: & \beta_1 = 0 \qquad (\text{No effect of of weight on AUC}),\\ H_1: & \beta_1 \neq 0 \qquad (\text{Yes effect of weight on AUC}). \end{cases} \]
In this hypothesis test, rejecting \(H_0\) implies determining that there is an effect of weight on AUC.
To see a manual calculation of the p-value, we can use the cdf
function from the Distributions
package together with TDist
which stands for a T-distribution.
Since this a two sided hypothesis, based on T_statistic
, we obtain the p-value via 2*(1-cdf(TDist(n-p), T_statistic))
. You can see this value agrees with the table above:
= 5
n = 2
p = coef(model)[2] / stderror(model)[2] # [2] is for weight
T_statistic = 2 * (1 - cdf(TDist(n - p), T_statistic))
p_value round(p_value, digits = 3)
0.356
Such a non-low p-value then implies that we do not reject \(H_0\) and hence we do not have evidence that weight affects AUC. If the p-value would have been low, as it is for the (Intercept)
row, then we would reject \(H_0\) and conclude that the coefficient is meaningful within the model.
In a similar way, we can obtain the final column of the table, where in the weight
case it presents a \(95\%\) confidence intervals for the estimated quantity \(\beta_1\). To reproduce these values, note the use of \(97.5\%\) qunatiles of the T-distribution:
= coef(model)[2]
β̂₁ = stderror(model)[2] * quantile(TDist(n - p), 0.975)
ci_radius = [β̂₁ - ci_radius, β̂₁ + ci_radius]
ci round.(ci, digits = 3) |> println
[-0.876, 1.788]
You can see that this manually calculated CI is the same as that appearing in the coefficient table.
Note that we can also use the confint
function, supported by GLM
directly. Observe these values agree with the coefficient table:
confint(model)
2×2 Matrix{Float64}:
2.85315 190.227
-0.875607 1.78761
Importantly, if we want a different confidence level we can use the level=
keyword argument in confint
. For example if we were to use the output of a linear model in the context of a TOST bioequivalence test, the most common way is to use a \(100\times(1-2\alpha)\) confidence interval, i.e. \(90\%\). In this case:
confint(model, level = 0.9)
2×2 Matrix{Float64}:
27.2602 165.82
-0.5287 1.4407
Again we stress that the above example is not a typical bioequivalence example. Yet the ideas and principles follow.
The estimated covariance matrix of the coefficient estimates
Just like we computed the theoretical covariance matrix of the coefficient estimates above, based on data, we also have an estimated covariance matrix. This is sometimes also called a variance-covariance matrix of the coefficient estimates.
With the GLM package we can obtain this matrix via the vcov
function:
vcov(model)
2×2 Matrix{Float64}:
866.633 -12.2554
-12.2554 0.175077
As a verification, if we consider the diagonal entries of the matrix, via diag
and take their square roots, we obtain the standard errors again:
round.(sqrt.(diag(vcov(model))), digits = 3)
2-element Vector{Float64}:
29.439
0.418
Contrasts
While inference on individual parameters like \(\beta_1\) is useful, we are often interested in linear combinations of several parameters. Such a linear combination is called a contrast.
Consider the parameter \(\beta = (\beta_0, \beta_1, \dots, \beta_p)\). Now take a contrast coefficient vector \(c = (c_0, c_1, \dots, c_p)\). The contrast itself is the scalar value: \[ L = \sum_{i=0}^p c_i \beta_i. \]
This is simply a linear combination of the various coefficients. Note that \(c_i\) values may be positive, negative, or zero.
The estimate of the contrast is naturally \[ \hat{L} = \sum_{i=0}^p c_i \hat{\beta}_i. \]
It turns out, that just like we can do inference for individual coefficients using a T-distribution, we can also do inference for contrasts.
The standard error for a contrast with coefficient vector \(c\) is given by, \[
\text{SE} = \sqrt{c^\top \hat{\Sigma} \, c},
\] where \(\hat{\Sigma}\) is the estimated variance covariance matrix (as you would get from vcov
), and \(c^\top\) is a row vector of the coefficients while \(c\) is a column vector. Hence \(c^\top \hat{\Sigma} \, c\) is a matrix multiplication (row times matrix times column) which results in a single number (scalar).
With this, we can form a T-statistic to test the null hypothesis \(H_0: L=0\) vs. the alternative hypothesis \(H_1: L \neq 0\). That T-statistic is: \[ T = \frac{\hat{L}}{\text{SE}} \] Just as it did for a single coefficient, this statistic follows a T-distribution with \(n-p\) degrees of freedom, where \(n\) is the total number of observations and \(p\) is the number of parameters in the model.
As a simple example let us revisit an example from above where there are three formulations R
, S
, and T
, with some modified values in comparison to above:
= DataFrame(
df = ["R", "S", "T", "R", "S", "T"],
formulation = log.([90.0, 127.9, 125.7, 90.2, 122.1, 119.8]), # observe the log transformation
AUC
)simple_table(df)
formulation | AUC |
R | 4.5 |
S | 4.85 |
T | 4.83 |
R | 4.5 |
S | 4.8 |
T | 4.79 |
As above let us fit a model for it, where we know that formulation
is a categorical predictor with \(3\) separate levels (hence we expect \(2\) coefficients for it):
= lm(@formula(AUC ~ 1 + formulation), df)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 4.5 | 0.0193 | 233 | 1.74e-7 | 4.44 | 4.56 |
formulation: S | 0.327 | 0.0273 | 12 | 0.00125 | 0.24 | 0.414 |
formulation: T | 0.309 | 0.0273 | 11.3 | 0.00148 | 0.222 | 0.396 |
Now hypotheically say we are only interested in carrying out a bioequivalence comparison between S
and T
. As you recall from above the estimator for the effect of S
is \(\beta_0 + \beta_1\) while that of T
is \(\beta_0 + \beta_2\). Thus are quantity of intest is,
\[ L = (\beta_0 + \beta_2) - (\beta_0 + \beta_1) = \beta_2 - \beta_1. \] Positioned as a contrast we can set \(c = (0, -1, 1)\) and thus,
\[ L = \sum_{i=0}^2 c_i \beta_i = \beta_2 - \beta_1. \]
Now recall the concept of a TOST test from Unit 3. The general idea was to consider a \(100 \times(1-2\alpha)\) percent confidence interval and consider if it lies within \(\log(0.8)\) and \(\log(1.25)\).
For this let us estimate the contrast:
= [0, -1, 1];
c = c' * coef(model) L̂
-0.01818364421195323
And present back from the log scale on the natural scale, the estimate for the GMR is:
exp(L̂)
0.9819806807302752
If we wanted to obtain a \(90\%\) confidence interval for it, we can use the formula for the standard error of the contrast:
= vcov(model)
Σ̂ = sqrt(c' * Σ̂ * c) SE
0.027294111055944363
Now we can construct the confidence interval
= 6
n = 3
p = SE * quantile(TDist(n - p), 0.95)
ci_radius = [L̂ - ci_radius, L̂ + ci_radius]
log_scale_ci |> println log_scale_ci
[-0.08241660715643286, 0.04604931873252639]
And presenting this confidence interval back on the natural scale we get:
= round.(100exp.(log_scale_ci), digits = 2) |> println percent_ci
[92.09, 104.71]
Thus if this was part of a BE trial we would conclude the treatments are BE since the CI is a subset of \([80, 125]\).
In practice, we do not typically formulate such contrasts ourselves, but rather let the Bioequivalence
package do the work for us.
Further, often the contrast at question would be quite simple as it would be based on a single coefficient. Details follow. Still, understanding the concepts of contrasts is important.
4 Linear model for the \(2 \times 2\) crossover design
The above concepts are now used for our linear model of a \(2 \times 2\) crossover design. Let us explore this on an example dataset and also consider the subtle point of dropping subjects that do not complete both periods or not.
Two subtle variants of an example datasets
To explore this let us consider an example dataset appearing in Park, Kim, and Bae (2020):
= DataFrame(
data_2020_PKB = [1, 1, 2, 2, 3, 4, 4, 5, 5, 6],
id = ["TR", "TR", "TR", "TR", "TR", "RT", "RT", "RT", "RT", "RT"],
sequence = [1, 2, 1, 2, 1, 1, 2, 1, 2, 1],
period = [269.3, 410.4, 120.2, 137.3, 105.2, 90.9, 68.9, 228.3, 301.5, 105.3],
cmax
)simple_table(data_2020_PKB, round_mode = nothing)
id | sequence | period | cmax |
1 | TR | 1 | 269.3 |
1 | TR | 2 | 410.4 |
2 | TR | 1 | 120.2 |
2 | TR | 2 | 137.3 |
3 | TR | 1 | 105.2 |
4 | RT | 1 | 90.9 |
4 | RT | 2 | 68.9 |
5 | RT | 1 | 228.3 |
5 | RT | 2 | 301.5 |
6 | RT | 1 | 105.3 |
While this is a dataset is unrealistically small, it serves as a good example for understanding the analysis.
Observe that there are only \(6\) subjects, half of which were allocated to the TR
sequence and the other half allocated to the RT
sequence. Observe also that subjects \(3\) and \(6\) dropped out after the first period, so for subject \(3\) we only have the endpoint result for the T
formulation and for subject \(6\) we only have the endpoint result for the R
formulation.
Let us also consider a variant of this dataset which removes subjects \(3\) and \(6\) since they did not complete both periods. Such filtering of the dataset can be done more generally as follows:
=
ids_with_both_periods @rsubset @combine(@groupby(data_2020_PKB, :id), :count = length(:id)) :count == 2).id
(= @rsubset data_2020_PKB :id in ids_with_both_periods
data_2020_PKB_complete_case simple_table(data_2020_PKB_complete_case, round_mode = nothing)
id | sequence | period | cmax |
1 | TR | 1 | 269.3 |
1 | TR | 2 | 410.4 |
2 | TR | 1 | 120.2 |
2 | TR | 2 | 137.3 |
4 | RT | 1 | 90.9 |
4 | RT | 2 | 68.9 |
5 | RT | 1 | 228.3 |
5 | RT | 2 | 301.5 |
Now let’s apply Pumas on both variants of the dataset. First for data_2020_PKB
:
= pumas_be(data_2020_PKB, endpoint = :cmax) be_result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 3 | 2 |
TR | 3 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 160.5 | ||
Geometric Naive Mean | 165.2 | |||
T | Geometric Marginal Mean | 139.8 | ||
Geometric Naive Mean | 147.9 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.2088 | |||
Period (p-value) | 0.4684 | |||
And now for data_2020_PKB_complete_case
:
= pumas_be(data_2020_PKB_complete_case, endpoint = :cmax) be_result_complete_case
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 2 | 2 |
TR | 2 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 184.9 | ||
Geometric Naive Mean | 184.9 | |||
T | Geometric Marginal Mean | 161 | ||
Geometric Naive Mean | 161 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.1476 | |||
Period (p-value) | 0.4684 | |||
Observe that the majority of the output values of pumas_be
are exactly the same for both of the above cases.
The one notable difference is in the marginal and naive means for the geometric means of R
and T
. We discuss these differences when we take a deep dive to understand marginal means in a future unit.
The reason that both outputs were very similar is that when using a Linear model
(as we do here), we only use subjects that have evaluable data for both the test and comparator products. That is, the linear model inherently implements a complete case analysis.
This practice in agreement with Section 2.2.3.2 of International council for harmonisation of technical requirements for pharmaceuticals for human use (2024) which states:
The primary statistical analyses should include all data for all subjects who provide evaluable data for both the test and comparator products.
It is also in agreement with EMA (2010) document where Section 4.1.8 states:
… subjects in a crossover trial who do not provide evaluable data for both of the test and reference products (or who fail to provide evaluable data for the single period in a parallel group trial) should not be included.
The same is echoed in EMA (2015) in Section 8.1:
The question of whether to use fixed or random effects is not important for the standard two period, two sequence (2×2) crossover trial. In section 4.1.8 of the guideline it is stated that “subjects in a crossover trial who do not provide evaluable data for both of the test and reference products should not be included. Provided this is followed the confidence intervals for the formulation effect will be the same regardless of whether fixed or random effects are used.”
This aspect of using a fixed effects model (Linear model
as we do here) or a mixed effects model (Mixed model
as is used in the sequel), is also discussed further below. Note that since some guidances rely on SAS code, there is sometimes a discussion about using SAS’s PROC GLM
or alterntativly SAS’s PROC MIXED
. In fact, the FDA guidance FDA (2022) states in Section II.B.2 (Missing Data and Intercurrent Events):
Statistical methods for handling missing data include complete case analysis, available case analysis, weighting methods, imputation, and model-based approaches. For example, in a two-way crossover study, a complete case analysis could be a general linear model as implemented in SAS PROC GLM, which removes all subjects with any missing observations for any variables included in the GLM model (i.e., removes subjects missing one or both periods). An available case analysis could be done using SAS PROC MIXED, which uses all observed data (e.g., in a two-way crossover study, uses all subjects with one or two complete periods of data).
Exploring the linear model used
Since pumas_be
uses a Linear model
for the analysis, let us understand the components of this model. For now let us stick with the data data_2020_PKB
, and return to the smaller data_2020_PKB_complete_case
dataset later.
Note that part of the process that pumas_be
uses to analyze the data involves internally activating the preprocess_be
function.
This functions makes slight modifications to the data including applying a log transformation, removing missing values, and other steps. To see that modified data used in the analysis, you can use the data
field of the result of pumas_be
.
In our presentation here, let us consider only a subset of the fields in a certain order, [:endpoint, :sequence, :formulation, :period, :id_within_sequence]
, and then inspect that data:
= select(
data_2020_PKB_processed
be_result.data,:endpoint, :sequence, :formulation, :period, :id_within_sequence],
[
)simple_table(data_2020_PKB_processed)
endpoint | sequence | formulation | period | id_within_sequence |
5.6 | TR | 'T' | 1 | 1 |
6.02 | TR | 'R' | 2 | 1 |
4.79 | TR | 'T' | 1 | 2 |
4.92 | TR | 'R' | 2 | 2 |
4.66 | TR | 'T' | 1 | 3 |
4.51 | RT | 'R' | 1 | 1 |
4.23 | RT | 'T' | 2 | 1 |
5.43 | RT | 'R' | 1 | 2 |
5.71 | RT | 'T' | 2 | 2 |
4.66 | RT | 'R' | 1 | 3 |
The linear model we use represents the endpoint
as a function of the factors:
sequence
- The categorical variable of the sequence used with levelsRT
andTR
.formulation
- The categorical variable of the formulation used with levelsR
andT
.period
- The categorical variable of the period with levels1
and2
.id_within_sequence
(nested withinformulation
) - This is a categorical variable associated with the subject. In our analysis we nest this categorical variable within each sequence.
We can now inspect attributes of this the linear model by considering the model
field of the result of pumas_be
.
Namely, be_result.model
contains details of the statistical model (this holds both for the case of linear models and other models used by pumas_be
).
In particular we can use the formula
function from GLM
to see the statistical formula used for the model (println
helps format cleaner output here):
|> formula |> println be_result.model
endpoint ~ 1 + sequence + formulation + period + sequence & id_within_sequence
The formula above shows us that the linear model creates an estimate of the endpoint, based on the categorical variables we described above. The term sequence & id_within_sequence
implies nesting of id_within_sequence
in sequence
. We describe the rational of this term and the others below.
We can also use the coeftable
function from GLM
to see a summary of the estimated parameters (coefficients) as well as their associated p-values and other information (simple_table
helps format cleaner output here):
simple_table(be_result.model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 4.87 | 0.128 | 38.1 | 0.000688 | 4.32 | 5.42 |
sequence: TR | 0.286 | 0.156 | 1.83 | 0.209 | -0.387 | 0.959 |
formulation: T | -0.138 | 0.156 | -0.884 | 0.47 | -0.811 | 0.535 |
period: 2 | 0.139 | 0.156 | 0.888 | 0.468 | -0.534 | 0.812 |
sequence: RT & id_within_sequence: 2 | 0.704 | 0.143 | 4.93 | 0.0388 | 0.0894 | 1.32 |
sequence: TR & id_within_sequence: 2 | -0.297 | 0.143 | -2.08 | 0.173 | -0.911 | 0.318 |
sequence: RT & id_within_sequence: 3 | -0.209 | 0.181 | -1.16 | 0.367 | -0.986 | 0.568 |
sequence: TR & id_within_sequence: 3 | -0.358 | 0.181 | -1.98 | 0.186 | -1.13 | 0.419 |
Since sequence
, formulation
, and period
each have two levels, they each have a single coefficient estimated. The rationale for this is based on the same approach for categorical variables as presented above.
As for id_within_sequence
in sequence
nested with sequence
, it is nesting just as presented above.
The rationale for the linear model
The key to this linear model is the following:
sequence
effects - in case there happen to be any - although we hope there aren’t any.formulation
effects - this is the most important point. Capturing any effect that the formulation has on the endpoint. A significant effect beyond \(20\%\) (or any other threshold) would imply not being bioequivalent.period
effects - in case there happen to be any - although we hope there aren’t any.id_within_sequence
(nested withinformulation
) effects. These effects are expected, as subjects are all different. The purpose of these effects is to be able to separate subject effects.
Interpreting the results
Once a linear model is fit, for bioequivalence purposes, we can consider interpretation of the results as composed of two parts:
- Part I: Ensure the model to experiment fit is sensible. Do this by making sure that the experiment does not seem to be influenced strongly by the
sequence
,period
, or other factors. - Part II: Consider the confidence interval for the GMR to see if it is contained within the bounds (typically \(80\%\) – \(125\%\)).
. To explore this, let’s reproduce the output for data_2020_PKB_complete_case
for Cmax here:
= pumas_be(data_2020_PKB_complete_case, endpoint = :cmax) be_result_complete_case
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 2 | 2 |
TR | 2 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 184.9 | ||
Geometric Naive Mean | 184.9 | |||
T | Geometric Marginal Mean | 161 | ||
Geometric Naive Mean | 161 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.1476 | |||
Period (p-value) | 0.4684 | |||
Part I of the interprentation is to see that
sequence
andperiod
are not significant. Indeed p-values are not small. More on such type-III p-values is in a future unit. With this, we get some confidence that the model is sound for further analysis. However, if we would have had strong evident of an effect on one or two of these factors (e.g.period
p-value being less than \(5\%\)), then we would further investigate if there are any reasons to believe that strong period effects are present. As you can see in this case the p-values are0.1476
and0.4684
respectively.Part II of the interprentation is the key part, we consider the confidence interval for the GMR, namely \([55.16, 137.5]\) in our case here. By comparing it to the \([80, 125]\) bounds we are carrying out a two one sided T-test (TOST). This test is on the contrast for the the difference between the test and reference, and since our confidence interval is a \(90\%\) confidence interval, the comparison is equivalent to a TOST with \(\alpha = 0.05\). As you can see in this case we
Fail
as we are not able to reject \(H_0\) of the TOST, and hence we may not conclude bioequivalence. (With such an unrealistically small data example this is not unexpected).
Let us reproduce the [55.16, 137.5]
CI result centered around the GMR estimate 87.08
, directly from the linear model that is used under the hood. Let’s look at the output of the linear model:
simple_table(be_result_complete_case.model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 4.97 | 0.156 | 31.8 | 0.000989 | 4.3 | 5.64 |
sequence: TR | 0.361 | 0.156 | 2.31 | 0.148 | -0.312 | 1.03 |
formulation: T | -0.138 | 0.156 | -0.884 | 0.47 | -0.811 | 0.535 |
period: 2 | 0.139 | 0.156 | 0.888 | 0.468 | -0.534 | 0.812 |
sequence: RT & id_within_sequence: 2 | 0.599 | 0.111 | 5.42 | 0.0324 | 0.123 | 1.08 |
sequence: TR & id_within_sequence: 2 | -0.475 | 0.111 | -4.3 | 0.0501 | -0.951 | 0.000429 |
As you can see, the formulation effect is in the third row and is with a coefficient estimate of -0.138
, or a more exact quantity:
= coef(be_result_complete_case.model)[3] β_test
-0.13832798631103677
Now estimating the effect of difference in means between T
and R
is a simple (degenerate) form of a contrast. This contrast is just the coefficient for the test.
Hence the estimate of the contrast is just this -0.138
value. To consider it as a GMR, we exponentiate and (after rounding) get the GMR:
round(100exp(β_test), digits = 2)
87.08
Observe that this value exactly agrees with the pumas_be
output.
Now to get a confidence interval for this coefficient, we can use the confint
function with level = 0.9
. This differs from the \(95\%\) confidence intervals that appear by default with the coeftable
function. Note that we can use level = 0.9
with coeftable
as well, but we are only interested in a \(90\%\) CI for a single coefficient, the third one:
= confint(be_result_complete_case.model, level = 0.9)[3, :]
log_scale_ci |> println log_scale_ci
[-0.5950139887763114, 0.31835801615423787]
This is a confidence interval for the contrast \(\mu_T - \mu_R\). To see it also explicitly as a contrast, you can see we get the same value (only with very minor differences due to a different numerical computation):
= [0, 0, 1, 0, 0, 0]
c = c' * coef(be_result_complete_case.model)
L̂ = sqrt(c' * vcov(be_result_complete_case.model) * c)
SE = 8 # Observations in the model
n = 6 # Parameters in the linear model
p = SE * quantile(TDist(n - p), 0.95)
ci_radius = [L̂ - ci_radius, L̂ + ci_radius]
log_scale_ci |> println log_scale_ci
[-0.5950139887763111, 0.31835801615423753]
Finally we can transform this confidence interval on the log scale to the natural scale with exp
:
round.(100exp.(log_scale_ci), digits = 2) |> println
[55.16, 137.49]
As you can see, this value exactly agrees with the CI output of pumas_be
.
5 On fixed and random effects
An alternative to the linear model is a mixed effects model (also known as a random effects model) where we treat each of the subjects influence as a random components. We dive in greater detail into random effect models when we study replicate designs and reference scaling in the sequel. At this point let us briefly illustrate this model.
Note that once we start to speak about random effects models, we can also call the linear models above fixed effects models.
The basic random effects model that we consider has this formula:
= @formula(endpoint ~ 1 + sequence + formulation + period + (1 | id))
fml println(fml)
endpoint ~ 1 + sequence + formulation + period + :(1 | id)
Here the expression (1 | id)
indicates that id
is a random effect. This implies that the model does not only involve residual randomness as the linear model, but also involves a random component per subject. We deal with model and variants in greater detail in future units. Very loosely, the model is,
\[ y = \beta_0 + \sum_{i=1}^p \beta_i x^{(i)} + \zeta + \varepsilon. \]
Here most of the model elements are like the fixed effects models, but there is also a random effect component \(\zeta\). This means that every observation of \(y\) doesn’t only depend on the fixed effects, \(\beta_0 + \sum_{i=1}^p \beta_i x^{(i)}\) and the residual noise \(\varepsilon\), but also depends on a random effects component \(\zeta\).
In terms of the @formula
, the 1 + sequence + formulation + period
component matches the \(\beta_0 + \sum_{i=1}^p \beta_i x^{(i)}\) part while the (1 | id)
component matches the \(\zeta\) part.
Comparing the random effects and fixed effects models
To analyze the model let us use the data
field from the previous result. Namely, be_result_complete_case.data
has the endpoint data on the log scale, as well as other needed fields. In particular, our dataset for the mixed model is:
=
data_2020_PKB_complete_case_processed select(be_result_complete_case.data, [:endpoint, :sequence, :formulation, :period, :id])
simple_table(data_2020_PKB_complete_case_processed)
endpoint | sequence | formulation | period | id |
5.6 | TR | 'T' | 1 | 1 |
6.02 | TR | 'R' | 2 | 1 |
4.79 | TR | 'T' | 1 | 2 |
4.92 | TR | 'R' | 2 | 2 |
4.51 | RT | 'R' | 1 | 4 |
4.23 | RT | 'T' | 2 | 4 |
5.43 | RT | 'R' | 1 | 5 |
5.71 | RT | 'T' | 2 | 5 |
Now we can fit this model with the lmm
function from the MixedModels
package. We get into further details, such as the REML = true
option, in future units. The result of the fit, model
, is amenable to a coeftable
similar to the fixed effects linear models used ab above:
= lmm(fml, data_2020_PKB_complete_case_processed; REML = true)
model simple_table(model |> coeftable, halign = :left)
Name | Coef. | Std. Error | z | Pr(>|z|) |
(Intercept) | 4.97 | 0.552 | 9 | 2.2e-19 |
sequence: TR | 0.361 | 0.765 | 0.471 | 0.637 |
formulation: T | -0.138 | 0.156 | -0.884 | 0.376 |
period: 2 | 0.139 | 0.156 | 0.888 | 0.375 |
As this is a mixed model, T-distributions are no longer used, so you can see the p-values are in terms of a normal distribution (Z
). The fixed effects are fit for the (Intercept)
, the sequence
effect, the formulation
effect, and the period
. However, the \(\zeta\) random (1 | id)
component is a random effect, so the subject does not appear in the coefficient table.
To compare, look at the output of the fixed effects model we had before:
simple_table(be_result_complete_case.model |> coeftable, halign = :left)
Name | Coef. | Std. Error | t | Pr(>|t|) | Lower 95% | Upper 95% |
(Intercept) | 4.97 | 0.156 | 31.8 | 0.000989 | 4.3 | 5.64 |
sequence: TR | 0.361 | 0.156 | 2.31 | 0.148 | -0.312 | 1.03 |
formulation: T | -0.138 | 0.156 | -0.884 | 0.47 | -0.811 | 0.535 |
period: 2 | 0.139 | 0.156 | 0.888 | 0.468 | -0.534 | 0.812 |
sequence: RT & id_within_sequence: 2 | 0.599 | 0.111 | 5.42 | 0.0324 | 0.123 | 1.08 |
sequence: TR & id_within_sequence: 2 | -0.475 | 0.111 | -4.3 | 0.0501 | -0.951 | 0.000429 |
You can see that the first four values, 4.97
, 0.361
, -0.138
, 0.139
, appear in both the random effects model and the fixed effects model. So in this sense the models are the same. As for the subject effects, the fixed effects model captures these effects via the the 0.599
and -0.475
coefficient estaimtes. In contrast, the random effects model does not estimate such coefficients.
Instead, the random effects model estimates the variance (or standard deviation), not just of \(\varepsilon\), but also of \(\zeta\). For this we can apply the VarCorr
function on the model to get the following:
VarCorr(model)
Column | Variance | Std.Dev | |
id | (Intercept) | 0.560657 | 0.748770 |
Residual | 0.048922 | 0.221183 |
You can see that the estimated standard deviation of the residual, 0.221183
, agrees with the estimate for the Variability
appearing the pumas_be
output above. Here is also this value directly:
round.(
sqrt(
deviance(be_result_complete_case.model) /
dof_residual(be_result_complete_case.model),
),= 6,
digits )
0.221183
As for the 0.748770
estimated standard deviation of the subject (id
). This is the variability of \(\zeta\) which is not estimated in the fixed effects model.
Note: At the moment pumas_be
does not use a random effects model for \(2 \times 2\) cross over designs. So if you must use such a model, you can use the MixedModels
package as we did here. However, as you can see, the results for both models are identical when we use the complete case data (as recommended by the EMA and ICH guidelines).
6 Comparing with SAS Code
Since some regulatory documents refer to SAS code, let us also consider SAS code for the \(2 \times 2\) design.
Basic SAS datastep
This is one way to setup the data in SAS. We set datasets named data_2020_PKB
and data_2020_PKB_complete_case
just like in Pumas. You may consult SAS documentation for more information on how this code exactly works.
DATA data_2020_PKB;
INPUT SUBJ SEQ $ PRD CMAX;
IF (SEQ='TR' AND PRD=1) OR (SEQ='RT' AND PRD=2) THEN TRT='T';
ELSE TRT='R';
LNCMAX=LOG(CMAX);
DATALINES;
1 TR 1 269.3
1 TR 2 410.4
2 TR 1 120.2
2 TR 2 137.3
3 TR 1 105.2
4 RT 1 90.9
4 RT 2 68.9
5 RT 1 228.3
5 RT 2 301.5
6 RT 1 105.3
;
RUN;
DATA data_2020_PKB_complete_case;
SET data_2020_PKB;
WHERE SUBJ NOT IN (3, 6);
RUN;
After you run the code above in SAS, you have datasets data_2020_PKB
and data_2020_PKB_complete_case
in your workspace and you can then carry out analysis. However, since SAS is not specifically designed for bioequivalence analysis, the results often output data in a less dedicated format than Pumas. Nevertheless, let us consider the comparison of these outputs.
We use both PROC GLM
for the linear model and PROC MIXED
for the mixed effects model.
PROC GLM on data_2020_PKB
The linear model is fit via SAS using SAS’s PROC GLM
. In our case, here is the code for PROC GLM
applied to data_2020_PKB
:
PROC GLM DATA=data_2020_PKB;
CLASS SUBJ SEQ PRD TRT;
MODEL LNCMAX = SEQ SUBJ(SEQ) PRD TRT;
RANDOM SUBJ(SEQ) / TEST;
LSMEANS TRT / DIFF=CONTROL("R") CL ALPHA=0.1;
RUN;
Key parts of the output are as follows. We mark key values that have equivalent pumas_be
outputs with light green:
These results are equivalent to the Pumas results which we repeat below for ease of comparison. Here are the key observations:
- The degrees of freedom next to
Error
and underDF
are2
. - The estimate of the variability appearing as
Root MSE
is at0.221183
. Observe that this is equivalent to a CV of22.39
percent as you can verify with thesigma2geocv
function. - The (TYPE-III) p-values for
SEQ
(sequence),PRD
(period), andTRT
(formulation) are0.2088
,0.4684
and0.4698
respectively, in agreement with thepumas_be
output. More on this in a future unit. - The marginal means for
R
andT
, underLSMEAN
appear on the log-scale. The values,5.07825945
and4.93993146
, exponentiated, agree with the marginal means ofpumas_be
:
round.(exp.([5.07825945, 4.93993146]), digits = 1) |> println
[160.5, 139.8]
- The point estimate for the geometric mean ratio (GMR) on the log scale is
-0.138328
. When converted back to the natural scale and represented as a percentage, it again agrees with thepumas_be
output:
round(100exp(-0.138328), digits = 2)
87.08
- Finally, and most importantly, the \(90\%\) confidence interval on the log-scale is
[-0.595006, 0.318351]
. Again, when converted to the natural scale and represented as a percentage, it agrees withpumas_be
output (up to rounding):
round.(100exp.([-0.595006, 0.318351]), digits = 2) |> println
[55.16, 137.49]
Here is the pumas_be
output for ease of comparison:
= pumas_be(data_2020_PKB, endpoint = :cmax) be_result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 3 | 2 |
TR | 3 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 160.5 | ||
Geometric Naive Mean | 165.2 | |||
T | Geometric Marginal Mean | 139.8 | ||
Geometric Naive Mean | 147.9 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.2088 | |||
Period (p-value) | 0.4684 | |||
PROC GLM on data_2020_PKB_complete_case
Now let us also consider the output if we apply it to data_2020_PKB_complete_case
instead. This is the case with subjects 3
and 6
removed. To do this change PROC GLM DATA=data_2020_PKB
to PROC GLM DATA=data_2020_PKB_complete_case
in the SAS code above.
Key parts of the output are as follows. We now mark key values that changed in comparison to the previous SAS output with light yellow:
Here are some observations, where we compare both to the SAS output above and the respective pumas_be
output which we present below.
- Now as there are two fewer observations in the model, the
Model
degrees of freedom dropped from7
to5
. - The
SEQ
(sequence) Type III, p-value changed and is now at0.1476
. This is in agreement with thepumas_be
output for this case. - The marginal means are also modified and (on the log scale) are now at
5.21993029
forR
and5.08160230
forT
. These agree with thepumas_be
marginal means:
round.(exp.([5.21993029, 5.08160230]), digits = 1) |> println
[184.9, 161.0]
Here is the pumas_be
output for ease of comparison:
= pumas_be(data_2020_PKB_complete_case, endpoint = :cmax) be_result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 2 | 2 |
TR | 2 | 2 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: cmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(cmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 184.9 | ||
Geometric Naive Mean | 184.9 | |||
T | Geometric Marginal Mean | 161 | ||
Geometric Naive Mean | 161 | |||
Geometric Mean T/R Ratio (%) | 87.08 | |||
Degrees of Freedom | 2 | |||
90% Confidence Interval (%) | [55.16, 137.5] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 22.39 | 0.2212 | ||
ANOVA | Formulation (p-value) | 0.4698 | ||
Sequence (p-value) | 0.1476 | |||
Period (p-value) | 0.4684 | |||
PROC MIXED
Let us also consider SAS’s PROC MIXED
. First let us use the data_2020_PKB_complete_case
dataset in which case the linear model (PROC GLM
and pumas_be
in this case) and the mixed effects model (PROC MIXED
and also a Julia MixedModels
which we present below) are identical.
The code is:
PROC MIXED DATA=data_2020_PKB_complete_case;
CLASS SUBJ SEQ PRD TRT;
MODEL LNCMAX = SEQ PRD TRT;
RANDOM SUBJ(SEQ);
ESTIMATE 'T VS R' TRT -1 1 / CL ALPHA=0.1;
RUN;
Key parts of the output are as follows. We mark key values that have equivalent pumas_be
outputs with light green. We also mark fields that require some discussion with light blue:
Here are some observations:
- The
T VS R
Estimate
of-0.1383
is equivalent to the above. This is the GMR on the log scale. - Similarly, the
Lower
andUpper
bounds of-0.5950
and0.3184
are equivalent to the above. - The Type-III p-values for the period (
PRD
) and formulation (TRT
) are also identical to the above at0.4684
and0.4698
respectively. - To get the variability we consider the estimated
Residual
variance, this appears inCovariance Parameter Estimtes
and is at a value of0.04892
. Observe that the square root of this value is the standard deviation appearing above:
round(100sigma2geocv(sqrt(0.04892)), digits = 2), round(sqrt(0.04892), digits = 4)
(22.39, 0.2212)
- As for the Type-III p-value of the sequence at
0.6838
, it is different for the mixed model and does not agree with thepumas_be
output (or thePROC GLM
output). - Finally observe the estimated variability (variance) for
SUBJ(SEQ)
at0.5607
. Observe that it closely agrees with the estimated standard deviation of theMixedModel
above forid
which is at0.748770
:
round(sqrt(0.5607), digits = 5)
0.7488
Finally consider the data_2020_PKB
dataset which includes subjects \(3\) and \(6\) having partial observations (only the first period and not the second). Here we see different numerical results because PROC MIXED uses the data also from all subjects including those with partial observations. We highlight those results in light red:
Here are the key values on the standard scale. As you can see the values differ somewhat from the linear model, due to those additional partial observations of subjects \(3\) and \(6\).
= round(100exp(-0.1429), digits = 2)
GMR = round.(100exp.([-0.5860, 0.3002]), digits = 2)
CI = round(100sigma2geocv(sqrt(0.04701)), digits = 2)
CV @show GMR
@show CI
@show CV;
GMR = 86.68
CI = [55.65, 135.01]
CV = 21.94
To obtain this output with Pumas we cannot use pumas_be
at this point, but rather need to construct the model via MixedModels
. We discuss this in a future unit.
7 Conclusion
This unit shifted our focus from experimental design to the statistical engine driving bioequivalence assessment: the linear model. We established how this fundamental tool is used to analyze data from the standard \(2 \times 2\) crossover design. The core of the unit involved breaking down the model’s components, starting with the basics of linear regression and progressing to the essential technique of representing formulation
, period
, and sequence
as categorical variables.
Building on these fundamentals, we constructed the specific linear model that accounts for the fixed effects of sequence
, period
, and formulation
, as well as the nested effect of subject
within sequence
. We demonstrated how to interpret the model’s output, showing that the geometric mean ratio (GMR) and its \(90\%\) confidence interval are derived directly from the coefficient for the formulation
effect. The unit also introduced mixed-effects models as an alternative, clarifying that for a complete-case \(2 \times 2\) study, the key results are identical to the fixed-effects approach. Finally, we drew direct parallels to industry-standard SAS procedures like PROC GLM
and PROC MIXED
, solidifying the concepts from a practical and regulatory perspective.
With a solid understanding of how linear models are used to estimate the average treatment effect, we are now equipped to delve deeper into another critical aspect of the analysis: variability. The next unit will focus on the intra-subject coefficient of variation (CV), a key metric derived from the linear model’s error term. We will explore how it is calculated, its significance in study design and power, and its crucial role in the analysis of highly variable drugs (HVDs), which often require more advanced statistical approaches.
8 Unit exercises
Understanding Categorical Variables in
GLM
In the section “Categorical Variables in Linear Models”, we used a simple dataset to model
AUC
based onformulation
. Let’s explore the meaning of the coefficients.= DataFrame( df = ["R", "S", "T", "R", "S", "T"], formulation = [120.0, 125.5, 130.7, 119.2, 125.1, 132.4], AUC )= lm(@formula(AUC ~ 1 + formulation), df) model = coef(model) β̂
- Based on the output, what is the reference formulation? What do the coefficients
β̂[1]
,β̂[2]
, andβ̂[3]
represent in terms of the formulations? - Using only the
β̂
array, manually calculate the predictedAUC
for formulationR
, formulationS
, and formulationT
. - Verify your manual calculations by using the
predict
function on a newDataFrame
containing one of each formulation.
predict(model, DataFrame(formulation = ["R", "S", "T"]))
- Based on the output, what is the reference formulation? What do the coefficients
Reproducing
pumas_be
Results ManuallyThe core of this unit is understanding the linear model behind
pumas_be
for a \(2 \times 2\) crossover. Let’s use thedata_2020_PKB_complete_case
dataset to do this from scratch.- Using the processed data from
be_result_complete_case.data
, fit a linear model usingGLM.jl
’slm
function. The formula should be the same as the one used bypumas_be
, which you can find in the unit. - Extract the coefficient corresponding to the
formulation: T
effect. Use this coefficient to calculate the Geometric Mean Ratio (GMR) as a percentage. Does it match thepumas_be
output? - Use the
confint
function on your fitted model withlevel=0.9
to get the 90% confidence interval for theformulation: T
coefficient. - Convert this log-scale confidence interval to the natural scale (as percentages) and confirm that it matches the
90% CI
reported bypumas_be
.
- Using the processed data from
Calculating Intra-Subject CV
The
pumas_be
output reports theVariability (CV%)
, which is the intra-subject coefficient of variation. This value is derived from the residual error of the linear model.- Using the linear model you fit in Exercise 2, find the residual degrees of freedom using
dof_residual(model)
. - Find the residual sum of squares (also called deviance) using
deviance(model)
. - The Mean Squared Error (MSE) is the deviance divided by the degrees of freedom. The residual standard deviation is
sqrt(MSE)
. Calculate this value. - The
Bioequivalence
package provides the functionsigma2geocv(σ)
to convert a standard deviation on the log-scale to a CV. Use this function on the residual standard deviation you calculated in part (c). Does it match theVariability (CV%)
from thepumas_be
output for thedata_2020_PKB_complete_case
dataset?
- Using the linear model you fit in Exercise 2, find the residual degrees of freedom using
Fixed vs. Mixed Models with Missing Data
This unit highlighted that for a \(2 \times 2\) crossover, the linear model (
PROC GLM
) uses a “complete case” analysis, while a mixed model (PROC MIXED
) can use all available data. Let’s demonstrate this.- Remind yourself of the
pumas_be
results fordata_2020_PKB
(with dropouts) anddata_2020_PKB_complete_case
. Confirm that the GMR and 90% CI are identical. Why is this? - Now, fit a mixed-effects model on both datasets using the
lmm
function fromMixedModels.jl
as shown in the unit (@formula(endpoint ~ 1 + sequence + formulation + period + (1 | id))
). - For each of the two mixed models, extract the
formulation: T
coefficient and its standard error from thecoeftable
. Manually calculate the 90% confidence interval on the log-scale (Hint: usequantile(Normal(), 0.95)
for the Z-score). - Compare the GMR and 90% CI from the two mixed models. Are they still identical? Explain why the mixed model gives different results for the two datasets while the fixed model did not.
- Remind yourself of the
Working with Contrasts
In the “Contrasts” section, we created a model for three formulations (
R
,S
,T
) and manually calculated the CI for theT vs S
comparison. Let’s use that same model to test a different hypothesis.= DataFrame( df = ["R", "S", "T", "R", "S", "T"], formulation = log.([90.0, 127.9, 125.7, 90.2, 122.1, 119.8]), AUC )= lm(@formula(AUC ~ 1 + formulation), df) model
- The estimated effect for
S
isβ̂₀ + β̂₁
and forR
is justβ̂₀
. The difference isβ̂₁
. Define a contrast vectorc
that isolates only theβ̂₁
coefficient. - Use your contrast vector
c
,coef(model)
, andvcov(model)
to calculate the 90% CI for theS vs R
comparison on the log scale. - Convert the CI to the natural scale (as percentages). Would you conclude
S
andR
are bioequivalent based on this result?
- The estimated effect for
Investigating the
vcov
MatrixThe variance-covariance matrix, obtained with
vcov(model)
, is fundamental to all inference.- Get the
vcov
matrix for the linear model from Exercise 2 (data_2020_PKB_complete_case
). - The diagonal elements of this matrix are the variances of the coefficient estimates. Extract the diagonal using
diag(vcov(model))
. - Take the square root of these diagonal elements. Compare the resulting values to the
Std. Error
column from thecoeftable(model)
output. What do you notice?
- Get the