A Course on Bioequivalence: Unit 2 - Log transformed data, geometric means, coefficient of variation, and the log-normal distribution

Authors

Yoni Nazarathy

Andreas Noack

This is Unit 2 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.

1 Unit overview

In the previous unit we learned what is a bioequivalence study. In this unit we lay some of the basic groundwork with regards to quantities and tools of interest. In particular we consider:

  • Log transformed data
  • The log-normal distribution
  • Geometric means
  • The geometric mean ratio (GMR)
  • Bounds on the GMR
  • Coefficient of variation
  • Maximum likelihood estimation for normal and log-normal data

In this unit we use the following packages, of which the first 4 are Pumas packages (available only with a Pumas installation).

using Pumas # available with Pumas products - also reexports Distributions
using NCA # available with Pumas products
using Bioequivalence  # available with Pumas products
using PharmaDatasets # available with Pumas products
using AlgebraOfGraphics
using CairoMakie
using SummaryTables
using DataFramesMeta
using StatsBase

We have used some of these packages in the previous unit. New to this unit is the StatsBase package which provides additional basic statistical functionality (not available in the in-built Statistics package). Similarly, the Distributions package provides support of all common probability distributions. While we don’t explicitly have a using Distributions statement, that package is re-exported by Pumas.

Warning

This unit has a bit more equations and algebra than other units. If you find the notation overwhelming, do not let that deter you from advancing. Keep in mind that some of the mathematical details are only there for further insight and may be skipped on a first reading.

2 From Normal to Log-Normal

Central to statistics is the normal distribution (or Gaussian distribution). Mathematically it is a probability distribution with density,

\[ f_N(u) = \frac{1}{\sqrt{2 \pi\sigma^2}} e^{-\frac{(u-\mu)^2}{2\sigma^2}}. \]

The parameters \(\mu\) and \(\sigma^2\) are the mean and the variance respectively.

A random value distributed according to this distribution can take any value, positive, or negative. Practically it is almost certain to fall in the interval \([\mu - 3 \sigma, \mu + 3 \sigma]\).

Many statistical inference models and procedures assume that the data (or some component of the data) is distributed according to a normal distribution, often with \(\mu\) and \(\sigma^2\) unknown, and estimated. This includes many of the bioequivalence procedures.

In a statistical model, one almost always makes assumptions about distributions. However, certain quantities, such as endpoints, Cmax, and AUC, are not distributed according to a normal distribution and thus making such assumptions won’t work. That is for such endpoints, assuming a normal distribution would typically be a highly invalid assumptions.

For example let us look at the R and T data from some arbitrary dataset (this happens to be a dataset of a four period fully replicate design - yet we simply separate out T and R). Here we fit kernel density plots to the distribution to get a rough feel for the shape of the distribution. The subtle wiggles in the curve are simply because we are estimating the distribution. But importantly the distributions only take on values on the positive numbers and are very asymmetric.

df = dataset(joinpath("bioequivalence", "RTTR_TRRT", "SLTGSF2020_DS16"))
@rtransform!(df, :treatment = :sequence[:period])
specs =
    data(df) * mapping(:PK, color = :treatment) * AlgebraOfGraphics.density(bandwidth = 0.5)
draw(specs, axis = (; limits = ((-1, 5), nothing)))

Note that PK just stands for an arbitrary endpoint like Cmax or AUC. Also note that we plot the R and T data separably.

It is clear from the above that the distributions are only of positive values, and that they are asymmetric. In particular these distributions are far from being normal distributions.

As a remedy for the situation, we carry out a log transformation of the endpoint data.

With this remember that the logarithm function takes values that are strictly positive and returns values that are either positive or negative. With such a log transformation, the shape of the distribution changes and is much more similar to a normal distribution. Note that the base of the logarithm we use is not of great importance because logarithms of different bases are the same up to a constant factor. Still it is common to use the natural logarithm. Here we just use Julia’s log function which is the natural logarithm, sometimes denoted in mathematics via \(\ln\), but in our case denoted via \(\log\).

@rtransform!(df, :logPK = log(:PK))
specs =
    data(df) *
    mapping(:logPK, color = :treatment) *
    AlgebraOfGraphics.density(bandwidth = 0.5)
draw(specs, axis = (; limits = ((-3, 3), nothing)))

Mathematically this transformation can be viewed as a relationship between two random quantities (random variables), \(X_{L}\) and \(X_N\), where \(X_{L}\) represents the original endpoint and \(X_N\) represents the log transformed endpoint. Specifically,

\[ X_N = \log(X_L), \qquad \qquad \text{or similarly} \qquad \qquad e^{X_N} = X_L. \]

Here it is \(X_N\), the transformed data, which is assumed to have a normal distribution with density as stated above and is determined its mean \(\mu\) and standard deviation \(\sigma\). However, \(X_L\), the original observed endpoint values, are distributed differently, they have a probability density function,

\[ f_L(u) = \frac{1}{u\sqrt{2 \pi\sigma^2}} e^{-\frac{(\log(u)-\mu)^2}{2\sigma^2}}, \qquad \text{for positive}~u. \]

This is the lognormal distribution.

Hence \(X_L\) denotes “lognormal” while \(X_N\) denotes “normal”. Note that the derivation of the density \(f_L(u)\) follows from that the density \(f_N(u)\) based on some probability theory and elementary calculus. We omit the details.

As an illustration of the relationship between \(X_N\) (normal) and \(X_L\) (lognormal), consider this numerical experiment on synthetic data.

μ, σ = 0.4, 1.2
X_N_points = rand(Normal(μ, σ), 10^6)
X_L_points = exp.(X_N_points)

df = DataFrame(
    value = vcat(X_N_points, X_L_points),
    variable = vcat(
        fill("X_N_points", length(X_N_points)),
        fill("X_L_points", length(X_L_points)),
    ),
)

plt =
    data(df) *
    mapping(:value, color = :variable) *
    AlgebraOfGraphics.density(npoints = 10^4, bandwidth = 0.1)
draw(plt, axis = (; limits = ((-3, 8), nothing)))

You can see the distribution of the normal random values (\(X_N\)) and compare it with the transformed log normal values (\(X_L\)). Note that in an endpoint data bioequivalence setting we go the other way around, we convert from \(X_L\) to \(X_N\).

Note that by transforming from \(X_N\) to \(X_L\) we get a different mean and variance. That is for the random quantity \(X_L\), the parameters \(\mu\) and \(\sigma^2\) are no longer the mean and variance, but are rather just parameters of the distribution. This is evident by considering the sample mean and sample standard deviation as such:

@show mean(X_N_points), std(X_N_points)
@show mean(X_L_points), std(X_L_points);
(mean(X_N_points), std(X_N_points)) = (0.4003250966728072, 1.2000377044416142)
(mean(X_L_points), std(X_L_points)) = (3.0673471602338482, 5.533093186638885)

Mathematically, one can compute the mean and variance of the log normal distribution, in particular, these formulas hold:

\[ {\mathbb E}[X_L] = e^{\mu + \frac{\sigma^2}{2}}, \qquad \text{and} \qquad \text{Var}(X_L) = (e^{\sigma^2}-1)e^{2\mu + \sigma^2}. \]

You can see that these theoretical values are very much near the estimated values (from the synthetic data):

@show exp+ σ^2 / 2)
@show ((exp^2) - 1) * exp(2μ + σ^2));
exp(μ + σ ^ 2 / 2) = 3.0648542032930024
√((exp(σ ^ 2) - 1) * exp(2μ + σ ^ 2)) = 5.500278427964335

Hence if we have endpoint data distributed as \(X_L\) (log-normal), we can transform it to \(X_N\) (normal), carry out statistical inference to estimate \(\mu\) and \(\sigma\), with the estimates denoted \(\hat{\mu}\) and \(\hat{\sigma}\). Then if we wanted to, we would get back to the log-normal distribution with these formulas.

In practice, considering a representation using Geometric means, and the coefficient of variation is common. We outline this approach now.

3 Geometric Means, Arithmetic Means, and Logs

In general, when we have a collection of values, \(u_1,\ldots,u_n\), there many ways summarize these values via a measure of centrality (or average). In our case, these two ways are important:

\[ \text{mean}(u_1,\ldots,u_n) = \frac{1}{n} \sum_{i=1}^n u_i, \qquad \qquad \text{geomean}(u_1,\ldots,u_n) = \sqrt[n]{\prod_{i=1}^n u_i}. \]

Note that “mean” here is just short for the arithmetic mean and “geomean” is short for the geometric mean (the big symbol \(\sum\) is for summation and the big symbol \(\prod\) is for product).

Each of these means is suitable for their own context. Arithmetic means, the most popular, are good for additive quantities and geometric means are good for multiplicative quantities. (Note also that with geometric means we typically assume that \(u_i > 0\) while for arithmetic means there is no such restriction.)

Here is an example illustrating the use of each of these means:

println("Arithmetic mean is good for totals:")
weights = [10, 20]
@show mean(weights)
total_weight = mean(weights) * 2
@show total_weight
@show sum(weights)

println("\nGeometric mean is good for rates:")
rates_of_growth_per_year = [1.1, 1.2]
@show geomean(rates_of_growth_per_year)
percentage_growth_after_two_years = geomean(rates_of_growth_per_year)^2
@show percentage_growth_after_two_years
@show prod(rates_of_growth_per_year);
Arithmetic mean is good for totals:
mean(weights) = 15.0
total_weight = 30.0
sum(weights) = 30

Geometric mean is good for rates:
geomean(rates_of_growth_per_year) = 1.1489125293076057
percentage_growth_after_two_years = 1.3199999999999998
prod(rates_of_growth_per_year) = 1.32

Now in our case, we have endpoint data (e.g. AUC). Let’s denote it as,

\[ \text{endpoint data} = \{x_1,\ldots,x_n\}. \]

When we transform the data with a log transformation, we have the data, \[ \text{log transformed endpoint data} = \{\log(x_1),\ldots, \log(x_n)\}. \]

Now look at the arithmetic mean of the log transformed endpoint data:

\[ \begin{align*} \text{mean}(\log(x_1),\ldots, \log(x_n)) &= \frac{1}{n} \sum_{i=1}^n \log(x_i) \\ &= \frac{1}{n} \log \Big(\prod_{i=1}^n x_i \Big)\\ &= \log \Big(\sqrt[n]{\prod_{i=1}^n x_i} \Big)\\ &= \log \Big(\text{geomean}(x_1,\ldots,x_n)\Big). \end{align*} \]

So we see:

The arithmetic mean of the log transformed data is the log of the geometric mean of the original data.

Similarly we have,

\[ e^{\text{mean}(\log(x_1),\ldots, \log(x_n))} = \text{geomean}(x_1,\ldots,x_n). \]

Hence we also see:

The geometric mean of the original data is the exponent of the arithmetic mean of the original data.

Now from a statistical perspective, because the log transformed data of endpoints such as AUC and Cmax can often be (more) safely assumed to follow a normal distribution, we almost always carry out the bioequivalence statistical analysis on the log transformed data. Yet when present the results, and summarize the data, we want to speak about the original data.

For the original data, we then use the geometric mean as the description. This means, that we take an exponent of the quantity of the original data. In certain cases this quantity is a direct arithmetic mean, but in other cases, it can be a statistical point estimate, a confidence interval boundary or similar (more on this in the sequel).

To see this let us create the same NCA based dataset that we analyzed in unit 1 with a parallel study, this time focusing only on Cmax.

df = dataset("nca/crossover_study.csv")

@subset!(df, :period .== 1)
@rtransform!(df, :sequence = :sequence == "RT" ? "R" : "T")

pop = read_nca(df; observations = :concentration, group = [:sequence, :period])
nca_result = run_nca(pop; parameters = [:cmax])
be_analysis_data = nca_result.reportdf;

Here is the dataset.

simple_table(be_analysis_data)
id sequence period cmax
1 R 1 8390
2 R 1 8790
3 R 1 6495
4 R 1 5288
5 R 1 8814
6 R 1 6672
7 R 1 9337
8 R 1 5928
9 R 1 12401
10 R 1 7495
11 T 1 7103
12 T 1 5877
13 T 1 9401
14 T 1 8203
15 T 1 5719
16 T 1 12288
17 T 1 8042
18 T 1 6346
19 T 1 7239
20 T 1 7800

Now let us repeat the application of pumas_be (this same output appeared in unit 1):

pumas_be(be_analysis_data, endpoint = :cmax)
Observation Counts
Sequence ╲ Period 1
R 10
T 10
Paradigm: Parallel design
Model: T test (equal variance)
Criteria: Standard ABE
Endpoint: cmax
Formulations: Reference(R), Test(T)
Results(cmax) Assessment Criteria
R Geometric Naive Mean 7735
T Geometric Naive Mean 7610
Geometric Mean T/R Ratio (%) 98.39
Degrees of Freedom 18
90% Confidence Interval (%) [81.66, 118.6] Pass CI ⊆ [80, 125]
Variability CV (%) | σ̂ 23.73 | 0.234

As you can see, we report a Geometric Naive Mean of 7735 and and 7610 for the reference and test respectively (the phrase Naive simply says it is based on the data and not on some model estimate, as may occur with more complex models as discussed in the sequel).

In this case of this simple model, let us see that these values are in agreement with the arithmetic means of the log transformed data. What pumas_be does is first log transform the data and computes the arithmetic means. These are the values (not visible as output):

arithmetic_mean_log_R = mean(log.(be_analysis_data.cmax[1:10]))
arithmetic_mean_log_T = mean(log.(be_analysis_data.cmax[11:20]))

@show arithmetic_mean_log_R
@show arithmetic_mean_log_T;
arithmetic_mean_log_R = 8.953467566963164
arithmetic_mean_log_T = 8.93727699385348

Then when presenting the summary, the model exponentiates the arithmetic means, and as described above presents them as geometric means.

@show exp(arithmetic_mean_log_R)
@show exp(arithmetic_mean_log_T);
exp(arithmetic_mean_log_R) = 7734.6658858159
exp(arithmetic_mean_log_T) = 7610.445525253932

This is a common bioequivalence pattern, where we work with the log transformed data internally for statistical computations, and then present results in the original scale.

4 The Geometric Mean Ratio (GMR)

At the end of the day, our goal with bioequivalence is to compare the test and the reference. In the common approach we wish to compare means, and this is called average bioequivalence. Now comparing of quantities may be done in two common ways: One way is taking the difference and seeing if it is greater or less than \(0\). Another way is taking the ratio and seeing if it is greater or less than \(1\).

For geometric means, the ratio is natural, and the ratio of the two geometric means is called the Geometric Mean T/R Ratio (or GMR - not to be confused with the “Geometric mean of R”). This value in percent is ideally as close to 100% as possible since 100% means that both geometric means are the same.

But observe also,

\[ \text{GMR} = \frac{\text{geomean}(x_1^T,\ldots,x_{n_T}^T)}{\text{geomean}(x_1^R,\ldots,x^R_{n_R})} = e^{\text{mean}(\log(x_1^T),\ldots,\log(x_{n_T}^T)) ~-~ \text{mean}(\log(x_1^R),\ldots,\log(x^R_{n_R}))}. \]

So,

\[ \log(\text{GMR}) = \text{mean}(\log(x_1^T),\ldots,\log(x_{n_T}^T)) ~-~ \text{mean}(\log(x_1^R),\ldots,\log(x^R_{n_R})). \]

Here \(x_i^T\) for \(i=1,\ldots,n_T\) are the test observations and similarly \(x_i^R\) for \(i=1,\ldots,n_R\) are the reference observations.

Now indeed if we consider the inner workings of the model, we can inspect that statistical model via,

be_out = pumas_be(be_analysis_data, endpoint = :cmax)
be_out.model[1][1]
# This use of [1][1] is not done typically,
# but is rather a way of seeing the model output in this case.
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -0.0161906
    95% confidence interval: (-0.242, 0.2096)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8819

Details:
    number of observations:   [10,10]
    t-statistic:              -0.15065912350205451
    degrees of freedom:       18
    empirical standard error: 0.1074649362968388

All of the details of this model are not typically important and this is why pumas_be only presents a neater table output as a summary.

Importantly, you can see a point estimate in the model with the line point estimate: -0.0161906.

Let us also compute this same value manually,

round(arithmetic_mean_log_T - arithmetic_mean_log_R, digits = 7)
-0.0161906

And here it is back in the original (geometric mean ratio) scale, presented as a rounded percentage, just like in the output of pumas_be:

round(100exp(arithmetic_mean_log_T - arithmetic_mean_log_R), digits = 2)
98.39

These same transformations also take place for quantities such as confidence interval bounds. We focus more deeply on confidence intervals in the sequel, yet in the context of the log transformations, still consider them here.

Let us inspect an output of the pumas_be computation using the result field:

simple_table(be_out.result)
Parameter δ SE lnLB lnUB trt_naiveGM ref_naiveGM GMR LB UB CV
T - R -0.0162 0.108 -0.203 0.17 7610 7735 0.984 0.817 1.19 0.237

In particular, you may see that lnLB = -0.203 and lnUB = 0.17.

These are bounds of a 90% confidence interval (as a side note observe that the 90% interval is narrower than the 95% confidence interval appearing in the model output above). As you can see, we can retrieve the 90% confidence interval on the original scale by exponentiation:

println(round.(100exp.([only(be_out.result.lnLB), only(be_out.result.lnUB)]), digits = 2))
[81.66, 118.55]

Compare this result with the pumas_be output above.

5 Bounds on the GMR

Much of the bioequivalence analysis that we do is average bioequivalence, where we want the geometric mean of R and the geometric mean of T to be close. For this we need to define what we mean by “close”.

Closeness is sometimes characterized by a percentage such as \(20\%\) or \(10\%\), where we want that the difference will “not be more than 20%”, or “not be more than 10%”, etc.

Let’s consider the most common value, \(20\%\), and see how it can be used with the geometric mean ratio (GMR). For this let’s call \(\text{GM}_T\) the geometric mean of the test formulation and \(\text{GM}_R\) the geometric mean of the reference formulation. Hence,

\[ \text{GMR} = \frac{\text{GM}_T}{\text{GM}_R}. \]

Now one natural way (not used today) to quantify a maximal difference of \(20\%\) would be to require that,

\[ 0.8\text{GM}_R \le \text{GM}_T \le 1.2\text{GM}_R, \]

and by dividing the inequality by \(\text{GM}_R\) we see that this is equivalent to,

\[ 0.8 \le \text{GMR} \le 1.2. \]

While this may seem natural, observe what happens if we take logarithms as required for statistical analysis. In particular \(\log(0.8) \approx -0.223144\) and \(\log(1.2) \approx 0.182322\). Hence requiring this from the GMR is equivalent to,

\[ -0.223144 \le \text{mean}(\log(x_1^T),\ldots,\log(x_{n_T}^T)) ~-~ \text{mean}(\log(x_1^R),\ldots,\log(x^R_{n_R})) \le 0.182322. \]

As you can see we get a criterion that is not symmetric on the log scale.

The center of the interval \([-0.223144, 0.182322]\) is at about \(-0.020411\) and one would expect this center to be the most probable case (in fact, once you visit the concept of power you may see it is the point with maximal power). Taking the exponent of this center we get \(0.979796\), and this means that with such bounds, the “most powerful center” is at about \(97.98\%\) and not at \(1\)!

Due to this asymmetry on the log-scale, while early bioequivalence specifications used this approach, today the approach that is recommended is a bound of the form,

\[ 0.8 \le \text{GMR} \le 1.25. \]

Observe that \(1/0.8 = 1.25\).

While these bounds are not symmetric on the non-log-transformed scale, they still can be interpreted as a 20% bound since \(0.8 \le \text{GMR} \le 1.25\) is equivalent to, \[ 0.8\text{GM}_R \le \text{GM}_T \qquad \text{and} \qquad 0.8\text{GM}_T \le \text{GM}_R. \]

Importantly, they imply that \[ -0.223144 \le \text{mean}(\log(x_1^T),\ldots,\log(x_{n_T}^T)) ~-~ \text{mean}(\log(x_1^R),\ldots,\log(x^R_{n_R})) \le 0.223144, \] which is centered at \(0\). This puts \(\text{GMR} = 1\), as the central point from a statistical perspective.

Finally note that if we want the more stringent \(10\%\) difference and not the \(20\%\) difference the bounds are \[ 0.9 \le \text{GMR} \le 1.1111, \]

where we observe that \(1/0.9 = 1.1\bar{1}\). On the log scale these bounds translate to

\[ -0.105361 \le \log(\text{GMR}) \le 0.105361. \]

All of these above values are common in bioequivalence analysis. We finally mention that in the sequel, where we expand beyond average bioequivalence to reference scaling methods, we’ll see that the concepts of bounds can also take variability of the product into account.

6 Coefficient of variation

The variability of a distribution is very naturally described by the variance or alternatively, by its square root, the standard deviation. The standard deviation is useful because it is in the same units of the quantity measured, and in case of a normal distribution, we know to expect almost all observations to be within \(\pm 3\) standard deviations

However, it is sometimes convenient to use a unit free, scaled version of the standard deviation which is called the coefficient of variation. In particular if the mean (this is the mathematical mean or expectation) of a random quantity \(X\), is \({\mathbb E}[X]\) and the variance is \(\text{Var}(X)\), then the coefficient of variation is,

\[ \text{CV}(X) = \frac{\sqrt{\text{Var}(X)}}{{\mathbb E}[X]} = \frac{\text{standard deviation}}{\text{mean}}. \]

The usefulness in using the coefficient of variation instead of the standard deviation is that it is comparable across scales. Pharmacology experts then get a sense of what are high CV values and what are low ones, independently of the units used.

For example, as a general rule, in Bioequivalence analysis we can expect CV values to range from 0.05 to 1.5, with any values greater than 0.3 typically interpreted as exhibiting a high level of variability.

Now if we return to the log normal distribution and its formulas for the mean and variance we may compute the CV in terms of the parameters \(\mu\) and \(\sigma\). In fact, in this case the CV is sometimes called the geometric CV, yet we just call it “CV” in most cases.

In particular for the log normal distribution using the formulas for the mean and variance from above,

\[ \text{CV}(X) = \frac{\sqrt{\text{Var}(X)}}{{\mathbb E}[X]}= \frac{\sqrt{(e^{\sigma^2}-1)e^{2\mu + \sigma^2}}}{e^{\mu + \frac{\sigma^2}{2}}} = \sqrt{e^{\sigma^2}-1}. \]

Interestingly we see that the CV does not depend on \(\mu\) but only on \(\sigma\). This is a desirable attribute. We can then easily transform from \(\sigma\) to \(\text{CV}\):

\[ \text{CV} =\sqrt{e^{\sigma^2}-1}. \]

Going the other way around from \(\text{CV}\) to \(\sigma\), we can solve the above equation to obtain.

\[ \sqrt{\log(\text{CV}^2 + 1)} = \sigma. \]

As a convenience, these transformations are also available with the Bioequivalence package via the geocv2sigma and sigma2geocv functions. For example, in the output for the parallel design, the output in the last row is `Variability CV (%) | σ̂ 23.73 | 0.234. This means that the CV is estimated at 23.73percent and has a corresponding estimate for \(\sigma\) which is0.234. You can see that these values match as follows:

round(geocv2sigma(23.73 / 100), digits = 3)
0.234
100round(sigma2geocv(0.234), digits = 4)
23.72

Note very minor differences due to variations of rounding.

7 Maximum likelihood estimation for the log-normal distribution

Statistical inference, both in bioequivalence and more generally, often requires estimating the parameters of the underlying distribution. Here let’s discuss estimation for the log-normal distribution. A fundamental estimation method is maximum likelihood estimation (MLE), which chooses the parameter values that make the observed data most probable under the assumed model.

Recall, if our endpoint data for a treatment (T or R), denoted \(x_1, \ldots, x_n\), are assumed to be independent and log-normally distributed, then their logarithms \(\log(x_i)\) are independent and normally distributed with mean \(\mu\) and variance \(\sigma^2\).

In MLE, we begin by writing down the likelihood function, which expresses the probability (density) of observing the actual data as a function of the distribution’s parameters. For the log-normal case, this means finding the likelihood of observing our endpoint data given unknown parameters \(\mu\) and \(\sigma^2\), which fully determine the shape and location of the distribution. The goal is to choose the values of \(\mu\) and \(\sigma^2\) that maximize this likelihood, making the observed data as “likely” as possible under the model. This approach leads directly to concrete formulas for estimating these parameters.

The likelihood function for the observed data, as a function of \(\mu\) and \(\sigma^2\), is given by: \[ L(\mu, \sigma^2\,|\,x_1, \ldots, x_n) = \prod_{i=1}^n \frac{1}{x_i \sqrt{2\pi\sigma^2}} \exp\left(-\frac{(\log x_i - \mu)^2}{2\sigma^2}\right). \]

With maximum likelihood we now have an optimization problem of finding \(\mu\) and \(\sigma^2\) that maximize this likelihood function. In certain cases, this optimization is solved numerically, but in other cases such as here we can find a formula (closed form) solution.

In any case, it is often common to consider the log-likelihood which is the logarithm of the likelihood function. We can optimize the log-likelihood instead of the likelihood and the optimizers will be the same. It is denoted via, \(\ell(\mu, \sigma^2)\), and is,

\[ \ell(\mu, \sigma^2) = -n \log(\sigma) - n \log(\sqrt{2\pi}) - \sum_{i=1}^n \log x_i - \frac{1}{2\sigma^2} \sum_{i=1}^n (\log x_i - \mu)^2. \]

Now with some calculus, it is possible to take derivatives of the log-likelihood, equate to zero, and solve for \(\mu\) and \(\sigma^2\). It is also possible to consider the second order considerations to see it is a maximum (we omit these details). Importantly, after such a calculation, the MLE for the parameters are:

The parameter \(\mu\) is estimated as the arithmetic mean of the log-transformed data:

\[ \hat\mu = \frac{1}{n} \sum_{i=1}^n \log x_i \]

The parameter \(\sigma^2\) is estimated as the sample variance of the log-transformed data:

\[ \hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n (\log x_i - \hat\mu)^2 \]

(Note: For unbiasedness, sometimes \(1/(n-1)\) is used in place of \(1/n\).)

These estimators are direct consequences of the normal likelihood for the transformed data.

Geometric mean and the MLE

Recall from the previous section that the geometric mean of \(x_1,\ldots,x_n\) is related to \(\hat\mu\):

\[ \text{Geometric Mean} = \left( \prod_{i=1}^n x_i \right)^{1/n} = \exp\left( \frac{1}{n} \sum_{i=1}^n \log x_i \right) = \exp(\hat\mu). \]

Thus, the MLE for the location parameter in the log-normal model corresponds to the logarithm of the geometric mean of the data.

This tight relationship further underscores why geometric means—and not arithmetic means—are the sensible notion of “average” for log-normally distributed quantities such as pharmacokinetic endpoints.

8 Conclusion

This unit of the bioequivalence analysis course introduces foundational statistical concepts essential for bioequivalence studies, specifically focusing on the use of log-transformed data, the log-normal distribution, geometric means, geometric mean ratios (GMR), and the coefficient of variation (CV).

It explains why pharmacokinetic endpoints like Cmax and AUC are better modeled by log-normal rather than normal distributions and demonstrates how log transformation allows for more appropriate statistical analysis. The unit details how geometric means (derived from log-transformed arithmetic means) provide meaningful central tendency for multiplicative data, and clarifies the interpretation and calculation of the GMR and its confidence intervals—central to average bioequivalence assessment. In addition, it covers the significance of bounds for GMR (such as the equivalence range 0.8–1.25), and introduces CV as a unit-free measure of variability, highlighting its role and calculation in the context of log-normal data. Throughout, practical examples using Pumas and relevant packages reinforce the theoretical discussion by analyzing both real and synthetic datasets.

9 Unit exercises

  1. Log-Transformation and Normality

    Using the provided pharmacokinetic endpoint dataset (e.g., be_analysis_data for Cmax), log-transform the data for both Test and Reference treatments. Create histogram or kernel density plots for:

    1. the original data, and
    2. the log-transformed data.

    Visually assess and comment on the distributional shape before and after transformation. Is the log-transformed data approximately normal? Note that in standard bioequivalence practice, such “checking” of normality is not carried out.

  2. Geometric Mean Calculation

    Given the following data for a Test group: \(x = [6200, 7800, 9900, 8500, 7400, 6700, 8000, 9200, 8800, 7600]\)

    1. Compute the arithmetic mean of the original data.
    2. Compute the geometric mean (you may use Julia’s geomean() or compute it manually).
    3. Explain in which context each mean is most meaningful for pharmacokinetic endpoints like Cmax.
  3. MLE for Log-Normal Parameters

    Implement the MLE estimators for \(\mu\) and \(\sigma^2\) for the log-normal distribution using the data from Exercise 2. Provide the code and final estimated values of \(\hat{\mu}\) and \(\hat{\sigma}^2\).

  4. GMR and Log Difference

    Let the sample means of log-transformed Cmax values for Test and Reference be \(\bar{x}_T = 8.948\) and \(\bar{x}_R = 8.970\), respectively.

    1. Compute the geometric mean ratio (GMR) of Test to Reference.
    2. Express the GMR as a percentage.
    3. What does this value mean in a bioequivalence analysis?
  5. CV and \(\sigma\), Conversion Practice

    For a log-normal endpoint, suppose the estimated \(\sigma\) from the log-transformed data is \(0.21\).

    1. Calculate the coefficient of variation (CV) as a percentage.
    2. If the estimated CV was 36%, what would be the corresponding \(\sigma\)? (You may use the formulas or Julia functions provided in the unit.)
  6. Equivalence Bounds and Interpretation

    Explain why the commonly used bioequivalence bounds for GMR are \([0.8, 1.25]\), rather than \([0.8, 1.2]\) as might seem natural. Show by computation (in Julia or by calculator) that \(\log(0.8)\) and \(\log(1.25)\) are equidistant from zero, and interpret the practical meaning of this in the context of statistical analysis on the log scale.

References