A Course on Bioequivalence: Unit 9 - Nonparameteric Analysis

1 Unit overview

In Unit 5, we established the linear model as the workhorse for analyzing pharmacokinetic (PK) endpoints like AUC and Cmax. These models, known as parametric models, rely on the assumption that the data (or the log-transformed data) follows a specific distribution, typically the log-normal distribution.

However, not all PK endpoints fit this mold. In particular a popular endpoint, the time to maximum concentration, Tmax, often violates these distributional assumptions.

When we carry out bioequivalence analysis for such an endpoint we often need to resort to a nonparametric test that does not rely on an explicit distributional assumption. In this unit, we cover the most common such test, the Wilcoxon Signed-Rank Test, and we see how to use it with Pumas.

In particular we consider:

The reasons Tmax analysis deviates from AUC and Cmax analysis.
A simple application of pumas_be for such an endpoint.
An outline of the ideas of the Wilcoxon Signed-Rank Test (or model).

We use the following packages.

using Pumas # available with Pumas products
using NCA # available with Pumas products
using Bioequivalence  # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using DataFramesMeta
using StatsBase
using CairoMakie
using SummaryTables
using AlgebraOfGraphics
using Random

We have used all of these packages in previous units.

2 Why Tmax needs a different approach

The statistical methods we used for AUC and Cmax, namely the linear model on log-transformed data, are powerful and precise. However, their validity hinges on the assumption that the log transformed endpoint follows a normal distribution; see Unit 2.

Yet when the desired point is Tmax, the time of maximal concentration, then there are several challenges to this assumption:

Discrete Nature: Tmax is not a continuous measurement. Its value is determined by the study’s sampling schedule. For example, if blood samples are taken at 0, 1, 2, 4, and 8 hours, Tmax can only be one of these values. This discreteness makes a continuous distribution like the normal distribution a poor fit.
Skewed Distribution: The distribution of Tmax values across subjects is often highly skewed and irregular. Unlike AUC and Cmax, applying a log transformation does not reliably “fix” this issue.

To see this consider this dataset (where we present the first 10 rows only):

PJ2017_3_16_data = dataset("bioequivalence/RT_TR/PJ2017_3_16")
simple_table(first(PJ2017_3_16_data, 10))


id	sequence	period	Tmax

1	RT	1	0.5
1	RT	2	0.5
2	TR	1	1
2	TR	2	1
3	TR	1	0.5
3	TR	2	0.5
4	RT	1	0.5
4	RT	2	1
5	RT	1	1.5
5	RT	2	0.25

Let us look at the distribution of the Tmax endpoint. As you can see it takes on only a few possible values:

counts = combine(groupby(PJ2017_3_16_data, :Tmax), nrow => :Count)
sort!(counts, :Tmax)
simple_table(counts)


Tmax	Count

0.25	5
0.5	24
1	21
1.5	11
2	2
4.02	1

And if we want to see the distribution of these values, here it is:

plt = data(counts) * mapping(:Tmax, :Count) * visual(BarPlot)
draw(plt)

With a discrete distribution as above, even if we transform the data with a log transformation, it would not be a valid assumption to assume that the data is normally distributed.

Another example, directly carrying out NCA

As another example let us carry out NCA directly on the concentration data, similarly to what we did in Unit1. In this case we take the nca/crossover_study.csv dataset and for purposes of illustration only consider the first period, thus treating it as a parallel design. We then plot the concentration-time profiles, focusing on the first 10 hours.

raw_data = dataset("nca/crossover_study.csv")
@subset!(raw_data, :period .== 1) # We only consider the first period and thus treat it as a parallel study
@rtransform!(raw_data, :sequence = :sequence == "RT" ? "R" : "T")
plt = data(raw_data) * mapping(:time, :concentration, color = :sequence) * visual(Lines)
draw(
    plt,
    scales(Color = (; palette = ["R" => :green, "T" => :red]));
    axis = (
        xlabel = "Time (hrs)",
        ylabel = "Concentration",
        title = "Concentration-Time Profiles (focusing on first 10 hours)",
        limits = ((0, 10), nothing),
    ),
)

Looking at the above plot, it should be evident that Tmax only attains a small number of values. Here we compute it via the run_nca function:

raw_data = dataset("nca/crossover_study.csv")
pop = read_nca(raw_data; observations = :concentration, group = [:sequence, :period])
nca_result = run_nca(pop; parameters = [:tmax])
be_analysis_data = nca_result.reportdf

And indeed, if we now count unique values we see the following:

be_analysis_data = nca_result.reportdf
counts = combine(groupby(be_analysis_data, :tmax), nrow => :Count)
sort!(counts, :tmax)
simple_table(counts)


tmax	Count

0.75	2
1	6
2	27
4	5

Thus in summary we require a different approach for bioequivalence analysis.

Beware: Linear models still “appear to work” even when not suitable

Going back to the PJ2017_3_16_data, say we wanted to insist to use the linear model bioequivalence approach - even though the endpoint is Tmax. In this case we can use the nonparametric = false option with pumas_be:

pumas_be(PJ2017_3_16_data, nonparametric = false, endpoint = :Tmax)


Observation Counts
Sequence ╲ Period	1	2

RT	17	17
TR	15	15


Paradigm: Non replicated crossover design
Model: Linear model
Criteria: Standard ABE
Endpoint: Tmax
Formulations: Reference(R), Test(T)

		Results(Tmax)	Assessment	Criteria
R	Geometric Marginal Mean	0.7529
	Geometric Naive Mean	0.7535
T	Geometric Marginal Mean	0.7957
	Geometric Naive Mean	0.7967
	Geometric Mean T/R Ratio (%)	105.7
	Degrees of Freedom	30
	90% Confidence Interval (%)	[81.7, 136.7]	Fail	CI ⊆ [80, 125]
Variability	CV (%) \| σ̂	66.52 \| 0.6053
ANOVA	Formulation (p-value)	0.7181
	Sequence (p-value)	0.8245
	Period (p-value)	0.9511

While in this case BE fails, with different data it could have passed!

Yet the result does not let us know that we are using the “wrong model”. That is, none of the numerical outputs indicate that we are carrying out a conceptual error.

A wrong naive approach would be to consider the Formulation, Sequence, and Period p-values, and rely on these p-values being non-small as an indication of model fit. But remember that these p-values are computed based on the normal distribution modelling assumptions - and it is those assumptions that are violated.

In the statistics and data world it is a common belief that all models are wrong, but some are useful. However, due to the reasons described above, in this case, our model is not just “wrong”, it is most probably “very wrong”. As such it is not useful and using it can be harmful. Hence we need an alternative and there are better models for this kind of data.

3 Carrying out a nonparametric test

The suitable alternative is the nonparameteric approach.

In this case apply pumas_be to the endpoint data without nonparametric = false. While false is the default option in most cases, when pumas_be sees that the endpoint name is :Tmax it decides to carry out a nonparameteric analysis. (If the endpoint name would differ from :Tmax and we still want such an analysis, just use nonparametric = true to explicitly invoke a nonparametric test).

Here it is:

pumas_be(PJ2017_3_16_data, endpoint = :Tmax)


Observation Counts
Sequence ╲ Period	1	2

RT	17	17
TR	15	15


Paradigm: Nonparametric analysis
Model: Signed rank test
Criteria: Standard ABE
Endpoint: Tmax
Formulations: Reference(R), Test(T)

		Results(Tmax)	Assessment	Criteria
T	90% Confidence Interval (%)	[81.64, 141.5]	Fail	CI ⊆ [80, 125]

As you can see in the output, the paradigm is Nonparametric analysis and the associated model is Signed rank test. The output then just consists of a \(90\%\) confidence interval, [81.64, 141.5]. This confidence interval is based on very minimal assumptions about the data, and no explicit distributional assumptions.

Now the TOST-like approach is still used and this confidence interval indicates Fail in this case as it is not within [80, 125].

Interestingly, the nonparametric case does not have many of the outputs we see with linear and mixed models.

Supported designs

The nonparametric approach works for any design where a subject is administrated both the R and the T formulation. It is mostly commonly used for a \(2 \times 2\) crossover design.

Under the hood in pumas_be, the data still undergoes a log transformation (unless logtransformed = true is used). Yet in principle this does not matter with a nonparametric approach.

In such a \(2 \times 2\) crossover design, the data is taken as paired, where a subject’s R and a subject’s T are compared. In that sense the fact that there are two sequences (RT and TR) is irrelevant. All that matters is the pairing of the effect of R and T per subject.

If the design is more complex, as in the case of a partial replicate or fully replicate design, the mean of the (log transformed) data is used for each formulation within a subject. So for example, a subject that has two R periods and a T period, would have the R data be the mean from the two R periods.

To see the type of transformation that takes place under the hood, consider this code:

data_preprocessed = preprocess_be(PJ2017_3_16_data, endpoint = :Tmax)
data_paired = combine(
    groupby(
        subset(data_preprocessed, :formulation => ByRow(∈(['T', 'R']))),
        [:id, :formulation],
    ),
    :endpoint => mean,
    renamecols = false,
)
data_paired = unstack(data_paired, :formulation, :endpoint)
dropmissing!(data_paired)
simple_table(data_paired)


id	R	T

1	-0.693	-0.693
2	0	0
3	-0.693	-0.693
4	-0.693	0
5	0.405	-1.39
6	-0.693	0
7	-1.39	0
8	0	-0.693
9	-1.39	0.405
10	0	0.405
11	-0.693	0
12	0	0
15	0.405	-0.693
16	0.405	0.693
17	0.405	0
18	-0.693	0
19	0.405	-0.693
20	-0.693	0
21	-0.693	-0.693
22	1.39	0.693
23	-0.693	-0.693
24	0	-0.693
25	0	-1.39
26	-1.39	-0.693
27	0	-0.693
28	0	0.405
29	0	0.405
30	0	-0.693
31	-0.693	0.405
34	-0.693	0.405
35	0	-0.693
36	-0.693	0

Data such as this is then fed into the nonparmetric model.

Note that there are 32 observations here as subjects [13, 14, 32, 33] are missing.

num_observations = nrow(data_paired)

It is also important to notice if there are ties (and sometimes to count them). A tie is a case where the differences are exactly the same for subjects. Here is one way to see if there are ties:

D = data_paired.R - data_paired.T
@show length(D)
@show length(unique(D))
has_ties = length(D) > length(unique(D))
@show has_ties;

length(D) = 32
length(unique(D)) = 13
has_ties = true

In the bioequivalence context with Tmax data, we are often bound to have ties since the number of possible values for Tmax is typically small.

Information from the `model` field

Sometimes, also for exploratory purposes, we may be interested in more information. For this we can inspect the model field of the output of pumas_be. For example:

result = pumas_be(PJ2017_3_16_data, endpoint = :Tmax)
result.model[1][1]

Approximate Wilcoxon signed rank test
-------------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          0.0
    95% confidence interval: (-0.2027, 0.3466)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8774

Details:
    number of observations:      32
    Wilcoxon rank-sum statistic: 182.0
    rank sums:                   [182.0, 169.0]
    adjustment for ties:         1848.0
    normal approximation (μ, σ): (6.5, 38.8812)

We see this is an Approximate Wilcoxon signed rank test and we see some interesting information under Details.

Some information dealing with the 95% confidence and the p-value is not as important for us since our confidence interval for bioequivalence is typically a \(90\%\) confidence interval and not a \(95\%\) one. Similarly, the p-value is for hypothesis test where \(H_0\) is equality of R and T, in contrast to the opposite case which we want in the context of bioequivalence.

Information that might be important is the Wilcoxon rank-sum statistic, rank sums, adjustment for ties, and normal approximation (μ, σ). For example, in certain cases you may be required to report these values to the agency.

You can also see that from the model’s perspective there are 32 observations, as in agreement with the data transformations we carried above.

Approximate vs. Exact tests

You can see above the model used an Approximate test. This means that an asymptotic approximate normal approximation was used to compute the confidence interval reported as a result.

In general the heuristic choice if to use an approximate or exact analysis is an implementation detail and is influenced by the number of ties and the number of observations.

To see this let us create no_tie_data which for all intents and purposes is identical to the original data. We simply negligible noise to the Tmax (normally distributed with mean \(0\) and standard deviation \(10^{-8}\)). Now when we run pumas_be on this data, observe that the results are effectively identical:

Random.seed!(0)
no_tie_data = copy(PJ2017_3_16_data)
no_tie_data.Tmax = no_tie_data.Tmax + rand(Normal(0, 1e-8), length(no_tie_data.Tmax));

result = pumas_be(no_tie_data, endpoint = :Tmax)


Observation Counts
Sequence ╲ Period	1	2

RT	17	17
TR	15	15


Paradigm: Nonparametric analysis
Model: Signed rank test
Criteria: Standard ABE
Endpoint: Tmax
Formulations: Reference(R), Test(T)

		Results(Tmax)	Assessment	Criteria
T	90% Confidence Interval (%)	[81.64, 141.5]	Fail	CI ⊆ [80, 125]

As you can see, the confidence interval of [81.64, 141.5] exactly agrees with the one above (to this level of rounding).

Now if we consider the model we see that an Exact Wilcoxon signed rank test was chosen.

result.model[1][1]

Exact Wilcoxon signed rank test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          2.6595e-8
    95% confidence interval: (-0.2027, 0.3466)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.5361

Details:
    number of observations:      32
    Wilcoxon rank-sum statistic: 298.0
    rank sums:                   [298.0, 230.0]
    adjustment for ties:         0.0

You can see that normal approximation (μ, σ) is no longer present and also since we excluded all ties, we get adjustment for ties as 0.0.

4 About the Wilcoxon Signed-Rank Test

Let us now present a very general overview of the Signed-Rank test. For further details see Hollander, Wolfe, and Chicken (2013) for a general resource of classical nonparaemtric statistics. We also recommend the notes Geyer (2003).

The phrase “nonparameteric” aims to distinguish this approach from approaches that rely on a specific probability distribution (e.g. Normal).

Nonparametric methods, are often called “distribution-free” because they make far fewer assumptions about the data’s distribution. Instead of estimating parameters like the mean, they often work with simpler properties of the data, such as the order of the values (ranks) or their signs (positive or negative). This makes them incredibly robust and useful when the assumptions of parametric tests are violated, which is precisely the case with Tmax.

The trade-off is that if the data does meet the assumptions for a parametric test, the nonparametric equivalent will generally have less statistical power (i.e., it’s slightly less likely to detect a true difference if one exists).

A minimal set of assumptions

While nonparametric tests are “distribution-free,” they are not “assumption-free.” The Wilcoxon Signed-Rank Test, when applied in the context of bioequivalence, relies on a minimal but important set of assumptions for its validity:

Paired Data: The data must consist of paired observations. These are the test (T) and reference (R) measurements for the same subject. pumas_be handles this by computing the difference in Tmax for each subject.
Independence of Differences: The differences calculated for each pair (i.e., for each subject) must be independent of the differences from other subjects. This is typically satisfied by the study design.
Symmetric Distribution of Differences: This is the most critical assumption. The test assumes that the distribution of the differences is symmetric around its median.

A statistic based on ranks

A statistic, sometimes called a test statistic, is a value calculated from the data. For example we have seen \(T\) statistics in the case of the T-test.

For the signed rank test, our test statistic is based ranks. The idea is to only use order information and not explicit magnitude. With this the specific distribution of the data does not affect the distribution of the statistic.

In general, for any statistical test (or model) if we are able to

compute the statistic (quite) easily; and
work with the distribution of the statistic under some null hypothesis,

then we can carry out hypothesis tests and confidence intervals.

In the case of the signed rank test (not in the context of bioequivalence), we assume there is a center for the differences, \(\mu\), which we can consider as the median. Then the null hypothesis \(H_0\) is,

\[ H_0: \mu = 0. \]

Now if our data (of differences) is \(D_1,\ldots,D_n\) then we denote the test statistic as \(W^+\) and calculate it via this procedure:

Discard Zeros: Remove any values with \(D_i = 0\). Let the remaining number of subjects be \(\tilde{n} \le n\), the data is \(\tilde{D}_1,\ldots,\tilde{D}_{\tilde{n}}\).
Take Absolute Values: Calculate the absolute value of each non-zero difference, \(|\tilde{D}_i|\).
Rank the Absolute Values: Rank these \(\tilde{n}\) absolute values from smallest (rank 1) to largest (rank \(\tilde{n}\)). If there are ties, each tied value receives the average of the ranks they would have occupied. Let’s call the rank of \(|\tilde{D}_i|\) as \(\text{Rank}(|\tilde{D}_i|)\).
Create test statistic(s): Here there are a few alternative test statistics, all of which are related.

One test statistic, \(W\), is the sum of the signed ranks where originally positive values get a positive sign and originally negative values get a negative sign.

\[ W = \sum_{i=1}^{\tilde{n}} \text{Sign}(\tilde{D_i}) ~~ \text{Rank}(|\tilde{D}_i|) \]

Another option is to to separate \(W\) into \(W_-\) and \(W_+\) where \(W_-\) is the sum of the ranks that had a negative sign and \(W_+\) is the sum of the ranks that had a positive sign. We then obviously have,

\[ W = W_- ~~+~~ W_+. \]

In our implementation, the Wilcoxon rank-sum statistic is \(W_+\), but we also get \(W_-\) reported.

In any case, the strength of the approach is that under \(H_0\) we know the distribution of each of these test statistics. With some smart computation we can have an exact distribution. An alternative is to use an approximate distribution. In the output above, we saw both cases with Approximate Wilcoxon signed rank test and Exact Wilcoxon signed rank test.

Back to the bioequivalence context

In our context, \(H_0\) is obviously inequivalence, so the approach taken is a TOST approach. Yet, just like we converted a T-test into a TOST (see Unit 3), here we converted the signed rank test and again require a \(90\%\) confidence interval.

To see an actual computation, let us return to the tie free dataset, no_tie_data, which we used above. Let us preprocess it in the same manner as we did above. Our purpose is to avoid issues of ties in this discussion which add complications.

data_preprocessed = preprocess_be(no_tie_data, endpoint = :Tmax)
data_paired = combine(
    groupby(
        subset(data_preprocessed, :formulation => ByRow(∈(['T', 'R']))),
        [:id, :formulation],
    ),
    :endpoint => mean,
    renamecols = false,
)
data_paired = unstack(data_paired, :formulation, :endpoint)
dropmissing!(data_paired)

D̃_data = data_paired.T - data_paired.R;
ñ = length(D̃_data)

Now D̃_data has 32 distinct values. Let us now compute the test statistics. We use Julia’s sortperm function for obtaining the ranks:

ranks = sortperm(abs.(D̃_data))
W₊ = sum(ranks[i] * (D̃_data[i] > 0) for i = 1:ñ)
W₋ = sum(ranks[i] * (D̃_data[i] < 0) for i = 1:ñ)
[W₊, W₋] |> println

[297, 231]

You can see that these values, 297 and 231 are very close to rank sums: [298.0, 230.0] appearing in the Exact Wilcoxon signed rank test. While our implementation (in pumas_be) calls W₊ the Wilcoxon rank-sum statistic (298 or 297), let us ponder about \(W = W_- ~~+~~ W_+\):

W = W₊ - W₋

You can reason that if there is no difference between R and T, then the values D̃_data should be “centered” around \(0\). You would then expect the distribution of \(W\) to be centered around \(0\). This is exactly the distribution used for the model, and confidence interval. We won’t cover further details about computing this distribution here. We recommend consulting with Geyer (2003) for further details.

5 Conclusion

In this unit, we addressed a critical exception to the standard bioequivalence analysis workflow: the endpoint Tmax. We established that due to its discrete and often skewed nature, Tmax violates the core normality assumption of the linear models used for AUC and Cmax. Applying a standard linear model in this situation, even if it produces a result, is conceptually flawed and can lead to unreliable conclusions. The correct approach is to use a nonparametric method, which does not rely on a specific distributional assumption.

The industry-standard solution is the Wilcoxon Signed-Rank Test, which pumas_be automatically applies for the Tmax endpoint. This test works by ranking the magnitudes of the differences between the test and reference formulations within each subject, providing a robust assessment based on order rather than magnitude. While this test is “distribution-free,” it is not “assumption-free,” relying on the key assumption that the distribution of differences is symmetric. By understanding when and why to use this nonparametric approach, you can ensure your bioequivalence analyses are statistically sound.

6 Unit exercises

Exploring Tmax Data Distribution

The fundamental reason for using a nonparametric test for Tmax is its distribution. Let’s explore this using the PJ2017_3_16_data dataset.
1. Create a DataFrame that shows each unique Tmax value and how many times it appears in the dataset.
2. Using CairoMakie or AlgebraOfGraphics, create a bar plot of the Tmax counts you calculated in part (a).
3. Based on your table and plot, write a sentence or two explaining why assuming this data follows a continuous, symmetric distribution like the normal distribution would be inappropriate.
Forcing the “Wrong” Model

It’s important to see the difference in results when applying an inappropriate model.
1. Run pumas_be on the PJ2017_3_16_data with endpoint = :Tmax. Note the 90% CI and the BE conclusion. This will use the default nonparametric test.
2. Now, run pumas_be on the same data, but force a parametric analysis by setting nonparametric = false. Note the new 90% CI and conclusion.
3. Compare the confidence intervals from the two analyses. Which result should you trust, and why?
Manual Data Preparation for Paired Analysis

The Wilcoxon Signed-Rank Test requires paired data. pumas_be prepares this automatically. Let’s replicate the key steps to understand the process.
1. Starting with PJ2017_3_16_data, follow the steps in the unit to create the data_paired DataFrame which has columns for id, R, and T.
2. How many rows are in your final data_paired DataFrame? Compare this number to the number of observations reported when you inspect the result.model from the pumas_be output in Exercise 2a. Do they match?
3. In your data_paired DataFrame, calculate how many subjects have the exact same Tmax value for both R and T. This is the number of ties.
Approximating the Rank-Sum Statistic

The core of the Wilcoxon test is the statistic derived from ranks. Let’s calculate it manually on a simplified dataset.
1. Use the data_paired DataFrame from Exercise 3. For this exercise, create a new DataFrame that removes subjects where Tmax for R and T were tied.
2. Calculate the difference for each subject: D = T - R.
3. Take the absolute value of these differences and find their ranks. (Hint: sortperm(abs.(D)) will give you the indices that sort the data, which can be used to determine the ranks).
4. The W+ statistic is the sum of the ranks that came from a positive difference (D > 0). Calculate this value. How does it compare to the rank sums reported in the result.model details from pumas_be? (Note: It may not match exactly because the official test has a more sophisticated way of handling ranks).
Exact vs. Approximate Tests

The presence of ties influences the type of test used.
1. Run pumas_be on the original PJ2017_3_16_data and inspect result.model[1][1]. Does it report an “Exact” or “Approximate” test? Note the adjustment for ties.
2. Now, create the no_tie_data as shown in the unit by adding a very small amount of random noise to Tmax.
3. Run pumas_be on no_tie_data and inspect its model. Is it “Exact” or “Approximate”? What is the adjustment for ties now? Explain the change.

A Course on Bioequivalence: Unit 9 - Nonparameteric Analysis

1 Unit overview

2 Why Tmax needs a different approach

Another example, directly carrying out NCA

Beware: Linear models still “appear to work” even when not suitable

3 Carrying out a nonparametric test

Supported designs

Information from the `model` field

Approximate vs. Exact tests

4 About the Wilcoxon Signed-Rank Test

A minimal set of assumptions

A statistic based on ranks

Back to the bioequivalence context

5 Conclusion

6 Unit exercises

References

Reuse

1 Unit overview

2 Why Tmax needs a different approach

Another example, directly carrying out NCA

Beware: Linear models still “appear to work” even when not suitable

3 Carrying out a nonparametric test

Supported designs

Information from the model field

Approximate vs. Exact tests

4 About the Wilcoxon Signed-Rank Test

A minimal set of assumptions

A statistic based on ranks

Back to the bioequivalence context

5 Conclusion

6 Unit exercises

References

Reuse

Information from the `model` field