using Pumas # available with Pumas products
using NCA # available with Pumas products
using Bioequivalence # available with Pumas products
using PharmaDatasets # available with Pumas products
using DataFrames
using DataFramesMeta
using StatsBase
using CairoMakie
using SummaryTables
using AlgebraOfGraphics
using Random

A Course on Bioequivalence: Unit 9 - Nonparameteric Analysis
This is Unit 9 of a complete bioequivalence analysis course using Pumas. There are 15 units in the course in total.
1 Unit overview
In Unit 5, we established the linear model as the workhorse for analyzing pharmacokinetic (PK) endpoints like AUC and Cmax. These models, known as parametric models, rely on the assumption that the data (or the log-transformed data) follows a specific distribution, typically the log-normal distribution.
However, not all PK endpoints fit this mold. In particular a popular endpoint, the time to maximum concentration, Tmax, often violates these distributional assumptions.
When we carry out bioequivalence analysis for such an endpoint we often need to resort to a nonparametric test that does not rely on an explicit distributional assumption. In this unit, we cover the most common such test, the Wilcoxon Signed-Rank Test, and we see how to use it with Pumas.
In particular we consider:
- The reasons Tmax analysis deviates from AUC and Cmax analysis.
- A simple application of
pumas_be
for such an endpoint. - An outline of the ideas of the Wilcoxon Signed-Rank Test (or model).
We use the following packages.
We have used all of these packages in previous units.
2 Why Tmax needs a different approach
The statistical methods we used for AUC and Cmax, namely the linear model on log-transformed data, are powerful and precise. However, their validity hinges on the assumption that the log transformed endpoint follows a normal distribution; see Unit 2.
Yet when the desired point is Tmax, the time of maximal concentration, then there are several challenges to this assumption:
- Discrete Nature: Tmax is not a continuous measurement. Its value is determined by the study’s sampling schedule. For example, if blood samples are taken at 0, 1, 2, 4, and 8 hours, Tmax can only be one of these values. This discreteness makes a continuous distribution like the normal distribution a poor fit.
- Skewed Distribution: The distribution of Tmax values across subjects is often highly skewed and irregular. Unlike AUC and Cmax, applying a log transformation does not reliably “fix” this issue.
To see this consider this dataset (where we present the first 10 rows only):
= dataset("bioequivalence/RT_TR/PJ2017_3_16")
PJ2017_3_16_data simple_table(first(PJ2017_3_16_data, 10))
id | sequence | period | Tmax |
1 | RT | 1 | 0.5 |
1 | RT | 2 | 0.5 |
2 | TR | 1 | 1 |
2 | TR | 2 | 1 |
3 | TR | 1 | 0.5 |
3 | TR | 2 | 0.5 |
4 | RT | 1 | 0.5 |
4 | RT | 2 | 1 |
5 | RT | 1 | 1.5 |
5 | RT | 2 | 0.25 |
Let us look at the distribution of the Tmax endpoint. As you can see it takes on only a few possible values:
= combine(groupby(PJ2017_3_16_data, :Tmax), nrow => :Count)
counts sort!(counts, :Tmax)
simple_table(counts)
Tmax | Count |
0.25 | 5 |
0.5 | 24 |
1 | 21 |
1.5 | 11 |
2 | 2 |
4.02 | 1 |
And if we want to see the distribution of these values, here it is:
= data(counts) * mapping(:Tmax, :Count) * visual(BarPlot)
plt draw(plt)
With a discrete distribution as above, even if we transform the data with a log transformation, it would not be a valid assumption to assume that the data is normally distributed.
Another example, directly carrying out NCA
As another example let us carry out NCA directly on the concentration data, similarly to what we did in Unit1. In this case we take the nca/crossover_study.csv
dataset and for purposes of illustration only consider the first period, thus treating it as a parallel design. We then plot the concentration-time profiles, focusing on the first 10 hours.
= dataset("nca/crossover_study.csv")
raw_data @subset!(raw_data, :period .== 1) # We only consider the first period and thus treat it as a parallel study
@rtransform!(raw_data, :sequence = :sequence == "RT" ? "R" : "T")
= data(raw_data) * mapping(:time, :concentration, color = :sequence) * visual(Lines)
plt draw(
plt,scales(Color = (; palette = ["R" => :green, "T" => :red]));
= (
axis = "Time (hrs)",
xlabel = "Concentration",
ylabel = "Concentration-Time Profiles (focusing on first 10 hours)",
title = ((0, 10), nothing),
limits
), )
Looking at the above plot, it should be evident that Tmax only attains a small number of values. Here we compute it via the run_nca
function:
= dataset("nca/crossover_study.csv")
raw_data = read_nca(raw_data; observations = :concentration, group = [:sequence, :period])
pop = run_nca(pop; parameters = [:tmax])
nca_result = nca_result.reportdf be_analysis_data
And indeed, if we now count unique values we see the following:
= nca_result.reportdf
be_analysis_data = combine(groupby(be_analysis_data, :tmax), nrow => :Count)
counts sort!(counts, :tmax)
simple_table(counts)
tmax | Count |
0.75 | 2 |
1 | 6 |
2 | 27 |
4 | 5 |
Thus in summary we require a different approach for bioequivalence analysis.
Beware: Linear models still “appear to work” even when not suitable
Going back to the PJ2017_3_16_data
, say we wanted to insist to use the linear model bioequivalence approach - even though the endpoint is Tmax. In this case we can use the nonparametric = false
option with pumas_be
:
pumas_be(PJ2017_3_16_data, nonparametric = false, endpoint = :Tmax)
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 17 | 17 |
TR | 15 | 15 |
Paradigm: Non replicated crossover design | ||||
Model: Linear model | ||||
Criteria: Standard ABE | ||||
Endpoint: Tmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(Tmax) | Assessment | Criteria | ||
R | Geometric Marginal Mean | 0.7529 | ||
Geometric Naive Mean | 0.7535 | |||
T | Geometric Marginal Mean | 0.7957 | ||
Geometric Naive Mean | 0.7967 | |||
Geometric Mean T/R Ratio (%) | 105.7 | |||
Degrees of Freedom | 30 | |||
90% Confidence Interval (%) | [81.7, 136.7] | Fail | CI ⊆ [80, 125] | |
Variability | CV (%) | σ̂ | 66.52 | 0.6053 | ||
ANOVA | Formulation (p-value) | 0.7181 | ||
Sequence (p-value) | 0.8245 | |||
Period (p-value) | 0.9511 | |||
While in this case BE fails, with different data it could have passed!
Yet the result does not let us know that we are using the “wrong model”. That is, none of the numerical outputs indicate that we are carrying out a conceptual error.
A wrong naive approach would be to consider the Formulation
, Sequence
, and Period
p-values, and rely on these p-values being non-small as an indication of model fit. But remember that these p-values are computed based on the normal distribution modelling assumptions - and it is those assumptions that are violated.
In the statistics and data world it is a common belief that all models are wrong, but some are useful. However, due to the reasons described above, in this case, our model is not just “wrong”, it is most probably “very wrong”. As such it is not useful and using it can be harmful. Hence we need an alternative and there are better models for this kind of data.
3 Carrying out a nonparametric test
The suitable alternative is the nonparameteric approach.
In this case apply pumas_be
to the endpoint data without nonparametric = false
. While false
is the default option in most cases, when pumas_be
sees that the endpoint name is :Tmax
it decides to carry out a nonparameteric analysis. (If the endpoint name would differ from :Tmax
and we still want such an analysis, just use nonparametric = true
to explicitly invoke a nonparametric test).
Here it is:
pumas_be(PJ2017_3_16_data, endpoint = :Tmax)
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 17 | 17 |
TR | 15 | 15 |
Paradigm: Nonparametric analysis | ||||
Model: Signed rank test | ||||
Criteria: Standard ABE | ||||
Endpoint: Tmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(Tmax) | Assessment | Criteria | ||
T | 90% Confidence Interval (%) | [81.64, 141.5] | Fail | CI ⊆ [80, 125] |
As you can see in the output, the paradigm is Nonparametric analysis
and the associated model is Signed rank test
. The output then just consists of a \(90\%\) confidence interval, [81.64, 141.5]
. This confidence interval is based on very minimal assumptions about the data, and no explicit distributional assumptions.
Now the TOST-like approach is still used and this confidence interval indicates Fail
in this case as it is not within [80, 125]
.
Interestingly, the nonparametric case does not have many of the outputs we see with linear and mixed models.
Supported designs
The nonparametric approach works for any design where a subject is administrated both the R
and the T
formulation. It is mostly commonly used for a \(2 \times 2\) crossover design.
Under the hood in pumas_be
, the data still undergoes a log transformation (unless logtransformed = true
is used). Yet in principle this does not matter with a nonparametric approach.
In such a \(2 \times 2\) crossover design, the data is taken as paired, where a subject’s R
and a subject’s T
are compared. In that sense the fact that there are two sequences (RT
and TR
) is irrelevant. All that matters is the pairing of the effect of R
and T
per subject.
If the design is more complex, as in the case of a partial replicate or fully replicate design, the mean of the (log transformed) data is used for each formulation within a subject. So for example, a subject that has two R
periods and a T
period, would have the R
data be the mean from the two R
periods.
To see the type of transformation that takes place under the hood, consider this code:
= preprocess_be(PJ2017_3_16_data, endpoint = :Tmax)
data_preprocessed = combine(
data_paired groupby(
subset(data_preprocessed, :formulation => ByRow(∈(['T', 'R']))),
:id, :formulation],
[
),:endpoint => mean,
= false,
renamecols
)= unstack(data_paired, :formulation, :endpoint)
data_paired dropmissing!(data_paired)
simple_table(data_paired)
id | R | T |
1 | -0.693 | -0.693 |
2 | 0 | 0 |
3 | -0.693 | -0.693 |
4 | -0.693 | 0 |
5 | 0.405 | -1.39 |
6 | -0.693 | 0 |
7 | -1.39 | 0 |
8 | 0 | -0.693 |
9 | -1.39 | 0.405 |
10 | 0 | 0.405 |
11 | -0.693 | 0 |
12 | 0 | 0 |
15 | 0.405 | -0.693 |
16 | 0.405 | 0.693 |
17 | 0.405 | 0 |
18 | -0.693 | 0 |
19 | 0.405 | -0.693 |
20 | -0.693 | 0 |
21 | -0.693 | -0.693 |
22 | 1.39 | 0.693 |
23 | -0.693 | -0.693 |
24 | 0 | -0.693 |
25 | 0 | -1.39 |
26 | -1.39 | -0.693 |
27 | 0 | -0.693 |
28 | 0 | 0.405 |
29 | 0 | 0.405 |
30 | 0 | -0.693 |
31 | -0.693 | 0.405 |
34 | -0.693 | 0.405 |
35 | 0 | -0.693 |
36 | -0.693 | 0 |
Data such as this is then fed into the nonparmetric model.
Note that there are 32 observations here as subjects [13, 14, 32, 33] are missing.
= nrow(data_paired) num_observations
32
It is also important to notice if there are ties (and sometimes to count them). A tie is a case where the differences are exactly the same for subjects. Here is one way to see if there are ties:
= data_paired.R - data_paired.T
D @show length(D)
@show length(unique(D))
= length(D) > length(unique(D))
has_ties @show has_ties;
length(D) = 32
length(unique(D)) = 13
has_ties = true
In the bioequivalence context with Tmax data, we are often bound to have ties since the number of possible values for Tmax is typically small.
Information from the model
field
Sometimes, also for exploratory purposes, we may be interested in more information. For this we can inspect the model
field of the output of pumas_be
. For example:
= pumas_be(PJ2017_3_16_data, endpoint = :Tmax)
result 1][1] result.model[
Approximate Wilcoxon signed rank test
-------------------------------------
Population details:
parameter of interest: Location parameter (pseudomedian)
value under h_0: 0
point estimate: 0.0
95% confidence interval: (-0.2027, 0.3466)
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.8774
Details:
number of observations: 32
Wilcoxon rank-sum statistic: 182.0
rank sums: [182.0, 169.0]
adjustment for ties: 1848.0
normal approximation (μ, σ): (6.5, 38.8812)
We see this is an Approximate Wilcoxon signed rank test
and we see some interesting information under Details
.
Some information dealing with the 95% confidence
and the p-value
is not as important for us since our confidence interval for bioequivalence is typically a \(90\%\) confidence interval and not a \(95\%\) one. Similarly, the p-value
is for hypothesis test where \(H_0\) is equality of R
and T
, in contrast to the opposite case which we want in the context of bioequivalence.
Information that might be important is the Wilcoxon rank-sum statistic
, rank sums
, adjustment for ties
, and normal approximation (μ, σ)
. For example, in certain cases you may be required to report these values to the agency.
You can also see that from the model’s perspective there are 32
observations, as in agreement with the data transformations we carried above.
Approximate vs. Exact tests
You can see above the model used an Approximate
test. This means that an asymptotic approximate normal approximation
was used to compute the confidence interval reported as a result.
In general the heuristic choice if to use an approximate or exact analysis is an implementation detail and is influenced by the number of ties and the number of observations.
To see this let us create no_tie_data
which for all intents and purposes is identical to the original data. We simply negligible noise to the Tmax
(normally distributed with mean \(0\) and standard deviation \(10^{-8}\)). Now when we run pumas_be
on this data, observe that the results are effectively identical:
Random.seed!(0)
= copy(PJ2017_3_16_data)
no_tie_data = no_tie_data.Tmax + rand(Normal(0, 1e-8), length(no_tie_data.Tmax));
no_tie_data.Tmax
= pumas_be(no_tie_data, endpoint = :Tmax) result
Observation Counts | ||
Sequence ╲ Period | 1 | 2 |
RT | 17 | 17 |
TR | 15 | 15 |
Paradigm: Nonparametric analysis | ||||
Model: Signed rank test | ||||
Criteria: Standard ABE | ||||
Endpoint: Tmax | ||||
Formulations: Reference(R), Test(T) | ||||
Results(Tmax) | Assessment | Criteria | ||
T | 90% Confidence Interval (%) | [81.64, 141.5] | Fail | CI ⊆ [80, 125] |
As you can see, the confidence interval of [81.64, 141.5]
exactly agrees with the one above (to this level of rounding).
Now if we consider the model we see that an Exact Wilcoxon signed rank test
was chosen.
1][1] result.model[
Exact Wilcoxon signed rank test
-------------------------------
Population details:
parameter of interest: Location parameter (pseudomedian)
value under h_0: 0
point estimate: 2.6595e-8
95% confidence interval: (-0.2027, 0.3466)
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.5361
Details:
number of observations: 32
Wilcoxon rank-sum statistic: 298.0
rank sums: [298.0, 230.0]
adjustment for ties: 0.0
You can see that normal approximation (μ, σ)
is no longer present and also since we excluded all ties, we get adjustment for ties
as 0.0
.
4 About the Wilcoxon Signed-Rank Test
Let us now present a very general overview of the Signed-Rank test. For further details see Hollander, Wolfe, and Chicken (2013) for a general resource of classical nonparaemtric statistics. We also recommend the notes Geyer (2003).
The phrase “nonparameteric” aims to distinguish this approach from approaches that rely on a specific probability distribution (e.g. Normal).
Nonparametric methods, are often called “distribution-free” because they make far fewer assumptions about the data’s distribution. Instead of estimating parameters like the mean, they often work with simpler properties of the data, such as the order of the values (ranks) or their signs (positive or negative). This makes them incredibly robust and useful when the assumptions of parametric tests are violated, which is precisely the case with Tmax.
The trade-off is that if the data does meet the assumptions for a parametric test, the nonparametric equivalent will generally have less statistical power (i.e., it’s slightly less likely to detect a true difference if one exists).
A minimal set of assumptions
While nonparametric tests are “distribution-free,” they are not “assumption-free.” The Wilcoxon Signed-Rank Test, when applied in the context of bioequivalence, relies on a minimal but important set of assumptions for its validity:
- Paired Data: The data must consist of paired observations. These are the test (
T
) and reference (R
) measurements for the same subject.pumas_be
handles this by computing the difference in Tmax for each subject. - Independence of Differences: The differences calculated for each pair (i.e., for each subject) must be independent of the differences from other subjects. This is typically satisfied by the study design.
- Symmetric Distribution of Differences: This is the most critical assumption. The test assumes that the distribution of the differences is symmetric around its median.
A statistic based on ranks
A statistic, sometimes called a test statistic, is a value calculated from the data. For example we have seen \(T\) statistics in the case of the T-test.
For the signed rank test, our test statistic is based ranks. The idea is to only use order information and not explicit magnitude. With this the specific distribution of the data does not affect the distribution of the statistic.
In general, for any statistical test (or model) if we are able to
- compute the statistic (quite) easily; and
- work with the distribution of the statistic under some null hypothesis,
then we can carry out hypothesis tests and confidence intervals.
In the case of the signed rank test (not in the context of bioequivalence), we assume there is a center for the differences, \(\mu\), which we can consider as the median. Then the null hypothesis \(H_0\) is,
\[ H_0: \mu = 0. \]
Now if our data (of differences) is \(D_1,\ldots,D_n\) then we denote the test statistic as \(W^+\) and calculate it via this procedure:
- Discard Zeros: Remove any values with \(D_i = 0\). Let the remaining number of subjects be \(\tilde{n} \le n\), the data is \(\tilde{D}_1,\ldots,\tilde{D}_{\tilde{n}}\).
- Take Absolute Values: Calculate the absolute value of each non-zero difference, \(|\tilde{D}_i|\).
- Rank the Absolute Values: Rank these \(\tilde{n}\) absolute values from smallest (rank 1) to largest (rank \(\tilde{n}\)). If there are ties, each tied value receives the average of the ranks they would have occupied. Let’s call the rank of \(|\tilde{D}_i|\) as \(\text{Rank}(|\tilde{D}_i|)\).
- Create test statistic(s): Here there are a few alternative test statistics, all of which are related.
One test statistic, \(W\), is the sum of the signed ranks where originally positive values get a positive sign and originally negative values get a negative sign.
\[ W = \sum_{i=1}^{\tilde{n}} \text{Sign}(\tilde{D_i}) ~~ \text{Rank}(|\tilde{D}_i|) \]
Another option is to to separate \(W\) into \(W_-\) and \(W_+\) where \(W_-\) is the sum of the ranks that had a negative sign and \(W_+\) is the sum of the ranks that had a positive sign. We then obviously have,
\[ W = W_- ~~+~~ W_+. \]
In our implementation, the Wilcoxon rank-sum statistic
is \(W_+\), but we also get \(W_-\) reported.
In any case, the strength of the approach is that under \(H_0\) we know the distribution of each of these test statistics. With some smart computation we can have an exact distribution. An alternative is to use an approximate distribution. In the output above, we saw both cases with Approximate Wilcoxon signed rank test
and Exact Wilcoxon signed rank test
.
Back to the bioequivalence context
In our context, \(H_0\) is obviously inequivalence, so the approach taken is a TOST approach. Yet, just like we converted a T-test into a TOST (see Unit 3), here we converted the signed rank test and again require a \(90\%\) confidence interval.
To see an actual computation, let us return to the tie free dataset, no_tie_data
, which we used above. Let us preprocess it in the same manner as we did above. Our purpose is to avoid issues of ties in this discussion which add complications.
= preprocess_be(no_tie_data, endpoint = :Tmax)
data_preprocessed = combine(
data_paired groupby(
subset(data_preprocessed, :formulation => ByRow(∈(['T', 'R']))),
:id, :formulation],
[
),:endpoint => mean,
= false,
renamecols
)= unstack(data_paired, :formulation, :endpoint)
data_paired dropmissing!(data_paired)
= data_paired.T - data_paired.R;
D̃_data = length(D̃_data) ñ
32
Now D̃_data
has 32 distinct values. Let us now compute the test statistics. We use Julia’s sortperm
function for obtaining the ranks:
= sortperm(abs.(D̃_data))
ranks = sum(ranks[i] * (D̃_data[i] > 0) for i = 1:ñ)
W₊ = sum(ranks[i] * (D̃_data[i] < 0) for i = 1:ñ)
W₋ |> println [W₊, W₋]
[297, 231]
You can see that these values, 297
and 231
are very close to rank sums: [298.0, 230.0]
appearing in the Exact Wilcoxon signed rank test
. While our implementation (in pumas_be
) calls W₊
the Wilcoxon rank-sum statistic
(298
or 297
), let us ponder about \(W = W_- ~~+~~ W_+\):
= W₊ - W₋ W
66
You can reason that if there is no difference between R
and T
, then the values D̃_data
should be “centered” around \(0\). You would then expect the distribution of \(W\) to be centered around \(0\). This is exactly the distribution used for the model, and confidence interval. We won’t cover further details about computing this distribution here. We recommend consulting with Geyer (2003) for further details.
5 Conclusion
In this unit, we addressed a critical exception to the standard bioequivalence analysis workflow: the endpoint Tmax. We established that due to its discrete and often skewed nature, Tmax violates the core normality assumption of the linear models used for AUC and Cmax. Applying a standard linear model in this situation, even if it produces a result, is conceptually flawed and can lead to unreliable conclusions. The correct approach is to use a nonparametric method, which does not rely on a specific distributional assumption.
The industry-standard solution is the Wilcoxon Signed-Rank Test, which pumas_be
automatically applies for the Tmax endpoint. This test works by ranking the magnitudes of the differences between the test and reference formulations within each subject, providing a robust assessment based on order rather than magnitude. While this test is “distribution-free,” it is not “assumption-free,” relying on the key assumption that the distribution of differences is symmetric. By understanding when and why to use this nonparametric approach, you can ensure your bioequivalence analyses are statistically sound.
6 Unit exercises
Exploring Tmax Data Distribution
The fundamental reason for using a nonparametric test for Tmax is its distribution. Let’s explore this using the
PJ2017_3_16_data
dataset.- Create a
DataFrame
that shows each uniqueTmax
value and how many times it appears in the dataset. - Using
CairoMakie
orAlgebraOfGraphics
, create a bar plot of theTmax
counts you calculated in part (a). - Based on your table and plot, write a sentence or two explaining why assuming this data follows a continuous, symmetric distribution like the normal distribution would be inappropriate.
- Create a
Forcing the “Wrong” Model
It’s important to see the difference in results when applying an inappropriate model.
- Run
pumas_be
on thePJ2017_3_16_data
withendpoint = :Tmax
. Note the 90% CI and the BE conclusion. This will use the default nonparametric test. - Now, run
pumas_be
on the same data, but force a parametric analysis by settingnonparametric = false
. Note the new 90% CI and conclusion. - Compare the confidence intervals from the two analyses. Which result should you trust, and why?
- Run
Manual Data Preparation for Paired Analysis
The Wilcoxon Signed-Rank Test requires paired data.
pumas_be
prepares this automatically. Let’s replicate the key steps to understand the process.- Starting with
PJ2017_3_16_data
, follow the steps in the unit to create thedata_paired
DataFrame
which has columns forid
,R
, andT
. - How many rows are in your final
data_paired
DataFrame
? Compare this number to the number of observations reported when you inspect theresult.model
from thepumas_be
output in Exercise 2a. Do they match? - In your
data_paired
DataFrame
, calculate how many subjects have the exact sameTmax
value for bothR
andT
. This is the number of ties.
- Starting with
Approximating the Rank-Sum Statistic
The core of the Wilcoxon test is the statistic derived from ranks. Let’s calculate it manually on a simplified dataset.
- Use the
data_paired
DataFrame
from Exercise 3. For this exercise, create a newDataFrame
that removes subjects whereTmax
forR
andT
were tied. - Calculate the difference for each subject:
D = T - R
. - Take the absolute value of these differences and find their ranks. (Hint:
sortperm(abs.(D))
will give you the indices that sort the data, which can be used to determine the ranks). - The
W+
statistic is the sum of the ranks that came from a positive difference (D > 0
). Calculate this value. How does it compare to therank sums
reported in theresult.model
details frompumas_be
? (Note: It may not match exactly because the official test has a more sophisticated way of handling ranks).
- Use the
Exact vs. Approximate Tests
The presence of ties influences the type of test used.
- Run
pumas_be
on the originalPJ2017_3_16_data
and inspectresult.model[1][1]
. Does it report an “Exact” or “Approximate” test? Note theadjustment for ties
. - Now, create the
no_tie_data
as shown in the unit by adding a very small amount of random noise toTmax
. - Run
pumas_be
onno_tie_data
and inspect its model. Is it “Exact” or “Approximate”? What is theadjustment for ties
now? Explain the change.
- Run