using Pumas
using Copulas
using Distributions
Multivariate Covariate Simulations in Pumas with Copulas.jl
1 Introduction
When performing simulations as part of a pharmacometric workflow it is always useful to be able to construct flexible covariate distributions.
One method is to sample subjects from a subject bank. This is a database of subjects with covariate values associated to them. These can be simulated by someone else or they can be real subejcts. To generate a new dataset we can then sample subjects until we have enough, and we will have new subjects with preserved correlation structure between covariates.
Another method is to parametrically simulate subjects. If all covariates can fit into a multivariate normal (Gaussian) framework this is trivial. However, once we have distributions that are only positive, have heavy tails, are truncated at some values, it becomes much more complicated to simulate. We will not go into the theory in this tutorial, but one alternative that is still parametric such that we can simulate new unique subjects is Copulas. Copulas work by connecting continuous marginal distributions using Sklar’s theorem (wikipedia link) into on connected Copula that is able to provide correlation structure between the variables we want to sample.
2 Practical example
Here, we will use Copulas as implemented in the Copulas.jl package as described in Laverny and Jimenez (2024). This package implements Copulas that are consistent with the Distributions package that we also use in Pumas for randoms effects, errors distributions on observed quantities, and so on. First, we load the required packages.
Then, we need to define distributions. One nice distribution that is defined on the positive real line is the Gamma
distribution (wikipedia link). Here, we parameterize two marginal distributions for height and weight in males where the mean height is 175 cm with a standard deviation of 10, and the mean weight is 70 Kg with a standard deviation of 15.
= Gamma((175 / 10)^2, 10^2 / 175)
height_dist_male = Gamma((70 / 15)^2, 15^2 / 70) weight_dist_male
Distributions.Gamma{Float64}(α=21.777777777777782, θ=3.2142857142857144)
and for females
= Gamma((165 / 8)^2, 8^2 / 165)
height_dist_female = Gamma((65 / 12)^2, 12^2 / 65) weight_dist_female
Distributions.Gamma{Float64}(α=29.340277777777782, θ=2.2153846153846155)
Then, we can construct a GaussianCopula
type to represent the correlation structure.
= GaussianCopula([1 0.7; 0.7 1])
copula_male = GaussianCopula([1 0.6; 0.6 1]) copula_female
Copulas.GaussianCopula{2, Matrix{Float64}}(
Σ: [1.0 0.6; 0.6 1.0]
)
The final step in specifying the Copula is to use Sklar’s Theorem to combine the previous distributions into one connected object that we can use to simulate covariates from.
= SklarDist(copula_male, (height_dist_male, weight_dist_male))
joint_dist_male = SklarDist(copula_female, (height_dist_female, weight_dist_female)) joint_dist_female
Copulas.SklarDist{Copulas.GaussianCopula{2, Matrix{Float64}}, Tuple{Distributions.Gamma{Float64}, Distributions.Gamma{Float64}}}(
C: Copulas.GaussianCopula{2, Matrix{Float64}}(
Σ: [1.0 0.6; 0.6 1.0]
)
m: (Distributions.Gamma{Float64}(α=425.390625, θ=0.3878787878787879), Distributions.Gamma{Float64}(α=29.340277777777782, θ=2.2153846153846155))
)
Then, we can sample from each Copula using the usual rand
function:
= rand(joint_dist_female) height, weight
2-element Vector{Float64}:
166.70693112756675
72.37553638897992
It will return the sampled height and weight for a sampled female subject.
3 Simulating a Population
Finally, we can do the actual simulation. We defined one Copula for males and one for females, so the only remaining component is to define the probability of being either one of the two categories in the SEX
covariates. This is controlled by ismale_prob
below.
# Define the population size and sex distribution
= 100
population_size = 0.5 # Equal probability of male and female
ismale_prob
# Create an array of virtual subjects
= map(1:population_size) do i
pop # Sample sex
= rand(Bernoulli(ismale_prob))
ismale
# Sample height and weight based on sex, using the copula distributions
= rand(ismale ? joint_dist_male : joint_dist_female)
height, weight
# Create a virtual subject with the sampled values
Subject(id = i, covariates = (; height, weight, ismale))
end
Population
Subjects: 100
Covariates: height, weight, ismale
Observations:
Lastly, we can check the covariates sampled as part of the population creation in terms of their means but also their correlations. The whole point here was that our subjects should have correlated covariates not just useful levels and dispersion as measured independently per covariate.
combine(
groupby(DataFrame(pop), "ismale"),
"height" .=> [mean, std],
"weight" .=> [mean, std],
"weight", "height"] => cor,
[ )
Row | ismale | height_mean | height_std | weight_mean | weight_std | weight_height_cor |
---|---|---|---|---|---|---|
Bool? | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | false | 165.045 | 8.1262 | 62.5283 | 10.9554 | 0.64906 |
2 | true | 173.597 | 8.87583 | 69.2692 | 15.5331 | 0.685818 |
We conclude that we got the means, standard deviations and correlations that we expected.
4 Conclusion
In this small tutorial, we showed how to simulate multivariate, parametric covariate distributions. This can be a great alternative to generating covariate samples from resampling when there are not enough candidates in the “subject bank” or if the method is not useful for some other reason. We saw how to specify the Copulay through the marginal distributions, how to specify correlations, how to sample and how to put it all together into a final Population
. In future tutorials, we will explain the underlying concepts of Copulas and provide workflows on the use of copulas for covariate sampling and population creation for simple to complex trial designs.