Simulations, Random Number Generator, and Reproducibility

Author

Jose Storopoli

Several Pumas’ functions use random number generators. Unless specifically controlled the results of these functions will change from run to run. Additionally, when functions calling random number generators are run in parallel, the results can be even more unpredictable. This behavior can be problematic when we want to reproduce the results of a simulation.

In this tutorial, we will learn how to set the seed of the random number generator, and feed it to a Pumas function, to ensure reproducibility.

First let’s recap the simulation in Generating and Simulating Populations:

using Pumas

model = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ_prop  RealDomain(; lower = 0)
    end

    @random begin
        η ~ MvNormal(Ω)
    end

    @covariates Wt

    @pre begin
        CL = tvcl * (Wt / 70)^0.75 * exp(η[1])
        Vc = tvvc * (Wt / 70) * exp(η[2])
        Ka = tvka * exp(η[3])
    end

    @dynamics begin
        Depot' = -Ka * Depot
        Central' = Ka * Depot - Central * CL / Vc
    end

    @derived begin
        conc = @. Central / Vc
        dv ~ @. Normal(conc, conc * σ_prop)
    end

end
PumasModel
  Parameters: tvcl, tvvc, tvka, Ω, σ_prop
  Random effects: η
  Covariates: Wt
  Dynamical variables: Depot, Central
  Derived: conc, dv
  Observed: conc, dv

As before, we’ll set the values of the parameter estimates:

fixeffs =
    (; tvcl = 0.4, tvvc = 20, tvka = 1.1, Ω = Diagonal([0.04, 0.04, 0.04]), σ_prop = 0.2)
(tvcl = 0.4,
 tvvc = 20,
 tvka = 1.1,
 Ω = [0.04 0.0 0.0; 0.0 0.04 0.0; 0.0 0.0 0.04],
 σ_prop = 0.2,)

And a DosageRegimen:

ev = DosageRegimen(100; time = 0)
1×10 DataFrame
Row time cmt amt evid ii addl rate duration ss route
Float64 Int64 Float64 Int8 Float64 Int64 Float64 Float64 Int8 NCA.Route
1 0.0 1 100.0 1 0.0 0 0.0 0.0 0 NullRoute

Finally, we’ll build a Population:

population = map(
    i -> Subject(;
        id = i,
        events = ev,
        covariates = (; Wt = 70),
        observations = (; dv = nothing),
    ),
    1:5,
)
Population
  Subjects: 5
  Covariates: Wt
  Observations: dv
map function

The map function applies a function to each element of an iterable. In this case, we are applying the Subject constructor to each element of the range 1:5. This is the same as writing:

population = [
    Subject(id = 1, events = ev, covariates = (; Wt = 70), observations = (; dv = nothing)),
    Subject(id = 2, events = ev, covariates = (; Wt = 70), observations = (; dv = nothing)),
    Subject(id = 3, events = ev, covariates = (; Wt = 70), observations = (; dv = nothing)),
    Subject(id = 4, events = ev, covariates = (; Wt = 70), observations = (; dv = nothing)),
    Subject(id = 5, events = ev, covariates = (; Wt = 70), observations = (; dv = nothing)),
]

If you want to know more about Julia syntax and the map function, check our Julia Syntax tutorial.

Now we are ready to simulate things with simobs. However, before we do that, let’s explain what a random number generator is.

1 Random Number Generator

A random number generator (RNG) is a function that generates a sequence of numbers that is not predictable. These numbers are called random numbers.

In Pumas, a number of functions use random number generators under the hood. Most notably the simobs function.

Pumas functions that use random number generators

This a non-exhaustive list of Pumas functions that use random number generators:

  • simobs
  • inspect when NPDEs are calculated with the nsim keyword argument
  • fit when using any Bayesian or SAEM estimation method
  • infer when using Bootstrap or SIR methods
  • any Pumas Bayesian cross-validation method
  • vpc

At first glance, it may appear that computers can generate random numbers spontaneously. However, it’s crucial to understand that computers, by their very nature, are deterministic machines. This means that their operations can be predicted and replicated given the same initial conditions and inputs.

In the context of generating random numbers, computers employ what are known as Pseudo-Random Number Generators (PRNGs). PRNGs are algorithms that use mathematical formulas or pre-calculated tables to produce sequences of numbers that mimic the properties of truly random numbers. The key characteristic of a PRNG is that it starts with an initial value, known as a “seed”. Given the same seed, a PRNG will always generate the same sequence of numbers.

This predictability, rather than being a limitation, is extremely useful for applications requiring reproducibility, such as simulations in scientific research. By using the same seed, researchers can generate the same sequence of ‘random’ numbers, ensuring that experiments can be replicated and verified by others. Thus, while the numbers generated by a PRNG may not be truly random in a mathematical sense, they provide a valuable tool for scientific and other applications where reproducibility is essential.

Pumas’ default random number generator

Pumas under the hood uses the Pumas.default_rng() for generating random numbers. Which is the same as Random.default_rng() in Julia:

using Random
Random.default_rng === Pumas.default_rng
true

All functions that expose an rng keyword argument, use the Pumas.default_rng() as the default value. Under the hood Pumas.default_rng() is a Random.TaskLocalRNG object:

typeof(Pumas.default_rng())
TaskLocalRNG

Under the hood Random.TaskLocalRNG wraps a random number generator and makes it task-local. Random.TaskLocalRNG is a random number generator that is deterministically defined by the parent task at the time of its creation. This means that this random number generator can be used safely in parallel computations.

There are several types of random number generators available in Julia:

  • Random.Xoshiro: default in Julia and in Pumas.
  • Random.MersenneTwister: default in R and Python.
  • Random.RandomDevice: uses the system’s random number generator, e.g. /dev/urandom in UNIX/Linux-based systems.

In the past, Julia used the Random.MersenneTwister as the default random number generator. However, the Random.Xoshiro uses less memory and is faster than the Random.MersenneTwister. As a side note, the Random.RandomDevice uses the system’s random number generator and is useful for cryptographic applications.

The way we deterministically define the sequence from a pseudo random number generator is by setting a seed value.

2 Setting a Seed

A seed is a number that initializes the random number generator. By setting the seed value, we ensure that the sequence of random numbers generated will be reproducible.

In Julia, the rand function is a random number generator. It generates a uniformly distributed random number between 0 and 1.

rand()
0.5345431431295281

Have a look at this example. Let’s generate a sequence of 3 random numbers with the seed 123.

Random.seed!(123)
rand(3)
3-element Vector{Float64}:
 0.521213795535383
 0.5868067574533484
 0.8908786980927811

Now let’s generate the same sequence of random numbers with the same seed:

Random.seed!(123)
rand(3)
3-element Vector{Float64}:
 0.521213795535383
 0.5868067574533484
 0.8908786980927811

As you can see, the sequence of random numbers is the same.

Note that if we initially set a seed value, but do not reset the random number generator between calls to rand, we will get different sequences of random numbers between calls because we are sampling further down the sequence of pseudo-random numbers:

Random.seed!(123)
rand(3)
3-element Vector{Float64}:
 0.521213795535383
 0.5868067574533484
 0.8908786980927811
rand(3)
3-element Vector{Float64}:
 0.040820522471272214
 0.5366036197527164
 0.454170253148516

However, we get the same behavior of this sequence of two calls to rand every time we run the code. This is due to the fact that every time that we call a PRNG in Julia, it advances its internal state. Setting a seed resets the internal state of the PRNG to a certain value.

3 Reproducibility

Knowing that setting a seed ensures that the random number generator will always generate the same sequence of numbers, there are two main ways to ensure reproducibility in Pumas. They both involve setting the seed of the random number generator:

  1. Local reproducibility: Set the seed of the random number generator inside a function call.
  2. Global reproducibility: Set the seed of the random number generator globally.
Best Practices for Reproducibility

It is best practice to only use global reproducibility in the finalized analysis script.

Local reproducibility is useful for some interactive work, and debugging purposes. However, it can introduce spurious correlations in the pseudorandom numbers which can bias analyses.

When you set the seed of the random number generator inside a function call, you ensure that the function call will always generate the same sequence of numbers. This is useful when you want to ensure reproducibility in a specific function call.

When you set the seed of the random number generator globally, you ensure that all function calls that use the random number generator will generate the same sequence of numbers.

3.1 Local Reproducibility: Setting the Seed Inside a Function Call

To set the seed of the random number generator inside a function call, you need to pass the rng keyword argument to the function call. Note that the rng value should be a PRNG that has the seed set.

PRNGs in Julia

As we’ve seen, Julia has several types of PRNGs. To pass a seed to a PRNG, you need to instantiate the PRNG with the seed value. For example, to set the seed of the Random.Xoshiro PRNG to 123, you would do:

Random.Xoshiro(123)

Analogously, to set the seed of the Random.MersenneTwister PRNG to 123, you would do:

Random.MersenneTwister(123)

Here’s an example of how to set the seed of the random number generator inside a simobs function call:

using DataFramesMeta
sims = simobs(model, population, fixeffs; rng = Random.Xoshiro(123), obstimes = 1:5)
df_sims = DataFrame(sims)
@chain df_sims begin
    groupby(:time)
    @combine :mean_conc = mean(:conc)
end
6×2 DataFrame
Row time mean_conc
Float64 Float64?
1 0.0 missing
2 1.0 3.4357
3 2.0 4.52999
4 3.0 4.84612
5 4.0 4.89696
6 5.0 4.85515

If we run this code several times, we will always get the same sequence of simulated observations because the random sequence of values from rng is reset at each function call by explicitly specifying the seed value:

sims = simobs(model, population, fixeffs; rng = Random.Xoshiro(123), obstimes = 1:5)
df_sims = DataFrame(sims)
@chain df_sims begin
    groupby(:time)
    @combine :mean_conc = mean(:conc)
end
6×2 DataFrame
Row time mean_conc
Float64 Float64?
1 0.0 missing
2 1.0 3.4357
3 2.0 4.52999
4 3.0 4.84612
5 4.0 4.89696
6 5.0 4.85515
@chain macro

The @chain macro is a part of the DataFramesMeta package. Similar to R’s pipe operator %>%, it allows you to chain operations on a DataFrame.

If you want to learn more about the @chain macro, check our DataFramesMeta tutorial.

3.2 Global Reproducibility: Setting the Seed Globally

To set the seed of the random number generator globally, you need to instantiate a PRNG with the seed value outside of function calls. You will need to pass this PRNG to the rng keyword argument of all Pumas’ functions that accept it. Calls to the sequence of random values in rng will proceed down the list of random values without resetting because there is no reapplication of the seed value.

Here’s an example of how to set the seed of the random number generator globally:

rng = Random.Xoshiro(123)

sims = simobs(model, population, fixeffs; rng = rng, obstimes = 1:5)
df_sims = DataFrame(sims)
@chain df_sims begin
    groupby(:time)
    @combine :mean_conc = mean(:conc)
end
6×2 DataFrame
Row time mean_conc
Float64 Float64?
1 0.0 missing
2 1.0 3.4357
3 2.0 4.52999
4 3.0 4.84612
5 4.0 4.89696
6 5.0 4.85515

If we run the same simobs function call that uses the random number generator again, we won’t get the same sequence of simulated observations:

sims = simobs(model, population, fixeffs; rng = rng, obstimes = 1:5)
df_sims = DataFrame(sims)
@chain df_sims begin
    groupby(:time)
    @combine :mean_conc = mean(:conc)
end
6×2 DataFrame
Row time mean_conc
Float64 Float64?
1 0.0 missing
2 1.0 3.94973
3 2.0 4.7593
4 3.0 4.86983
5 4.0 4.81396
6 5.0 4.71799

As you can see we have different results. However, the script that sets the seed globally will always generate the same results. So it is up to you what kind of reproducibility you want to achieve: global or local.

Functions that don’t expose an rng keyword argument

You might come across functions outside of Pumas that, despite using random number generators under the hood, don’t expose an rng keyword argument. In this case, you can use the Random.seed! function to set the seed of the random number generator globally. This will ensure that all function calls that use the random number generator will generate the same sequence of numbers.

At the top of your script, after the using statement, you can set the seed of the random number generator globally:

using Random
Random.seed!(123) # or any other number

4 Random Covariates

In our population creation above, we set the covariate Wt to 70. This was a simple example, but in practice we would often like to set the covariate Wt to a random value from some distribution of possible values. This can also be done in a reproducible way. For a uniform random distribution we can do this using the rand function and setting a seed value. Other random distributions are possible, such as a normal distribution from the randn function.

Every function that uses a random number generator in Julia, as in Pumas, accepts rng either as a positional or keyword argument. For example, if you want to set the covariate Wt to a random value, first create a rand_covariates function that uses the random number generator:

function rand_covariates(rng)
    Wt = rand(rng, 50:100)
    return (; Wt)
end
rand_covariates (generic function with 1 method)

The rand_covariates function above takes a random number generator rng as an argument. It then generates a random integer value for the covariate Wt between 50 and 100 using the random number generator rng and outputs a NamedTuple with the covariate Wt.

Now, we just need to incorporate the rand_covariates function into the Subject constructor:

rng = Random.Xoshiro(123)
population = map(
    i -> Subject(;
        id = i,
        events = ev,
        covariates = rand_covariates(rng),
        observations = (; dv = nothing),
    ),
    1:5,
)
Population
  Subjects: 5
  Covariates: Wt
  Observations: dv

Note that we are not setting the seed of the random number generator inside the rand_covariates function. Otherwise, we would always get the same sequence of random values for the covariate Wt. And all subjects would have the same value for the covariate Wt. Here we don’t need local reproducibility. Instead we need global reproducibility.

Reproducibility across Pumas versions

It is important to note that the reproducibility of simulations across different versions of Pumas is not guaranteed.

5 Conclusion

In this tutorial, we learned how to set the seed of the random number generator, and feed it to a Pumas function, to ensure reproducibility. We also learned that there are two main ways to ensure reproducibility in Pumas:

  1. Set the seed of the random number generator inside a function call.
  2. Set the seed of the random number generator globally.

We also learned that every function that uses a random number generator in Julia, as in Pumas, accepts rng either as a positional or keyword argument. This is useful when we want to set random covariates in a reproducible way.

As a final note, it is best practice to only use global reproducibility in the finalized analysis script. Local reproducibility can be especially helpful when you are doing interactive work and you want to ensure that a code snippet will always return the same thing even if you add code before it, but should typically be removed from the final result.