Introduction to Optimal Design

Authors

Mohamed Tarek

Jose Storopoli

Joel Owen

Optimal design, particularly within the context of pharmacometrics and statistics, is about strategically planning experiments or studies to get the most informative data possible, using the least resources. This approach is especially important in fields like drug development, where experiments can be costly and time-consuming.

1 Key Principles

  1. Objective: Before you can design an optimal experiment, you need to be clear about what you’re trying to achieve. Which parameters are you trying to estimate with good precision?
  2. Decision variables These are the variables that you can control in your experiment to achieve your objective, e.g sample times, group sizes or dose amounts. Your objective and decision variables will guide your design.
  3. Constraints: These are the limits within which you must operate, such as number of samples, time windows or ethical considerations. An optimal design finds the best solution within these constraints.

1.1 Expected Fisher Information

The most common criteria for optimality when designing experiments uses a concept known as the expected Fisher information. Consider a single subject study with experimental design \(d\) and covariates \(x\). The design \(d\) includes the controllable covariates of the subject, e.g dose amount and sample times. The covariates \(x\) includes the uncontrollable covariates, assumed fixed, e.g age and sex.

Additionally, consider a model with a single parameter \(\theta\) and a marginal likelihood function \(L(\theta; x, d, y)\) where \(y\) is the unobserved response of the subject. The expected Fisher information \(I(\theta)\) is given by the expected value of the second derivative of \(-\log L\), where the differentiation is with respect to the parameter \(\theta\) and the expectation is with respect to the response \(y\): \[ I(\theta; x, d) = E_y \Bigg[ -\frac{\partial ^2 \log L(\theta; x, d, y)}{\partial \theta^2} \Bigg] \]

This quantity gives a theoretical measure of information that your experiment gives you about your parameter \(\theta\).

Note

A design \(d\) that maximizes \(I(\theta; x, d)\) for given values of \((\theta, x)\) while satisfying all the design constraints (if any) is the optimal design according to this optimality criteria.

When \(\theta\) is made up of \(M > 1\) parameters instead of a single parameter, the expected Fisher information becomes an \(M \times M\) matrix: \[ I(\theta; x, d) = E_y \Bigg[ -\frac{\partial^2 \log L(\theta; x, d, y)}{\partial \theta \cdot \partial \theta^T} \Bigg] \] This matrix is known as the expected Fisher information matrix (FIM). When comparing 2 designs \(d_1\) and \(d_2\) given the above matrix, there are multiple ways to summarize the matrix using a single scalar objective function value. These summarization methods are commonly known as optimality criteria and will be discussed below.

When considering \(N\) subjects with covariates \(x = \{ x_i: i = 1 \dots N \}\), the Fisher information is additive with respect to the subjects: \[ I(\theta; x, d) = \sum_{i=1}^{N} I(\theta; x_i, d) \]

The interpretation of the expected FIM is as follows:

  • Diagonal Element (\(I_{ii}\)): This represents the information about the parameter \(\theta_i\), quantifying how much information the data provides about this parameter. It reflects the sensitivity of the likelihood function to changes in this parameter.

  • Off-Diagonal Element (\(I_{ij}\) where \(i \neq j\)): This element represents the interaction term between the estimates of parameters \(\theta_i\) and \(\theta_j\), indicating how changes in one parameter estimate might affect the estimate of another. If these off-diagonal elements are non-zero, it suggests that there is some level of correlation between the parameters in how they are informed by the data.

1.1.1 Example

To give a concrete example, if you’re estimating two parameters, say \(\mu\) (mean) and \(\sigma^2\) (variance), from a normal distribution:

  • A diagonal element, such as \(I_{\mu\mu}\), would quantify how sensitive the likelihood is to changes in \(\mu\), directly relating to the precision with which \(\mu\) can be estimated from the data.
  • An off-diagonal element, like \(I_{\mu\sigma^2}\), would indicate how the estimate of \(\mu\) might change with changes in \(\sigma^2\) and vice versa, revealing any interdependence in their estimation from the data.

Here’s how this particular FIM would be represented:

\[ I(\theta) = \begin{bmatrix} I_{\mu\mu} & I_{\mu\sigma^2} \\ I_{\sigma^2\mu} & I_{\sigma^2\sigma^2} \end{bmatrix} \]

Where:

  • \(I_{\mu \mu}\) is the information about the mean, reflecting how much the data tells us about \(\mu\).
  • \(I_{\sigma^2 \sigma^2}\) is the information about the variance, reflecting how much the data informs us about \(\sigma^2\).
  • \(I_{\mu \sigma^2}\) and \(I_{\sigma^2 \mu}\) represent the interaction between the estimates of \(\mu\) and \(\sigma^2\), indicating the degree to which the estimation of one parameter is influenced by the estimation of the other.

In a typical setting for a Normal distribution with maximum likelihood estimation, the Fisher Information elements for \(\mu\) and \(\sigma^2\) could be defined as follows, assuming a sample size of \(n\) and that the parameters \(\mu\) and \(\sigma^2\) are estimated independently:

  • \(I_{\mu\mu} = \frac{n}{\sigma^2}\)
  • \(I_{\sigma^2\sigma^2} = \frac{n}{2\sigma^4}\)

The off-diagonal elements, \(I_{\mu\sigma^2}\) and \(I_{\sigma^2\mu}\), are zero, reflecting the independence of the estimates of \(\mu\) and \(\sigma^2\) under the assumption of Normal distribution and independent observations.

1.1.2 Random Effects and First Order Approximation

To compute the expected FIM, one needs the Hessian of \(-\log L\) with respect to the parameters \(\theta\) where \(L\) is the marginal likelihood. In nonlinear mixed effects (NLME) models, to compute \(\log L\), one would need to marginalize the random effects first. In Pumas, the first order (FO) approximation method is used to marginalize the random effects and approximate the marginal likelihood. The parameters \(\theta\) in optimal design therefore only refers to the population-level parameters.

1.1.3 Cramer-Rao Lower Bound for a Single Parameter

The Cramer-Rao (CR) lower bound further highlights the importance of the expected Fisher information. The CR lower bound for single-parameter models is: \[ \text{Var}_y [ \hat{\theta}(x, d, y) ] \geq \frac{1}{I(\theta; x, d)} \] where \(\text{Var}_y[\hat{\theta}(x, d, y)]\) is the variance of the unbiased maximum likelihood estimator \(\hat{\theta}(x, d, y)\) assuming the true parameter value is \(\theta\). Moving forward, \(\hat{\theta}(x, d, y)\) will be shortened to \(\hat{\theta}\), and \(I(\theta, x, d)\) will be shortened to \(I(\theta)\).

Note

The above expression for the CR lower bound is only correct for scalar \(\theta\). You cannot divide by a matrix!

Lower Bound Minimization

Choosing an experimental design \(d\) that minimizes \(1 / I(\theta)\) (i.e., maximizes \(I(\theta)\)), only minimizes a lower bound on the variance \(\text{Var}_y [ \hat{\theta} ]\), not the variance itself. This does not guarantee that \(\text{Var}_y [ \hat{\theta} ]\) is itself minimized because in theory the lower bound can be loose. However in practice, minimizing the lower bound \(1 / I(\theta)\) tends to be a good surrogate objective for minimizing \(\text{Var}_y [ \hat{\theta} ]\), which is the quantity we actually care about minimizing.

Importance of Simulations

Given that in NLME models, the marginal likelihood (and by extension \(I(\theta)\)) is approximated using a first order approximation, the computed \(1 / I(\theta)\) is only an approximation of the actual lower bound. While this approximate lower bound may be a good surrogate when optimizing the experiment design, using simulation to evaluate the final design is usually recommended. In addition, running a simulation to evaluate the design can be useful to confirm that the model will be estimable using the software intended for the eventual data analysis with the future experimental data.

1.1.4 Cramer-Rao Lower Bound for a Vector of Parameters

One generalization of the scalar CR lower bound for multiple parameters, i.e. vector \(\theta\), is: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] which is true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\). \(M\) is the number of parameters in \(\theta\). \(\text{Var}_y[\hat{\theta}]\) is the \(M \times M\) covariance matrix of the unbiased maximum likelihood estimate \(\hat{\theta}\). The proof of this lower bound is shown in Section 4.

When restricting \(w\) so that \(w^T \cdot w = 1\), i.e. \(w\) is a unit vector, the generalized CR lower bound simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \] When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we recover the scalar CR lower bound: \[ \text{Var}_y [ \hat{\theta}_i ] \geq \frac{1}{I(\theta)_{i,i}} \]

Another better and more commonly used generalization of the scalar CR lower bound, originally presented by Rao 1945 and later shown to be a special case of the Hammersley-Chapman-Robbins lower bound, is: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \] This is also true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\).

Note

Note that the division by \(w^T \cdot w\) on both sides is not necessary but is done for consistency with the previous lower bound.

This improved lower bound can also be equivalently written as: \[ \text{Var}_y[\hat{\theta}] - I(\theta)^{-1} \succeq 0 \] where \(A \succeq 0\) means that \(A\) is a positive semi-definite matrix.

The second generalization is a tighter lower bound when \(\theta\) is a vector because: \[ \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] for all \(w\), assuming \(w' \cdot w > 0\), which follows from the Cauchy-Schwarz inequality: \[ (u^T \cdot v)^2 \leq (u^T \cdot u) \cdot (v^T \cdot v) \] by substituting: \[ \begin{aligned} u & = I(\theta)^{1/2} \cdot w \\ v & = I(\theta)^{-1/2} \cdot w \end{aligned} \] where \(I(\theta)^{1/2}\) is the symmetric positive definite square root of \(I(\theta)\) and \(I(\theta)^{-1/2}\) is its inverse.

The relationship between the 2 bounds is therefore given by: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] If \(w' \cdot w\) is assumed to be 1, this simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq w^T \cdot I(\theta)^{-1} \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \]

When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we get: \[ \text{Var}_y[\hat{\theta}_i] \geq (I(\theta)^{-1})_{i,i} \geq \frac{1}{I(\theta)_{i,i}} \]

If \(\theta\) is a scalar, or more generally if \(w\) is an eigenvector of \(I(\theta)\), the 2 lower bounds are equal.

1.2 Criteria for Optimality

When optimizing the design to maximize a measure of information gain for multi-parameter models, there are multiple choices of scalar objective functions, which are functions of the expected FIM. All of the scalar objective functions are aggregate functions of one of the 2 lower bounds presented above, used as surrogate objectives for minimizing the variances of the parameters’ estimates. Each optimality criteria has a different name. Let’s explore some of the most commonly used criteria.

Local Optimal Design

To define any of these objective functions, the expected FIM, \(I(\theta)\), is typically only computed for a user-provided value of \(\theta = \theta_0\) (taken from literature), which is assumed to be the true value of \(\theta\). This approach is often called local optimal design.

1.2.1 D-Optimality

D-Optimal design maximizes the (log) determinant of the expected FIM. By doing so, a D-optimal design aims to minimize the volume of the confidence ellipsoid around the estimated parameters.

Note

In Pumas, the negative of the log determinant of the expected FIM is minimized which is equivalent to maximizing the determinant of the expected FIM.

Application: Ideal for experiments where the goal is to estimate parameters and all their weighted combinations with maximum average precision in the presence of correlations.

1.2.2 T-Optimality

T-Optimality maximizes the trace (the sum of the diagonal elements) of the expected FIM, \(\sum_i I_{i,i}\). This criterion indirectly minimizes the average \(\text{Var}_y [ \hat{\theta}_i ]\) by maximizing the sum of the denominators of the loose CR lower bounds which indirectly minimizes the loose CR lower bounds, \(1 / I_{i,i}\).

Note

In Pumas, the negative of the trace of the expected FIM is minimized which is equivalent to maximizing the trace.

Application: Useful when the objective is to minimize the average variance across all parameter estimates or when we would like to use different weights on parameters by incorporating weights in the objective.

1.2.3 A-Optimality

A-Optimality minimizes the trace (the sum of the diagonal elements) of the inverse of the expected FIM, \(\sum_i (I^{-1})_{i,i}\). Similar to T-optimality, this criterion indirectly minimizes the average \(\text{Var}_y [ \hat{\theta}_i ]\) by minimizing the sum of the tighter lower bounds, \((I^{-1})_{i,i}\).

Application: Similar to T-optimality.

1.2.4 Using Weights and Parameter Scales

Both T-Optimality and A-Optimality can be extended using weights on parameters. In T-Optimality, the weighted objective to be maximized is \(\sum_i p_i \cdot I_{i,i}\), where \(p_i\) is the weight of parameter \(\theta_i\). In A-Optimality, the weighted objective to be minimized is \(\sum_i p_i \cdot (I^{-1})_{i,i}\).

Using weights can help us achieve any of the following goals:

  1. Weights allow us to vary the level of importance of different parameters. For example, we typically care more about the clearance parameter than the inter-individual variability (IIV) parameters. More important parameters should get higher weights than less important parameters. Parameters which we do not care about can be given a weight of 0.
  2. Weights can also be used to make the objective function value invariant to the scales of the parameters, \(\hat{\theta}\). Some parameters may be in the 100s while other parameters maybe in the 0.1s. The optimal design objective function values are sensitive to the scales of the parameters.
  3. The product of importance weights and scaling weights can be used to achieve both goals 1 and 2 simultaneously.

In T-Optimality, to make the objective function invariant to the parameters’ scales, one can use the weights \(p_i = \theta_i^2\). The justification is shown below.

Using the loose CR lower bound and dividing both sides by \(\theta_i^2\), we get: \[ \begin{aligned} \text{Var}_y [ \hat{\theta}_i ] & \geq \frac{1}{I(\theta)_{i,i}} \\ \frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{1}{\theta_i^2 \cdot I(\theta)_{i,i}} \end{aligned} \] If \(\hat{\theta} \approx \theta\), \(\text{Var}_y [ \hat{\theta}_i ] / \theta_i^2\) will be approximately equal to the square of the relative standard error estimate of \(\hat{\theta}_i\), which is invariant to the parameters’ scales. T-Optimality tries to indirectly minimize the LHS of the inequality by maximizing the sum of the denominators of the RHS terms for all \(i\): \[ \sum_i \theta_i^2 \cdot I(\theta)_{i,i} \] Setting \(p_i = \theta_i^2\) in T-Optimality therefore makes the objective function invariant to the parameters’ scales.

In A-Optimality, weights \(p_i = 1 / \theta_i^2\) can be used to achieve invariance to the parameters’ scales. This also follows from the tight CR lower bound and assuming \(\hat{\theta} \approx \theta\). \[ \begin{aligned} \text{Var}_y [ \hat{\theta}_i ] & \geq (I(\theta)^{-1})_{i,i} \\ \frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{(I(\theta)^{-1})_{i,i}}{\theta_i^2} \end{aligned} \]

Zero Parameters

Using weights to make the objective function value invariant to the parameters’ scales assumes that none of the parameters is exactly 0. For parameters with value 0 or very close to 0, using the above weighting scheme can give spurious results.

2 Example in Pumas

Let’s dive into an example using the OptimalDesign package in Pumas.

2.1 Loading OptimalDesign

The first step is to load the OptimalDesign package using:

using OptimalDesign
OptimalDesign and Pumas

By loading OptimalDesign, Pumas will be automatically loaded as well. If you are doing an optimal design workflow, you don’t need to load Pumas separately.

2.2 Model Definition

After loading OptimalDesign, you should define a Pumas model.

model = @model begin
    @param begin
        tvcl  RealDomain(; lower = 0)
        tvvc  RealDomain(; lower = 0)
        tvka  RealDomain(; lower = 0)
        Ω  PSDDomain(3)
        σ  RealDomain(; lower = 0)
    end

    @random begin
        η ~ MvNormal(Ω)
    end

    @pre begin
        CL = tvcl * exp(η[1])
        Vc = tvvc * exp(η[2])
        Ka = tvka * exp(η[3])
    end

    @dynamics Depots1Central1

    @derived begin
        dv ~ @. Normal(Central / Vc, σ)
    end
end
PumasModel
  Parameters: tvcl, tvvc, tvka, Ω, σ
  Random effects: η
  Covariates: 
  Dynamical system variables: Depot, Central
  Dynamical system type: Closed form
  Derived: dv
  Observed: dv

As you can see this is a simple one-compartment model with first-order absorption.

2.3 Define a Subject/Population for Optimal Design

To perform optimal design, one needs to specify the subject(s) for which optimal sample times will be identified. A Subject in Pumas typically includes:

  • Covariates, possibly time-varying
  • Dosing regimen
  • Sample times
  • Observation names, e.g. dv
  • Observation values

All of the above information is required when using a Subject for fitting a model. However for the purpose of optimal design, while the same Subject data structure is used, some of its fields become optional and get ignored if present. For example, sample times are optimized in optimal design so they are not required to be input. Observation values are also assumed to be missing in optimal design since presumably no experiment had been conducted yet.

There are 2 ways to define a Subject or Population for optimal design, discussed below.

2.3.1 Manual Definition

The following is an example where a Subject is defined manually using the Subject constructor.

subject = Subject(; events = DosageRegimen(100), observations = (; dv = nothing))
Subject
  ID: 1
  Events: 1
  Observations: dv: (n=0)

To define a population manually, you can use the map function to define multiple subjects programmatically. The following defines a population of 10 identical subjects with IDs going through 1 to 10.

population = map(
    i ->
        Subject(; id = i, events = DosageRegimen(100), observations = (; dv = nothing)),
    1:10,
)
Population
  Subjects: 10
  Observations: dv

or define a vector of subjects manually:

population = [
    Subject(; id = 1, events = DosageRegimen(100), observations = (; dv = nothing)),
    Subject(; id = 2, events = DosageRegimen(100), observations = (; dv = nothing)),
    Subject(; id = 3, events = DosageRegimen(100), observations = (; dv = nothing)),
]
Population
  Subjects: 3
  Observations: dv

Note that you can pass any covariate to the Subject constructor using the covariates keyword argument, e.g. covariates = (; AGE = 30).

Notice how the sample times were not specified in the above Subject constructors. Also notice that the observation values were set to nothing. However, it is important to use the correct observation name, dv, when constructing the subject to be consistent with the model, matching the observed variables’ names in the @derived block.

Note

If an observation name is missing from the observations keyword argument but present in the model, sample times for this response variable will not be optimized. This has applications when we require a separate design for PK samples times vs PD sample times for each subject. This will be presented in a future tutorial.

2.3.2 Read from a DataFrame

Beside the manual definition, one can also conveniently define a Population from a DataFrame using read_pumas. The DataFrame may be one of the standard datasets available in the PharmaDatasets package, or may be read from a CSV file.

When reading subjects’ data from a DataFrame using read_pumas and then doing optimal design, only the observations names, dosing information and covariates are taken into account. The observation values and sample time points are ignored. If time-varying covariates are present in the data, the time points in the data will be used to define the time-varying covariates only but new sample times will be selected for the optimal design.

To give an example, we’ll be using the Warfarin PK dataset from the PharmaDatasets package.

using PharmaDatasets
using DataFrames
using DataFramesMeta
df = dataset("warfarin.csv")
first(df, 5)
5×6 DataFrame
Row id time dv amt evid cmt
Int64 Float64 Float64? Int64? Int64 Int64
1 1 0.0 missing 70 1 1
2 1 0.5 9.9973 missing 0 1
3 1 1.0 7.3587 missing 0 1
4 1 2.0 7.4868 missing 0 1
5 1 6.0 9.6466 missing 0 1

The dataset contains the following columns:

  • :id: the subject ID.
  • :time: the time of the observation.
  • :dv: the observed drug concentration.
  • :amt: the dose administered.
  • :evid: the event ID, indicating whether the observation is a dose or a concentration measurement.
  • :cmt: the compartment ID, indicating the compartment where the dose was administered.

Let’s check the summary statistics of the dataset:

describe(df, :mean, :std, :min, :max, :nuniqueall)
6×6 DataFrame
Row variable mean std min max nuniqueall
Symbol Float64 Float64 Real Real Int64
1 id 16.5 9.24916 1 32 32
2 time 29.0556 39.3667 0.0 120.0 9
3 dv 4.68351 2.63184 0.28399 12.184 256
4 amt 70.0 0.0 70 70 1
5 evid 0.111111 0.314817 0 1 2
6 cmt 1.0 0.0 1 1 1

We have 32 subjects each with 8 measurements and 1 dose event.

@chain df begin
    @rsubset :id == 1
    @by :evid :nrow = length(:time)
end
2×2 DataFrame
Row evid nrow
Int64 Int64
1 0 8
2 1 1

With the data loaded, we can now proceed to parse it into a Population:

population = read_pumas(
    df;
    id = :id,
    time = :time,
    observations = [:dv],
    amt = :amt,
    evid = :evid,
    cmt = :cmt,
)
Population
  Subjects: 32
  Observations: dv

2.4 Specify the Population Parameters

Currently, only local optimal design is available, so the user must specify the values of the population parameters to use. These parameter values will be assumed to be the true parameter values when computing the expected FIM.

Population Parameters

Population parameters are the parameters of the model that are shared across the population. These include the so-called typical values of the parameters, e.g. the tvcl, tvvc, and tvka along with the IIV parameter Ωs and the residual variability parameter σ in the Warfarin model.

params = (;
    tvcl = 0.15,
    tvvc = 8,
    tvka = 1,
    Ω = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
    σ = sqrt(0.01),
)
(tvcl = 0.15,
 tvvc = 8,
 tvka = 1,
 Ω = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
 σ = 0.1,)

2.5 Time Window Specification

In OptimalDesign, one needs to specify the time windows in which optimal samples will be chosen. These time windows should represent clinical hours to increase the probability that the design found by OptimalDesign is actually feasible to implement in practice.

In OptimalDesign, users can express time intervals as regular floating point number bounds, i.e. Float64. This would be typical for designing future clinical trials. Alternatively, for convenience in managing clinical scheduling in a hospital or clinic setting, the Dates standard Julia package is available to easily specify the time bounds for optimization using specific dates and clock times via the DateTime type. If the time bounds are passed in as Float64, the optimal design’s times will be in Float64 format. Alternatively, if the time bounds are passed in using DateTime, the optimal design’s times will be in DateTime.

2.5.1 Floating Point Time Windows

Let us define the start and end times of a time window via t0 and tend:

t0, tend = 0.0, 16.0
(0.0, 16.0)
Defining Multiple Variables in a Single Line

In Julia, you can define multiple variables in the same line of code where the variables names and values are separated by commas. The following 2 code snippets are equivalent:

t0, tend = 0.0, 16.0
t0 = 0.0
tend = 16.0

The time unit assumed for t0 and tend matches the time unit assumed in the model. Hence if your model assumes “hour” is the time unit then the time bounds above represent hour 0 and hour 16.

To define time bounds, OptimalDesign makes use of intervals in Julia. One can define a time window that starts at t0 and ends at tend using:

t0 .. tend
0.0 .. 16.0

Arithmetic on intervals is also supported to increment the whole interval by a fixed number. However, parentheses must be used carefully. The following is the correct syntax to increment the start and finish bounds by 1 day, assuming the model’s unit is “hour”:

(t0 .. tend) + 24
24.0 .. 40.0

The following syntax parses differently incrementing only the finish bound by a day.

t0 .. tend + 24
0.0 .. 40.0

2.5.2 DateTime Time Windows

We’ll now cover Julia’s DateTime API and how to use them in OptimalDesign. Let us define the start and end times of a time window via t0 and tend:

t0 = DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am
2024-01-25T09:00:00
tend = DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm
2024-01-25T21:00:00

The DateTime constructor’s inputs are of the form: (Year, Month, Day, Hour, Minute, Second, Millisecond). Dropping arguments from the end assigns their value to 0, e.g. the Minute, Second, Millisecond values in t0 and tend above are 0.

2.5.2.1 DateTime Arithmetic

One can use arithmetic functions to increment dates and times as follows:

  • Next second: t0 + Second(1).
  • Next minute: t0 + Minute(1).
  • Next hour: t0 + Hour(1).
  • Next day: t0 + Day(1).
  • Next week: t0 + Day(7).
  • Next month: t0 + Month(1).
  • Next year: t0 + Year(1).

Multiplication with integers and subtraction are also supported.

2.5.2.2 Intervals

Similar to floating point intervals, one can define an interval where t0 and tend are DateTime objects.

t0 .. tend
2024-01-25T09:00:00 .. 2024-01-25T21:00:00

The interval can then be incremented 1 day as follows:

(t0 .. tend) + Day(1)
2024-01-26T09:00:00 .. 2024-01-26T21:00:00

The following syntax parses differently incrementing only the finish bound by a day.

t0 .. tend + Day(1)
2024-01-25T09:00:00 .. 2024-01-26T21:00:00

2.6 Decision Definition

The next step in OptimalDesign is to define the design structure and requirements that make up the decision variables to be optimized. The term decision is used in this context to refer to the design that is yet to be optimized.

2.6.1 Decision Structure

Currently, there are 2 dimensions, with 2 choices each, along which one can vary the structure of the optimal sample times in OptimalDesign. These are orthogonal dimensions giving rise to 4 different combinations.

2.6.1.1 Subject vs Population Decision

In OptimalDesign, a decision can be:

  • A subject decision, or
  • A population decision.

In the first case, a single subject is used and the sample times for this individual subject are optimized. In this case, the output optimal samples times will be a vector of sample times (i.e. Float64 numbers or DateTime objects).

In the second case when a population of more than 1 subject is used, separate sample times will be optimized for each subject. In this case, the output optimal samples times will be a vector of vectors of sample times. The outer vector has the same length as the population and the inner vectors are the sample times per subject.

Currently, dividing subjects into groups where each group shares the sample times is not supported except in the trivial case where all the subjects in a group share the same covariates and dosing regimen. In this case, the subject_multiples keyword argument (discussed later) can be used to specify how many times (i.e. the group size) each unique subject should be counted when computing the expected FIM.

Note

Increasing subject_multiples for all of the subjects in a population by the same ratio will not change the optimal designs and will only change the expected FIM at the optimal design. However, changing subject_multiples by different ratios for different subjects will change the optimal design.

Note

Using the same subject 2 times in a Population means that there are 2 independent copies of this subject. Each copy will have its own optimal sample times, even though the 2 copies are identical in covariates and dosing regimen. The reason for this is that when sample times are sparse, it can be advantageous to observe different copies at different times to have good overall coverage of all parts of the dynamics when combining the subjects’ data together in the population analysis.

2.6.1.2 Pre-specified vs Optimized Number of Samples Per Time Window

Another dimension along which one can vary the decision structure is by either:

  • Pre-specifying the number of samples per time window per subject, or
  • Only specifying the total number of samples for all time windows per subject.

In the second case, the optimization algorithm determines how many samples should be in each time window to optimize the objective function. In this tutorial, only the first case is covered.

2.6.2 Floating Point Decisions

Once you define the time bounds, the next step is to define the decision. The decision function is used to specify the decision to be optimized.

Dictionaries

Julia has a built-in dictionary type called Dict. It is a collection of key-value pairs where each key is unique. For more information on dictionaries, see the Julia documentation, and the Julia Syntax Pumas tutorial

The following dictionary can be used to associate a number of samples (2) for each time window. Assuming the unit is “hour”, the time windows below refer to

  • Hour 0 to hour 16 of day 1, and
  • Hour 0 to hour 16 of day 2.
t0, tend = 0.0, 16.0
bounds = Dict((t0 .. tend) => 2, (t0 .. tend) + 24 => 2)
Dict{ClosedInterval{Float64}, Int64} with 2 entries:
  0.0 .. 16.0  => 2
  24.0 .. 40.0 => 2

After defining the time bounds and number of samples per time window, a decision can be constructed for a single Subject using:

subject = population[1]
subj_dec = decision(
    model,
    subject,
    params;
    type = :observation_times,
    bounds = bounds,
    minimum_offset = 0.25,
);

or for a whole Population using:

pop_dec = decision(
    model,
    population,
    params;
    type = :observation_times,
    bounds = bounds,
    minimum_offset = 0.25,
);

The minimum_offset keyword argument can be optionally specified to enforce a minimum duration between any 2 samples. The default value of minimum_offset is 1e-4.

Some additional options which can be specified as keyword arguments include:

  • rng: the random number generator used to generate a random initial of sample times. The default value is Random.GLOBAL_RNG.
  • init: The initial sample times. The default value is nothing. If nothing is input, random sample times are used to initialize the optimization. The initial sample times for a subject is a vector of sample times. The initial sample times for a population is a vector of vectors of sample times. The type of sample times (Float64 or DateTime) should match that in the bounds.
  • subject_multiples: The number of times each subject is counted. This is a number if only a single subject is passed in the second argument, or a vector if a population is used. The default value is 1 for an individual subject or a vector of ones for a population.
  • time_multiples: a vector of integers used to replicate each sample multiple times. If not specified, each sample will be counted only once. This is useful when multiple sub-samples can be taken from each sample giving multiple measurements at the same sample time.

2.6.3 DateTime Decisions

Since models in Pumas do not specify any time unit, when DateTime decisions are used, a model_time_unit must be passed as an additional keyword argument in order to specify the unit of time assumed in the model definition and dynamics model, e.g. model_time_unit = Hour(1). In this case, minimum_offset should also be defined using a DateTime object.

t0 = DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am
tend = DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm
date_bounds = Dict((t0 .. tend) => 2, (t0 .. tend) + Day(1) => 2)
date_pop_dec = decision(
    model,
    population,
    params;
    bounds = date_bounds,
    minimum_offset = Minute(15),
    model_time_unit = Hour(1),
);

2.7 Optimizing the Design

In OptimalDesign, a number of types of optimal design are supported. The optimality type can be specified using the keyword argument optimality which can take the following values:

  • :aoptimal: minimizes the trace of the inverse of the expected information matrix.
  • :doptimal: maximizes the (log) determinant of the expected information matrix. In the optimization, negative the log determinant is minimized instead.
  • :toptimal: maximizes the trace of the expected information matrix. In the optimization, the negative of the trace is minimized instead.

To perform the sample time optimization using the previously constructed decision pop_dec, run the following line, where :doptimal can be replaced with :aoptimal or :toptimal.

pop_result = design(pop_dec; optimality = :doptimal)
┌ Warning: Samples 3 and 4 of subject 23 closer than minimum offset of 0.25.
└ @ OptimalDesign ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/OptimalDesign/9srk5/src/optimaldesign/observationtimes.jl:920
[ Info: Looking for a feasible solution.
[ Info: Feasible solution found.
[ Info: Optimizing.

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************
Optimal schedule:
  Time points per subject: 4
  Subjects: 32
  Initial objective: -71.9777910601971
  Optimal objective: -77.08595941067784
  Size of FIM: (10, 10)

To perform the optimization for the subject decision subj_dec, run the following line instead. verbose is an option for the design function that can be set to true or false. If true, the optimization progress will be displayed. Its default value is false.

subj_result = design(subj_dec; optimality = :doptimal, verbose = true)
┌ Warning: Samples 1 and 2 of subject 1 closer than minimum offset of 0.25.
└ @ OptimalDesign ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/OptimalDesign/9srk5/src/optimaldesign/observationtimes.jl:920
[ Info: Looking for a feasible solution.
Running HiGHS 1.7.2 (git hash: 5ce7a2753): Copyright (c) 2024 HiGHS under MIT licence terms
Coefficient ranges:
  Matrix [1e+00, 1e+00]
  Cost   [1e+00, 1e+00]
  Bound  [1e-07, 4e+01]
  RHS    [2e-01, 3e+01]
Solving LP without presolve, or with basis, or unconstrained
Using EKK dual simplex solver - serial
  Iteration        Objective     Infeasibilities num(sum)
          0    -1.0659208859e-05 Pr: 1(0.189192) 0s
          2     1.8919186585e-01 Pr: 0(0) 0s
Model   status      : Optimal
Simplex   iterations: 2
Objective value     :  1.8919186585e-01
HiGHS run time      :          0.00
[ Info: Feasible solution found.
[ Info: Optimizing.
This is Ipopt version 3.14.16, running with linear solver MUMPS 5.7.3.

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        6
Number of nonzeros in Lagrangian Hessian.............:        0

Total number of variables............................:        4
                     variables with only lower bounds:        0
                variables with lower and upper bounds:        4
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        3
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        3

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0 -3.4871474e+01 1.89e-01 8.35e+00   0.0 0.00e+00    -  0.00e+00 0.00e+00   0
   1 -3.8182363e+01 0.00e+00 6.99e-01   0.8 5.98e+00    -  1.13e-01 1.00e+00f  1
   2 -3.8207289e+01 0.00e+00 4.94e-02  -0.8 1.22e-01    -  1.00e+00 1.00e+00f  1
   3 -3.8218060e+01 0.00e+00 5.55e-02  -2.7 7.45e-02    -  9.94e-01 1.00e+00f  1
   4 -3.8229749e+01 0.00e+00 5.94e-02  -4.0 7.34e-02    -  1.00e+00 1.00e+00f  1
   5 -3.8243154e+01 0.00e+00 6.83e-02  -5.6 7.55e-02    -  1.00e+00 1.00e+00f  1
   6 -3.8274064e+01 0.00e+00 8.88e-02  -6.8 1.84e-01    -  1.00e+00 1.00e+00f  1
   7 -3.8316691e+01 0.00e+00 1.13e-01  -8.2 1.84e-01    -  1.00e+00 1.00e+00f  1
   8 -3.8381138e+01 0.00e+00 1.46e-01  -9.8 2.14e-01    -  1.00e+00 1.00e+00f  1
   9 -3.8486158e+01 0.00e+00 1.95e-01 -11.0 2.71e-01    -  1.00e+00 1.00e+00f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  10 -3.8669911e+01 0.00e+00 2.64e-01 -11.0 3.56e-01    -  1.00e+00 1.00e+00f  1
  11 -3.9003243e+01 0.00e+00 3.45e-01 -11.0 4.81e-01    -  1.00e+00 1.00e+00f  1
  12 -3.9565703e+01 0.00e+00 3.91e-01 -11.0 6.44e-01    -  1.00e+00 1.00e+00f  1
  13 -4.0311297e+01 0.00e+00 4.29e-01 -11.0 8.34e-01    -  1.00e+00 1.00e+00f  1
  14 -4.1228214e+01 0.00e+00 1.44e-01  -9.0 2.80e+05    -  2.09e-05 4.74e-06f  2
  15 -4.1249214e+01 0.00e+00 9.14e-01 -11.0 1.84e+00    -  1.00e+00 1.00e+00f  1
  16 -4.1728801e+01 0.00e+00 5.07e-01 -11.0 1.22e+00    -  1.00e+00 1.00e+00f  1
  17 -4.1748855e+01 0.00e+00 5.08e-01 -11.0 1.43e+00    -  1.00e+00 8.28e-02f  1
  18 -4.1749073e+01 0.00e+00 5.08e-01 -11.0 9.06e+00    -  1.00e+00 1.19e-04f  1
  19 -4.1749088e+01 0.00e+00 4.89e-01 -11.0 3.65e+00    -  1.00e+00 1.82e-05f  1
iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
  20 -4.2409915e+01 0.00e+00 2.14e-01 -11.0 1.00e+01    -  1.00e+00 5.33e-01f  1
  21 -4.2412803e+01 0.00e+00 2.23e-01 -10.4 1.95e+01    -  1.00e+00 2.73e-03f  1
  22 -4.2412847e+01 0.00e+00 2.35e-01  -7.3 1.97e+01    -  5.85e-01 2.71e-05f  1
  23 -4.2413032e+01 0.00e+00 2.23e-01  -8.2 1.42e+00    -  1.00e+00 1.47e-03f  1
  24 -4.2426810e+01 0.00e+00 1.02e-01  -9.2 8.72e-02    -  1.00e+00 1.00e+00f  1
  25 -4.2428461e+01 0.00e+00 2.79e-02 -10.3 2.97e-02    -  1.00e+00 1.00e+00f  1
  26 -4.2428620e+01 0.00e+00 2.70e-03 -11.0 6.54e-03    -  1.00e+00 1.00e+00f  1
  27 -4.2428621e+01 0.00e+00 8.23e-05 -11.0 6.74e-04    -  1.00e+00 1.00e+00f  1
  28 -4.2428621e+01 0.00e+00 2.26e-07 -11.0 1.96e-05    -  1.00e+00 1.00e+00h  1

Number of Iterations....: 28

                                   (scaled)                 (unscaled)
Objective...............:  -4.2428621213201318e+01   -4.2428621213201318e+01
Dual infeasibility......:   2.2551172117557782e-07    2.2551172117557782e-07
Constraint violation....:   0.0000000000000000e+00    0.0000000000000000e+00
Variable bound violation:   3.9987164512922391e-07    3.9987164512922391e-07
Complementarity.........:   1.0000000000000001e-11    1.0000000000000001e-11
Overall NLP error.......:   2.2551172117557782e-07    2.2551172117557782e-07


Number of objective function evaluations             = 32
Number of objective gradient evaluations             = 29
Number of equality constraint evaluations            = 0
Number of inequality constraint evaluations          = 32
Number of equality constraint Jacobian evaluations   = 0
Number of inequality constraint Jacobian evaluations = 1
Number of Lagrangian Hessian evaluations             = 0
Total seconds in IPOPT                               = 28.985

EXIT: Optimal Solution Found.
Optimal schedule:
  Time points per subject: 4
  Subjects: 1
  Initial objective: -37.41445869156619
  Optimal objective: -42.42862121320132
  Size of FIM: (10, 10)

The following additional options can be specified using keyword arguments:

  • time_limit: time limit in seconds, default value is 1200 seconds (20 minutes).
  • nlp_tol: tolerance for the optimization solver, default value is 1e-5.
  • auto_initialize: auto-initializes the optimizer to a feasible design, default value is true.

The output of design can be queried in multiple ways using:

  • optimaltimes(pop_result): returns the optimal sample times for each subject.
  • initvalue(pop_result): returns the initial objective function value.
  • optimalvalue(pop_result): returns the optimal objective function value.
  • defficiency(pop_result): returns (det(FIM_2) / det(FIM_1))^(1/M) where FIM_2 is the Fisher information matrix (FIM) at the optimal design, FIM_1 is the FIM at the initial design, and M is the number of model parameters. This metric makes sense to use only when D-optimality is the optimality criteria.

Let’s query the optimal sample times for the population design pop_result:

optimaltimes(pop_result)
32-element Vector{Vector{Float64}}:
 [0.3810372576787533, 1.9603905991486463, 23.999999862240035, 40.00000039589277]
 [0.38103392783481577, 1.9604347073168455, 23.999999862240035, 40.000000395892776]
 [0.38103863739849214, 1.9603676164858244, 23.999999862240035, 40.00000039589277]
 [0.3810389909054834, 1.9603920061263924, 23.99999986224004, 40.00000039589277]
 [0.38103623611377, 1.960405475045552, 23.999999862240035, 40.00000039589277]
 [0.3810393854989022, 1.9603920420235714, 23.999999862240035, 40.00000039589277]
 [0.3810422953069568, 1.9603149931503359, 23.99999986224003, 40.00000039589277]
 [0.3810200560512011, 1.9606983921516647, 23.999999862240024, 40.000000395892805]
 [0.3810370777997231, 1.9604033147897997, 23.99999986224004, 40.00000039589277]
 [0.3810358147814479, 1.96052759813349, 23.999999862240028, 40.00000039589279]
 ⋮
 [0.3810550051125516, 1.960163053175076, 23.999999862240035, 40.000000395892755]
 [0.3810295663180821, 1.9604897342799048, 23.99999986224003, 40.00000039589278]
 [0.38103392401284003, 1.9604437228521738, 23.999999862240035, 40.000000395892776]
 [0.38103547218703127, 1.9604216137260408, 23.999999862240035, 40.00000039589277]
 [0.3810471305601967, 1.960561132431024, 23.999999862240056, 40.00000039589276]
 [0.38105573340711063, 1.9603524650984474, 23.999999862240035, 40.00000039589276]
 [0.38103722793294426, 1.9604103575635246, 23.999999862240035, 40.00000039589277]
 [0.3810399209138152, 1.960377566853783, 23.999999862240035, 40.00000039589277]
 [0.38106851837702943, 1.9601452821968544, 23.99999986224005, 40.00000039589274]

To get the first subject’s optimal times, use:

optimaltimes(pop_result)[1]
4-element Vector{Float64}:
  0.3810372576787533
  1.9603905991486463
 23.999999862240035
 40.00000039589277

Notice that there are 4 sample times per subject, where the first 2 sample times are correctly placed in the first time window and the second 2 sample times are in the second time window.

Binning the Optimal Sample Times

In some cases, it may make sense to post-process the per-subject optimal designs by binning the optimal sample times. Binning the optimal sample times makes it easier to communicate the result of optimal design to the clinic. For example, we can have 2 bins: 9-10 am and 3-4 pm, and ask the clinic to collect 10 samples in each time bin from all subjects combined where each subject is sampled from once in each time bin. This makes it easier to implement a design that is close to the mathematically optimal design. Time bins can be visually selected by plotting a histogram of the optimal sample times for all subjects combined. To combine all the sample times for all subjects in a single sorted vector, you can use:

all_times = sort(reduce(vcat, optimaltimes(pop_result)))
128-element Vector{Float64}:
  0.3810200560512011
  0.3810295663180821
  0.38103392401284003
  0.38103392783481577
  0.38103547218703127
  0.3810358147814479
  0.38103623611377
  0.3810369441155763
  0.3810370777997231
  0.38103720029296567
  ⋮
 40.00000039589277
 40.00000039589277
 40.00000039589277
 40.00000039589277
 40.000000395892776
 40.000000395892776
 40.00000039589278
 40.00000039589279
 40.000000395892805

To plot a histogram, you can run:

using CairoMakie
fig = Figure()
maxT = all_times[end] + 1
ax = Axis(fig[1, 1]; xticks = 0.0:2.0:maxT)
hist!(ax, all_times)
fig

Now let’s query the optimal objective value:

optimalvalue(pop_result)
-77.08595941067784

Finally, let’s query the design efficiency:

defficiency(pop_result)
1.6666520188678398

2.8 Modifying the Design and Evaluating the Objective Function Value

In some cases, one may want to modify the optimal design in a custom way to impose additional constraints beside the supported time windows’ constraint. In such cases, one can manually evaluate the objective function value using the following routine:

# copy the optimal design
new_pop_times = copy.(optimaltimes(pop_result))
# make a change to the design
new_pop_times[1][1] = new_pop_times[1][1] + 0.1
# create a new decision with the new design passed to the `init` keyword argument
new_pop_dec = decision(
    model,
    population,
    params;
    type = :observation_times,
    bounds = bounds,
    init = new_pop_times,
);
# evaluate the objective function value at `new_pop_times`
new_pop_result =
    design(new_pop_dec; optimality = :doptimal, auto_initialize = false, time_limit = 1e-8)
# get the initial objective function value
initvalue(new_pop_result)
[ Info: Optimizing.
-77.08522472865653

Notice how the time limit was set to a very small value to force termination at iteration 0.

2.9 Optimal Cramer-Rao Lower Bound on the Standard Errors

One value of the optimization process is assessing the expected standard errors of the model parameters. As discussed above, we can get lower bounds of the standard errors by the Cramer-Rao methods. These methods evaluate the FIM at the optimal sample times. To get the expected FIM at the optimal sample times, you can use:

optimalfim = pop_result.optimalfim
10×10 Symmetric{Float64, Matrix{Float64}}:
 23058.9       -55.9287     -9.47757  …   127.252      6.51013      40.6359
   -55.9287     28.3817     -8.03798       -0.25446   -6.69217      -5.53359
    -9.47757    -8.03798  4495.32         883.997     55.8224       39.7138
   856.559     -71.3421     -4.10585        0.468732   0.0166898    73.0627
 -2648.71       42.6282     64.2718        47.6952    -0.141467   -272.549
    83.2394     -2.06136  -223.14     …    23.8679     1.70977     -19.3797
 -1133.02     -163.722    -100.832       -210.558      0.299778    585.323
   127.252      -0.25446   883.997       2545.39      -7.24621      82.9182
     6.51013    -6.69217    55.8224        -7.24621   43.7887        3.77636
    40.6359     -5.53359    39.7138        82.9182     3.77636    6418.76
Symmetric matrix

The expected FIM is a symmetric matrix so the returned pop_result.optimalfim above is of type Symmetric, a matrix type in Julia.

The order of parameters here is the same as the order they appear in the model.

To get the Cramer-Rao (CR) lower bound on the parameters’ standard errors given the optimal design, you can use:

sqrt.(diag(inv(optimalfim)))
10-element Vector{Float64}:
 0.006681912606368064
 0.20060343386977755
 0.015738918225125218
 0.018451796796853603
 0.006872280239217231
 0.03702172496321876
 0.00528903929262707
 0.020609696143178924
 0.15568962237057749
 0.01249999999981819

One can further divide the CR lower bounds by the parameter values to get the CR lower bounds on the parameters’ relative standard errors given the optimal design.

3 Conclusion

In this tutorial, we introduced the concept of optimal design in the context of pharmacometrics and statistics. We discussed the Fisher Information Matrix and its role in evaluating the efficiency of experimental designs, as well as the different criteria for optimality. We also demonstrated how to perform model-based optimal design of experiments in Pumas using the OptimalDesign package.

4 Appendix

In this section, the proof of the loose generalization of the scalar Cramer-Rao lower bound for multiple parameters is presented. Shortening \(L(\theta; x, d, y)\) to \(L(\theta; y)\), the Cauchy-Schwarz inequality for random variables is given by: \[ \begin{aligned} \text{Cov}_y \Bigg[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]^2 & \leq \text{Var}_y[w^T \cdot \hat{\theta}] \cdot \text{Var}_y\Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] \\ \text{Var}_y[w^T \cdot \hat{\theta}] & \geq \frac{\text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]^2}{\text{Var}_y\Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]} \end{aligned} \]

The following identities will be used in the remaining steps of the proof:

  1. \(E_y[\hat{\theta}] = \theta\) because \(\hat{\theta}\) is assumed to be an unbiased estimator of \(\theta\).
  2. \(E_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\) which implies that \(E_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\).
  3. \(\text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = E_y \Big[ -\frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T } \Big]\)

The last 2 identities are well known results in statistics about the mean and covariance of the score function \(\frac{\partial \log L(\theta; y)}{\partial \theta}\).

Breaking down the numerator on the RHS of the inequality without the square: \[ \begin{multlined} \text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = \\ E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] - E_y[w^T \cdot \hat{\theta}] \cdot \overbrace{E_y \Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]}^{0} \end{multlined} \]

The expectation term can be written as an integral: \[ \begin{aligned} E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \cdot L(\theta; y) dy \\ & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial L(\theta; y)}{\partial \theta} dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} \int w^T \cdot \hat{\theta} \cdot L(\theta; y) dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} E_y[w^T \cdot \hat{\theta}] \\ & = w^T \cdot \frac{\partial (w^T \theta)}{\partial \theta} \\ & = w^T w \end{aligned} \]

The denominator can be written as \[ \begin{aligned} \text{Var}_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] & = w^T \cdot \text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] \cdot w \\ & = w^T \cdot E_y\Big[ - \frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T} \Big] \cdot w \\ & = w^T \cdot I(\theta) \cdot w \end{aligned} \]

Re-writing the Cauchy-Schwarz inequality for random variables using the simplified expressions:

\[ \begin{aligned} \text{Var}_y [ w^T \cdot \hat{\theta} ] & = w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{(w^T \cdot w)^2}{w^T \cdot I(\theta) \cdot w} \\ & \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \end{aligned} \]

This completes the proof.