`using OptimalDesign`

# Introduction to Optimal Design

**Optimal design**, particularly within the context of pharmacometrics and statistics, is about strategically planning experiments or studies to get the most informative data possible, using the least resources. This approach is especially important in fields like drug development, where experiments can be costly and time-consuming.

## 1 Key Principles

**Objective**: Before you can design an optimal experiment, you need to be clear about what you’re trying to achieve. Which parameters are you trying to estimate with good precision?**Decision variables**These are the variables that you can control in your experiment to achieve your objective, e.g sample times, group sizes or dose amounts. Your objective and decision variables will guide your design.**Constraints**: These are the limits within which you must operate, such as number of samples, time windows or ethical considerations. An optimal design finds the best solution within these constraints.

### 1.1 Expected Fisher Information

The most common criteria for optimality when designing experiments uses a concept known as the expected Fisher information. Consider a single subject study with experimental design \(d\) and covariates \(x\). The design \(d\) includes the controllable covariates of the subject, e.g dose amount and sample times. The covariates \(x\) includes the uncontrollable covariates, assumed fixed, e.g age and sex.

Additionally, consider a model with a single parameter \(\theta\) and a marginal likelihood function \(L(\theta; x, d, y)\) where \(y\) is the unobserved response of the subject. The expected Fisher information \(I(\theta)\) is given by the expected value of the second derivative of \(-\log L\), where the differentiation is with respect to the parameter \(\theta\) and the expectation is with respect to the response \(y\): \[ I(\theta; x, d) = E_y \Bigg[ -\frac{\partial ^2 \log L(\theta; x, d, y)}{\partial \theta^2} \Bigg] \]

This quantity gives a theoretical measure of information that your experiment gives you about your parameter \(\theta\).

A design \(d\) that maximizes \(I(\theta; x, d)\) for given values of \((\theta, x)\) while satisfying all the design constraints (if any) is the optimal design according to this optimality criteria.

When \(\theta\) is made up of \(M > 1\) parameters instead of a single parameter, the expected Fisher information becomes an \(M \times M\) matrix: \[
I(\theta; x, d) = E_y \Bigg[ -\frac{\partial^2 \log L(\theta; x, d, y)}{\partial \theta \cdot \partial \theta^T} \Bigg]
\] This matrix is known as the expected ** Fisher information matrix** (FIM). When comparing 2 designs \(d_1\) and \(d_2\) given the above matrix, there are multiple ways to

*summarize*the matrix using a single scalar objective function value. These summarization methods are commonly known as

**and will be discussed below.**

*optimality criteria*When considering \(N\) subjects with covariates \(x = \{ x_i: i = 1 \dots N \}\), the Fisher information is additive with respect to the subjects: \[ I(\theta; x, d) = \sum_{i=1}^{N} I(\theta; x_i, d) \]

The interpretation of the expected FIM is as follows:

**Diagonal Element (\(I_{ii}\))**: This represents the information about the parameter \(\theta_i\), quantifying how much information the data provides about this parameter. It reflects the sensitivity of the likelihood function to changes in this parameter.**Off-Diagonal Element (\(I_{ij}\) where \(i \neq j\))**: This element represents the interaction term between the estimates of parameters \(\theta_i\) and \(\theta_j\), indicating how changes in one parameter estimate might affect the estimate of another. If these off-diagonal elements are non-zero, it suggests that there is some level of correlation between the parameters in how they are informed by the data.

#### 1.1.1 Example

To give a concrete example, if you’re estimating two parameters, say \(\mu\) (mean) and \(\sigma^2\) (variance), from a normal distribution:

- A diagonal element, such as \(I_{\mu\mu}\), would quantify how sensitive the likelihood is to changes in \(\mu\), directly relating to the precision with which \(\mu\) can be estimated from the data.
- An off-diagonal element, like \(I_{\mu\sigma^2}\), would indicate how the estimate of \(\mu\) might change with changes in \(\sigma^2\) and vice versa, revealing any interdependence in their estimation from the data.

Here’s how this particular FIM would be represented:

\[ I(\theta) = \begin{bmatrix} I_{\mu\mu} & I_{\mu\sigma^2} \\ I_{\sigma^2\mu} & I_{\sigma^2\sigma^2} \end{bmatrix} \]

Where:

- \(I_{\mu \mu}\) is the information about the mean, reflecting how much the data tells us about \(\mu\).
- \(I_{\sigma^2 \sigma^2}\) is the information about the variance, reflecting how much the data informs us about \(\sigma^2\).
- \(I_{\mu \sigma^2}\) and \(I_{\sigma^2 \mu}\) represent the interaction between the estimates of \(\mu\) and \(\sigma^2\), indicating the degree to which the estimation of one parameter is influenced by the estimation of the other.

In a typical setting for a Normal distribution with maximum likelihood estimation, the Fisher Information elements for \(\mu\) and \(\sigma^2\) could be defined as follows, assuming a sample size of \(n\) and that the parameters \(\mu\) and \(\sigma^2\) are estimated independently:

- \(I_{\mu\mu} = \frac{n}{\sigma^2}\)
- \(I_{\sigma^2\sigma^2} = \frac{n}{2\sigma^4}\)

The off-diagonal elements, \(I_{\mu\sigma^2}\) and \(I_{\sigma^2\mu}\), are zero, reflecting the independence of the estimates of \(\mu\) and \(\sigma^2\) under the assumption of Normal distribution and independent observations.

#### 1.1.2 Random Effects and First Order Approximation

To compute the expected FIM, one needs the Hessian of \(-\log L\) with respect to the parameters \(\theta\) where \(L\) is the *marginal* likelihood. In nonlinear mixed effects (NLME) models, to compute \(\log L\), one would need to marginalize the random effects first. In Pumas, the first order (FO) approximation method is used to marginalize the random effects and approximate the marginal likelihood. The parameters \(\theta\) in optimal design therefore only refers to the population-level parameters.

#### 1.1.3 Cramer-Rao Lower Bound for a Single Parameter

The Cramer-Rao (CR) lower bound further highlights the importance of the expected Fisher information. The CR lower bound for single-parameter models is: \[ \text{Var}_y [ \hat{\theta}(x, d, y) ] \geq \frac{1}{I(\theta; x, d)} \] where \(\text{Var}_y[\hat{\theta}(x, d, y)]\) is the variance of the unbiased maximum likelihood estimator \(\hat{\theta}(x, d, y)\) assuming the true parameter value is \(\theta\). Moving forward, \(\hat{\theta}(x, d, y)\) will be shortened to \(\hat{\theta}\), and \(I(\theta, x, d)\) will be shortened to \(I(\theta)\).

The above expression for the CR lower bound is only correct for scalar \(\theta\). You cannot divide by a matrix!

Choosing an experimental design \(d\) that minimizes \(1 / I(\theta)\) (i.e., maximizes \(I(\theta)\)), only minimizes a lower bound on the variance \(\text{Var}_y [ \hat{\theta} ]\), not the variance itself. This does not guarantee that \(\text{Var}_y [ \hat{\theta} ]\) is itself minimized because in theory the lower bound can be loose. However in practice, minimizing the lower bound \(1 / I(\theta)\) tends to be a good surrogate objective for minimizing \(\text{Var}_y [ \hat{\theta} ]\), which is the quantity we actually care about minimizing.

Given that in NLME models, the marginal likelihood (and by extension \(I(\theta)\)) is approximated using a first order approximation, the computed \(1 / I(\theta)\) is only an approximation of the actual lower bound. While this approximate lower bound may be a good surrogate when optimizing the experiment design, **using simulation to evaluate the final design is usually recommended**. In addition, running a simulation to evaluate the design can be useful to confirm that the model will be estimable using the software intended for the eventual data analysis with the future experimental data.

#### 1.1.4 Cramer-Rao Lower Bound for a Vector of Parameters

**One generalization** of the scalar CR lower bound for multiple parameters, i.e. vector \(\theta\), is: \[
\frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w}
\] which is true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\). \(M\) is the number of parameters in \(\theta\). \(\text{Var}_y[\hat{\theta}]\) is the \(M \times M\) covariance matrix of the unbiased maximum likelihood estimate \(\hat{\theta}\). The proof of this lower bound is shown in Section 4.

When restricting \(w\) so that \(w^T \cdot w = 1\), i.e. \(w\) is a unit vector, the generalized CR lower bound simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \] When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we recover the scalar CR lower bound: \[ \text{Var}_y [ \hat{\theta}_i ] \geq \frac{1}{I(\theta)_{i,i}} \]

**Another better and more commonly used generalization** of the scalar CR lower bound, originally presented by Rao 1945 and later shown to be a special case of the Hammersley-Chapman-Robbins lower bound, is: \[
\frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w}
\] This is also true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\).

Note that the division by \(w^T \cdot w\) on both sides is not necessary but is done for consistency with the previous lower bound.

This improved lower bound can also be equivalently written as: \[ \text{Var}_y[\hat{\theta}] - I(\theta)^{-1} \succeq 0 \] where \(A \succeq 0\) means that \(A\) is a positive semi-definite matrix.

The second generalization is a tighter lower bound when \(\theta\) is a vector because: \[ \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] for all \(w\), assuming \(w' \cdot w > 0\), which follows from the Cauchy-Schwarz inequality: \[ (u^T \cdot v)^2 \leq (u^T \cdot u) \cdot (v^T \cdot v) \] by substituting: \[ \begin{aligned} u & = I(\theta)^{1/2} \cdot w \\ v & = I(\theta)^{-1/2} \cdot w \end{aligned} \] where \(I(\theta)^{1/2}\) is the symmetric positive definite square root of \(I(\theta)\) and \(I(\theta)^{-1/2}\) is its inverse.

The relationship between the 2 bounds is therefore given by: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] If \(w' \cdot w\) is assumed to be 1, this simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq w^T \cdot I(\theta)^{-1} \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \]

When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we get: \[ \text{Var}_y[\hat{\theta}_i] \geq (I(\theta)^{-1})_{i,i} \geq \frac{1}{I(\theta)_{i,i}} \]

If \(\theta\) is a scalar, or more generally if \(w\) is an eigenvector of \(I(\theta)\), the 2 lower bounds are equal.

### 1.2 Criteria for Optimality

When optimizing the design to maximize a measure of information gain for multi-parameter models, there are multiple choices of scalar objective functions, which are functions of the expected FIM. All of the scalar objective functions are aggregate functions of one of the 2 lower bounds presented above, used as surrogate objectives for minimizing the variances of the parameters’ estimates. Each optimality criteria has a different name. Let’s explore some of the most commonly used criteria.

To define any of these objective functions, the expected FIM, \(I(\theta)\), is typically only computed for a user-provided value of \(\theta = \theta_0\) (taken from literature), which is assumed to be the true value of \(\theta\). This approach is often called *local optimal design*.

#### 1.2.1 D-Optimality

D-Optimal design * maximizes* the (log) determinant of the expected FIM. By doing so, a D-optimal design aims to minimize the volume of the confidence ellipsoid around the estimated parameters.

In Pumas, the ** negative** of the log determinant of the expected FIM is

**which is equivalent to**

*minimized***the determinant of the expected FIM.**

*maximizing***Application**: Ideal for experiments where the goal is to estimate parameters and all their weighted combinations with maximum average precision in the presence of correlations.

#### 1.2.2 T-Optimality

T-Optimality * maximizes* the trace (the sum of the diagonal elements) of the expected FIM, \(\sum_i I_{i,i}\). This criterion indirectly

*the average \(\text{Var}_y [ \hat{\theta}_i ]\) by*

**minimizes***the sum of the denominators of the loose CR lower bounds which indirectly*

**maximizing***the loose CR lower bounds, \(1 / I_{i,i}\).*

**minimizes**In Pumas, the * negative* of the trace of the expected FIM is

*which is equivalent to*

**minimized***the trace.*

**maximizing****Application**: Useful when the objective is to minimize the average variance across all parameter estimates or when we would like to use different weights on parameters by incorporating weights in the objective.

#### 1.2.3 A-Optimality

A-Optimality * minimizes* the trace (the sum of the diagonal elements) of the inverse of the expected FIM, \(\sum_i (I^{-1})_{i,i}\). Similar to T-optimality, this criterion indirectly

*the average \(\text{Var}_y [ \hat{\theta}_i ]\) by*

**minimizes***the sum of the tighter lower bounds, \((I^{-1})_{i,i}\).*

**minimizing****Application**: Similar to T-optimality.

#### 1.2.4 Using Weights and Parameter Scales

Both T-Optimality and A-Optimality can be extended using weights on parameters. In T-Optimality, the weighted objective to be * maximized* is \(\sum_i p_i \cdot I_{i,i}\), where \(p_i\) is the weight of parameter \(\theta_i\). In A-Optimality, the weighted objective to be

*is \(\sum_i p_i \cdot (I^{-1})_{i,i}\).*

**minimized**Using weights can help us achieve any of the following goals:

- Weights allow us to vary the level of importance of different parameters. For example, we typically care more about the clearance parameter than the inter-individual variability (IIV) parameters. More important parameters should get higher weights than less important parameters. Parameters which we do not care about can be given a weight of 0.
- Weights can also be used to make the objective function value invariant to the scales of the parameters, \(\hat{\theta}\). Some parameters may be in the 100s while other parameters maybe in the 0.1s. The optimal design objective function values are sensitive to the scales of the parameters.
- The product of importance weights and scaling weights can be used to achieve both goals 1 and 2 simultaneously.

**In T-Optimality**, to make the objective function invariant to the parameters’ scales, one can use the weights \(p_i = \theta_i^2\). The justification is shown below.

Using the loose CR lower bound and dividing both sides by \(\theta_i^2\), we get: \[
\begin{aligned}
\text{Var}_y [ \hat{\theta}_i ] & \geq \frac{1}{I(\theta)_{i,i}} \\
\frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{1}{\theta_i^2 \cdot I(\theta)_{i,i}}
\end{aligned}
\] If \(\hat{\theta} \approx \theta\), \(\text{Var}_y [ \hat{\theta}_i ] / \theta_i^2\) will be approximately equal to the square of the relative standard error estimate of \(\hat{\theta}_i\), which is invariant to the parameters’ scales. T-Optimality tries to indirectly * minimize* the LHS of the inequality by

*the sum of the denominators of the RHS terms for all \(i\): \[ \sum_i \theta_i^2 \cdot I(\theta)_{i,i} \] Setting \(p_i = \theta_i^2\) in T-Optimality therefore makes the objective function invariant to the parameters’ scales.*

**maximizing****In A-Optimality**, weights \(p_i = 1 / \theta_i^2\) can be used to achieve invariance to the parameters’ scales. This also follows from the tight CR lower bound and assuming \(\hat{\theta} \approx \theta\). \[
\begin{aligned}
\text{Var}_y [ \hat{\theta}_i ] & \geq (I(\theta)^{-1})_{i,i} \\
\frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{(I(\theta)^{-1})_{i,i}}{\theta_i^2}
\end{aligned}
\]

Using weights to make the objective function value invariant to the parameters’ scales assumes that none of the parameters is exactly 0. For parameters with value 0 or very close to 0, using the above weighting scheme can give spurious results.

## 2 Example in Pumas

Let’s dive into an example using the `OptimalDesign`

package in Pumas.

### 2.1 Loading `OptimalDesign`

The first step is to load the `OptimalDesign`

package using:

`OptimalDesign`

and `Pumas`

By loading `OptimalDesign`

, `Pumas`

will be automatically loaded as well. If you are doing an optimal design workflow, you don’t need to load `Pumas`

separately.

### 2.2 Model Definition

After loading `OptimalDesign`

, you should define a `Pumas`

model.

```
= @model begin
model @param begin
∈ RealDomain(; lower = 0)
tvcl ∈ RealDomain(; lower = 0)
tvvc ∈ RealDomain(; lower = 0)
tvka ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0)
σ end
@random begin
~ MvNormal(Ω)
η end
@pre begin
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc = tvka * exp(η[3])
Ka end
@dynamics Depots1Central1
@derived begin
~ @. Normal(Central / Vc, σ)
dv end
end
```

```
PumasModel
Parameters: tvcl, tvvc, tvka, Ω, σ
Random effects: η
Covariates:
Dynamical system variables: Depot, Central
Dynamical system type: Closed form
Derived: dv
Observed: dv
```

As you can see this is a simple one-compartment model with first-order absorption.

### 2.3 Define a `Subject`

/`Population`

for Optimal Design

To perform optimal design, one needs to specify the subject(s) for which optimal sample times will be identified. A `Subject`

in `Pumas`

typically includes:

- Covariates, possibly time-varying
- Dosing regimen
- Sample times
- Observation names, e.g.
`dv`

- Observation values

All of the above information is required when using a `Subject`

for fitting a model. However for the purpose of optimal design, while the same `Subject`

data structure is used, some of its fields become optional and get ignored if present. For example, sample times are optimized in optimal design so they are not required to be input. Observation values are also assumed to be missing in optimal design since presumably no experiment had been conducted yet.

There are 2 ways to define a `Subject`

or `Population`

for optimal design, discussed below.

#### 2.3.1 Manual Definition

The following is an example where a `Subject`

is defined manually using the `Subject`

constructor.

`= Subject(; events = DosageRegimen(100), observations = (; dv = nothing)) subject `

```
Subject
ID: 1
Events: 1
Observations: dv: (n=0)
```

To define a population manually, you can use the `map`

function to define multiple subjects programmatically. The following defines a population of 10 identical subjects with IDs going through 1 to 10.

```
= map(
population ->
i Subject(; id = i, events = DosageRegimen(100), observations = (; dv = nothing)),
1:10,
)
```

```
Population
Subjects: 10
Observations: dv
```

or define a vector of subjects manually:

```
= [
population Subject(; id = 1, events = DosageRegimen(100), observations = (; dv = nothing)),
Subject(; id = 2, events = DosageRegimen(100), observations = (; dv = nothing)),
Subject(; id = 3, events = DosageRegimen(100), observations = (; dv = nothing)),
]
```

```
Population
Subjects: 3
Observations: dv
```

Note that you can pass any covariate to the `Subject`

constructor using the `covariates`

keyword argument, e.g. `covariates = (; AGE = 30)`

.

Notice how the sample times were not specified in the above `Subject`

constructors. Also notice that the observation values were set to `nothing`

. However, it is important to use the correct observation name, `dv`

, when constructing the subject to be consistent with the model, matching the observed variables’ names in the `@derived`

block.

If an observation name is missing from the `observations`

keyword argument but present in the model, sample times for this response variable will not be optimized. This has applications when we require a separate design for PK samples times vs PD sample times for each subject. This will be presented in a future tutorial.

#### 2.3.2 Read from a `DataFrame`

Beside the manual definition, one can also conveniently define a `Population`

from a `DataFrame`

using `read_pumas`

. The `DataFrame`

may be one of the standard datasets available in the `PharmaDatasets`

package, or may be read from a CSV file.

When reading subjects’ data from a `DataFrame`

using `read_pumas`

and then doing optimal design, only the observations *names*, dosing information and covariates are taken into account. The observation *values* and sample time points are ignored. If time-varying covariates are present in the data, the time points in the data will be used to define the time-varying covariates only but new sample times will be selected for the optimal design.

To give an example, we’ll be using the Warfarin PK dataset from the `PharmaDatasets`

package.

```
using PharmaDatasets
using DataFrames
using DataFramesMeta
= dataset("warfarin.csv")
df first(df, 5)
```

Row | id | time | dv | amt | evid | cmt |
---|---|---|---|---|---|---|

Int64 | Float64 | Float64? | Int64? | Int64 | Int64 | |

1 | 1 | 0.0 | missing | 70 | 1 | 1 |

2 | 1 | 0.5 | 9.9973 | missing | 0 | 1 |

3 | 1 | 1.0 | 7.3587 | missing | 0 | 1 |

4 | 1 | 2.0 | 7.4868 | missing | 0 | 1 |

5 | 1 | 6.0 | 9.6466 | missing | 0 | 1 |

The dataset contains the following columns:

`:id`

: the subject ID.`:time`

: the time of the observation.`:dv`

: the observed drug concentration.`:amt`

: the dose administered.`:evid`

: the event ID, indicating whether the observation is a dose or a concentration measurement.`:cmt`

: the compartment ID, indicating the compartment where the dose was administered.

Let’s check the summary statistics of the dataset:

`describe(df, :mean, :std, :min, :max, :nuniqueall)`

Row | variable | mean | std | min | max | nuniqueall |
---|---|---|---|---|---|---|

Symbol | Float64 | Float64 | Real | Real | Int64 | |

1 | id | 16.5 | 9.24916 | 1 | 32 | 32 |

2 | time | 29.0556 | 39.3667 | 0.0 | 120.0 | 9 |

3 | dv | 4.68351 | 2.63184 | 0.28399 | 12.184 | 256 |

4 | amt | 70.0 | 0.0 | 70 | 70 | 1 |

5 | evid | 0.111111 | 0.314817 | 0 | 1 | 2 |

6 | cmt | 1.0 | 0.0 | 1 | 1 | 1 |

We have 32 subjects each with 8 measurements and 1 dose event.

```
@chain df begin
@rsubset :id == 1
@by :evid :nrow = length(:time)
end
```

Row | evid | nrow |
---|---|---|

Int64 | Int64 | |

1 | 0 | 8 |

2 | 1 | 1 |

With the data loaded, we can now proceed to parse it into a `Population`

:

```
= read_pumas(
population
df;= :id,
id = :time,
time = [:dv],
observations = :amt,
amt = :evid,
evid = :cmt,
cmt )
```

```
Population
Subjects: 32
Observations: dv
```

### 2.4 Specify the Population Parameters

Currently, only *local optimal design* is available, so the user must specify the values of the population parameters to use. These parameter values will be assumed to be the true parameter values when computing the expected FIM.

Population parameters are the parameters of the model that are shared across the population. These include the so-called typical values of the parameters, e.g. the `tvcl`

, `tvvc`

, and `tvka`

along with the IIV parameter `Ω`

s and the residual variability parameter `σ`

in the Warfarin model.

```
= (;
params = 0.15,
tvcl = 8,
tvvc = 1,
tvka = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
Ω = sqrt(0.01),
σ )
```

```
(tvcl = 0.15,
tvvc = 8,
tvka = 1,
Ω = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
σ = 0.1,)
```

### 2.5 Time Window Specification

In `OptimalDesign`

, one needs to specify the time windows in which optimal samples will be chosen. These time windows should represent clinical hours to increase the probability that the design found by `OptimalDesign`

is actually feasible to implement in practice.

In `OptimalDesign`

, users can express time intervals as regular floating point number bounds, i.e. `Float64`

. This would be typical for designing future clinical trials. Alternatively, for convenience in managing clinical scheduling in a hospital or clinic setting, the `Dates`

standard Julia package is available to easily specify the time bounds for optimization using specific dates and clock times via the `DateTime`

type. If the time bounds are passed in as `Float64`

, the optimal design’s times will be in `Float64`

format. Alternatively, if the time bounds are passed in using `DateTime`

, the optimal design’s times will be in `DateTime`

.

#### 2.5.1 Floating Point Time Windows

Let us define the start and end times of a time window via `t0`

and `tend`

:

`= 0.0, 16.0 t0, tend `

`(0.0, 16.0)`

In Julia, you can define multiple variables in the same line of code where the variables names and values are separated by commas. The following 2 code snippets are equivalent:

`= 0.0, 16.0 t0, tend `

```
= 0.0
t0 = 16.0 tend
```

The time unit assumed for `t0`

and `tend`

matches the time unit assumed in the model. Hence if your model assumes “hour” is the time unit then the time bounds above represent hour 0 and hour 16.

To define time bounds, `OptimalDesign`

makes use of intervals in Julia. One can define a time window that starts at `t0`

and ends at `tend`

using:

`.. tend t0 `

`0.0 .. 16.0`

Arithmetic on intervals is also supported to increment the whole interval by a fixed number. However, parentheses must be used carefully. The following is the correct syntax to increment the start and finish bounds by 1 day, assuming the model’s unit is “hour”:

`.. tend) + 24 (t0 `

`24.0 .. 40.0`

The following syntax parses differently incrementing only the finish bound by a day.

`.. tend + 24 t0 `

`0.0 .. 40.0`

#### 2.5.2 `DateTime`

Time Windows

We’ll now cover Julia’s `DateTime`

API and how to use them in `OptimalDesign`

. Let us define the start and end times of a time window via `t0`

and `tend`

:

`= DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am t0 `

`2024-01-25T09:00:00`

`= DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm tend `

`2024-01-25T21:00:00`

The `DateTime`

constructor’s inputs are of the form: (`Year`

, `Month`

, `Day`

, `Hour`

, `Minute`

, `Second`

, `Millisecond`

). Dropping arguments from the end assigns their value to `0`

, e.g. the `Minute`

, `Second`

, `Millisecond`

values in `t0`

and `tend`

above are `0`

.

##### 2.5.2.1 `DateTime`

Arithmetic

One can use arithmetic functions to increment dates and times as follows:

- Next second:
`t0 + Second(1)`

. - Next minute:
`t0 + Minute(1)`

. - Next hour:
`t0 + Hour(1)`

. - Next day:
`t0 + Day(1)`

. - Next week:
`t0 + Day(7)`

. - Next month:
`t0 + Month(1)`

. - Next year:
`t0 + Year(1)`

.

Multiplication with integers and subtraction are also supported.

##### 2.5.2.2 Intervals

Similar to floating point intervals, one can define an interval where `t0`

and `tend`

are `DateTime`

objects.

`.. tend t0 `

`2024-01-25T09:00:00 .. 2024-01-25T21:00:00`

The interval can then be incremented 1 day as follows:

`.. tend) + Day(1) (t0 `

`2024-01-26T09:00:00 .. 2024-01-26T21:00:00`

The following syntax parses differently incrementing only the finish bound by a day.

`.. tend + Day(1) t0 `

`2024-01-25T09:00:00 .. 2024-01-26T21:00:00`

### 2.6 Decision Definition

The next step in `OptimalDesign`

is to define the design structure and requirements that make up the decision variables to be optimized. The term *decision* is used in this context to refer to the design that is yet to be optimized.

#### 2.6.1 Decision Structure

Currently, there are 2 dimensions, with 2 choices each, along which one can vary the structure of the optimal sample times in `OptimalDesign`

. These are orthogonal dimensions giving rise to 4 different combinations.

##### 2.6.1.1 `Subject`

vs `Population`

Decision

In `OptimalDesign`

, a decision can be:

- A subject decision, or
- A population decision.

In the first case, a single subject is used and the sample times for this individual subject are optimized. In this case, the output optimal samples times will be a vector of sample times (i.e. `Float64`

numbers or `DateTime`

objects).

In the second case when a population of more than 1 subject is used, separate sample times will be optimized for each subject. In this case, the output optimal samples times will be a vector of vectors of sample times. The outer vector has the same length as the population and the inner vectors are the sample times per subject.

Currently, dividing subjects into groups where each group shares the sample times is not supported except in the trivial case where all the subjects in a group share the same covariates and dosing regimen. In this case, the `subject_multiples`

keyword argument (discussed later) can be used to specify how many times (i.e. the group size) each unique subject should be counted when computing the expected FIM.

Increasing `subject_multiples`

for all of the subjects in a population by the same ratio will not change the optimal designs and will only change the expected FIM at the optimal design. However, changing `subject_multiples`

by different ratios for different subjects will change the optimal design.

Using the same subject 2 times in a `Population`

means that there are 2 independent copies of this subject. Each copy will have its own optimal sample times, even though the 2 copies are identical in covariates and dosing regimen. The reason for this is that when sample times are sparse, it can be advantageous to observe different copies at different times to have good overall coverage of all parts of the dynamics when combining the subjects’ data together in the population analysis.

##### 2.6.1.2 Pre-specified vs Optimized Number of Samples Per Time Window

Another dimension along which one can vary the decision structure is by either:

- Pre-specifying the number of samples per time window per subject, or
- Only specifying the total number of samples for all time windows per subject.

In the second case, the optimization algorithm determines how many samples should be in each time window to optimize the objective function. In this tutorial, only the first case is covered.

#### 2.6.2 Floating Point Decisions

Once you define the time bounds, the next step is to define the decision. The `decision`

function is used to specify the decision to be optimized.

Julia has a built-in dictionary type called `Dict`

. It is a collection of key-value pairs where each key is unique. For more information on dictionaries, see the Julia documentation, and the Julia Syntax Pumas tutorial

The following dictionary can be used to associate a number of samples (2) for each time window. Assuming the unit is “hour”, the time windows below refer to

- Hour 0 to hour 16 of day 1, and
- Hour 0 to hour 16 of day 2.

```
= 0.0, 16.0
t0, tend = Dict((t0 .. tend) => 2, (t0 .. tend) + 24 => 2) bounds
```

```
Dict{ClosedInterval{Float64}, Int64} with 2 entries:
0.0 .. 16.0 => 2
24.0 .. 40.0 => 2
```

After defining the time bounds and number of samples per time window, a `decision`

can be constructed for a single `Subject`

using:

```
= population[1]
subject = decision(
subj_dec
model,
subject,
params;type = :observation_times,
= bounds,
bounds = 0.25,
minimum_offset );
```

or for a whole `Population`

using:

```
= decision(
pop_dec
model,
population,
params;type = :observation_times,
= bounds,
bounds = 0.25,
minimum_offset );
```

The `minimum_offset`

keyword argument can be optionally specified to enforce a minimum duration between any 2 samples. The default value of `minimum_offset`

is `1e-4`

.

Some additional options which can be specified as keyword arguments include:

`rng`

: the random number generator used to generate a random initial of sample times. The default value is`Random.GLOBAL_RNG`

.`init`

: The initial sample times. The default value is`nothing`

. If`nothing`

is input, random sample times are used to initialize the optimization. The initial sample times for a subject is a vector of sample times. The initial sample times for a population is a vector of vectors of sample times. The type of sample times (`Float64`

or`DateTime`

) should match that in the`bounds`

.`subject_multiples`

: The number of times each subject is counted. This is a number if only a single subject is passed in the second argument, or a vector if a population is used. The default value is`1`

for an individual subject or a vector of ones for a population.`time_multiples`

: a vector of integers used to replicate each sample multiple times. If not specified, each sample will be counted only once. This is useful when multiple sub-samples can be taken from each sample giving multiple measurements at the same sample time.

#### 2.6.3 `DateTime`

Decisions

Since models in `Pumas`

do not specify any time unit, when `DateTime`

decisions are used, a `model_time_unit`

must be passed as an additional keyword argument in order to specify the unit of time assumed in the model definition and dynamics model, e.g. `model_time_unit = Hour(1)`

. In this case, `minimum_offset`

should also be defined using a `DateTime`

object.

```
= DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am
t0 = DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm
tend = Dict((t0 .. tend) => 2, (t0 .. tend) + Day(1) => 2)
date_bounds = decision(
date_pop_dec
model,
population,
params;= date_bounds,
bounds = Minute(15),
minimum_offset = Hour(1),
model_time_unit );
```

### 2.7 Optimizing the Design

In `OptimalDesign`

, a number of types of optimal design are supported. The optimality type can be specified using the keyword argument `optimality`

which can take the following values:

`:aoptimal`

: minimizes the trace of the inverse of the expected information matrix.`:doptimal`

: maximizes the (log) determinant of the expected information matrix. In the optimization, negative the log determinant is minimized instead.`:toptimal`

: maximizes the trace of the expected information matrix. In the optimization, the negative of the trace is minimized instead.

To perform the sample time optimization using the previously constructed decision `pop_dec`

, run the following line, where `:doptimal`

can be replaced with `:aoptimal`

or `:toptimal`

.

`= design(pop_dec; optimality = :doptimal) pop_result `

```
[ Info: Looking for a feasible solution.
[ Info: Feasible solution found.
[ Info: Optimizing.
Presolve 320 (-32) rows, 256 (0) columns and 640 (-64) elements
Perturbing problem by 0.001% of 1 - largest nonzero change 5.5246371e-05 ( 0.0055246371%) - largest zero change 5.4929341e-05
0 Obj 0.061982079 Primal inf 1023.0274 (192)
81 Obj 0.096826196 Primal inf 138.39655 (60)
138 Obj 0.53525003
Optimal - objective value 0.43285876
After Postsolve, objective 0.43285876, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective 0.4328587606 - 138 iterations time 0.002, Presolve 0.00
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit https://github.com/coin-or/Ipopt
******************************************************************************
```

```
Optimal schedule:
Time points: 4
Subjects: 32
Initial objective: -71.2241621575998
Optimal objective: -77.08595942659673
Size of FIM: (10, 10)
```

To perform the optimization for the subject decision `subj_dec`

, run the following line instead. `verbose`

is an option for the `design`

function that can be set to `true`

or `false`

. If `true`

, the optimization progress will be displayed. Its default value is `false`

.

`= design(subj_dec; optimality = :doptimal, verbose = true) subj_result `

```
[ Info: Looking for a feasible solution.
[ Info: Feasible solution found.
[ Info: Optimizing.
Presolve 10 (-1) rows, 8 (0) columns and 20 (-2) elements
Perturbing problem by 0.001% of 1 - largest nonzero change 4.4133758e-05 ( 0.0044133758%) - largest zero change 4.2811508e-05
0 Obj 0.0018711403 Primal inf 34.364217 (6)
4 Obj 0.0030840463
Optimal - objective value 0
After Postsolve, objective 0, infeasibilities - dual 0 (0), primal 0 (0)
Optimal objective 0 - 4 iterations time 0.002, Presolve 0.00
This is Ipopt version 3.14.13, running with linear solver MUMPS 5.6.1.
Number of nonzeros in equality constraint Jacobian...: 0
Number of nonzeros in inequality constraint Jacobian.: 6
Number of nonzeros in Lagrangian Hessian.............: 0
Total number of variables............................: 4
variables with only lower bounds: 0
variables with lower and upper bounds: 4
variables with only upper bounds: 0
Total number of equality constraints.................: 0
Total number of inequality constraints...............: 3
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 3
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
0 -3.8095391e+01 0.00e+00 6.05e-01 0.0 0.00e+00 - 0.00e+00 0.00e+00 0
1 -3.8097030e+01 0.00e+00 2.73e-02 -5.0 2.95e-02 - 9.88e-01 1.00e+00f 1
2 -3.8125341e+01 0.00e+00 8.37e-03 -2.3 6.10e-01 - 9.82e-01 1.00e+00f 1
3 -3.8160639e+01 0.00e+00 1.14e-02 -3.2 1.26e+00 - 1.00e+00 1.00e+00f 1
4 -3.8321660e+01 0.00e+00 8.52e-02 -3.8 1.72e+00 - 9.96e-01 1.00e+00f 1
5 -3.8564586e+01 0.00e+00 2.88e-01 -3.7 1.21e+01 - 1.00e+00 6.54e-02f 2
6 -3.8751732e+01 0.00e+00 2.88e-01 -2.0 8.58e-01 - 1.00e+00 1.00e+00f 1
7 -4.0034820e+01 0.00e+00 5.45e-01 -2.7 2.53e+00 - 1.00e+00 1.00e+00f 1
8 -4.0137679e+01 0.00e+00 8.93e-01 -1.1 1.51e+02 - 1.00e+00 4.35e-03f 2
9 -4.0275817e+01 0.00e+00 5.99e-01 -1.1 4.26e-01 - 9.22e-01 1.00e+00f 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
10 -4.0604230e+01 0.00e+00 5.72e-01 -1.7 1.24e+00 - 1.00e+00 1.00e+00f 1
11 -4.1685004e+01 0.00e+00 1.99e-01 -1.3 3.19e+01 - 1.00e+00 1.84e-01f 1
12 -4.2079535e+01 0.00e+00 4.43e-01 -1.3 1.39e+01 - 5.98e-01 4.35e-01f 1
13 -4.2191363e+01 0.00e+00 1.96e+00 -2.6 7.49e-01 - 9.98e-01 7.82e-01f 1
14 -4.2287405e+01 0.00e+00 3.62e-01 -2.0 2.92e-01 - 1.00e+00 1.00e+00f 1
15 -4.2367853e+01 0.00e+00 2.78e-01 -3.6 1.70e-01 - 1.00e+00 9.93e-01f 1
16 -4.2417243e+01 0.00e+00 2.34e-01 -5.2 3.89e-01 - 1.00e+00 9.12e-01f 1
17 -4.2427795e+01 0.00e+00 4.27e-02 -5.1 1.10e-01 - 1.00e+00 1.00e+00f 1
18 -4.2428607e+01 0.00e+00 4.50e-03 -6.8 2.46e-02 - 1.00e+00 9.93e-01f 1
19 -4.2428621e+01 0.00e+00 6.69e-04 -8.9 3.95e-03 - 1.00e+00 1.00e+00f 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
20 -4.2428621e+01 0.00e+00 2.16e-04 -11.0 1.52e-04 - 1.00e+00 1.00e+00f 1
21 -4.2428621e+01 0.00e+00 1.44e-06 -11.0 1.34e-04 - 1.00e+00 1.00e+00h 1
Number of Iterations....: 21
(scaled) (unscaled)
Objective...............: -4.2428621213200756e+01 -4.2428621213200756e+01
Dual infeasibility......: 1.4409275527066390e-06 1.4409275527066390e-06
Constraint violation....: 0.0000000000000000e+00 0.0000000000000000e+00
Variable bound violation: 3.9987163091836919e-07 3.9987163091836919e-07
Complementarity.........: 1.0001224572486519e-11 1.0001224572486519e-11
Overall NLP error.......: 1.4409275527066390e-06 1.4409275527066390e-06
Number of objective function evaluations = 27
Number of objective gradient evaluations = 22
Number of equality constraint evaluations = 0
Number of inequality constraint evaluations = 27
Number of equality constraint Jacobian evaluations = 0
Number of inequality constraint Jacobian evaluations = 1
Number of Lagrangian Hessian evaluations = 0
Total seconds in IPOPT = 59.488
EXIT: Optimal Solution Found.
```

```
Optimal schedule:
Time points: 4
Subjects: 1
Initial objective: -38.09539124319573
Optimal objective: -42.428621213200756
Size of FIM: (10, 10)
```

The following additional options can be specified using keyword arguments:

`time_limit`

: time limit in seconds, default value is`1200`

seconds (20 minutes).`nlp_tol`

: tolerance for the optimization solver, default value is`1e-5`

.`auto_initialize`

: auto-initializes the optimizer to a feasible design, default value is`true`

.

The output of `design`

can be queried in multiple ways using:

`optimaltimes(pop_result)`

: returns the optimal sample times for each subject.`initvalue(pop_result)`

: returns the initial objective function value.`optimalvalue(pop_result)`

: returns the optimal objective function value.`defficiency(pop_result)`

: returns`(det(FIM_2) / det(FIM_1))^(1/M)`

where`FIM_2`

is the Fisher information matrix (FIM) at the optimal design,`FIM_1`

is the FIM at the initial design, and`M`

is the number of model parameters. This metric makes sense to use only when D-optimality is the optimality criteria.

Let’s query the optimal sample times for the population design `pop_result`

:

`optimaltimes(pop_result)`

```
32-element Vector{Vector{Float64}}:
[0.38105232815557555, 1.9602808548550907, 23.999999862240134, 40.000000395892634]
[0.3810569748120521, 1.9602544229450516, 23.999999862240134, 40.00000039589263]
[0.3810555864560761, 1.9601809640522705, 23.99999986224013, 40.00000039589263]
[0.3810544384594627, 1.9602941470027446, 23.999999862240134, 40.000000395892634]
[0.3810511717720558, 1.9602799877661232, 23.999999862240134, 40.000000395892634]
[0.38105317618000834, 1.9602609452364468, 23.99999986224013, 40.000000395892634]
[0.3810506457736531, 1.9602803679640093, 23.999999862240134, 40.000000395892634]
[0.38105122089809734, 1.9602774587508671, 23.99999986224013, 40.000000395892634]
[0.3810534660650994, 1.9602627019724774, 23.999999862240127, 40.000000395892634]
[0.38105275081467044, 1.9602602666679505, 23.99999986224013, 40.000000395892634]
⋮
[0.3810513874257813, 1.9602672399438983, 23.999999862240134, 40.000000395892634]
[0.38105191246414877, 1.9602770014404096, 23.99999986224013, 40.000000395892634]
[0.38104270813521346, 1.9602537752671811, 23.999999862240138, 40.00000039589262]
[0.38105539238888275, 1.960238876742562, 23.99999986224013, 40.00000039589263]
[0.38105463143847496, 1.960265837862734, 23.999999862240134, 40.00000039589263]
[0.38105355401182517, 1.9602842903992255, 23.999999862240127, 40.000000395892634]
[0.3810543805736101, 1.960270948417552, 23.999999862240134, 40.000000395892634]
[0.3810523063496214, 1.960288038090397, 23.99999986224013, 40.00000039589263]
[0.3810607028046549, 1.960265177352722, 23.999999862240134, 40.00000039589263]
```

To get the first subject’s optimal times, use:

`optimaltimes(pop_result)[1]`

```
4-element Vector{Float64}:
0.38105232815557555
1.9602808548550907
23.999999862240134
40.000000395892634
```

Notice that there are 4 sample times per subject, where the first 2 sample times are correctly placed in the first time window and the second 2 sample times are in the second time window.

In some cases, it may make sense to post-process the per-subject optimal designs by binning the optimal sample times. Binning the optimal sample times makes it easier to communicate the result of optimal design to the clinic. For example, we can have 2 bins: 9-10 am and 3-4 pm, and ask the clinic to collect 10 samples in each time bin from all subjects combined where each subject is sampled from once in each time bin. This makes it easier to implement a design that is close to the mathematically optimal design. Time bins can be visually selected by plotting a histogram of the optimal sample times for all subjects combined. To combine all the sample times for all subjects in a single sorted vector, you can use:

`= sort(reduce(vcat, optimaltimes(pop_result))) all_times `

```
128-element Vector{Float64}:
0.38104270813521346
0.38104707611689653
0.3810500366594194
0.3810506457736531
0.38105094630935055
0.3810511717720558
0.38105122089809734
0.38105136221967173
0.3810513734169819
0.3810513874257813
⋮
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
40.000000395892634
```

To plot a histogram, you can run:

```
using CairoMakie
= Figure()
fig = all_times[end] + 1
maxT = Axis(fig[1, 1]; xticks = 0.0:2.0:maxT)
ax hist!(ax, all_times)
fig
```

Now let’s query the optimal objective value:

`optimalvalue(pop_result)`

`-77.08595942659673`

Finally, let’s query the design efficiency:

`defficiency(pop_result)`

`1.7971098343760468`

### 2.8 Modifying the Design and Evaluating the Objective Function Value

In some cases, one may want to modify the optimal design in a custom way to impose additional constraints beside the supported time windows’ constraint. In such cases, one can manually evaluate the objective function value using the following routine:

```
# copy the optimal design
= copy.(optimaltimes(pop_result))
new_pop_times # make a change to the design
1][1] = new_pop_times[1][1] + 0.1
new_pop_times[# create a new decision with the new design passed to the `init` keyword argument
= decision(
new_pop_dec
model,
population,
params;type = :observation_times,
= bounds,
bounds = new_pop_times,
init
);# evaluate the objective function value at `new_pop_times`
=
new_pop_result design(new_pop_dec; optimality = :doptimal, auto_initialize = false, time_limit = 1e-8)
# get the initial objective function value
initvalue(new_pop_result)
```

`[ Info: Optimizing.`

`-77.08522458644009`

Notice how the time limit was set to a very small value to force termination at iteration 0.

### 2.9 Optimal Cramer-Rao Lower Bound on the Standard Errors

One value of the optimization process is assessing the expected standard errors of the model parameters. As discussed above, we can get lower bounds of the standard errors by the Cramer-Rao methods. These methods evaluate the FIM at the optimal sample times. To get the expected FIM at the optimal sample times, you can use:

`= pop_result.optimalfim optimalfim `

```
10×10 Symmetric{Float64, Matrix{Float64}}:
23058.7 -55.9232 -9.48401 … 127.247 6.51013 40.6347
-55.9232 28.3815 -8.03756 -0.254189 -6.69216 -5.53398
-9.48401 -8.03756 4495.46 883.984 55.8204 39.7162
856.532 -71.3415 -4.10611 0.468823 0.0166925 73.0659
-2648.57 42.6247 64.2741 47.6985 -0.141491 -272.578
83.2455 -2.06151 -223.137 … 23.8706 1.7099 -19.3811
-1133.12 -163.719 -100.837 -210.575 0.299831 585.393
127.247 -0.254189 883.984 2545.38 -7.24685 82.925
6.51013 -6.69216 55.8204 -7.24685 43.7887 3.77643
40.6347 -5.53398 39.7162 82.925 3.77643 6418.77
```

The expected FIM is a symmetric matrix so the returned `pop_result.optimalfim`

above is of type `Symmetric`

, a matrix type in Julia.

The order of parameters here is the same as the order they appear in the model.

To get the Cramer-Rao (CR) lower bound on the parameters’ standard errors given the optimal design, you can use:

`sqrt.(diag(inv(optimalfim)))`

```
10-element Vector{Float64}:
0.00668193257106166
0.20060405450278607
0.015738626638131453
0.018451807457342623
0.00687229599423481
0.03702172932779622
0.0052890658747032325
0.020609708673145474
0.15568953630902185
0.012499999999988543
```

One can further divide the CR lower bounds by the parameter values to get the CR lower bounds on the parameters’ relative standard errors given the optimal design.

## 3 Conclusion

In this tutorial, we introduced the concept of optimal design in the context of pharmacometrics and statistics. We discussed the Fisher Information Matrix and its role in evaluating the efficiency of experimental designs, as well as the different criteria for optimality. We also demonstrated how to perform model-based optimal design of experiments in Pumas using the `OptimalDesign`

package.

## 4 Appendix

In this section, the proof of the loose generalization of the scalar Cramer-Rao lower bound for multiple parameters is presented. Shortening \(L(\theta; x, d, y)\) to \(L(\theta; y)\), the Cauchy-Schwarz inequality for random variables is given by: \[ \begin{aligned} \text{Cov}_y \Bigg[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]^2 & \leq \text{Var}_y[w^T \cdot \hat{\theta}] \cdot \text{Var}_y\Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] \\ \text{Var}_y[w^T \cdot \hat{\theta}] & \geq \frac{\text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]^2}{\text{Var}_y\Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]} \end{aligned} \]

The following identities will be used in the remaining steps of the proof:

- \(E_y[\hat{\theta}] = \theta\) because \(\hat{\theta}\) is assumed to be an unbiased estimator of \(\theta\).
- \(E_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\) which implies that \(E_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\).
- \(\text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = E_y \Big[ -\frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T } \Big]\)

The last 2 identities are well known results in statistics about the mean and covariance of the score function \(\frac{\partial \log L(\theta; y)}{\partial \theta}\).

Breaking down the numerator on the RHS of the inequality without the square: \[ \begin{multlined} \text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = \\ E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] - E_y[w^T \cdot \hat{\theta}] \cdot \overbrace{E_y \Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]}^{0} \end{multlined} \]

The expectation term can be written as an integral: \[ \begin{aligned} E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \cdot L(\theta; y) dy \\ & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial L(\theta; y)}{\partial \theta} dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} \int w^T \cdot \hat{\theta} \cdot L(\theta; y) dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} E_y[w^T \cdot \hat{\theta}] \\ & = w^T \cdot \frac{\partial (w^T \theta)}{\partial \theta} \\ & = w^T w \end{aligned} \]

The denominator can be written as \[ \begin{aligned} \text{Var}_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] & = w^T \cdot \text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] \cdot w \\ & = w^T \cdot E_y\Big[ - \frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T} \Big] \cdot w \\ & = w^T \cdot I(\theta) \cdot w \end{aligned} \]

Re-writing the Cauchy-Schwarz inequality for random variables using the simplified expressions:

\[ \begin{aligned} \text{Var}_y [ w^T \cdot \hat{\theta} ] & = w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{(w^T \cdot w)^2}{w^T \cdot I(\theta) \cdot w} \\ & \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \end{aligned} \]

This completes the proof.