using OptimalDesign
Introduction to Optimal Design
Optimal design, particularly within the context of pharmacometrics and statistics, is about strategically planning experiments or studies to get the most informative data possible, using the least resources. This approach is especially important in fields like drug development, where experiments can be costly and time-consuming.
1 Key Principles
- Objective: Before you can design an optimal experiment, you need to be clear about what you’re trying to achieve. Which parameters are you trying to estimate with good precision?
- Decision variables These are the variables that you can control in your experiment to achieve your objective, e.g sample times, group sizes or dose amounts. Your objective and decision variables will guide your design.
- Constraints: These are the limits within which you must operate, such as number of samples, time windows or ethical considerations. An optimal design finds the best solution within these constraints.
1.1 Expected Fisher Information
The most common criteria for optimality when designing experiments uses a concept known as the expected Fisher information. Consider a single subject study with experimental design \(d\) and covariates \(x\). The design \(d\) includes the controllable covariates of the subject, e.g dose amount and sample times. The covariates \(x\) includes the uncontrollable covariates, assumed fixed, e.g age and sex.
Additionally, consider a model with a single parameter \(\theta\) and a marginal likelihood function \(L(\theta; x, d, y)\) where \(y\) is the unobserved response of the subject. The expected Fisher information \(I(\theta)\) is given by the expected value of the second derivative of \(-\log L\), where the differentiation is with respect to the parameter \(\theta\) and the expectation is with respect to the response \(y\): \[ I(\theta; x, d) = E_y \Bigg[ -\frac{\partial ^2 \log L(\theta; x, d, y)}{\partial \theta^2} \Bigg] \]
This quantity gives a theoretical measure of information that your experiment gives you about your parameter \(\theta\).
A design \(d\) that maximizes \(I(\theta; x, d)\) for given values of \((\theta, x)\) while satisfying all the design constraints (if any) is the optimal design according to this optimality criteria.
When \(\theta\) is made up of \(M > 1\) parameters instead of a single parameter, the expected Fisher information becomes an \(M \times M\) matrix: \[ I(\theta; x, d) = E_y \Bigg[ -\frac{\partial^2 \log L(\theta; x, d, y)}{\partial \theta \cdot \partial \theta^T} \Bigg] \] This matrix is known as the expected Fisher information matrix (FIM). When comparing 2 designs \(d_1\) and \(d_2\) given the above matrix, there are multiple ways to summarize the matrix using a single scalar objective function value. These summarization methods are commonly known as optimality criteria and will be discussed below.
When considering \(N\) subjects with covariates \(x = \{ x_i: i = 1 \dots N \}\), the Fisher information is additive with respect to the subjects: \[ I(\theta; x, d) = \sum_{i=1}^{N} I(\theta; x_i, d) \]
The interpretation of the expected FIM is as follows:
Diagonal Element (\(I_{ii}\)): This represents the information about the parameter \(\theta_i\), quantifying how much information the data provides about this parameter. It reflects the sensitivity of the likelihood function to changes in this parameter.
Off-Diagonal Element (\(I_{ij}\) where \(i \neq j\)): This element represents the interaction term between the estimates of parameters \(\theta_i\) and \(\theta_j\), indicating how changes in one parameter estimate might affect the estimate of another. If these off-diagonal elements are non-zero, it suggests that there is some level of correlation between the parameters in how they are informed by the data.
1.1.1 Example
To give a concrete example, if you’re estimating two parameters, say \(\mu\) (mean) and \(\sigma^2\) (variance), from a normal distribution:
- A diagonal element, such as \(I_{\mu\mu}\), would quantify how sensitive the likelihood is to changes in \(\mu\), directly relating to the precision with which \(\mu\) can be estimated from the data.
- An off-diagonal element, like \(I_{\mu\sigma^2}\), would indicate how the estimate of \(\mu\) might change with changes in \(\sigma^2\) and vice versa, revealing any interdependence in their estimation from the data.
Here’s how this particular FIM would be represented:
\[ I(\theta) = \begin{bmatrix} I_{\mu\mu} & I_{\mu\sigma^2} \\ I_{\sigma^2\mu} & I_{\sigma^2\sigma^2} \end{bmatrix} \]
Where:
- \(I_{\mu \mu}\) is the information about the mean, reflecting how much the data tells us about \(\mu\).
- \(I_{\sigma^2 \sigma^2}\) is the information about the variance, reflecting how much the data informs us about \(\sigma^2\).
- \(I_{\mu \sigma^2}\) and \(I_{\sigma^2 \mu}\) represent the interaction between the estimates of \(\mu\) and \(\sigma^2\), indicating the degree to which the estimation of one parameter is influenced by the estimation of the other.
In a typical setting for a Normal distribution with maximum likelihood estimation, the Fisher Information elements for \(\mu\) and \(\sigma^2\) could be defined as follows, assuming a sample size of \(n\) and that the parameters \(\mu\) and \(\sigma^2\) are estimated independently:
- \(I_{\mu\mu} = \frac{n}{\sigma^2}\)
- \(I_{\sigma^2\sigma^2} = \frac{n}{2\sigma^4}\)
The off-diagonal elements, \(I_{\mu\sigma^2}\) and \(I_{\sigma^2\mu}\), are zero, reflecting the independence of the estimates of \(\mu\) and \(\sigma^2\) under the assumption of Normal distribution and independent observations.
1.1.2 Random Effects and First Order Approximation
To compute the expected FIM, one needs the Hessian of \(-\log L\) with respect to the parameters \(\theta\) where \(L\) is the marginal likelihood. In nonlinear mixed effects (NLME) models, to compute \(\log L\), one would need to marginalize the random effects first. In Pumas, the first order (FO) approximation method is used to marginalize the random effects and approximate the marginal likelihood. The parameters \(\theta\) in optimal design therefore only refers to the population-level parameters.
1.1.3 Cramer-Rao Lower Bound for a Single Parameter
The Cramer-Rao (CR) lower bound further highlights the importance of the expected Fisher information. The CR lower bound for single-parameter models is: \[ \text{Var}_y [ \hat{\theta}(x, d, y) ] \geq \frac{1}{I(\theta; x, d)} \] where \(\text{Var}_y[\hat{\theta}(x, d, y)]\) is the variance of the unbiased maximum likelihood estimator \(\hat{\theta}(x, d, y)\) assuming the true parameter value is \(\theta\). Moving forward, \(\hat{\theta}(x, d, y)\) will be shortened to \(\hat{\theta}\), and \(I(\theta, x, d)\) will be shortened to \(I(\theta)\).
The above expression for the CR lower bound is only correct for scalar \(\theta\). You cannot divide by a matrix!
Choosing an experimental design \(d\) that minimizes \(1 / I(\theta)\) (i.e., maximizes \(I(\theta)\)), only minimizes a lower bound on the variance \(\text{Var}_y [ \hat{\theta} ]\), not the variance itself. This does not guarantee that \(\text{Var}_y [ \hat{\theta} ]\) is itself minimized because in theory the lower bound can be loose. However in practice, minimizing the lower bound \(1 / I(\theta)\) tends to be a good surrogate objective for minimizing \(\text{Var}_y [ \hat{\theta} ]\), which is the quantity we actually care about minimizing.
Given that in NLME models, the marginal likelihood (and by extension \(I(\theta)\)) is approximated using a first order approximation, the computed \(1 / I(\theta)\) is only an approximation of the actual lower bound. While this approximate lower bound may be a good surrogate when optimizing the experiment design, using simulation to evaluate the final design is usually recommended. In addition, running a simulation to evaluate the design can be useful to confirm that the model will be estimable using the software intended for the eventual data analysis with the future experimental data.
1.1.4 Cramer-Rao Lower Bound for a Vector of Parameters
One generalization of the scalar CR lower bound for multiple parameters, i.e. vector \(\theta\), is: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] which is true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\). \(M\) is the number of parameters in \(\theta\). \(\text{Var}_y[\hat{\theta}]\) is the \(M \times M\) covariance matrix of the unbiased maximum likelihood estimate \(\hat{\theta}\). The proof of this lower bound is shown in Section 4.
When restricting \(w\) so that \(w^T \cdot w = 1\), i.e. \(w\) is a unit vector, the generalized CR lower bound simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \] When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we recover the scalar CR lower bound: \[ \text{Var}_y [ \hat{\theta}_i ] \geq \frac{1}{I(\theta)_{i,i}} \]
Another better and more commonly used generalization of the scalar CR lower bound, originally presented by Rao 1945 and later shown to be a special case of the Hammersley-Chapman-Robbins lower bound, is: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \] This is also true for all weight vectors \(w \in \mathbb{R}^M\) except the all-zero \(w = [0, \dots, 0]\).
Note that the division by \(w^T \cdot w\) on both sides is not necessary but is done for consistency with the previous lower bound.
This improved lower bound can also be equivalently written as: \[ \text{Var}_y[\hat{\theta}] - I(\theta)^{-1} \succeq 0 \] where \(A \succeq 0\) means that \(A\) is a positive semi-definite matrix.
The second generalization is a tighter lower bound when \(\theta\) is a vector because: \[ \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] for all \(w\), assuming \(w' \cdot w > 0\), which follows from the Cauchy-Schwarz inequality: \[ (u^T \cdot v)^2 \leq (u^T \cdot u) \cdot (v^T \cdot v) \] by substituting: \[ \begin{aligned} u & = I(\theta)^{1/2} \cdot w \\ v & = I(\theta)^{-1/2} \cdot w \end{aligned} \] where \(I(\theta)^{1/2}\) is the symmetric positive definite square root of \(I(\theta)\) and \(I(\theta)^{-1/2}\) is its inverse.
The relationship between the 2 bounds is therefore given by: \[ \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot I(\theta)^{-1} \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \] If \(w' \cdot w\) is assumed to be 1, this simplifies to: \[ w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq w^T \cdot I(\theta)^{-1} \cdot w \geq \frac{1}{w^T \cdot I(\theta) \cdot w} \]
When \(w\) is 1 for a single parameter \(i\) and 0 for all other parameters, we get: \[ \text{Var}_y[\hat{\theta}_i] \geq (I(\theta)^{-1})_{i,i} \geq \frac{1}{I(\theta)_{i,i}} \]
If \(\theta\) is a scalar, or more generally if \(w\) is an eigenvector of \(I(\theta)\), the 2 lower bounds are equal.
1.2 Criteria for Optimality
When optimizing the design to maximize a measure of information gain for multi-parameter models, there are multiple choices of scalar objective functions, which are functions of the expected FIM. All of the scalar objective functions are aggregate functions of one of the 2 lower bounds presented above, used as surrogate objectives for minimizing the variances of the parameters’ estimates. Each optimality criteria has a different name. Let’s explore some of the most commonly used criteria.
To define any of these objective functions, the expected FIM, \(I(\theta)\), is typically only computed for a user-provided value of \(\theta = \theta_0\) (taken from literature), which is assumed to be the true value of \(\theta\). This approach is often called local optimal design.
1.2.1 D-Optimality
D-Optimal design maximizes the (log) determinant of the expected FIM. By doing so, a D-optimal design aims to minimize the volume of the confidence ellipsoid around the estimated parameters.
In Pumas, the negative of the log determinant of the expected FIM is minimized which is equivalent to maximizing the determinant of the expected FIM.
Application: Ideal for experiments where the goal is to estimate parameters and all their weighted combinations with maximum average precision in the presence of correlations.
1.2.2 T-Optimality
T-Optimality maximizes the trace (the sum of the diagonal elements) of the expected FIM, \(\sum_i I_{i,i}\). This criterion indirectly minimizes the average \(\text{Var}_y [ \hat{\theta}_i ]\) by maximizing the sum of the denominators of the loose CR lower bounds which indirectly minimizes the loose CR lower bounds, \(1 / I_{i,i}\).
In Pumas, the negative of the trace of the expected FIM is minimized which is equivalent to maximizing the trace.
Application: Useful when the objective is to minimize the average variance across all parameter estimates or when we would like to use different weights on parameters by incorporating weights in the objective.
1.2.3 A-Optimality
A-Optimality minimizes the trace (the sum of the diagonal elements) of the inverse of the expected FIM, \(\sum_i (I^{-1})_{i,i}\). Similar to T-optimality, this criterion indirectly minimizes the average \(\text{Var}_y [ \hat{\theta}_i ]\) by minimizing the sum of the tighter lower bounds, \((I^{-1})_{i,i}\).
Application: Similar to T-optimality.
1.2.4 Using Weights and Parameter Scales
Both T-Optimality and A-Optimality can be extended using weights on parameters. In T-Optimality, the weighted objective to be maximized is \(\sum_i p_i \cdot I_{i,i}\), where \(p_i\) is the weight of parameter \(\theta_i\). In A-Optimality, the weighted objective to be minimized is \(\sum_i p_i \cdot (I^{-1})_{i,i}\).
Using weights can help us achieve any of the following goals:
- Weights allow us to vary the level of importance of different parameters. For example, we typically care more about the clearance parameter than the inter-individual variability (IIV) parameters. More important parameters should get higher weights than less important parameters. Parameters which we do not care about can be given a weight of 0.
- Weights can also be used to make the objective function value invariant to the scales of the parameters, \(\hat{\theta}\). Some parameters may be in the 100s while other parameters maybe in the 0.1s. The optimal design objective function values are sensitive to the scales of the parameters.
- The product of importance weights and scaling weights can be used to achieve both goals 1 and 2 simultaneously.
In T-Optimality, to make the objective function invariant to the parameters’ scales, one can use the weights \(p_i = \theta_i^2\). The justification is shown below.
Using the loose CR lower bound and dividing both sides by \(\theta_i^2\), we get: \[ \begin{aligned} \text{Var}_y [ \hat{\theta}_i ] & \geq \frac{1}{I(\theta)_{i,i}} \\ \frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{1}{\theta_i^2 \cdot I(\theta)_{i,i}} \end{aligned} \] If \(\hat{\theta} \approx \theta\), \(\text{Var}_y [ \hat{\theta}_i ] / \theta_i^2\) will be approximately equal to the square of the relative standard error estimate of \(\hat{\theta}_i\), which is invariant to the parameters’ scales. T-Optimality tries to indirectly minimize the LHS of the inequality by maximizing the sum of the denominators of the RHS terms for all \(i\): \[ \sum_i \theta_i^2 \cdot I(\theta)_{i,i} \] Setting \(p_i = \theta_i^2\) in T-Optimality therefore makes the objective function invariant to the parameters’ scales.
In A-Optimality, weights \(p_i = 1 / \theta_i^2\) can be used to achieve invariance to the parameters’ scales. This also follows from the tight CR lower bound and assuming \(\hat{\theta} \approx \theta\). \[ \begin{aligned} \text{Var}_y [ \hat{\theta}_i ] & \geq (I(\theta)^{-1})_{i,i} \\ \frac{\text{Var}_y [ \hat{\theta}_i ]}{\theta_i^2} & \geq \frac{(I(\theta)^{-1})_{i,i}}{\theta_i^2} \end{aligned} \]
Using weights to make the objective function value invariant to the parameters’ scales assumes that none of the parameters is exactly 0. For parameters with value 0 or very close to 0, using the above weighting scheme can give spurious results.
2 Example in Pumas
Let’s dive into an example using the OptimalDesign
package in Pumas.
2.1 Loading OptimalDesign
The first step is to load the OptimalDesign
package using:
OptimalDesign
and Pumas
By loading OptimalDesign
, Pumas
will be automatically loaded as well. If you are doing an optimal design workflow, you don’t need to load Pumas
separately.
2.2 Model Definition
After loading OptimalDesign
, you should define a Pumas
model.
= @model begin
model @param begin
∈ RealDomain(; lower = 0)
tvcl ∈ RealDomain(; lower = 0)
tvvc ∈ RealDomain(; lower = 0)
tvka ∈ PSDDomain(3)
Ω ∈ RealDomain(; lower = 0)
σ end
@random begin
~ MvNormal(Ω)
η end
@pre begin
= tvcl * exp(η[1])
CL = tvvc * exp(η[2])
Vc = tvka * exp(η[3])
Ka end
@dynamics Depots1Central1
@derived begin
~ @. Normal(Central / Vc, σ)
dv end
end
PumasModel
Parameters: tvcl, tvvc, tvka, Ω, σ
Random effects: η
Covariates:
Dynamical system variables: Depot, Central
Dynamical system type: Closed form
Derived: dv
Observed: dv
As you can see this is a simple one-compartment model with first-order absorption.
2.3 Define a Subject
/Population
for Optimal Design
To perform optimal design, one needs to specify the subject(s) for which optimal sample times will be identified. A Subject
in Pumas
typically includes:
- Covariates, possibly time-varying
- Dosing regimen
- Sample times
- Observation names, e.g.
dv
- Observation values
All of the above information is required when using a Subject
for fitting a model. However for the purpose of optimal design, while the same Subject
data structure is used, some of its fields become optional and get ignored if present. For example, sample times are optimized in optimal design so they are not required to be input. Observation values are also assumed to be missing in optimal design since presumably no experiment had been conducted yet.
There are 2 ways to define a Subject
or Population
for optimal design, discussed below.
2.3.1 Manual Definition
The following is an example where a Subject
is defined manually using the Subject
constructor.
= Subject(; events = DosageRegimen(100), observations = (; dv = nothing)) subject
Subject
ID: 1
Events: 1
Observations: dv: (n=0)
To define a population manually, you can use the map
function to define multiple subjects programmatically. The following defines a population of 10 identical subjects with IDs going through 1 to 10.
= map(
population ->
i Subject(; id = i, events = DosageRegimen(100), observations = (; dv = nothing)),
1:10,
)
Population
Subjects: 10
Observations: dv
or define a vector of subjects manually:
= [
population Subject(; id = 1, events = DosageRegimen(100), observations = (; dv = nothing)),
Subject(; id = 2, events = DosageRegimen(100), observations = (; dv = nothing)),
Subject(; id = 3, events = DosageRegimen(100), observations = (; dv = nothing)),
]
Population
Subjects: 3
Observations: dv
Note that you can pass any covariate to the Subject
constructor using the covariates
keyword argument, e.g. covariates = (; AGE = 30)
.
Notice how the sample times were not specified in the above Subject
constructors. Also notice that the observation values were set to nothing
. However, it is important to use the correct observation name, dv
, when constructing the subject to be consistent with the model, matching the observed variables’ names in the @derived
block.
If an observation name is missing from the observations
keyword argument but present in the model, sample times for this response variable will not be optimized. This has applications when we require a separate design for PK samples times vs PD sample times for each subject. This will be presented in a future tutorial.
2.3.2 Read from a DataFrame
Beside the manual definition, one can also conveniently define a Population
from a DataFrame
using read_pumas
. The DataFrame
may be one of the standard datasets available in the PharmaDatasets
package, or may be read from a CSV file.
When reading subjects’ data from a DataFrame
using read_pumas
and then doing optimal design, only the observations names, dosing information and covariates are taken into account. The observation values and sample time points are ignored. If time-varying covariates are present in the data, the time points in the data will be used to define the time-varying covariates only but new sample times will be selected for the optimal design.
To give an example, we’ll be using the Warfarin PK dataset from the PharmaDatasets
package.
using PharmaDatasets
using DataFrames
using DataFramesMeta
= dataset("warfarin.csv")
df first(df, 5)
Row | id | time | dv | amt | evid | cmt |
---|---|---|---|---|---|---|
Int64 | Float64 | Float64? | Int64? | Int64 | Int64 | |
1 | 1 | 0.0 | missing | 70 | 1 | 1 |
2 | 1 | 0.5 | 9.9973 | missing | 0 | 1 |
3 | 1 | 1.0 | 7.3587 | missing | 0 | 1 |
4 | 1 | 2.0 | 7.4868 | missing | 0 | 1 |
5 | 1 | 6.0 | 9.6466 | missing | 0 | 1 |
The dataset contains the following columns:
:id
: the subject ID.:time
: the time of the observation.:dv
: the observed drug concentration.:amt
: the dose administered.:evid
: the event ID, indicating whether the observation is a dose or a concentration measurement.:cmt
: the compartment ID, indicating the compartment where the dose was administered.
Let’s check the summary statistics of the dataset:
describe(df, :mean, :std, :min, :max, :nuniqueall)
Row | variable | mean | std | min | max | nuniqueall |
---|---|---|---|---|---|---|
Symbol | Float64 | Float64 | Real | Real | Int64 | |
1 | id | 16.5 | 9.24916 | 1 | 32 | 32 |
2 | time | 29.0556 | 39.3667 | 0.0 | 120.0 | 9 |
3 | dv | 4.68351 | 2.63184 | 0.28399 | 12.184 | 256 |
4 | amt | 70.0 | 0.0 | 70 | 70 | 1 |
5 | evid | 0.111111 | 0.314817 | 0 | 1 | 2 |
6 | cmt | 1.0 | 0.0 | 1 | 1 | 1 |
We have 32 subjects each with 8 measurements and 1 dose event.
@chain df begin
@rsubset :id == 1
@by :evid :nrow = length(:time)
end
Row | evid | nrow |
---|---|---|
Int64 | Int64 | |
1 | 0 | 8 |
2 | 1 | 1 |
With the data loaded, we can now proceed to parse it into a Population
:
= read_pumas(
population
df;= :id,
id = :time,
time = [:dv],
observations = :amt,
amt = :evid,
evid = :cmt,
cmt )
Population
Subjects: 32
Observations: dv
2.4 Specify the Population Parameters
Currently, only local optimal design is available, so the user must specify the values of the population parameters to use. These parameter values will be assumed to be the true parameter values when computing the expected FIM.
Population parameters are the parameters of the model that are shared across the population. These include the so-called typical values of the parameters, e.g. the tvcl
, tvvc
, and tvka
along with the IIV parameter Ω
s and the residual variability parameter σ
in the Warfarin model.
= (;
params = 0.15,
tvcl = 8,
tvvc = 1,
tvka = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
Ω = sqrt(0.01),
σ )
(tvcl = 0.15,
tvvc = 8,
tvka = 1,
Ω = [0.07 0.0 0.0; 0.0 0.02 0.0; 0.0 0.0 0.6],
σ = 0.1,)
2.5 Time Window Specification
In OptimalDesign
, one needs to specify the time windows in which optimal samples will be chosen. These time windows should represent clinical hours to increase the probability that the design found by OptimalDesign
is actually feasible to implement in practice.
In OptimalDesign
, users can express time intervals as regular floating point number bounds, i.e. Float64
. This would be typical for designing future clinical trials. Alternatively, for convenience in managing clinical scheduling in a hospital or clinic setting, the Dates
standard Julia package is available to easily specify the time bounds for optimization using specific dates and clock times via the DateTime
type. If the time bounds are passed in as Float64
, the optimal design’s times will be in Float64
format. Alternatively, if the time bounds are passed in using DateTime
, the optimal design’s times will be in DateTime
.
2.5.1 Floating Point Time Windows
Let us define the start and end times of a time window via t0
and tend
:
= 0.0, 16.0 t0, tend
(0.0, 16.0)
In Julia, you can define multiple variables in the same line of code where the variables names and values are separated by commas. The following 2 code snippets are equivalent:
= 0.0, 16.0 t0, tend
= 0.0
t0 = 16.0 tend
The time unit assumed for t0
and tend
matches the time unit assumed in the model. Hence if your model assumes “hour” is the time unit then the time bounds above represent hour 0 and hour 16.
To define time bounds, OptimalDesign
makes use of intervals in Julia. One can define a time window that starts at t0
and ends at tend
using:
.. tend t0
0.0 .. 16.0
Arithmetic on intervals is also supported to increment the whole interval by a fixed number. However, parentheses must be used carefully. The following is the correct syntax to increment the start and finish bounds by 1 day, assuming the model’s unit is “hour”:
.. tend) + 24 (t0
24.0 .. 40.0
The following syntax parses differently incrementing only the finish bound by a day.
.. tend + 24 t0
0.0 .. 40.0
2.5.2 DateTime
Time Windows
We’ll now cover Julia’s DateTime
API and how to use them in OptimalDesign
. Let us define the start and end times of a time window via t0
and tend
:
= DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am t0
2024-01-25T09:00:00
= DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm tend
2024-01-25T21:00:00
The DateTime
constructor’s inputs are of the form: (Year
, Month
, Day
, Hour
, Minute
, Second
, Millisecond
). Dropping arguments from the end assigns their value to 0
, e.g. the Minute
, Second
, Millisecond
values in t0
and tend
above are 0
.
2.5.2.1 DateTime
Arithmetic
One can use arithmetic functions to increment dates and times as follows:
- Next second:
t0 + Second(1)
. - Next minute:
t0 + Minute(1)
. - Next hour:
t0 + Hour(1)
. - Next day:
t0 + Day(1)
. - Next week:
t0 + Day(7)
. - Next month:
t0 + Month(1)
. - Next year:
t0 + Year(1)
.
Multiplication with integers and subtraction are also supported.
2.5.2.2 Intervals
Similar to floating point intervals, one can define an interval where t0
and tend
are DateTime
objects.
.. tend t0
2024-01-25T09:00:00 .. 2024-01-25T21:00:00
The interval can then be incremented 1 day as follows:
.. tend) + Day(1) (t0
2024-01-26T09:00:00 .. 2024-01-26T21:00:00
The following syntax parses differently incrementing only the finish bound by a day.
.. tend + Day(1) t0
2024-01-25T09:00:00 .. 2024-01-26T21:00:00
2.6 Decision Definition
The next step in OptimalDesign
is to define the design structure and requirements that make up the decision variables to be optimized. The term decision is used in this context to refer to the design that is yet to be optimized.
2.6.1 Decision Structure
Currently, there are 2 dimensions, with 2 choices each, along which one can vary the structure of the optimal sample times in OptimalDesign
. These are orthogonal dimensions giving rise to 4 different combinations.
2.6.1.1 Subject
vs Population
Decision
In OptimalDesign
, a decision can be:
- A subject decision, or
- A population decision.
In the first case, a single subject is used and the sample times for this individual subject are optimized. In this case, the output optimal samples times will be a vector of sample times (i.e. Float64
numbers or DateTime
objects).
In the second case when a population of more than 1 subject is used, separate sample times will be optimized for each subject. In this case, the output optimal samples times will be a vector of vectors of sample times. The outer vector has the same length as the population and the inner vectors are the sample times per subject.
Currently, dividing subjects into groups where each group shares the sample times is not supported except in the trivial case where all the subjects in a group share the same covariates and dosing regimen. In this case, the subject_multiples
keyword argument (discussed later) can be used to specify how many times (i.e. the group size) each unique subject should be counted when computing the expected FIM.
Increasing subject_multiples
for all of the subjects in a population by the same ratio will not change the optimal designs and will only change the expected FIM at the optimal design. However, changing subject_multiples
by different ratios for different subjects will change the optimal design.
Using the same subject 2 times in a Population
means that there are 2 independent copies of this subject. Each copy will have its own optimal sample times, even though the 2 copies are identical in covariates and dosing regimen. The reason for this is that when sample times are sparse, it can be advantageous to observe different copies at different times to have good overall coverage of all parts of the dynamics when combining the subjects’ data together in the population analysis.
2.6.1.2 Pre-specified vs Optimized Number of Samples Per Time Window
Another dimension along which one can vary the decision structure is by either:
- Pre-specifying the number of samples per time window per subject, or
- Only specifying the total number of samples for all time windows per subject.
In the second case, the optimization algorithm determines how many samples should be in each time window to optimize the objective function. In this tutorial, only the first case is covered.
2.6.2 Floating Point Decisions
Once you define the time bounds, the next step is to define the decision. The decision
function is used to specify the decision to be optimized.
Julia has a built-in dictionary type called Dict
. It is a collection of key-value pairs where each key is unique. For more information on dictionaries, see the Julia documentation, and the Julia Syntax Pumas tutorial
The following dictionary can be used to associate a number of samples (2) for each time window. Assuming the unit is “hour”, the time windows below refer to
- Hour 0 to hour 16 of day 1, and
- Hour 0 to hour 16 of day 2.
= 0.0, 16.0
t0, tend = Dict((t0 .. tend) => 2, (t0 .. tend) + 24 => 2) bounds
Dict{ClosedInterval{Float64}, Int64} with 2 entries:
0.0 .. 16.0 => 2
24.0 .. 40.0 => 2
After defining the time bounds and number of samples per time window, a decision
can be constructed for a single Subject
using:
= population[1]
subject = decision(
subj_dec
model,
subject,
params;type = :observation_times,
= bounds,
bounds = 0.25,
minimum_offset );
or for a whole Population
using:
= decision(
pop_dec
model,
population,
params;type = :observation_times,
= bounds,
bounds = 0.25,
minimum_offset );
The minimum_offset
keyword argument can be optionally specified to enforce a minimum duration between any 2 samples. The default value of minimum_offset
is 1e-4
.
Some additional options which can be specified as keyword arguments include:
rng
: the random number generator used to generate a random initial of sample times. The default value isRandom.GLOBAL_RNG
.init
: The initial sample times. The default value isnothing
. Ifnothing
is input, random sample times are used to initialize the optimization. The initial sample times for a subject is a vector of sample times. The initial sample times for a population is a vector of vectors of sample times. The type of sample times (Float64
orDateTime
) should match that in thebounds
.subject_multiples
: The number of times each subject is counted. This is a number if only a single subject is passed in the second argument, or a vector if a population is used. The default value is1
for an individual subject or a vector of ones for a population.time_multiples
: a vector of integers used to replicate each sample multiple times. If not specified, each sample will be counted only once. This is useful when multiple sub-samples can be taken from each sample giving multiple measurements at the same sample time.
2.6.3 DateTime
Decisions
Since models in Pumas
do not specify any time unit, when DateTime
decisions are used, a model_time_unit
must be passed as an additional keyword argument in order to specify the unit of time assumed in the model definition and dynamics model, e.g. model_time_unit = Hour(1)
. In this case, minimum_offset
should also be defined using a DateTime
object.
= DateTime(2024, 1, 25, 9) # 25 Jan 2024 - 9:00 am
t0 = DateTime(2024, 1, 25, 21) # 25 Jan 2024 - 9:00 pm
tend = Dict((t0 .. tend) => 2, (t0 .. tend) + Day(1) => 2)
date_bounds = decision(
date_pop_dec
model,
population,
params;= date_bounds,
bounds = Minute(15),
minimum_offset = Hour(1),
model_time_unit );
2.7 Optimizing the Design
In OptimalDesign
, a number of types of optimal design are supported. The optimality type can be specified using the keyword argument optimality
which can take the following values:
:aoptimal
: minimizes the trace of the inverse of the expected information matrix.:doptimal
: maximizes the (log) determinant of the expected information matrix. In the optimization, negative the log determinant is minimized instead.:toptimal
: maximizes the trace of the expected information matrix. In the optimization, the negative of the trace is minimized instead.
To perform the sample time optimization using the previously constructed decision pop_dec
, run the following line, where :doptimal
can be replaced with :aoptimal
or :toptimal
.
= design(pop_dec; optimality = :doptimal) pop_result
┌ Warning: Samples 3 and 4 of subject 23 closer than minimum offset of 0.25.
└ @ OptimalDesign ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/OptimalDesign/9srk5/src/optimaldesign/observationtimes.jl:920
[ Info: Looking for a feasible solution.
[ Info: Feasible solution found.
[ Info: Optimizing.
******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
Ipopt is released as open source code under the Eclipse Public License (EPL).
For more information visit https://github.com/coin-or/Ipopt
******************************************************************************
Optimal schedule:
Time points per subject: 4
Subjects: 32
Initial objective: -71.9777910601971
Optimal objective: -77.08595941067784
Size of FIM: (10, 10)
To perform the optimization for the subject decision subj_dec
, run the following line instead. verbose
is an option for the design
function that can be set to true
or false
. If true
, the optimization progress will be displayed. Its default value is false
.
= design(subj_dec; optimality = :doptimal, verbose = true) subj_result
┌ Warning: Samples 1 and 2 of subject 1 closer than minimum offset of 0.25.
└ @ OptimalDesign ~/_work/PumasTutorials.jl/PumasTutorials.jl/custom_julia_depot/packages/OptimalDesign/9srk5/src/optimaldesign/observationtimes.jl:920
[ Info: Looking for a feasible solution.
Running HiGHS 1.7.2 (git hash: 5ce7a2753): Copyright (c) 2024 HiGHS under MIT licence terms
Coefficient ranges:
Matrix [1e+00, 1e+00]
Cost [1e+00, 1e+00]
Bound [1e-07, 4e+01]
RHS [2e-01, 3e+01]
Solving LP without presolve, or with basis, or unconstrained
Using EKK dual simplex solver - serial
Iteration Objective Infeasibilities num(sum)
0 -1.0659208859e-05 Pr: 1(0.189192) 0s
2 1.8919186585e-01 Pr: 0(0) 0s
Model status : Optimal
Simplex iterations: 2
Objective value : 1.8919186585e-01
HiGHS run time : 0.00
[ Info: Feasible solution found.
[ Info: Optimizing.
This is Ipopt version 3.14.16, running with linear solver MUMPS 5.7.3.
Number of nonzeros in equality constraint Jacobian...: 0
Number of nonzeros in inequality constraint Jacobian.: 6
Number of nonzeros in Lagrangian Hessian.............: 0
Total number of variables............................: 4
variables with only lower bounds: 0
variables with lower and upper bounds: 4
variables with only upper bounds: 0
Total number of equality constraints.................: 0
Total number of inequality constraints...............: 3
inequality constraints with only lower bounds: 0
inequality constraints with lower and upper bounds: 0
inequality constraints with only upper bounds: 3
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
0 -3.4871474e+01 1.89e-01 8.35e+00 0.0 0.00e+00 - 0.00e+00 0.00e+00 0
1 -3.8182363e+01 0.00e+00 6.99e-01 0.8 5.98e+00 - 1.13e-01 1.00e+00f 1
2 -3.8207289e+01 0.00e+00 4.94e-02 -0.8 1.22e-01 - 1.00e+00 1.00e+00f 1
3 -3.8218060e+01 0.00e+00 5.55e-02 -2.7 7.45e-02 - 9.94e-01 1.00e+00f 1
4 -3.8229749e+01 0.00e+00 5.94e-02 -4.0 7.34e-02 - 1.00e+00 1.00e+00f 1
5 -3.8243154e+01 0.00e+00 6.83e-02 -5.6 7.55e-02 - 1.00e+00 1.00e+00f 1
6 -3.8274064e+01 0.00e+00 8.88e-02 -6.8 1.84e-01 - 1.00e+00 1.00e+00f 1
7 -3.8316691e+01 0.00e+00 1.13e-01 -8.2 1.84e-01 - 1.00e+00 1.00e+00f 1
8 -3.8381138e+01 0.00e+00 1.46e-01 -9.8 2.14e-01 - 1.00e+00 1.00e+00f 1
9 -3.8486158e+01 0.00e+00 1.95e-01 -11.0 2.71e-01 - 1.00e+00 1.00e+00f 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
10 -3.8669911e+01 0.00e+00 2.64e-01 -11.0 3.56e-01 - 1.00e+00 1.00e+00f 1
11 -3.9003243e+01 0.00e+00 3.45e-01 -11.0 4.81e-01 - 1.00e+00 1.00e+00f 1
12 -3.9565703e+01 0.00e+00 3.91e-01 -11.0 6.44e-01 - 1.00e+00 1.00e+00f 1
13 -4.0311297e+01 0.00e+00 4.29e-01 -11.0 8.34e-01 - 1.00e+00 1.00e+00f 1
14 -4.1228214e+01 0.00e+00 1.44e-01 -9.0 2.80e+05 - 2.09e-05 4.74e-06f 2
15 -4.1249214e+01 0.00e+00 9.14e-01 -11.0 1.84e+00 - 1.00e+00 1.00e+00f 1
16 -4.1728801e+01 0.00e+00 5.07e-01 -11.0 1.22e+00 - 1.00e+00 1.00e+00f 1
17 -4.1748855e+01 0.00e+00 5.08e-01 -11.0 1.43e+00 - 1.00e+00 8.28e-02f 1
18 -4.1749073e+01 0.00e+00 5.08e-01 -11.0 9.06e+00 - 1.00e+00 1.19e-04f 1
19 -4.1749088e+01 0.00e+00 4.89e-01 -11.0 3.65e+00 - 1.00e+00 1.82e-05f 1
iter objective inf_pr inf_du lg(mu) ||d|| lg(rg) alpha_du alpha_pr ls
20 -4.2409915e+01 0.00e+00 2.14e-01 -11.0 1.00e+01 - 1.00e+00 5.33e-01f 1
21 -4.2412803e+01 0.00e+00 2.23e-01 -10.4 1.95e+01 - 1.00e+00 2.73e-03f 1
22 -4.2412847e+01 0.00e+00 2.35e-01 -7.3 1.97e+01 - 5.85e-01 2.71e-05f 1
23 -4.2413032e+01 0.00e+00 2.23e-01 -8.2 1.42e+00 - 1.00e+00 1.47e-03f 1
24 -4.2426810e+01 0.00e+00 1.02e-01 -9.2 8.72e-02 - 1.00e+00 1.00e+00f 1
25 -4.2428461e+01 0.00e+00 2.79e-02 -10.3 2.97e-02 - 1.00e+00 1.00e+00f 1
26 -4.2428620e+01 0.00e+00 2.70e-03 -11.0 6.54e-03 - 1.00e+00 1.00e+00f 1
27 -4.2428621e+01 0.00e+00 8.23e-05 -11.0 6.74e-04 - 1.00e+00 1.00e+00f 1
28 -4.2428621e+01 0.00e+00 2.26e-07 -11.0 1.96e-05 - 1.00e+00 1.00e+00h 1
Number of Iterations....: 28
(scaled) (unscaled)
Objective...............: -4.2428621213201318e+01 -4.2428621213201318e+01
Dual infeasibility......: 2.2551172117557782e-07 2.2551172117557782e-07
Constraint violation....: 0.0000000000000000e+00 0.0000000000000000e+00
Variable bound violation: 3.9987164512922391e-07 3.9987164512922391e-07
Complementarity.........: 1.0000000000000001e-11 1.0000000000000001e-11
Overall NLP error.......: 2.2551172117557782e-07 2.2551172117557782e-07
Number of objective function evaluations = 32
Number of objective gradient evaluations = 29
Number of equality constraint evaluations = 0
Number of inequality constraint evaluations = 32
Number of equality constraint Jacobian evaluations = 0
Number of inequality constraint Jacobian evaluations = 1
Number of Lagrangian Hessian evaluations = 0
Total seconds in IPOPT = 28.985
EXIT: Optimal Solution Found.
Optimal schedule:
Time points per subject: 4
Subjects: 1
Initial objective: -37.41445869156619
Optimal objective: -42.42862121320132
Size of FIM: (10, 10)
The following additional options can be specified using keyword arguments:
time_limit
: time limit in seconds, default value is1200
seconds (20 minutes).nlp_tol
: tolerance for the optimization solver, default value is1e-5
.auto_initialize
: auto-initializes the optimizer to a feasible design, default value istrue
.
The output of design
can be queried in multiple ways using:
optimaltimes(pop_result)
: returns the optimal sample times for each subject.initvalue(pop_result)
: returns the initial objective function value.optimalvalue(pop_result)
: returns the optimal objective function value.defficiency(pop_result)
: returns(det(FIM_2) / det(FIM_1))^(1/M)
whereFIM_2
is the Fisher information matrix (FIM) at the optimal design,FIM_1
is the FIM at the initial design, andM
is the number of model parameters. This metric makes sense to use only when D-optimality is the optimality criteria.
Let’s query the optimal sample times for the population design pop_result
:
optimaltimes(pop_result)
32-element Vector{Vector{Float64}}:
[0.3810372576787533, 1.9603905991486463, 23.999999862240035, 40.00000039589277]
[0.38103392783481577, 1.9604347073168455, 23.999999862240035, 40.000000395892776]
[0.38103863739849214, 1.9603676164858244, 23.999999862240035, 40.00000039589277]
[0.3810389909054834, 1.9603920061263924, 23.99999986224004, 40.00000039589277]
[0.38103623611377, 1.960405475045552, 23.999999862240035, 40.00000039589277]
[0.3810393854989022, 1.9603920420235714, 23.999999862240035, 40.00000039589277]
[0.3810422953069568, 1.9603149931503359, 23.99999986224003, 40.00000039589277]
[0.3810200560512011, 1.9606983921516647, 23.999999862240024, 40.000000395892805]
[0.3810370777997231, 1.9604033147897997, 23.99999986224004, 40.00000039589277]
[0.3810358147814479, 1.96052759813349, 23.999999862240028, 40.00000039589279]
⋮
[0.3810550051125516, 1.960163053175076, 23.999999862240035, 40.000000395892755]
[0.3810295663180821, 1.9604897342799048, 23.99999986224003, 40.00000039589278]
[0.38103392401284003, 1.9604437228521738, 23.999999862240035, 40.000000395892776]
[0.38103547218703127, 1.9604216137260408, 23.999999862240035, 40.00000039589277]
[0.3810471305601967, 1.960561132431024, 23.999999862240056, 40.00000039589276]
[0.38105573340711063, 1.9603524650984474, 23.999999862240035, 40.00000039589276]
[0.38103722793294426, 1.9604103575635246, 23.999999862240035, 40.00000039589277]
[0.3810399209138152, 1.960377566853783, 23.999999862240035, 40.00000039589277]
[0.38106851837702943, 1.9601452821968544, 23.99999986224005, 40.00000039589274]
To get the first subject’s optimal times, use:
optimaltimes(pop_result)[1]
4-element Vector{Float64}:
0.3810372576787533
1.9603905991486463
23.999999862240035
40.00000039589277
Notice that there are 4 sample times per subject, where the first 2 sample times are correctly placed in the first time window and the second 2 sample times are in the second time window.
In some cases, it may make sense to post-process the per-subject optimal designs by binning the optimal sample times. Binning the optimal sample times makes it easier to communicate the result of optimal design to the clinic. For example, we can have 2 bins: 9-10 am and 3-4 pm, and ask the clinic to collect 10 samples in each time bin from all subjects combined where each subject is sampled from once in each time bin. This makes it easier to implement a design that is close to the mathematically optimal design. Time bins can be visually selected by plotting a histogram of the optimal sample times for all subjects combined. To combine all the sample times for all subjects in a single sorted vector, you can use:
= sort(reduce(vcat, optimaltimes(pop_result))) all_times
128-element Vector{Float64}:
0.3810200560512011
0.3810295663180821
0.38103392401284003
0.38103392783481577
0.38103547218703127
0.3810358147814479
0.38103623611377
0.3810369441155763
0.3810370777997231
0.38103720029296567
⋮
40.00000039589277
40.00000039589277
40.00000039589277
40.00000039589277
40.000000395892776
40.000000395892776
40.00000039589278
40.00000039589279
40.000000395892805
To plot a histogram, you can run:
using CairoMakie
= Figure()
fig = all_times[end] + 1
maxT = Axis(fig[1, 1]; xticks = 0.0:2.0:maxT)
ax hist!(ax, all_times)
fig
Now let’s query the optimal objective value:
optimalvalue(pop_result)
-77.08595941067784
Finally, let’s query the design efficiency:
defficiency(pop_result)
1.6666520188678398
2.8 Modifying the Design and Evaluating the Objective Function Value
In some cases, one may want to modify the optimal design in a custom way to impose additional constraints beside the supported time windows’ constraint. In such cases, one can manually evaluate the objective function value using the following routine:
# copy the optimal design
= copy.(optimaltimes(pop_result))
new_pop_times # make a change to the design
1][1] = new_pop_times[1][1] + 0.1
new_pop_times[# create a new decision with the new design passed to the `init` keyword argument
= decision(
new_pop_dec
model,
population,
params;type = :observation_times,
= bounds,
bounds = new_pop_times,
init
);# evaluate the objective function value at `new_pop_times`
=
new_pop_result design(new_pop_dec; optimality = :doptimal, auto_initialize = false, time_limit = 1e-8)
# get the initial objective function value
initvalue(new_pop_result)
[ Info: Optimizing.
-77.08522472865653
Notice how the time limit was set to a very small value to force termination at iteration 0.
2.9 Optimal Cramer-Rao Lower Bound on the Standard Errors
One value of the optimization process is assessing the expected standard errors of the model parameters. As discussed above, we can get lower bounds of the standard errors by the Cramer-Rao methods. These methods evaluate the FIM at the optimal sample times. To get the expected FIM at the optimal sample times, you can use:
= pop_result.optimalfim optimalfim
10×10 Symmetric{Float64, Matrix{Float64}}:
23058.9 -55.9287 -9.47757 … 127.252 6.51013 40.6359
-55.9287 28.3817 -8.03798 -0.25446 -6.69217 -5.53359
-9.47757 -8.03798 4495.32 883.997 55.8224 39.7138
856.559 -71.3421 -4.10585 0.468732 0.0166898 73.0627
-2648.71 42.6282 64.2718 47.6952 -0.141467 -272.549
83.2394 -2.06136 -223.14 … 23.8679 1.70977 -19.3797
-1133.02 -163.722 -100.832 -210.558 0.299778 585.323
127.252 -0.25446 883.997 2545.39 -7.24621 82.9182
6.51013 -6.69217 55.8224 -7.24621 43.7887 3.77636
40.6359 -5.53359 39.7138 82.9182 3.77636 6418.76
The expected FIM is a symmetric matrix so the returned pop_result.optimalfim
above is of type Symmetric
, a matrix type in Julia.
The order of parameters here is the same as the order they appear in the model.
To get the Cramer-Rao (CR) lower bound on the parameters’ standard errors given the optimal design, you can use:
sqrt.(diag(inv(optimalfim)))
10-element Vector{Float64}:
0.006681912606368064
0.20060343386977755
0.015738918225125218
0.018451796796853603
0.006872280239217231
0.03702172496321876
0.00528903929262707
0.020609696143178924
0.15568962237057749
0.01249999999981819
One can further divide the CR lower bounds by the parameter values to get the CR lower bounds on the parameters’ relative standard errors given the optimal design.
3 Conclusion
In this tutorial, we introduced the concept of optimal design in the context of pharmacometrics and statistics. We discussed the Fisher Information Matrix and its role in evaluating the efficiency of experimental designs, as well as the different criteria for optimality. We also demonstrated how to perform model-based optimal design of experiments in Pumas using the OptimalDesign
package.
4 Appendix
In this section, the proof of the loose generalization of the scalar Cramer-Rao lower bound for multiple parameters is presented. Shortening \(L(\theta; x, d, y)\) to \(L(\theta; y)\), the Cauchy-Schwarz inequality for random variables is given by: \[ \begin{aligned} \text{Cov}_y \Bigg[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]^2 & \leq \text{Var}_y[w^T \cdot \hat{\theta}] \cdot \text{Var}_y\Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] \\ \text{Var}_y[w^T \cdot \hat{\theta}] & \geq \frac{\text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]^2}{\text{Var}_y\Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big]} \end{aligned} \]
The following identities will be used in the remaining steps of the proof:
- \(E_y[\hat{\theta}] = \theta\) because \(\hat{\theta}\) is assumed to be an unbiased estimator of \(\theta\).
- \(E_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\) which implies that \(E_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = 0\).
- \(\text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = E_y \Big[ -\frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T } \Big]\)
The last 2 identities are well known results in statistics about the mean and covariance of the score function \(\frac{\partial \log L(\theta; y)}{\partial \theta}\).
Breaking down the numerator on the RHS of the inequality without the square: \[ \begin{multlined} \text{Cov}_y \Big[ w^T \cdot \hat{\theta}, w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] = \\ E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] - E_y[w^T \cdot \hat{\theta}] \cdot \overbrace{E_y \Bigg[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg]}^{0} \end{multlined} \]
The expectation term can be written as an integral: \[ \begin{aligned} E_y \Bigg[ w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Bigg] & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \cdot L(\theta; y) dy \\ & = \int w^T \cdot \hat{\theta} \cdot w^T \cdot \frac{\partial L(\theta; y)}{\partial \theta} dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} \int w^T \cdot \hat{\theta} \cdot L(\theta; y) dy \\ & = w^T \cdot \frac{\partial}{\partial \theta} E_y[w^T \cdot \hat{\theta}] \\ & = w^T \cdot \frac{\partial (w^T \theta)}{\partial \theta} \\ & = w^T w \end{aligned} \]
The denominator can be written as \[ \begin{aligned} \text{Var}_y \Big[ w^T \cdot \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] & = w^T \cdot \text{Var}_y \Big[ \frac{\partial \log L(\theta; y)}{\partial \theta} \Big] \cdot w \\ & = w^T \cdot E_y\Big[ - \frac{\partial^2 \log L(\theta; y)}{\partial \theta \cdot \partial \theta^T} \Big] \cdot w \\ & = w^T \cdot I(\theta) \cdot w \end{aligned} \]
Re-writing the Cauchy-Schwarz inequality for random variables using the simplified expressions:
\[ \begin{aligned} \text{Var}_y [ w^T \cdot \hat{\theta} ] & = w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w \geq \frac{(w^T \cdot w)^2}{w^T \cdot I(\theta) \cdot w} \\ & \frac{w^T \cdot \text{Var}_y[\hat{\theta}] \cdot w}{w^T \cdot w} \geq \frac{w^T \cdot w}{w^T \cdot I(\theta) \cdot w} \end{aligned} \]
This completes the proof.