Supervised Learning

Authors:

Abdelwahed Khamis, Mohamed Tarek

What is Supervised Learning?

Learning from labeled training data
Goal: Learn a mapping from inputs to outputs
System learns from examples with known answers
Like learning with a teacher who provides correct answers

Key Components

Input Variables (\(x\)): Features or predictors
Output Variables (\(y\)): Target or labels
Training Data: Pairs of (\(x\), \(y\))
Model: The learned mapping function

Types of Supervised Learning

Classification
- Predicts categorical outputs
- Example: Drug response (responding/non-responding)
- Models: kNN, Decision Trees
Regression
- Predicts continuous values
- Example: Drug concentration over time
- Models: Gradient Boosting

Common Applications in Drug Development

Patient response prediction
Drug-target interaction prediction
Toxicity classification
Dose-response modeling

Data-Driven Workflow

Model Evaluation

Strategies to Model Evaluation

To evaluate the performance of a predictive model, or compare different models, we need to assess how well the model performs on unseen data.
A model is said to generalize well if it performs well on new, independent datasets.
Overfitting: Happens when a model learns the training data too well, including its noise and outliers. This results in high accuracy on the training set but poor performance on unseen data.
Using the performance of the model on the training data to evaluate and compare models is prone to overfitting.
To detect and prevent overfitting, there are 2 common strategies for model evaluation.
The specific performance metrics used to evaluate the model depend on the type of problem (classification or regression) and the specific goals of the analysis, however the general strategies to evaluate the model are the same.

Strategies to Model Evaluation

There are two main strategies to evaluate model performance:

Training-Validation-Test Split: Split the dataset into a training set, a validation set and a test set.
- The model is trained on the training set.
- Its hyper-parameter are fine-tuned using the validation set.
- Its generalization performance is evaluated using the test set.
- If no fine tuning is done, the validation set can be omitted and the dataset is simply split into a training set and a test set.
- The splitting strategy should reflect the intended use of the model.
- Example: If the model is to be used to extrapolate a patient’s response to new dose levels, the test set should contain dose levels outside the range of dose levels in the training set.

Strategies to Model Evaluation

Cross-validation: The dataset is divided into k subsets (folds). The model is trained on k-1 folds and validated/tested on the remaining fold. This process is repeated k times, with each fold used as the validation set once. The performance metrics are averaged over the k iterations.
- Common choices for k are 5 to 10.
- Leave-one-out cross-validation (LOOCV) is a special case where k equals the number of data points.
- Cross-validation is more computationally intensive but provides a more robust estimate of model performance.
- Cross-validation can be used to fine-tune hyper-parameters in the model.
- If no fine tuning is done, cross-validation can be used to estimate the generalization/test performance of the model.

Cross-validation

Cross-validation

Cross-validation

Data Leakage

Data Leakage: Occurs when information from outside the training dataset is used to create the model. This can lead to overly optimistic performance estimates and poor generalization to new data.
- Example: Including future data points in the training set that would not be available at prediction time.
Data leakage takes many forms, but some common examples include:
- Using the entire dataset for both training and evaluation.
- Using the validation data or cross-validation to fine-tune the model hyper-parameters and then for evaluating the model, instead of a separate test set.
- The more hyper-parameters and the smaller the validation set is or the smaller k is in k-fold cross-validation, the more likely it is to overfit the validation data.

Small Datasets

When the dataset is small
- Cross-validation is preferred over a simple training-validation-test split to make the most efficient use of the available data.
- A single train-test split can lead to extreme sampling bias in each of the sets.
Having a separate test set may not be feasible for small data.
If the number of hyper-parameters is small, cross-validation is sometimes used to estimate the generalization/test performance of the model.
This is not ideal but may be necessary in some cases.

Supervised ML Algorithms

Decision Tree

Decision Tree

For the selected data, we can come up with a tree splits on X at a threshold of 2.9:

X ≤ 2.9, classify as Non-responder.
X > 2.9, classify as Responder.

Decision Tree

Decision Tree

Decision Tree

k Nearest Neighbour (kNN)

Simple but powerful classification algorithm
It classifies new data points based on their proximity to existing labeled data
Principle: “Similar data points tend to have similar labels”

How to classify using kNN?

Calculate distances to all existing data points
Find the k closest neighbors
Take a majority vote among these neighbors
Assign the majority class to the new point

How to do regression using kNN?

Calculate distances to all existing data points
Find the k closest neighbors
Take the average of their values \[ \hat{y} = \frac{1}{k} \sum_{i=1}^{k} y_i \]
Alternatively, use a weighted average giving more weight to closer neighbors \[ \hat{y} = \frac{\sum_{i=1}^{k} w_i \cdot y_i}{\sum_{i=1}^{k} w_i} \] where \(w_i = \frac{1}{d_i}\) and \(d_i\) is the distance to neighbor \(i\).

kNN

kNN

kNN

kNN

Important Considerations

Distance metric (usually Euclidean distance in continuous space)
Choice of k affects model complexity:
- Small k: More flexible, might overfit
- Large k: Smoother decision boundary, might underfit

kNN

Advantages

Simple to understand and implement
No training phase (lazy learning)
Works well with multi-class problems
Non-parametric (makes no assumptions about data distribution)

Disadvantages

Each prediction requires calculating distances to all training points
- Making it inefficient for large datasets or real-time applications.
Sensitive to the choice of distance metric
Requires careful selection of k

Predictive Model Evaluation Metrics for Classification

Confusion Matrix

Confusion Matrix

Confusion Matrix

Sensitivity

Sensitivity (True Positive Rate, Recall): The proportion of actual positive instances correctly predicted by the model.

Specificity

Specificity (True Negative Rate): The proportion of actual negative instances correctly predicted by the model.

Precision

The proportion of true positive predictions among all positive predictions made by the model.

Receiver Operating Curve (ROC)

A graphical representation of a classifier’s performance at various threshold settings. It plots the True Positive Rate (TPR, Sensitivity) against the False Positive Rate (FPR, 1-Specificity).

Classify data points according to the threshold.
Compute TP, FP, TN, FN for each threshold.
Compute TPR and FPR at each threshold.
Use the values to plot ROC curve

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

Receiver Operating Curve (ROC)

The Area Under the Curve (AUC)

Higher AUC \(\rightarrow\) better model performance.

Linear Regression

What is Linear Regression?

A statistical method to model the relationship between a dependent variable (response) \(y\) and one or more independent variables (predictors) \(x\).
Example: \[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_p \cdot x_p + \epsilon \]
\(\beta_0\) is known as the intercept
\(\beta_1, \beta_2, \ldots, \beta_p\) are the covariate effects
\(\beta_j\) are known as regression coefficients
\(\epsilon\) is an error term
The above model is linear in each \(\beta_j\) assuming \(x_j\) is fixed, and in each \(x_j\) assuming \(\beta_j\) is fixed.
Second order partial derivatives \(\frac{\partial^2 y}{\partial x_j^2}\) and \(\frac{\partial^2 y}{\partial \beta_j^2}\) are zero for all \(j\) and all values of \(x_j\) and \(\beta_j\).

What is Linear Regression?

Important

The “linear” in linear regression refers to the linearity in the parameters \(\beta_j\). \(\frac{\partial^2 y}{\partial \beta_j^2} = 0\) for all \(j\) and all values of \(\beta_j\).
This is also a linear regression model: \[ y = \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x^2 + \epsilon \]
Nonlinear functions (aka basis functions) of the predictors can be included as long as the model is linear in the parameters.
Linear regressions models can approximate any nonlinear continuous function of the predictors if enough basis functions are included.
Using polynomial basis functions is often called polynomial regression.

Simple Linear Regression

The simple linear regression model is: \[ y = \beta_0 + \beta_1 \cdot x + \epsilon \] where the intercept \(\beta_0\) and slope \(\beta_1\) are unknown constants and \(\epsilon\) is a random error component.

Simple Linear Regression

Assumptions:
- The errors have mean zero and unknown variance \(\sigma^2\).
- The errors are uncorrelated, i.e. the value of one point’s error does not depend on the value of any other point’s error.
- \(x\) is known and measured without error.

Simple Linear Regression

The mean of \(y\) given \(x\) is: \[ E(y | x) = \beta_0 + \beta_1 \cdot x \] and the variance is: \[ \text{Var}(y | x) = \text{Var}(\beta_0 + \beta_1 \cdot x + \epsilon) = \text{Var}(\epsilon) = \sigma^2 \]
The mean of \(y\) is a linear function of \(x\) but the variance of \(y\) does not depend on \(x\).
Because the errors are uncorrelated, the responses \(y_i\) are also uncorrelated with one another, conditional on \(x\).
If the model is linear in \(x\):
- \(\beta_1\) represents the change in the mean response \(y\) for a one-unit increase in \(x\).
- The intercept \(\beta_0\) represents the mean response when \(x = 0\). If the range of \(x\) does not include 0, the intercept does not have a practical interpretation.

Least Squares Estimation

The parameters \(\beta_0\) and \(\beta_1\) are unknown and must be estimated using sample data.
Suppose that we have \(n\) pairs of data, say \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\).
The method of least squares can be used to estimate the parameters \(\beta_0\) and \(\beta_1\) by minimizing the sum of squared errors (SSE).
For each point \(i\), we can write: \[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \]
The general regression model \(y = \beta_0 + \beta_1 \cdot x + \epsilon\) can be viewed as a population regression model, while the equation \(y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i\) can be viewed as a sample regression model.

Least Squares Estimation

The least squares criterion is: \[ S(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 \cdot x_i)^2 \]
The least squares estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the values of \(\beta_0\) and \(\beta_1\) that minimize \(S(\beta_0, \beta_1)\).
In simple linear regression, the least squares estimates are: \[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} \quad \text{and} \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \cdot \bar{x} \] where \[ S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) \quad \text{and} \quad S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
Here, \(\bar{x}\) and \(\bar{y}\) are the sample means of \(x\) and \(y\), respectively.

Least Squares Estimation

The fitted linear regression model is then: \[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_i \] This model gives a point estimate of of the mean of \(y\) for a particular \(x\).
The difference between the observed value \(y_i\) and the fitted value \(\hat{y}_i\) is called a residual.
The \(i\)-th residual is: \[ e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 \cdot x_i) \]
Residuals play an important role in investigating the adequacy of the fitted regression model and detecting departures from the underlying assumptions in the regression model.

Properties of Least Squares Estimators

The least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are linear functions of the response values \(y_i\).
The least squares estimators are unbiased, i.e. \(E(\hat{\beta}_j) = \beta_j\) for \(j = 0, 1\).
The least squares estimators have minimum variance among all linear unbiased estimators (Gauss-Markov theorem). \[ \begin{aligned} \text{Var}(\hat{\beta}_1) &= \frac{\sigma^2}{S_{xx}} \\ \text{Var}(\hat{\beta}_0) &= \sigma^2 \cdot \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) \\ \text{Cov}(\hat{\beta}_0, \hat{\beta}_1) &= \underbrace{-\bar{x} \cdot \text{Var}(\hat{\beta}_1)}_{\text{0 if } \bar{x} = 0} \end{aligned} \] \(\sigma^2\) is unknown but can be estimated. The least squares estimators are often called best (minimum variance) linear unbiased estimators (BLUE).

Properties of Least Squares Estimators

The sum of the residuals is zero. \[\sum_{i=1}^{n} e_i = 0\]
The sum of the observed values equals the sum of the fitted values. \[\sum_{i=1}^{n} y_i = \sum_{i=1}^{n} \hat{y}_i\]
The least squares regression line passes through the point \((\bar{x}, \bar{y})\).

Properties of Least Squares Estimators

The sum of the residuals weighted by the corresponding \(x\) values is zero. \[\sum_{i=1}^{n} x_i e_i = 0\]
The sum of the residuals weighted by the corresponding fitted values is zero. \[\sum_{i=1}^{n} \hat{y}_i e_i = 0\]

Estimation of \(\sigma^2\)

In addition to estimating the regression coefficients, we also need to estimate the variance \(\sigma^2\) of the error term \(\epsilon\) to test hypotheses and construct confidence intervals.
The estimate of \(\sigma^2\) is based on the residuals/error sum of squares \(\text{SS}_{\text{Res}}\): \[ \text{SS}_{\text{Res}} = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
The residual sum of squares \(\text{SS}_{\text{Res}}\) has \(n - 2\) degrees of freedom because two parameters (\(\beta_0\) and \(\beta_1\)) have been estimated from the data.
The expected value of \(\text{SS}_{\text{Res}}\) is \(E[\text{SS}_{\text{Res}}] = (n - 2) \sigma^2\).
An unbiased estimator of \(\sigma^2\) is: \[ \hat{\sigma}^2 = \frac{\text{SS}_{\text{Res}}}{n - 2} = \text{MS}_{\text{Res}} \]

Estimation of \(\sigma^2\)

\[ \hat{\sigma}^2 = \frac{\text{SS}_{\text{Res}}}{n - 2} = \text{MS}_{\text{Res}} \]

\(\text{MS}_{\text{Res}}\) is called the residual mean square.
The square root of \(\hat{\sigma}^2\) is sometimes called the standard error of regression and it has the same unit as the response variable \(y\).
Because \(\hat{\sigma}^2\) depends on the residual sum of squares, any violation of the assumptions on the model errors or any misspecification of the model form may seriously damage the usefulness of \(\hat{\sigma}^2\) as an estimator of \(\sigma^2\).
Because \(\hat{\sigma}^2\) is computed from the regression model residuals, we say that it is a model-dependent estimate of \(\sigma^2\).
If we have multiple observations at each \(x\) value, we can compute a model-independent estimate of \(\sigma^2\) based on the variability of the replicate observations at each \(x\) value.

What is a Confidence Interval?

In the frequentist framework, the true parameters \(\beta_0\), \(\beta_1\), and \(\sigma^2\) are unknown constants.
The randomness in the estimates \(\hat{\beta}_0\), \(\hat{\beta}_1\), and \(\hat{\sigma}^2\) comes from the random observed data.
If we were to repeat the sampling and estimation process many times, we would get different estimates each time.
A confidence interval (CI) provides a range of plausible values for an unknown parameter based on the observed data.

Important

A \((1 - \alpha)100\%\) CI does not mean that there is a \((1 - \alpha)100\%\) probability that the true parameter value lies within the interval computed from the observed data.
Instead, it means that if we were to repeat the sampling and estimation process many times, \((1 - \alpha)100\%\) of the computed CIs would contain the true parameter value. Each time, the computed CI will be different.

What is a Confidence Interval?

In practice, we only have one sample and one computed CI, so we do not know whether this particular interval contains the true parameter value. However, the single computed CI still gives us a range of plausible values for the parameter.
The CI can be derived from the sampling distribution of the estimator of interest.
The sampling distribution describes how the estimator varies from sample to sample. The sample here is the entire dataset of \(n\) pairs of observations \((x_i, y_i)\).
Each observation \(y_i\) is assumed to be a random variable because it depends on the random error term \(\epsilon_i\). \[ y_i = \beta_0 + \beta_1 \cdot x_i + \epsilon_i \]
Given an assumed distribution for the errors \(\epsilon_i\), we can derive the sampling distributions of the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) which are just functions of \(y_i\) and therefore functions of the random errors \(\epsilon_i\).

Frequentist Confidence Interval vs Bayesian Credible Interval

In the Bayesian framework, the parameters \(\beta_0\), \(\beta_1\), and \(\sigma^2\) are treated as random variables, given a prior distribution.
The data are assumed to be fixed once observed.
The posterior distribution of a parameter is the distribution of the parameter given the observed data.
A credible interval is a range of values within which the parameter lies with a certain probability \((1 - \alpha)\) given the observed data.
A \((1 - \alpha)100\%\) credible interval means that there is a \((1 - \alpha)100\%\) probability that the parameter lies within the interval given the observed data.
Despite the philosophical differences between the frequentist and Bayesian frameworks, for specific model forms and specific non-informative prior distributions, the Bayesian credible interval is numerically identical to the frequentist CI.
See Probability Theory: The Logic of Science by E. T. Jaynes for more details.

Confidence Intervals of Regression Coefficients

To construct confidence intervals (CIs) for the regression coefficients, we need to know the sampling distributions of the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
If the errors \(\epsilon_i\) are normally distributed, then the least squares estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are also normally distributed. \[ \hat{\beta}_j \sim N(\beta_j, \text{Var}(\hat{\beta}_j)) \quad \text{for } j = 0, 1 \] \[ \text{Var}(\hat{\beta}_1) = \frac{\sigma^2}{S_{xx}}, \quad \text{Var}(\hat{\beta}_0) = \sigma^2 \cdot \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) \]
The square root of the variance is called the standard error (SE) of the estimator. \[ \text{SE}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)} \quad \text{for } j = 0, 1 \]

Confidence Intervals of Regression Coefficients

\(\sigma^2\) is unknown but can be estimated by \(\hat{\sigma}^2\). \[ \text{Var}(\hat{\beta}_1) \approx \widehat{\text{Var}}(\hat{\beta}_1) = \frac{\hat{\sigma}^2}{S_{xx}}, \quad \text{Var}(\hat{\beta}_0) \approx \widehat{\text{Var}}(\hat{\beta}_0) = \hat{\sigma}^2 \cdot \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) \] \[ \widehat{\text{SE}}(\hat{\beta}_j) = \sqrt{\widehat{\text{Var}}(\hat{\beta}_j)} \quad \text{for } j = 0, 1 \]
To correct for the fact that \(\sigma^2\) is unknown and has been estimated by \(\hat{\sigma}^2\), we use the \(t\) distribution instead of the normal distribution. \[ \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \sim t_{n-2} \quad \text{for } j = 0, 1 \] where \(t_{n-2}\) is the \(t\) distribution with \(n - 2\) degrees of freedom.

Confidence Intervals of Regression Coefficients

A \((1 - \alpha)100\%\) CI for \(\beta_j\) is: \[ \left( \hat{\beta}_j - t_{\alpha/2, n-2} \cdot \widehat{\text{SE}}(\hat{\beta}_j), \, \hat{\beta}_j + t_{\alpha/2, n-2} \cdot \widehat{\text{SE}}(\hat{\beta}_j) \right) \] where \(t_{\alpha/2, n-2}\) is the upper \(\alpha/2\) quantile of the \(t\) distribution with \(n - 2\) degrees of freedom.
The upper \(\alpha/2\) quantile \(t_{\alpha/2, n-2}\) is defined such that \(P(t > t_{\alpha/2, n-2}) = \alpha/2\) according to the \(t\) distribution with \(n - 2\) degrees of freedom.
For \(\alpha = 0.05\), \(t_{\alpha/2, n-2}\) is the 97.5th percentile of the \(t\) distribution with \(n - 2\) degrees of freedom which is positive.

Confidence Intervals of Error Variance

If the errors are normally and independently distributed, the sampling distribution of \((n - 2) \hat{\sigma}^2 / \sigma^2\) is chi-square with \(n - 2\) degrees of freedom. \[ P\left( \chi^2_{1 - \alpha/2, n-2} \leq \frac{(n - 2) \hat{\sigma}^2}{\sigma^2} \leq \chi^2_{\alpha/2, n-2} \right) = 1 - \alpha \] where \(\chi^2_{\alpha/2, n-2}\) and \(\chi^2_{1 - \alpha/2, n-2}\) are the upper \(\alpha/2\) and \(1 - \alpha/2\) quantiles of the chi-square distribution with \(n - 2\) degrees of freedom.
It follows that \[ P\left( \frac{(n - 2) \hat{\sigma}^2}{\chi^2_{\alpha/2, n-2}} \leq \sigma^2 \leq \frac{(n - 2) \hat{\sigma}^2}{\chi^2_{1 - \alpha/2, n-2}} \right) = 1 - \alpha \]
The \((1 - \alpha)100\%\) CI for \(\sigma^2\) is \(\left( \frac{(n - 2) \hat{\sigma}^2}{\chi^2_{\alpha/2, n-2}}, \frac{(n - 2) \hat{\sigma}^2}{\chi^2_{1 - \alpha/2, n-2}} \right)\).

Confidence Interval of Model Prediction

A major use of a regression model is to estimate the mean response \(E[y | x]\) for a particular value of \(x\).
Let \(x_0\) be the level of the regressor variable for which we wish to estimate the mean response \(E[y | x_0]\).
We assume that \(x_0\) is any value of the regressor variable within the range of the original data on \(x\) used to fit the model.
An unbiased point estimator of \(E[y | x_0]\) is found from the fitted model as: \[ \widehat{E[y | x_0]} = \hat{\mu}_{y | x_0} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x_0 \]
To obtain a CI for \(E[y | x_0]\), we need to know the sampling distribution of \(\hat{\mu}_{y | x_0}\).
\(\hat{\mu}_{y | x_0}\) is a linear function of the observations \(y_i\) and is therefore normally distributed if the errors \(\epsilon_i\) are normally distributed.

Confidence Interval of Model Prediction

The variance of \(\hat{\mu}_{y | x_0}\) is: \[ \text{Var}(\hat{\mu}_{y | x_0}) = \sigma^2 \cdot \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \]
Since \(\sigma^2\) is unknown, we estimate it by \(\hat{\sigma}^2\) and use a \(t\) distribution to account for the additional variability introduced by estimating \(\sigma^2\). \[ \frac{\hat{\mu}_{y | x_0} - E[y | x_0]}{\text{SE}(\hat{\mu}_{y | x_0})} \sim t_{n-2} \] where \(\text{SE}(\hat{\mu}_{y | x_0}) = \sqrt{\hat{\sigma}^2 \cdot \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)}\) is the standard error estimate of \(\hat{\mu}_{y | x_0}\).
A \((1 - \alpha)100\%\) CI for \(E[y | x_0]\) is: \[ \hat{\mu}_{y | x_0} \pm t_{\alpha/2, n-2} \cdot \text{SE}(\hat{\mu}_{y | x_0}) \]

Confidence Interval of Model Prediction

\[ \hat{\mu}_{y | x_0} \pm t_{\alpha/2, n-2} \cdot \text{SE}(\hat{\mu}_{y | x_0}) \] \[ \text{SE}(\hat{\mu}_{y | x_0}) = \sqrt{\hat{\sigma}^2 \cdot \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)} \]

The CI for \(E[y | x_0]\) is narrowest at \(x_0 = \bar{x}\) and becomes wider as \(x_0\) moves away from \(\bar{x}\).
This shows why regression models should not be used to extrapolate far beyond the range of the original data on \(x\) used to fit the model.

Confidence Interval of Model Prediction

Prediction Interval of a New Observation

In addition to estimating the CI of the mean response \(E[y | x_0]\), we may also be interested in estimating an interval for the new observation \(y_0\) at \(x_0\).
Let the residual error term for the new observation be: \[ \psi = y_0 - \hat{\mu}_{y | x_0} \]
Note that \(y_0\) is assumed to follow the linear regression model using the true unknown parameters \(\beta_0\) and \(\beta_1\), not their random estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\). \[ y_0 = \beta_0 + \beta_1 \cdot x + \epsilon_0 \] \[ \epsilon_0 \sim N(0, \sigma^2) \]
The variance of \(y_0\) is \(\text{Var}(y_0) = \text{Var}(\epsilon_0) = \sigma^2\).

Prediction Interval of a New Observation

The variance of the prediction error \(\psi\) is: \[ \begin{aligned} \text{Var}(\psi) = \text{Var}(y_0 - \hat{\mu}_{y | x_0}) &= \text{Var}(y_0) + \text{Var}(\hat{\mu}_{y | x_0}) \\ &= \sigma^2 + \sigma^2 \cdot \left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \\ &= \sigma^2 \cdot \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right) \end{aligned} \]
Since \(\sigma^2\) is unknown, we estimate it by \(\hat{\sigma}^2\) and use a \(t\) distribution to account for the additional variability introduced by estimating \(\sigma^2\). \[ \frac{y_0 - \hat{\mu}_{y | x_0}}{\text{SE}(\psi)} \sim t_{n-2} \] where \(\text{SE}(\psi) = \sqrt{\hat{\sigma}^2 \cdot \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)}\) is the standard error estimate of \(\psi\).

Prediction Interval of a New Observation

A \((1 - \alpha)100\%\) prediction interval (PI) for \(y_0\) is: \[ \hat{\mu}_{y | x_0} \pm t_{\alpha/2, n-2} \cdot \text{SE}(\psi) \] \[ \text{SE}(\psi) = \sqrt{\hat{\sigma}^2 \cdot \left( 1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}} \right)} \]

Multiple Linear Regression

In multiple linear regression, there are multiple predictor variables.
The model is: \[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_p \cdot x_p + \epsilon \] where \(x_1, x_2, \ldots, x_p\) are the predictor variables, \(\beta_0, \beta_1, \ldots, \beta_p\) are the regression coefficients, and \(\epsilon\) is the random error term.
The predictor variables for all data points can be arranged in matrix, called the design matrix: \[ \mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \end{bmatrix} \]

Multiple Linear Regression

The least squares estimates of the regression coefficients can be obtained using matrix algebra. \[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \] where \(\mathbf{X}\) is the design matrix and \(\mathbf{y}\) is the response vector.
Similar results as in simple linear regression can be derived for multiple linear regression, e.g:
- Properties of least squares estimators
- Estimation of \(\sigma^2\)
- Confidence intervals for regression coefficients
- Confidence interval for mean response
- Prediction interval for new observation
For more details, refer to “Introduction to Linear Regression Analysis” by Montgomery, Peck, and Vining.

Maximum Likelihood Estimation (MLE)

Instead of minimizing the sum of squared errors, we can also estimate the regression coefficients using maximum likelihood estimation (MLE).
Under the assumption that the errors \(\epsilon_i\) are normally and independently distributed, the likelihood function for the linear regression model is: \[ L(\beta_0, \beta_1, \ldots, \beta_p, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(y_i - \beta_0 - \beta_1 x_{i1} - \ldots - \beta_p x_{ip})^2}{2 \sigma^2} \right) \]
We can find the MLE by maximizing the log-likelihood function with respect to the parameters.
The MLEs of the regression coefficients \(\beta_0, \beta_1, \ldots, \beta_p\) are the same as the least squares estimates.

Maximum Likelihood Estimation (MLE)

The MLE of \(\sigma^2\) is: \[ \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] which is slightly different from the unbiased estimator \(\hat{\sigma}^2 = \frac{1}{n - p - 1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\) used in least squares estimation.
The MLE estimate of \(\sigma^2\) is biased but consistent and has a lower variance than the unbiased estimator.

Different Response Distributions

The linear regression model can be extended to handle different types of response variables by using generalized linear models (GLMs).
We can use MLE to estimate the model parameters for GLMs.
Examples of GLMs include:
- Logistic regression for binary response variables
- Poisson regression for count data
- Negative binomial regression for overdispersed count data
- Gamma regression for positive continuous response variables
Each model has its own likelihood function.

Logistic Regression

Logistic regression is used for binary classification problems where the response variable \(y\) takes on two possible values (e.g., 0 and 1).
The model estimates the probability of the positive class (e.g., \(y = 1\)) given the predictor variables.
The logistic regression model is: \[ \text{logit}(P(y = 1 | x)) = \log\left( \frac{P(y = 1 | x)}{1 - P(y = 1 | x)} \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p \]
The probability of the positive class can be obtained by applying the inverse logit function: \[ P(y = 1 | x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p))} \]

Logistic Regression

The likelihood function for logistic regression is: \[ \begin{aligned} L(\beta_0, \beta_1, \ldots, \beta_p) &= \prod_{i=1}^{n} P(y_i | x_i) \\ &= \prod_{i=1}^{n} P(y_i = 1 | x_i)^{y_i} \left( 1 - P(y_i = 1 | x_i) \right)^{1 - y_i} \end{aligned} \]

Poisson Regression

Poisson regression is used for modeling count data where the response variable \(y\) represents the number of occurrences of an event.
The Poisson regression model is: \[ \log(E[y | x]) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p \]
The expected count can be obtained by applying the exponential function: \[ E[y | x] = \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p) \]
The likelihood function for Poisson regression is: \[ L(\beta_0, \beta_1, \ldots, \beta_p) = \prod_{i=1}^{n} \frac{\exp(-\lambda_i) \lambda_i^{y_i}}{y_i!} \] where \(\lambda_i = E[y_i | x_i]\).

Predictive Model Evaluation Metrics

Non-probabilistic Metrics

Mean Absolute Error (MAE): Average of absolute differences between predicted and actual values. \[ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| \]
Mean Squared Error (MSE): Average of squared differences between predicted and actual values. \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]
Root Mean Squared Error (RMSE): Square root of MSE, providing error in original units. \[ \text{RMSE} = \sqrt{\text{MSE}} \]

Predictive Model Evaluation Metrics

Non-probabilistic Metrics

R-squared (\(R^2\)): Proportion of variance in the dependent variable explained by the model. \[ R^2 = 1 - \frac{\text{SS}_{\text{Res}}}{\text{SS}_{\text{Tot}}} \] where \(\text{SS}_{\text{Res}}\) is the residual sum of squares and \(\text{SS}_{\text{Tot}}\) is the total sum of squares. \[ \text{SS}_{\text{Tot}} = \sum_{i=1}^{n} (y_i - \bar{y})^2 \] \[ \text{SS}_{\text{Res}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

Predictive Model Evaluation Metrics

Non-probabilistic Metrics

Adjusted R-squared: Adjusted version of \(R^2\) that accounts for the number of predictors in the model. \[ \text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n - 1}{n - p - 1} \] where \(n\) is the number of observations and \(p\) is the number of predictors.

Predictive Model Evaluation Metrics

Non-probabilistic Metrics

PRESS statistic: Prediction sum of squares, an analytical leave-one-out cross-validation (LOOCV) error estimate. \[ \text{PRESS} = \sum_{i=1}^{n} (y_i - \hat{y}_{(i)})^2 \] where \(\hat{y}_{(i)}\) is the predicted value for observation \(i\) when the model is fitted without observation \(i\). \[ \hat{y}_{(i)} = \hat{y}_i + \frac{e_i}{1 - h_{ii}} \] where \(e_i\) is the residual for observation \(i\) and \(h_{ii}\) is the leverage of observation \(i\). \[ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{S_{xx}} \]

Predictive Model Evaluation Metrics

Probabilistic Metrics (Scoring Rules)

Negative Log Likelihood (NLL): Negative of the log likelihood, lower is better. \[ \text{NLL} = -\ln(L) \] where \(L\) is the likelihood of the model given the data, which is the joint probability of the observed data points given the model parameters.
Continuous Ranked Probability Score (CRPS): Measure of the accuracy of probabilistic predictions, lower is better. CRPS is a proper scoring rule that generalizes MAE to probabilistic forecasts. \[ \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \left( F(x) - \mathbf{1}(x \geq y) \right)^2 dx \] where \(F\) is the cumulative distribution function of the predicted distribution and \(y\) is the observed value.

Predictive Model Evaluation Metrics

Probabilistic Metrics (Scoring Rules)

Brier Score: Measure of the accuracy of probabilistic predictions for multi-class outcomes, lower is better. \[ \text{Brier Score} = \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K (f_{ik} - o_{ik})^2 \] where \(K\) is the number of classes, \(f_{ik}\) is the predicted probability for class \(k\), and \(o_{ik}\) is 1 if the observed class for point \(i\) is \(k\) and 0 otherwise.

For binary classification, the Brier Score simplifies to: \[ \text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} (f_i - o_i)^2 \] where \(f_i\) is the predicted probability and \(o_i\) is the observed outcome (0 or 1).

Predictive Model Evaluation Metrics

Probabilistic Metrics (Scoring Rules)

Cross-Entropy Loss: Negative log likelihood for classification problems, lower is better. \[ \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} o_{ik} \log(p_{ik}) \] where \(K\) is the number of classes, \(o_{ik}\) is 1 if the observed class for point \(i\) is \(k\) and 0 otherwise, and \(p_{ik}\) is the predicted probability for class \(k\).

For binary classification, the cross-entropy loss simplifies to: \[ \text{Cross-Entropy Loss} = -\frac{1}{n} \sum_{i=1}^{n} \left( y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right) \] where \(y_i\) is the true label and \(p_i\) is the predicted probability for the positive class.

Information Criteria

Common Information Criteria

Akaike Information Criterion (AIC): Measure of the relative quality of statistical models for a given dataset, lower is better. \[ \text{AIC} = 2k - 2\ln(L) \] where \(k\) is the number of estimated parameters and \(L\) is the maximum value of the likelihood function for the model.
Bayesian Information Criterion (BIC): Similar to AIC but includes a generally larger penalty term for the number of parameters, lower is better. \[ \text{BIC} = k \ln(n) - 2\ln(L) \] where \(n\) is the number of observations.

AIC Derivation

Given:

\(X = x_1, x_2, \dots, x_n \sim f(x)\) (unknown true distribution)
A parametric model \(p(x | \theta)\), with MLE \(\hat{\theta}\)
Single point log-likelihood: \(\ell(\theta; x) = \log p(x | \theta)\)

We want to estimate:

\[ \mathbb{E}_{x_{\text{new}} \sim f}\left[\log p(x_{\text{new}} | \hat{\theta})\right] = \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] \]

where \(x_{\text{new}}\) is a new data point from the true distribution \(f\), not in the training set \(X\).

AIC Derivation

Let \(\theta\) be the “pseudo-true” parameter (i.e., the one that minimizes the KL divergence to the true distribution).

Expand \(\ell(\hat{\theta})\) around \(\theta\) using a second-order Taylor expansion:

\[ \ell(\hat{\theta}; x_{\text{new}}) \approx \ell(\theta; x_{\text{new}}) + (\hat{\theta} - \theta)^T \cdot \nabla \ell(\theta; x_{\text{new}}) + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot \nabla^2 \ell(\theta; x_{\text{new}}) \cdot (\hat{\theta} - \theta) \]

Taking the expectation with respect to \(x_{\text{new}}\):

\[ \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] \approx \mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})] + (\hat{\theta} - \theta)^T \cdot \underbrace{\mathbb{E}_{x_{\text{new}}}[\nabla \ell(\theta; x_{\text{new}})]}_{\text{expected score is } 0} + \\ \quad \quad \quad \quad \frac{1}{2} (\hat{\theta} - \theta)^T \cdot \mathbb{E}_{x_{\text{new}}}\left[ \nabla^2 \ell(\theta; x_{\text{new}}) \right] \cdot (\hat{\theta} - \theta) \]

\(-\mathbb{E}_{x_{\text{new}}}\left[ \nabla^2 \ell(\theta; x_{\text{new}}) \right]\) is the Fisher information for a single data point, denoted \(I(\theta)\).

AIC Derivation

So we have:

\[ \boxed{\mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] \approx \mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})] - \frac{1}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta)} \]

Let the log likelihood for the entire dataset at the MLE \(\hat{\theta}\) be: \[ \ell(\hat{\theta}; X) = \log L(\hat{\theta}; X) = \sum_{i=1}^{n} \log p(x_i | \hat{\theta}) = \sum_{i=1}^{n} \ell(\hat{\theta}; x_i) \]

For each data point \(x_i\), using a second order Taylor expansion around the unknown true parameter \(\theta\): \[ \ell(\hat{\theta}; x_i) \approx \ell(\theta; x_i) + (\hat{\theta} - \theta)^T \cdot \nabla \ell(\theta; x_i) + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot \nabla^2 \ell(\theta; x_i) \cdot (\hat{\theta} - \theta) \]

AIC Derivation

Summing over \(i\) from 1 to \(n\): \[ \ell(\hat{\theta}; X) \approx \sum_{i=1}^{n} \ell(\theta; x_i) + (\hat{\theta} - \theta)^T \cdot \underbrace{\sum_{i=1}^{n} \nabla \ell(\theta; x_i)}_{\text{Term 1}} + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot \underbrace{\sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i)}_{\text{Term 2}} \cdot (\hat{\theta} - \theta) \]

Term 1

Using a first order Taylor expansion of \(\nabla \ell(\hat{\theta}; x_i)\) around \(\theta\):

\[ \nabla \ell(\hat{\theta}; x_i) \approx \nabla \ell(\theta; x_i) + \nabla^2 \ell(\theta; x_i) \cdot (\hat{\theta} - \theta) \]

Since \(\hat{\theta}\) is the MLE, \(\sum_{i=1}^{n} \nabla \ell(\hat{\theta}; x_i) = 0\).

\[ \begin{aligned} 0 &\approx \sum_{i=1}^{n} \nabla \ell(\theta; x_i) + \sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i) \cdot (\hat{\theta} - \theta) \end{aligned} \]

AIC Derivation

Term 1

\[ \begin{aligned} \sum_{i=1}^{n} \nabla \ell(\theta; x_i) &\approx -\sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i) \cdot (\hat{\theta} - \theta) \end{aligned} \]

Approximating the sum of the second derivative by its expectation: \[ \begin{aligned} \sum_{i=1}^{n} \nabla \ell(\theta; x_i) &\approx -\mathbb{E}_X\left[ \sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i) \right] \cdot (\hat{\theta} - \theta) \\ &= n \cdot I(\theta) \cdot (\hat{\theta} - \theta) \end{aligned} \] where \(I(\theta)\) is the Fisher information for a single data point.

AIC Derivation

Term 2

Recall

\[ \ell(\hat{\theta}; X) \approx \sum_{i=1}^{n} \ell(\theta; x_i) + (\hat{\theta} - \theta)^T \cdot \underbrace{\sum_{i=1}^{n} \nabla \ell(\theta; x_i)}_{\text{Term 1}} + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot \underbrace{\sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i)}_{\text{Term 2}} \cdot (\hat{\theta} - \theta) \]

Approximating Term 2 by its expectation: \[ \sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i) \approx \mathbb{E}_X\left[ \sum_{i=1}^{n} \nabla^2 \ell(\theta; x_i) \right] = -n \cdot I(\theta) \]

AIC Derivation

Substituting the approximate Term 1 and Term 2 back into the Taylor expansion:

\[ \begin{aligned} \ell(\hat{\theta}; X) &\approx \sum_{i=1}^{n} \ell(\theta; x_i) + (\hat{\theta} - \theta)^T \cdot n \cdot I(\theta) \cdot (\hat{\theta} - \theta) + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot (-n \cdot I(\theta)) \cdot (\hat{\theta} - \theta) \\ &= \sum_{i=1}^{n} \ell(\theta; x_i) + \frac{n}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta) \end{aligned} \]

Therefore,

\[ \frac{1}{n} \ell(\hat{\theta}; X) \approx \frac{1}{n} \sum_{i=1}^{n} \ell(\theta; x_i) + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta) \]

Approximating \(\frac{1}{n} \sum_{i=1}^{n} \ell(\theta; x_i)\) by \(\mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})]\):

\[ \boxed{\frac{1}{n} \ell(\hat{\theta}; X) \approx \mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})] + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta)} \]

AIC Derivation

The 2 main results so far are: \[ \begin{aligned} \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] &\approx \mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})] - \frac{1}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta) \\ \frac{1}{n} \ell(\hat{\theta}; X) &\approx \mathbb{E}_{x_{\text{new}}}[\ell(\theta; x_{\text{new}})] + \frac{1}{2} (\hat{\theta} - \theta)^T \cdot I(\theta) \cdot (\hat{\theta} - \theta) \end{aligned} \]

Subtracting the the second equation from the first:

\[ \begin{aligned} \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] - \frac{1}{n} \ell(\hat{\theta}; X) &\approx - (\hat{\theta} - \theta)^T \cdot I(\hat{\theta} - \theta) \\ &= -\text{tr}\left( I(\theta) \cdot (\hat{\theta} - \theta) \cdot (\hat{\theta} - \theta)^T \right) \end{aligned} \]

AIC Derivation

Taking the expectation wrt the training data \(X\): \[ \begin{aligned} \mathbb{E}_X[\mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})]] - \mathbb{E}_X\left[\frac{1}{n} \ell(\hat{\theta}; X)\right] &\approx -\mathbb{E}_X\left[ \text{tr}\left( I(\theta) \cdot (\hat{\theta} - \theta) \cdot (\hat{\theta} - \theta)^T \right) \right] \\ &= -\text{tr}\left( I(\theta) \cdot \mathbb{E}_X\left[ (\hat{\theta} - \theta) \cdot (\hat{\theta} - \theta)^T \right] \right) \end{aligned} \]

Using asymptotic theory of MLE, we can approximate \(\mathbb{E}_X\left[ \left( \hat{\theta}(X) - \theta \right) \cdot \left( \hat{\theta}(X) - \theta \right)^T \right]\).

\[ \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta)) \quad \Rightarrow \quad \mathbb{E}_{X}[(\hat{\theta} - \theta) \cdot (\hat{\theta} - \theta)^T] \approx \frac{1}{n} I^{-1}(\theta) \]

AIC Derivation

Therefore: \[ \begin{aligned} \mathbb{E}_X[\mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})]] - \mathbb{E}_X\left[\frac{1}{n} \ell(\hat{\theta}; X)\right] & \approx -\text{tr}\left( I(\theta) \cdot \frac{1}{n} I^{-1}(\theta) \right) \\ & = -\frac{1}{n} \text{tr}(I(\theta) \cdot I^{-1}(\theta)) \\ & = -\frac{k}{n} \end{aligned} \]

where \(k\) is the number of parameters in the model. Using the single data set \(X\) to estimate the expectation wrt \(X\): \[ \begin{aligned} \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] - \frac{1}{n} \ell(\hat{\theta}; X) & \approx -\frac{k}{n} \\ \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] & \approx \frac{1}{n} \ell(\hat{\theta}; X) - \frac{k}{n} \end{aligned} \]

AIC Derivation

\[ \boxed{\text{AIC} = -2n \cdot \mathbb{E}_{x_{\text{new}}}[\ell(\hat{\theta}; x_{\text{new}})] \approx -2 \ell(\hat{\theta}; X) + 2k} \]

Assumptions:

The model is fitted using maximum likelihood estimation (MLE).
The regularity conditions for MLE hold (e.g., differentiability of the log-likelihood function).
The Fisher information matrix \(I(\theta)\) is positive definite.
The parameters \(\theta\) are identifiable.
The sample size \(n\) is large for asymptotic approximations to hold.
The model is correctly specified (i.e., the true distribution \(f\) is in the family of distributions defined by the model).
The observations \(x_i\) are independent and identically distributed (i.i.d.).

BIC Derivation

BIC assumes a prior distribution on the parameters \(\theta\) and approximates the marginal likelihood \(p(X)\) using Laplace’s method.

\[ \begin{aligned} p(X) &= \int p(X | \theta) \cdot p(\theta)\, d\theta \\ &= \int \exp(\log p(X | \theta) + \log p(\theta)) \, d\theta \end{aligned} \]

Using a second-order Taylor expansion of \(\log p(X | \theta) + \log p(\theta)\) around the MLE \(\hat{\theta}\): \[ \log p(X | \theta) + \log p(\theta) \approx \log p(X | \hat{\theta}) + \log p(\hat{\theta}) + g^T \cdot (\theta - \hat{\theta}) + \frac{1}{2} (\theta - \hat{\theta})^T \cdot H \cdot (\theta - \hat{\theta}) \] where \(g = \frac{\partial}{\partial \theta} (\log p(X | \theta) + \log p(\theta)) \big|_{\theta = \hat{\theta}} = \frac{\partial}{\partial \theta} \log p(\theta) \big|_{\theta = \hat{\theta}}\), and \(H\) is the Hessian matrix of second derivatives at \(\hat{\theta}\).

BIC Derivation

Using the second order Taylor expansion in the integral: \[ \begin{aligned} p(X) &\approx p(X | \hat{\theta}) \cdot p(\hat{\theta}) \cdot \sqrt{\frac{(2\pi)^k}{| -H |}} \cdot \exp\left(-\frac{1}{2} g^T \cdot H^{-1} \cdot g \right) \end{aligned} \]

Approximating the log likelihood component of the Hessian using the Fisher information matrix: \[ \begin{aligned} H &= \frac{\partial^2}{\partial \theta^2} (\log p(X | \theta) + \log p(\theta)) \big|_{\theta = \hat{\theta}} \\ &= \frac{\partial^2}{\partial \theta^2} \log p(X | \theta) \big|_{\theta = \hat{\theta}} + \frac{\partial^2}{\partial \theta^2} \log p(\theta) \big|_{\theta = \hat{\theta}} \\ &= - n \cdot I(\hat{\theta}) + \frac{\partial^2}{\partial \theta^2} \log p(\theta) \big|_{\theta = \hat{\theta}} \\ H^{-1} &= \frac{1}{n} \left(-I(\hat{\theta}) + \frac{1}{n} \frac{\partial^2}{\partial \theta^2} \log p(\theta) \big|_{\theta = \hat{\theta}} \right)^{-1} \end{aligned} \]

BIC Derivation

Assuming a large \(n\), the term in the exponent becomes negligible: \[ \exp\left(-\frac{1}{2} g^T \cdot H^{-1} \cdot g \right) \approx 1 \] and the determinant can be approximated as: \[ | -H | \approx | n \cdot I(\hat{\theta}) | = n^k \cdot | I(\hat{\theta}) | \]

where \(k\) is the number of parameters in the model.

The log integral can then be approximated as: \[ \log p(X) \approx \underbrace{\log p(X | \hat{\theta}) - \frac{k}{2} \log(n)}_{\text{depend on } n} + \underbrace{\log p(\hat{\theta}) + \frac{k}{2} \log(2\pi) - \frac{1}{2} \log | I(\hat{\theta}) |}_{\text{do not depend on } n} \]

BIC Derivation

Assuming \(n\) is large, we can ignore the terms that are constant in \(n\) and define the BIC as: \[ \boxed{\text{BIC} = -2 \left( \log p(X | \hat{\theta}) - \frac{k}{2} \log(n) \right) = -2 \log p(X | \hat{\theta}) + k \log(n)} \]

Assumptions:

The model is fitted using maximum likelihood estimation (MLE).
The log-likelihood function is twice differentiable.
The parameters \(\theta\) are identifiable.
The quadratic approximation of the log-likelihood around the MLE is valid.
The Fisher information matrix \(I(\theta)\) at the MLE is positive definite.
The sample size \(n\) is much larger than the number of parameters \(k\) for the constant terms, the log prior term, its gradient and Hessian to be negligible, and for the asymptotic Hessian approximation to hold.

Summary

Both AIC and BIC make similar assumptions about the size of the data and the identifiability of the parameters.
They are therefore not suitable for small datasets or black-box (e.g neural network based) models with many non-identifiable parameters.
If you only care about predictive accuracy and the AIC and BIC assumptions are violated, use cross-validation instead.
AIC approximates the expected negative log-likelihood for new data. Cross-validation can be used with other scoring rules (e.g., CRPS, Brier score).

Ensembles

What is Ensemble Learning?

Ensemble learning combines multiple models to improve overall performance.
Common ensemble methods include Bagging, Boosting, and Stacking.
Helps reduce overfitting and improve generalization.
Often used with decision trees (e.g., Random Forests, Gradient Boosting).
Can be used for both classification and regression tasks.
Can be used to get uncertainty estimates (e.g., prediction intervals).

Bagging

Bagging (Bootstrap Aggregating) involves training multiple models on different random subsets of the data and averaging their predictions.
Each model is trained independently on a subset, and the final prediction is made by averaging (for regression) or majority voting (for classification).
Easy to implement and parallelize.
Commonly used with decision trees
- Bagging + decision trees + feature randomness = random forests

Boosting

Boosting is an ensemble learning technique that combines multiple “weak learners” into a strong learner by focusing on correcting mistakes.
Used in classification and regression (powering models like AdaBoost, Gradient Boosting, XGBoost).

How Boosting Works

Train an initial weak model (e.g., a shallow decision tree).
Identify misclassified points – these get more weight.
Train the next model, giving more focus to difficult cases.
Repeat until the ensemble is strong and accurate.
Combine all models for the final prediction.

Each new model improves upon its predecessor.

Types of Boosting Algorithms

AdaBoost (Adaptive Boosting)
- Sequentially adds weak learners, focusing on misclassified samples
- Commonly uses decision stumps (one-level trees)
Gradient Boosting
- Builds models sequentially to correct errors of previous models
- Uses gradient descent to minimize loss
Stochastic Gradient Boosting
- Uses random subsets of data to train each weak learner
- Reduces overfitting and improves generalization

Types of Boosting Algorithms

XGBoost (Extreme Gradient Boosting)
- Optimized implementation of gradient boosting
- Features regularization, parallel processing, and handling of missing values
- Widely used for its speed and performance
Other Variants
- LightGBM: Faster training with leaf-wise tree growth
- CatBoost: Handles categorical features automatically

Optimization in a Slide

Gradient Boosting

Initialize with constant predictor: \[ F_0(x) = \arg\min_c \sum_{i=1}^{N} L(y_i, c) \]
For \(m = 1\) to \(M\):

Compute pseudo-residuals: \[ r_i^{(m)} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F=F_{m-1}} \]
Fit weak learner \(h_m(x)\) to \(r^{(m)}\)

Compute step size: \[ \gamma_m = \arg\min_{\gamma} \sum_{i} L(y_i, F_{m-1}(x_i) + \gamma \cdot h_m(x_i)) \]
Update model: \[ F_m(x) = F_{m-1}(x) + \eta \cdot \gamma_m \cdot h_m(x) \]

Return final model, \(F_M(x)\)

Boosting (Live Example)

Stacking

Stacking (Stacked Generalization) combines multiple base models (level-0 models) by training a meta-model (level-1 model) on their predictions.
Base models can be of different types (e.g., decision trees, neural networks).
The meta-model learns to optimally combine the base model predictions.
Helps leverage strengths of different models and improve overall performance.
For unseen data, base models make predictions, which are then fed into the meta-model for the final prediction.
Some problems with stacking:
- Risk of overfitting if not properly validated
- More complex to implement and tune
- Requires careful selection of base and meta-models

Nominal Covariate Encoding

Nominal (categorical) covariates are variables that represent categories without any inherent order (e.g., ethnicity, tumor type).
Most statistical models require numerical input, so nominal covariates need to be encoded into numerical format.
If the model can handle categorical variables directly (e.g., decision trees, random forests), encoding is not necessary.
Common encoding methods:
- One-hot encoding
- Reference encoding

One-hot Encoding

One-hot encoding creates binary (0/1) columns for each category of the nominal variable.
Each observation has a 1 in the column corresponding to its category and 0s in all other columns.
This method avoids introducing ordinal relationships between categories.

	Ethnicity	One-hot Encoding
1	Asian	[1, 0, 0, 0, 0, 0]
2	Black	[0, 1, 0, 0, 0, 0]
3	White	[0, 0, 1, 0, 0, 0]
4	Hispanic	[0, 0, 0, 1, 0, 0]
5	Middle Eastern	[0, 0, 0, 0, 1, 0]
6	Other	[0, 0, 0, 0, 0, 1]

Reference Encoding

Reference encoding (also known as dummy coding) creates binary columns for each category except one, which serves as the reference category.
The reference category is represented by all 0s in the binary columns.
This method reduces the number of columns compared to one-hot encoding and avoids multicollinearity issues.

	Ethnicity	Reference Encoding
1	Asian (Ref)	[0, 0, 0, 0, 0]
2	Black	[1, 0, 0, 0, 0]
3	White	[0, 1, 0, 0, 0]
4	Hispanic	[0, 0, 1, 0, 0]
5	Middle Eastern	[0, 0, 0, 1, 0]
6	Other	[0, 0, 0, 0, 1]

Reading Material

Pattern Recognition and Machine Learning by Christopher M. Bishop (2006)
Introduction to Linear Regression Analysis by Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining (2015)
Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability by Jerzy Neyman (1937)
Machine Learning: A Probabilistic Perspective by Kevin P. Murphy (2012)
Applied Predictive Modeling by Kuhn and Johnson (2013)
Algorithms for Optimization by Mykel J. Kochenderfer and Tim A. Wheeler (2019)