Supervised Learning

Authors:
Abdelwahed Khamis, Mohamed Tarek

What is Supervised Learning?

  • Learning from labeled training data
  • Goal: Learn a mapping from inputs to outputs
  • System learns from examples with known answers
  • Like learning with a teacher who provides correct answers

Key Components

  • Input Variables (\(x\)): Features or predictors
  • Output Variables (\(y\)): Target or labels
  • Training Data: Pairs of (\(x\), \(y\))
  • Model: The learned mapping function

Types of Supervised Learning

  1. Classification
    • Predicts categorical outputs
    • Example: Drug response (responding/non-responding)
    • Models: kNN, Decision Trees
  2. Regression
    • Predicts continuous values
    • Example: Drug concentration over time
    • Models: Gradient Boosting

Common Applications in Drug Development

  • Patient response prediction
  • Drug-target interaction prediction
  • Toxicity classification
  • Dose-response modeling

Data-Driven Workflow

Model Evaluation

Strategies to Model Evaluation

  • To evaluate the performance of a predictive model, or compare different models, we need to assess how well the model performs on unseen data.
  • A model is said to generalize well if it performs well on new, independent datasets.
  • Overfitting: Happens when a model learns the training data too well, including its noise and outliers. This results in high accuracy on the training set but poor performance on unseen data.
  • Using the performance of the model on the training data to evaluate and compare models is prone to overfitting.
  • To detect and prevent overfitting, there are 2 common strategies for model evaluation.
  • The specific performance metrics used to evaluate the model depend on the type of problem (classification or regression) and the specific goals of the analysis, however the general strategies to evaluate the model are the same.

Strategies to Model Evaluation

There are two main strategies to evaluate model performance:

  1. Training-Validation-Test Split: Split the dataset into a training set, a validation set and a test set.
    • The model is trained on the training set.
    • Its hyper-parameter are fine-tuned using the validation set.
    • Its generalization performance is evaluated using the test set.
    • If no fine tuning is done, the validation set can be omitted and the dataset is simply split into a training set and a test set.
    • The splitting strategy should reflect the intended use of the model.
    • Example: If the model is to be used to extrapolate a patient’s response to new dose levels, the test set should contain dose levels outside the range of dose levels in the training set.

Strategies to Model Evaluation

  1. Cross-validation: The dataset is divided into k subsets (folds). The model is trained on k-1 folds and validated/tested on the remaining fold. This process is repeated k times, with each fold used as the validation set once. The performance metrics are averaged over the k iterations.
    • Common choices for k are 5 to 10.
    • Leave-one-out cross-validation (LOOCV) is a special case where k equals the number of data points.
    • Cross-validation is more computationally intensive but provides a more robust estimate of model performance.
    • Cross-validation can be used to fine-tune hyper-parameters in the model.
    • If no fine tuning is done, cross-validation can be used to estimate the generalization/test performance of the model.

Cross-validation

Cross-validation

Cross-validation