Machine learning fundamentals

Authors

Lucas Pereira

Mohamed Tarek

1 Introduction

In the context of the first unit in the “AI for drug development” course, this tutorial provides a didactic use of real data and study for the introduction of machine learning (ML) fundamentals, rather than robust clinical tools. The discussion here is guided towards professionals and academics from healthcare, specially pharmacometrics.

Furthermore, the workflow we present here assumes no prior knowledge about AI. The focus of each section is as follows:

We begin by summarizing the context of the study below.
In Section 2, common data-processing steps are presented, such as imputation, one-hot encoding, and standardization.
In Section 3, the distinction between supervised and unsupervised application is explained. There, some unsupervised models are discussed as well, focusing in dimensionality reduction (through principal component analysis) and clustering (including K-Medoids and K-Means).
In Section 4, we present decision trees, ensembles, random forest, and gradient-boosted trees. Also in that section, options for model evaluation in binary classification are detailed. Lastly, K-nearest neighbors is explained, and references for further study are listed.

In the case study analyzed, the goal is to determine the main predictors of response to a treatment against critical limb ischemia (CLI), a disease in advanced stage, and characterized by the blockage of arteries in a limb. The authors of “Characteristics of responders to autologous bone marrow cell therapy for no-option critical limb ischemia” (Madaric et al. 2016) made their dataset available as a spreadsheet, containing multiple features about the 62 patients that initially enrolled in the study. The endpoints of interest in the study are:

Primary endpoints:
1. Responder: surviving patients with limb salvage and wound healing.
2. Non-responders: patients requiring major limb amputation or those with no signs of wound healing.
Secondary endpoints:
1. Mortality.
2. Amputation-free survival.
3. Major limb amputation.
4. Change in tcpO2.
5. Rutherford category.
6. Pain scale after BMC transplantation.

And, from the point of view of ML, this leads to a classification problem. In other words, there is a finite amount of discrete possible outcomes. Further, the problem is specifically of binary classification: either a patient responded to the treatment or not. In contrast, a regression problem has a continuous outcome, such as forecasting energy demand or traffic congestion. Other high-level groupings of ML algorithms can be found at (Brownlee 2023).

It’s important to point out that a dataset with 62 data points is extremely small for ML applications. In general, datasets with hundreds of points are considered small. Still, the main concepts can be described, although the specific performance values obtained are somewhat didactic.

First, some environment setup is necessary. The MLJ.jl package will be used throughout the tutorial, but the ML concepts and techniques discussed here are tool-agnostic. Depending on how you are running Julia, you will need to add at least the packages below in the specified versions. Just copy, paste and execute the following code block.

import Pkg
Pkg.add(
    [
        Pkg.PackageSpec("Clustering"),
        Pkg.PackageSpec("Distances", Pkg.VersionNumber(0, 10, 11)),
        Pkg.PackageSpec("EvoTrees", Pkg.VersionNumber(0, 16, 7)),
        Pkg.PackageSpec("KernelFunctions"),
        Pkg.PackageSpec("MLJ"),
        Pkg.PackageSpec("MLJClusteringInterface"),
        Pkg.PackageSpec("MLJModels"),
        Pkg.PackageSpec("MLJMultivariateStatsInterface"),
        Pkg.PackageSpec("MultivariateStats", Pkg.VersionNumber(0, 10, 3)),
        Pkg.PackageSpec("NearestNeighborModels"),
        Pkg.PackageSpec("PharmaDatasets", Pkg.VersionNumber(0, 11, 1)),
    ],
    preserve = Pkg.PRESERVE_ALL,
)

With all the packages installed, the code block below brings them into the environment. A random seed is also set for reproducibility, and interfaces to ML models are loaded.

# Bring packages to environment
using KernelFunctions, PharmaDatasets, Statistics
using MLJ, LinearAlgebra, CSV, DataFrames
using Random, Distances, Clustering, MultivariateStats
Random.seed!(378); # Set random seed for reproducibility
# Load models 
KMedoids = @load KMedoids pkg = Clustering
EvoTreeClassifier = @load EvoTreeClassifier pkg = EvoTrees
knn = @load KNNClassifier pkg = NearestNeighborModels
ContinuousEncoder = @load ContinuousEncoder pkg = MLJModels
KernelPCA = @load KernelPCA pkg = MultivariateStats
PCA = @load PCA pkg = MultivariateStats

2 Data processing

ML projects tend to follow very similar steps (Mehreen 2022). To start the analysis, the file is read. And some preprocessing is done. This includes conversion of occurrences of empty strings (““) into the value missing. In the data overview, the nmissing column indicate the number of missing values for that dataset column.

# Read spreadsheet data
df = CSV.read(dataset("CLIstemCell", String), DataFrame; missingstring = ["", "#NA!"])
describe(df)[:, Not(7)] |> println; # Print overview

47×6 DataFrame
 Row │ variable                 mean        min      median   max       nmissi ⋯
     │ Symbol                   Float64     Real     Float64  Real      Int64  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ PatientID                 31.5         1       31.5      62             ⋯
   2 │ HealedAmputFreeSurvival    0.532258    0.0      1.0       1.0
   3 │ Survival                   0.887097    0.0      1.0       1.0
   4 │ LimbSalvage                0.709091    0.0      1.0       1.0
   5 │ AmputFreeSurvival          0.629032    0.0      1.0       1.0           ⋯
   6 │ HealedLimbSalvage          0.6         0.0      1.0       1.0
   7 │ tcpO2                     13.9194      0.0     13.5      33.0
   8 │ tcpO2 at 6M               29.65        1       30.0      67
  ⋮  │            ⋮                 ⋮          ⋮        ⋮        ⋮         ⋮   ⋱
  41 │ sICAM-1                  744.629     256.36   622.97   3220.08          ⋯
  42 │ sICAM-3                  126.915       0.0    105.89    399.11
  43 │ sEselectin               140.433      46.61   128.04    670.85
  44 │ sPselectin               803.58       79.24   640.11   4043.14
  45 │ sPECAM-1                 418.588     219.14   379.9     832.93          ⋯
  46 │ VEGF                     170.164      15.675   68.239   709.564
  47 │ ApplicationRoute           0.516129    0.0      1.0       1.0
                                                    1 column and 32 rows omitted

2.1 Imputation

Now, the dataset contains only missings and numerical values of types Float64 and Int64. The next data-processing step is called imputation, assigning a value to all missings. For this, the MLJ.jl Julia package has the FillImputer() utility, which, by default, uses the median or the mode per column according to the type of data. This is a very simple approach for imputation, and many alternatives exist (Airbyte 2024).

# Impute missing values by median and mode
machImp = machine(FillImputer(), df) # Attach model to data
MLJ.fit!(machImp, verbosity = 0)
impDF = MLJ.transform(machImp, df) # Perform imputation
describe(impDF)[:, Not(7)] |> println; # Overview

47×6 DataFrame
 Row │ variable                 mean        min      median   max       nmissi ⋯
     │ Symbol                   Float64     Real     Float64  Real      Int64  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ PatientID                 31.5         1       31.5      62             ⋯
   2 │ HealedAmputFreeSurvival    0.532258    0.0      1.0       1.0
   3 │ Survival                   0.887097    0.0      1.0       1.0
   4 │ LimbSalvage                0.741935    0.0      1.0       1.0
   5 │ AmputFreeSurvival          0.629032    0.0      1.0       1.0           ⋯
   6 │ HealedLimbSalvage          0.645161    0.0      1.0       1.0
   7 │ tcpO2                     13.9194      0.0     13.5      33.0
   8 │ tcpO2 at 6M               29.7742      1       30.0      67
  ⋮  │            ⋮                 ⋮          ⋮        ⋮        ⋮         ⋮   ⋱
  41 │ sICAM-1                  711.271     256.36   622.97   3220.08          ⋯
  42 │ sICAM-3                  121.15        0.0    105.89    399.11
  43 │ sEselectin               137.435      46.61   128.04    670.85
  44 │ sPselectin               758.757      79.24   640.11   4043.14
  45 │ sPECAM-1                 407.98      219.14   379.9     832.93          ⋯
  46 │ VEGF                     152.081      15.675   68.239   709.564
  47 │ ApplicationRoute           0.516129    0.0      1.0       1.0
                                                    1 column and 32 rows omitted

Note that, despite the nmissing column now only containing zeros, the means and medians did not change much from the previous step. This is a good sign, indicating that the imputation process didn’t have a significant impact in the original data distribution. Now, as an exploration step, the histograms in Figure 1 can now be plotted to get more familiar with the data.

Figure 1: Histograms of selected variables. Leu - leukocyte level in peripheral blood.

Plot code

# Activate plotting backend
GLMakie.activate!()
fig = Figure(; fontsize = 18) # Create figure to draw in
# Axis and plot for each histogram
ax = Axis(fig[1, 1]; title = "Age (y)")
hist!(ax, df[:, "Age"])
ax = Axis(fig[1, 2]; title = "Cholesterol (mmol/L)")
hist!(ax, df[:, "Cholesterol"])
ax = Axis(fig[2, 1]; title = "Leu (10⁹/L)")
hist!(ax, df[:, "Leu"])
ax = Axis(fig[2, 2]; title = "Creatinine (μmol/L)")
hist!(ax, df[:, "Creatinine"])
save("./case1_hists.png", fig) # Save PNG

2.2 Coercing data types

Another data-processing step, necessary for MLJ.jl, is managing the columns’ “scientific data type”, such as Continuous, Count, Binary etc. They are used to clearly define the list of ML models compatible with the inputs and outputs of an application.

In the current case, the inputs are “coerced” into Continuous; and the outputs, into Binary (if the patient responded to the treatment or not). In the following, compatible models are also listed to showcase the diversity of ML models.

# Subset of columns used as inputs
colIDs = vcat([7, 10, 13], 16:32, [35, 38], 41:46)
inputs = impDF[:, colIDs] # 62x28
# Primary endpoint as binary prediction target 
outputs = collect(impDF[:, "HealedAmputFreeSurvival"])
# Change scientific type of column "Rutherford"
inputs = coerce(inputs, Symbol("Rutherford") => OrderedFactor)
# Coerce data types of remaining columns
machCont = machine(ContinuousEncoder(; one_hot_ordered_factors = true), inputs)
fit!(machCont, verbosity = 0)
coercedInputs = MLJ.transform(machCont, inputs) # 62x29
outputs = coerce(outputs, Binary) # Coerce target into binary
# List of available models in MLJ.jl
models(matching(coercedInputs, outputs)) .|> println;

(name = AdaBoostClassifier, package_name = MLJScikitLearnInterface, ... )
(name = AdaBoostStumpClassifier, package_name = DecisionTree, ... )
(name = BaggingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = BayesianLDA, package_name = MLJScikitLearnInterface, ... )
(name = BayesianLDA, package_name = MultivariateStats, ... )
(name = BayesianQDA, package_name = MLJScikitLearnInterface, ... )
(name = BayesianSubspaceLDA, package_name = MultivariateStats, ... )
(name = CatBoostClassifier, package_name = CatBoost, ... )
(name = ConstantClassifier, package_name = MLJModels, ... )
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeClassifier, package_name = DecisionTree, ... )
(name = DeterministicConstantClassifier, package_name = MLJModels, ... )
(name = DummyClassifier, package_name = MLJScikitLearnInterface, ... )
(name = EvoTreeClassifier, package_name = EvoTrees, ... )
(name = ExtraTreesClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GaussianNBClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GaussianNBClassifier, package_name = NaiveBayes, ... )
(name = GaussianProcessClassifier, package_name = MLJScikitLearnInterface, ... )
(name = GradientBoostingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = HistGradientBoostingClassifier, package_name = MLJScikitLearnInterface, ... )
(name = KNNClassifier, package_name = NearestNeighborModels, ... )
(name = KNeighborsClassifier, package_name = MLJScikitLearnInterface, ... )
(name = KernelPerceptronClassifier, package_name = BetaML, ... )
(name = LDA, package_name = MultivariateStats, ... )
(name = LGBMClassifier, package_name = LightGBM, ... )
(name = LinearBinaryClassifier, package_name = GLM, ... )
(name = LinearSVC, package_name = LIBSVM, ... )
(name = LogisticCVClassifier, package_name = MLJScikitLearnInterface, ... )
(name = LogisticClassifier, package_name = MLJLinearModels, ... )
(name = LogisticClassifier, package_name = MLJScikitLearnInterface, ... )
(name = MaxnetBinaryClassifier, package_name = Maxnet, ... )
(name = MultinomialClassifier, package_name = MLJLinearModels, ... )
(name = NeuralNetworkBinaryClassifier, package_name = MLJFlux, ... )
(name = NeuralNetworkClassifier, package_name = BetaML, ... )
(name = NeuralNetworkClassifier, package_name = MLJFlux, ... )
(name = NuSVC, package_name = LIBSVM, ... )
(name = PassiveAggressiveClassifier, package_name = MLJScikitLearnInterface, ... )
(name = PegasosClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = BetaML, ... )
(name = PerceptronClassifier, package_name = MLJScikitLearnInterface, ... )
(name = ProbabilisticNuSVC, package_name = LIBSVM, ... )
(name = ProbabilisticSGDClassifier, package_name = MLJScikitLearnInterface, ... )
(name = ProbabilisticSVC, package_name = LIBSVM, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestClassifier, package_name = DecisionTree, ... )
(name = RandomForestClassifier, package_name = MLJScikitLearnInterface, ... )
(name = RidgeCVClassifier, package_name = MLJScikitLearnInterface, ... )
(name = RidgeClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SGDClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVC, package_name = LIBSVM, ... )
(name = SVMClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVMLinearClassifier, package_name = MLJScikitLearnInterface, ... )
(name = SVMNuClassifier, package_name = MLJScikitLearnInterface, ... )
(name = StableForestClassifier, package_name = SIRUS, ... )
(name = StableRulesClassifier, package_name = SIRUS, ... )
(name = SubspaceLDA, package_name = MultivariateStats, ... )
(name = XGBoostClassifier, package_name = XGBoost, ... )

And when it comes to what features to use, coldIDs includes almost all of them, excluding most secondary endpoints. Another information not included were snapshots after baseline. Snapshots are features that were measured multiple times at different moments: right before treatment (baseline), and 6 and 12 months after treatment, namely tcpO2, ABI (ankle-brachial index), wound size, and Rutherford, pain, and quality of life scales.

Since the context of the ML models developed in this tutorial is to predict if a new patient suffering from CLI would benefit from the proposed treatment, only the baseline values of snapshot variables will be used. However, one can imagine using such models 6 months after treatment, to have a more precise prediction of further development of the patients condition in the next 6 months. This exercise of including the snapshots at 6 months is left as an exercise for the reader.

Moving on, in the previous code, other than coercing, another data-processing step performed is one-hot encoding, included in ContinuousEncoder(; one_hot_ordered_factors = true). For a variable that can only assume a finite set of possible discrete values (called “categorical”), this process replaces the variable with multiple binary variables.

For example, in the current case, there are features that include a rating in the Rutherford scale for critical limb ischemia, which measures the severity of the disease in 6 discrete levels. One of these features is called “Rutherford at 6M” (column 11), referring to measurements of this scale at 6 months after treatment.

oneHotDF = impDF[:, [11, 13]] # "Rutherford at 6M" and "Pain" columns (62x2)
oneHotDF = coerce(oneHotDF, Symbol("Rutherford at 6M") => OrderedFactor)
machOneHot = machine(ContinuousEncoder(; one_hot_ordered_factors = true), oneHotDF)
fit!(machOneHot, verbosity = 0)
first(oneHotDF, 10) |> println # Overview before one-hot encoding
println()
oneHotDF = MLJ.transform(machOneHot, oneHotDF) # 62x7 DataFrame
first(oneHotDF, 10) |> println; # Overview after one-hot encoding

10×2 DataFrame
 Row │ Rutherford at 6M  Pain
     │ Categorical…      Float64
─────┼───────────────────────────
   1 │ 3                     0.0
   2 │ 5                     4.0
   3 │ 5                     7.0
   4 │ 5                     5.0
   5 │ 2                     0.0
   6 │ 1                     3.0
   7 │ 5                     6.0
   8 │ 3                     0.0
   9 │ 5                     5.0
  10 │ 5                     6.0

10×7 DataFrame
 Row │ Rutherford at 6M__1  Rutherford at 6M__2  Rutherford at 6M__3  Rutherfo ⋯
     │ Float64              Float64              Float64              Float64  ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │                 0.0                  0.0                  1.0           ⋯
   2 │                 0.0                  0.0                  0.0
   3 │                 0.0                  0.0                  0.0
   4 │                 0.0                  0.0                  0.0
   5 │                 0.0                  1.0                  0.0           ⋯
   6 │                 1.0                  0.0                  0.0
   7 │                 0.0                  0.0                  0.0
   8 │                 0.0                  0.0                  1.0
   9 │                 0.0                  0.0                  0.0           ⋯
  10 │                 0.0                  0.0                  0.0
                                                               4 columns omitted

One-hot encoding turns this categorical column of 6 levels into 6 binary variables: “Rutherford at 6M__i”, for i from 1 to the number of unique values of the categorical variable. Across these new binary variables, each patient has 0 in all of them except one, which is 1 and indicates the value in the original categorical column.

Lastly, it’s worth pointing out that, in the variable coercedInputs, which will be used in this tutorial’s pipeline, the Rutherford measurements at baseline (beginning of study) were used. And, since only patients in critical condition participated in the study, all values are 5 or 6. Therefore, only two new binary variables were created: “Rutherford__5” and “Rutherford__6”.

2.3 Standardization

Afterwards, each column is standardized, which means subtracting the mean and dividing by the standard deviation.

\[ \bar{x_{i}} = \frac{x_{i} - \mathbb{E}[x]}{\sqrt{Var[x]}} \]

As a result, all inputs have null mean and unitary standard deviation. This alleviates some numerical problems that may show up during training. An alternative is quantile normalization (Bolstad et al. 2003), which can be achieved through MLJ’s UnivariateDiscretizer.

# Standardization
machStd = machine(Standardizer(), coercedInputs)
# Columns' mean and std calculated through 'fitting'
MLJ.fit!(machStd, verbosity = 0)
# Use established mean and std to transform data
stdInputs = MLJ.transform(machStd, coercedInputs)
# Check transformation
stdCols = eachcol(stdInputs)
DataFrame(
    Feature = names(stdInputs),
    Min = minimum.(stdCols),
    Mean = mean.(stdCols),
    Max = maximum.(stdCols),
    SD = std.(stdCols),
) |> println;

29×5 DataFrame
 Row │ Feature               Min        Mean          Max       SD
     │ String                Float64    Float64       Float64   Float64
─────┼──────────────────────────────────────────────────────────────────
   1 │ tcpO2                 -1.42178   -5.49236e-17  1.94898       1.0
   2 │ Rutherford__5.0       -3.03031    7.52087e-17  0.324676      1.0
   3 │ Rutherford__6.0       -0.324676  -2.68602e-18  3.03031       1.0
   4 │ Pain                  -2.00892   -6.46884e-17  2.22758       1.0
   5 │ Sex                   -0.381784   5.37205e-18  2.57704       1.0
   6 │ Age                   -2.39604    4.67368e-16  2.27203       1.0
   7 │ ArterialHypertension  -1.83665    3.58136e-18  0.535689      1.0
   8 │ DiabetesMellitus      -1.38596    1.14604e-16  0.709883      1.0
  ⋮  │          ⋮                ⋮           ⋮           ⋮         ⋮
  23 │ Wound size base       -0.762213  -2.14882e-17  3.62204       1.0
  24 │ sICAM-1               -1.12333    4.92438e-17  6.19511       1.0
  25 │ sICAM-3               -1.85742   -5.30937e-16  4.26157       1.0
  26 │ sEselectin            -1.07723   -3.95293e-16  6.32657       1.0
  27 │ sPselectin            -1.27739   -2.05033e-16  6.17412       1.0
  28 │ sPECAM-1              -1.45862   -6.81355e-16  3.28237       1.0
  29 │ VEGF                  -0.770142   1.45045e-16  3.14753       1.0
                                                         14 rows omitted

3 Unsupervised learning

In ML, each data point is usually an \((input, label)\) pair. When a model processes the input value, it’s expected that it will output the \(label\) value, or something at least close to it.

This rationale defines supervised ML applications, ones with a label associated with each data point. Common examples of pairs are \((image, category)\), or, for the current problem, \((patient \: data, treatment \: response)\).

On the other hand, the ML subfield of unsupervised learning builds models for data without labels. Its main uses are to gain insights about the data and learn patterns, instead of making predictions.

3.1 Dimensionality reduction

And, inside unsupervised learning, dimensionality reduction (Jacob Murel 2024) can be used to reduce the number of dimensions of the data. This is useful to (Géron 2019):

Avoid the curse of dimensionality.
Compress the information in the data.
Visualize the dataset.
Speed up training.

A great example of an unsupervised learning technique is Principal Component Analysis (Jaadi 2024). It projects the data points to a space of lower dimension, searching for orthogonal axes that account for as much of the data’s variance as possible. Figure 2 illustrates the information retention in PCA, as more principal components are included.

Figure 2: Information retained in PCA depending on number of principal components kept.

Plot code

matIn = Matrix(stdInputs) # Inputs as matrix
pca = MultivariateStats.fit(MultivariateStats.PCA, matIn'; maxoutdim = 31)
valsP = eigvals(pca) # Eigenvalues
fig = Figure(; size = (800, 500), fontsize = 18)
ax = Axis(
    fig[1, 1],
    xlabel = "Number of principal components",
    ylabel = "Percentage of variance",
    yticks = 0:10:100,
    xticks = 1:2:31,
    ytickformat = "{:d}%",
)
cumulativePercentage = cumsum(valsP ./ sum(valsP)) .* 100
lines!(ax, cumulativePercentage)
save("./eigVar.png", fig)

In the current case, the patient vectors from the previous step can be mathematically interpreted as points in a 31-dimensional space. In the following, PCA is used to visualize the scatter plot in Figure 3 of the data after projecting the points to 3-dimensional space.

machPCAviz = machine(PCA(; maxoutdim = 3), stdInputs)
MLJ.fit!(machPCAviz, verbosity = 0) # Compute principal components
# Project data to lower dimension
lowInputsViz = MLJ.transform(machPCAviz, stdInputs);

Figure 3: Scatter plot of 3-dimensional projection of data. Colors tied to x axis.

A limitation of standard PCA is the assumption of linearity between features. This problem is solved in Kernel PCA (Thompson 2018), which first uses a ‘kernel function’ to map the data to a mathematical space of higher dimension, usually recovering the linear separability of the points.

As a side example, the data could be projected from 29 to 10 dimensions using a squared exponential kernel. Further, the mean squared reconstruction error is calculated to be 0.84, indicating that the dimension reduction is representing the data well.

machKPCA = machine(
    KernelPCA(; maxoutdim = 10, kernel = (x, y) -> SqExponentialKernel()(x, y)),
    stdInputs,
)
MLJ.fit!(machKPCA, verbosity = 0)
KPCA_low_inputs = MLJ.transform(machKPCA, stdInputs)
# Reconstruction
reconst = inverse_transform(machKPCA, KPCA_low_inputs) |> DataFrame |> Matrix
# Mean squared reconstruction error
reconstError = (Matrix(stdInputs) .- reconst) .^ 2 |> mean;

However, for the current case, PCA leads to a mean squared reconstruction error of 0.28. As a result, results from PCA dimensionality reduction will be used for the remainder of the tutorial.

machPCA = machine(PCA(; maxoutdim = 10), stdInputs)
MLJ.fit!(machPCA, verbosity = 0)
lowInputs = MLJ.transform(machPCA, stdInputs)
# Reconstruction
reconstPCA = inverse_transform(machPCA, lowInputs) |> DataFrame |> Matrix
# Mean reconstruction error
PCAreconstError = (Matrix(stdInputs) .- reconstPCA) .^ 2 |> mean;

3.2 K-medoids

It’s important to remember that supervised and unsupervised techniques are not antagonistic. Each has their uses, pros and cons. In fact, they can be combined. For the rest of this tutorial, the previous 10-dimensional KPCA data projection is used as the input to the models.

Now the unsupervised learning technique K-medoids will be used (H 2024). Its goal is clustering: splitting the data according to groupings of the points. The following steps are required:

The amount of clusters is specified by K.
K data points are chosen as the centers (medoids) of each cluster.
The remaining points are assigned to the cluster of the closest center.
The K medoids are adjusted to minimize the sum of distances from them to all points assigned to their cluster.

K-means is another common clustering algorithm. The main difference to k-medoids is that the centers of clusters don’t need to be data points, being instead defined by the average of all points in a cluster.

machKM = machine(KMedoids(; k = 2), lowInputs)
MLJ.fit!(machKM, verbosity = 0)
# Predict point assignments
KMpredictions = MLJ.levelcode.(MLJ.predict(machKM, lowInputs));

3.3 Tangent: hyperparameters

In K-medoids, K is a hyperparameter: a parameter that the model does not learn during training. Rather, this type of parameter is manually set by the user when setting up the model, before training.

From the study’s context of predicting if patients will respond to the treatment or not, here the number K of clusters is set to 2. However, a lot of times, the parameter K in clustering techniques is not known beforehand.

In fact, a common use of clustering is to determine K itself. This translates to revealing if the data points are potentially organized in groups, which tend to be unrecognizable by manual data analysis, specially in high-dimensional spaces.

And to determine hyperparameters, optimization techniques can be applied. For this, usually the dataset will be split between training and validation sets. In such a context, the optimization tends to follow the general procedure in Figure 4. The Tuning Models page in MLJ.jl’s documentation includes examples of strategies.

Figure 4: Usual logic for hyperparameter optimization.

4 Supervised learning

One of the most popular supervised learning algorithms are decision trees. They work by repeatedly splitting the dataset, one feature at a time, by establishing thresholds on the features.

For example, a node may split the patients between those who are younger that 60 years old or not. And, among those younger, another node may do another split between those with cholesterol level lower than 4 mmol/L or not, and so on. Figure 5 illustrates such a decision tree.

Figure 5: Example of decision tree for current case. Does not reflect the models developed.

And, simplifying the problem to only these 2 features, Figure 6 contains a scatter plot showing the splits of the data and the resulting classifications of patients. The splitting continues until the number of samples to which the node refers is too small, or the samples are concentrated in a specific value or interval of the corresponding feature.

Figure 6: Splits and classification from illustrative decision tree. NR - non-responder. Marker colors match respective node in Figure 5.

Plot code

GLMakie.activate!() # Plotting backend
# Indices of patients in each category
lowYoung = inputs.Age .< 60 .&& inputs.Cholesterol .< 4
highYoung = inputs.Age .< 60 .&& inputs.Cholesterol .>= 4
older = inputs.Age .>= 60
fig = Figure(; fontsize = 18) # Figure to draw in
ax = Axis(fig[1, 1])
vlines!(ax, 60; color = :brown) # Age split
age = inputs[:, :Age]
# Function for relative position in x axis
ageRel(x) = (x - minimum(age)) / (maximum(age) - minimum(age))
# Cholesterol split
hlines!(ax, 4; xmin = ageRel(38), xmax = ageRel(60.4), color = :blue)
# Scatter plots
plot1 = scatter!( # Non-responders on 1st level
    ax,
    inputs[older, "Age"],
    inputs[older, "Cholesterol"];
    color = :green,
    marker = :diamond,
    markersize = 13,
)
plot2 = scatter!( # Non-responders on 2nd level
    ax,
    inputs[highYoung, "Age"],
    inputs[highYoung, "Cholesterol"];
    color = :purple,
    marker = :xcross,
    markersize = 13,
)
plot3 = scatter!( # Responders
    ax,
    inputs[lowYoung, "Age"],
    inputs[lowYoung, "Cholesterol"];
    color = :red,
    marker = :pentagon,
    markersize = 13,
)
Legend( # Labels for splits
    fig[1, 2],
    [plot1, plot2, plot3],
    ["NR 1st level", "NR 2nd level", "Responders"],
)
# Save PNG
save("./decisionTreeScatter.png", fig)

Decision trees are very popular in ML, since they are interpretable, versatile, and scalable. However, they can be too sensitive to small or superficial changes in the dataset. To overcome this, they can be used in ensembles, that combine multiple simpler models into a single more robust learner (Brownlee 2024).

For decision trees, common ensembles are random forests. They contain multiples trees, each one being trained on a random subset of the data. Each tree also uses only a random subset of the features, to avoid building many similar trees, in case there are a few features with high predictive power. Usually, random forests use a voting system among trees for classification; and average the trees’ outputs for regression.

4.1 Gradient boosted trees

Another type of ensemble is called boosting (Winston 2014): training the weak learners sequentially, each one focusing on cases that posed a challenge to the previous model.

In gradient boosting (Starmer 2019) specifically, the first model \(M_1\) is trained to map features \(X\) to labels \(Y\), so \(M_1: X \rightarrow Y\) as usual. Then, the second model \(M_2\) is instead trained to map the residuals of the first one (\(r = M_1(X) - Y\)) to the labels: \(M_{2}: r \rightarrow Y\). A third model would do the same to the second one, and so on. Lastly, the final prediction made by the ensemble is the sum of all these intermediate outputs.

Such a procedure equates to stacking consecutive corrections to the initial prediction. This ensemble is used below as the base in the TunedModel MLJ.jl utility, which provides workflows for hyperparameter optimization.

GBoostModel = EvoTreeClassifier() # Base model
# Target hyperparameters and their ranges
paramRanges = [
    range(GBoostModel, :max_depth, values = 3:6),
    range(GBoostModel, :nrounds, values = 200:10:300),
]
# Model wrapper for optimization
selfTuningGBoost = TunedModel(
    model = GBoostModel,
    resampling = CV(),
    tuning = Grid(),
    range = paramRanges,
    measure = Accuracy(),
    train_best = false,
)
machGBoost = machine(selfTuningGBoost, lowInputs, outputs)
MLJ.fit!(machGBoost, verbosity = 0)
# Evaluate model with best hyperparameters
bestGBoost = fitted_params(machGBoost).best_model
MLJ.evaluate(bestGBoost, lowInputs, outputs; measure = Accuracy(), verbosity = 0)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌────────────┬──────────────┬─────────────┐
│ measure    │ operation    │ measurement │
├────────────┼──────────────┼─────────────┤
│ Accuracy() │ predict_mode │ 0.694       │
└────────────┴──────────────┴─────────────┘
┌────────────────────────────────────┬─────────┐
│ per_fold                           │ 1.96*SE │
├────────────────────────────────────┼─────────┤
│ [0.636, 0.727, 0.5, 0.9, 0.6, 0.8] │ 0.127   │
└────────────────────────────────────┴─────────┘

In the previous code block, evaluate returns a PerformanceEvaluation object, which created the output shown. The ‘per_fold’ field includes the average values of the measure in the validations for different data splits. The measure used was accuracy, the rate of correct predictions. The ’1.96*SE’ field is the standard error of the validation errors, which can be useful for defining confidence intervals on the validation performance, and comparing models. And lastly, the ‘measurement’ field is the average of the ‘per_fold’ values, after weighting based on the validation split size. Lastly, it’s worth noting the variation that data splitting causes on the validation accuracy.

Going back to the TunedModel definition, another setting is the resampling strategy in resampling = CV(). Back in Section 3.3, the reason for splitting the data between training and validation is to avoid overfitting (Google 2024): adjusting the model too tightly to the data distribution.

It’s important to avoid such a phenomenon, because data includes a (hopefully not dominant) random component, which is not informative of the dynamic the data represents. And, if the model gets to the point of ‘memorizing’ the training data, it won’t generalize well, resulting in a considerable performance drop when facing new cases.

Hence it’s standard practice in ML to split the data between training and validation (and sometimes also for testing), so one tracks not only the performance improvement during training, but also the validation behavior. There should be an initial phase during which both training and validation metrics improve. Afterwards however, validation performance starts dropping, indicating overfitting. Figure 7 shows the performance trends in the training and validation splits for different gradient boost ensembles.

Figure 7: Performance trends in training and validation splits when overfitting.

Plot code

# Split data
trainIndices, valIndices = partition(shuffle(MersenneTwister(30), 1:nrow(lowInputs)), 0.8)
# Range for number of trees in ensembles
r = 1:3:500
# Store accuracies in both splits for each ensemble size
accs = Dict("train" => zeros(length(r)), "val" => zeros(length(r)))
for (i, nrounds) in enumerate(r)
    # Instantiate and fit current ensemble
    overModel = EvoTreeClassifier(; nrounds)
    overM = machine(overModel, lowInputs[trainIndices, :], outputs[trainIndices])
    MLJ.fit!(overM; verbosity = 0)
    cmOver = MLJ.ConfusionMatrix(; levels = [1, 2])
    # Accuracy of fitted model in training split
    predictionsOver = MLJ.predict_mode(overM, lowInputs[trainIndices, :])
    cmOver = cmOver(MLJ.levelcode.(predictionsOver), MLJ.levelcode.(outputs[trainIndices]))
    accs["train"][i] = Accuracy()(cmOver)
    # Accuracy of fitted model in validation split
    cmOver = MLJ.ConfusionMatrix(; levels = [1, 2])
    predictionsOver = MLJ.predict_mode(overM, lowInputs[valIndices, :])
    cmOver = cmOver(MLJ.levelcode.(predictionsOver), MLJ.levelcode.(outputs[valIndices]))
    accs["val"][i] = Accuracy()(cmOver)
end
# Plotting
fig = Figure(; fontsize = 18)
ax = Axis(fig[1, 1], xlabel = "Number of trees", ylabel = "Accuracy")
lTrain = lines!(r, accs["train"])
lVal = lines!(r, accs["val"])
Legend( # Labels for splits
    fig[1, 2],
    [lTrain, lVal],
    ["Training", "Validation"],
)
save("./overfit.png", fig)

The y axis indicates the accuracy: the proportion of predictions that are correct. The higher this value, the better, of course. And the x axis indicates the number of trees included in the ensemble. Increasing this number expands the model’s capacity (Ian Goodfellow 2016): the diversity of mappings it can represent. If it’s low, the model underfits the training data and doesn’t perform as well as it could. If it’s too high, the model has enough flexibility to fit even the noise in the training set. Here, overfitting is identified by the validation performance reaching a peak, and then diminishing again.

Further, aside from manually splitting the dataset (“holdout validation”), a common strategy in the context of hyperparameter search is cross-validation (CV), which Figure 8 illustrates:

Split the data in \(k\) ‘folds’ of equal size.
Iterate on the \(k\) folds:
1. Assign the fold for validation.
2. Set the model hyperparameters.
3. Create a new instance of the model.
4. Train the current model instance on the training folds.
5. Validate the current model instance on the validation fold.
6. Log the performance
Use a strategy to choose the hyperparameters according to the validation errors.

Figure 8: Usage of data splits across iterations of cross-validation.

K-fold cross-validation is sometimes also used without changing hyperparameters between iterations. That’s useful to get a broader perspective on the model’s performance, averaging out the impact of what data points are used in training. This is done in MLJ’s evaluate() function.

Furthermore, regarding the number of folds K (Kohavi 2001), values from 5 to 10 are common. Higher values increase the computational cost of the procedure, but reduce the bias when estimating the model’s performance. Lower values will use less data for training, so small models may be adequate, but large models may overfit.

Going back to the gradient boosted TunedModel, after resampling = CV(), there is also the tuning = Grid() setting, specifying the strategy used to update the hyperparameters across iterations. Grid is the most simple option in MLJ.jl, simply testing a certain number of equally-spaced values. This strategy quickly becomes underwhelming in demanding applications, since it’s mostly a trial-and-error exploration of a few possibilities, instead of an efficient search algorithm (Run:ai 2024).

Those values are specified by the next setting, range = paramRanges. It specifies the parameters to be optimized, and what range of values should be considered. Here, the max_depth parameter was optimized in the range 3:6, controlling the maximum depth allowed for the decision trees that underlie the gradient boosting ensemble. Additionally, the parameter nrounds was optimized in the range 200:10:300, determining the number of trees.

After fitting the TunedModel, a model with the optimal hyperparameters can be extracted and used. Below, such a model is fitted to the training split of manually partitioned data. Later on, it will be used to make predictions on the validation set.

# Manual data split
trainIndices, valIndices = partition(1:nrow(lowInputs), 0.8)
# Bind model to training split
machBest = machine(bestGBoost, lowInputs[trainIndices, :], outputs[trainIndices])
MLJ.fit!(machBest, verbosity = 0);

4.2 Model evaluation

In ML, there are usually numerous options to perform any step in the project, from data preparation to performance evaluation. As an example, there are various metrics for the current case of binary classification.

Despite this, the confusion matrix enables a handful of useful options. It is built by, after making predictions with the model, arranging in a matrix how many cases fall in each of the following categories:

True positive (TP): a positive case (responder) that was correctly classified as positive.
True negative (TN): a negative case (non-responder) that was correctly classified as negative.
False positive (FP, type I error): a negative case that was misclassified as positive.
False negative (FN, type II error): a positive case that was misclassified as negative.

	Ground truth
Predicted	0	1
0	4	2
1	1	5

From these basic definitions, the following binary classification metrics can be calculated to prioritize different aspects of performance:

Precision, positive predictive value (PPV): rate of correct positive classifications.

\[\frac{TP}{TP + FP}\]

Recall, sensitivity or true positive rate: ratio of positive cases that are correctly classified.

\[\frac{TP}{TP + FN}\]

F1 score: harmonic mean between precision and recall. It is only high if both previous metrics are high.

\[2 * \frac{(Precision * Recall)}{Precision + Recall}\]

Accuracy: fraction of classifications that are correct.

\[\frac{TP + TN}{TP + TN + FP + FN}\]

predictionsGBoost = predict_mode(machBest, lowInputs[valIndices, :])
# Confusion matrix [TN FN; FP TP]
cm = MLJ.ConfusionMatrix(; levels = [1, 2])
cmGBoost = cm(MLJ.levelcode.(predictionsGBoost), MLJ.levelcode.(outputs[valIndices]))
# Metrics
prec = ppv(cmGBoost)
rec = recall(cmGBoost)
f = FScore()(cmGBoost)
acc = Accuracy()(cmGBoost);

The following table presents the validation metrics associated with the previous model:

Precision	0.75
Recall	0.86
F1	0.80
Accuracy	0.75

4.3 K-nearest neighbors pipeline

The next model to be used is K-nearest neighbors (elastic 2024). It is a supervised approach that compares the input against the K nearest data points. In regression problems, it then averages the labels of these neighbors; and, in classification, it uses the labels of these nearby points in a voting system to determine the output. Here, the parameter K defines how many neighboring points will be considered, and its default is 5. Such model will be included in an MLJ pipeline, a functionality that enables very concise definitions of entire ML workflows. The code below recreates the procedures done so far, but now in a pipeline. First, functions are defined to include the coercing and one-hot encoding procedures. Then, these steps are included in a pipeline alongside standardization, imputing, PCA, and an instance of a k-nearest neighbors model.

# One-hot encode "Rutherford" and coerce other columns to 'continuous'
ordFacCoerce = inp -> coerce(inp, Symbol("Rutherford") => OrderedFactor)
contCoerce = ContinuousEncoder(; one_hot_ordered_factors = true)
# Pipeline model
modelPipe =
    FillImputer |>
    ordFacCoerce |>
    contCoerce |>
    Standardizer |>
    PCA(; maxoutdim = 10) |>
    knn(; K = 8)
machPipe = machine(modelPipe, df[:, colIDs], outputs)
fit!(machPipe, verbosity = 0)
# Shuffle rows to avoid one-hot encode problem with
# category imbalance in baseline "Rutherford".
shuffleRows = shuffle(MersenneTwister(9), 1:nrow(df))
MLJ.evaluate(
    modelPipe,
    df[shuffleRows, colIDs],
    outputs[shuffleRows];
    measure = [Accuracy(), cross_entropy],
    resampling = CV(; nfolds = 5),
    verbosity = 0,
)

PerformanceEvaluation object with these fields:
  model, measure, operation,
  measurement, per_fold, per_observation,
  fitted_params_per_fold, report_per_fold,
  train_test_rows, resampling, repeats
Extract:
┌───┬──────────────────────┬──────────────┬─────────────┐
│   │ measure              │ operation    │ measurement │
├───┼──────────────────────┼──────────────┼─────────────┤
│ A │ Accuracy()           │ predict_mode │ 0.661       │
│ B │ LogLoss(             │ predict      │ 0.62        │
│   │   tol = 2.22045e-16) │              │             │
└───┴──────────────────────┴──────────────┴─────────────┘
┌───┬────────────────────────────────────┬─────────┐
│   │ per_fold                           │ 1.96*SE │
├───┼────────────────────────────────────┼─────────┤
│ A │ [0.692, 0.692, 0.583, 0.75, 0.583] │ 0.0726  │
│ B │ [0.614, 0.6, 0.629, 0.545, 0.717]  │ 0.0612  │
└───┴────────────────────────────────────┴─────────┘

Note

The A |> B operator in Julia is called piping, which uses the output of expression A as the input to expression B, clearly indicating information flow.

In the evaluate call in the last code block, the metric cross_entropy was provided as well. It refers to binary cross-entropy (Godoy 2018), a measure of dissimilarity between two distributions. After fitting, the outputs of the model should follow the same distribution of the original data. Therefore, binary cross-entropy can be used as an objective to be minimized for training binary classification models, leading to the same results as if maximizing the log-likelihood (Draelos 2019). In the following formula, \(N\) is the amount of binary variables being compared; and variable \(i\) assumes the value \(y_i\) with probability \(p(y_i)\).

\[CE = -\frac{1}{N} \sum^{N}_{i = 1} y_i \cdot log[p(y_i)] + (1 - y_i) \cdot log[1 - p(y_i)]\]

The model’s “machine” can then be saved to disk with MLJ.save("pipeModel.jld2", machPipe). It can then be loaded later, for example when starting a new session, for a future continuation of the project, building a model catalogue etc. Together with pipelines, in case of a new file with additional data, the model can be loaded with machLoaded = machine("./pipeModel.jld2"), and used to quickly reapply the entire workflow discussed so far with pipePredictions = MLJ.predict(machLoaded, df[:, colIDs]), directly on raw data, that has missing values and isn’t preprocessed.

Furthermore, with this model, the output for each case is a probability distribution across the two possibilities. Here, this refers to the chance of the patient responding to the treatment. Therefore, to assess the model’s performance, the area under the receiver operator curve (AUC) can be applied.

In short, when the model predicts a certain chance that a patient will respond to the treatment, a threshold (60% or 70% etc.) needs to be established, above which it is interpreted that the model considers the patient a responder. However, the threshold level impacts the metrics in the confusion matrix (Section 4.2), so the best value for a given context must be chosen.

The receiver operating characteristic (ROC) curve is commonly used for this. It plots the true positive rate (i.e. TPR, recall, sensitivity) in the y axis, and the false positive rate (F.P.R.) in the x axis. And, for each threshold level, the model’s predictions result in a (F.P.R., TPR) pair, representing a point in the plot and generating a curve.

A model that randomly guesses the outcome (equivalent to a coin flip), is graphically represented by a 45° line. And, ideally, the curve that results from the model isn’t only above such a straight line, but also approaches the top left corner, of unitary TPR and null F.P.R.. This can be measured through the area under the curve (AUC), which would then approach 1.

Finally, for the previous pipeline model, the ROC is included in Figure 9. The AUC is of 0.83, and the threshold levels used were [0.2, 0.4, 0.6, 0.8, 1.0].

Plot code

fig = Figure(; fontsize = 18) # Create figure to draw in
axROC = Axis(
    fig[1, 1];
    title = "Receiver operating characteristic (ROC) curve",
    xlabel = "False positive rate",
    ylabel = "True positive rate",
    xticks = 0:0.2:1,
    yticks = 0:0.2:1,
)
# MLJ function for ROC plot data
rocD, tpr, thresholds = roc_curve(pipePredictions, outputs)
l45 = lines!(axROC, [0, 1], [0, 1]; linestyle = :dash) # 45° line
lModel = lines!(axROC, rocD, tpr) # Model
Legend(fig[1, 2], [l45, lModel], ["Random guess", "kNN"])
save("./roc.png", fig) # Save PNG

5 References

Airbyte. 2024. “What Is Data Imputation: Purpose, Techniques, & Methods.” 2024. https://airbyte.com/data-engineering-resources/data-imputation.

Bolstad, B. M., R. A Irizarry, M. Åstrand, and T. P. Speed. 2003. “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias.” Bioinformatics 19 (2): 185–93. https://doi.org/10.1093/bioinformatics/19.2.185.

Brownlee, Jason. 2023. “A Tour of Machine Learning Algorithms.” 2023. https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/.

———. 2024. “A Gentle Introduction to Ensemble Diversity for Machine Learning.” 2024. https://machinelearningmastery.com/ensemble-diversity-for-machine-learning/.

Draelos, Rachel. 2019. “Connections: Log Likelihood, Cross Entropy, KL Divergence, Logistic Regression, and Neural Networks.” 2019. https://glassboxmedicine.com/2019/12/07/connections-log-likelihood-cross-entropy-kl-divergence-logistic-regression-and-neural-networks/.

elastic. 2024. “What Is kNN?” 2024. https://www.elastic.co/what-is/knn.

Géron, Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly.

Godoy, Daniel. 2018. “Understanding Binary Cross-Entropy / Log Loss: A Visual Explanation.” 2018. https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a.

Google. 2024. “Overfitting.” 2024. https://developers.google.com/machine-learning/crash-course/overfitting/overfitting.

H, Prasan N. 2024. “Exploring the World of Clustering: K-Means Vs. K-Medoids.” 2024. https://medium.com/@prasanNH/exploring-the-world-of-clustering-k-means-vs-k-medoids-f648ea738508.

Ian Goodfellow, Aaron Courville, Yoshua Bengio. 2016. “Deep Learning.” 2016. https://www.deeplearningbook.org/.

Jaadi, Zakaria. 2024. “Principal Component Analysis (PCA): A Step-by-Step Explanation.” 2024. https://builtin.com/data-science/step-step-explanation-principal-component-analysis.

Jacob Murel, Eda Kavlakoglu. 2024. “What Is Dimensionality Reduction?” 2024. https://www.ibm.com/topics/dimensionality-reduction.

Kohavi, Ron. 2001. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection” 14 (March).

Madaric, Juraj, Andrej Klepanec, Martina Valachovicova, Martin Mistrik, Maria Bucova, Ingrid Olejarova, Roman Necpal, Terezia Madaricova, Ludovit Paulis, and Ivan Vulev. 2016. “Characteristics of Responders to Autologous Bone Marrow Cell Therapy for No-Option Critical Limb Ischemia.” Stem Cell Research & Therapy 7 (1): 116. https://doi.org/10.1186/s13287-016-0379-z.

Mehreen, Kanwal. 2022. “Top 5 Machine Learning Practices Recommended by Experts.” 2022. https://www.kdnuggets.com/2022/09/top-5-machine-learning-practices-recommended-experts.html.

Run:ai. 2024. “Hyperparameter Tuning.” 2024. https://www.run.ai/guides/hyperparameter-tuning.

Starmer, Josh. 2019. “XGBoost Part 1 (of 4): Regression.” 2019. https://www.youtube.com/watch?v=OtD8wVaFm6E.

Thompson, David. 2018. “8.6 David Thompson (Part 6): Nonlinear Dimensionality Reduction: KPCA.” 2018. https://www.youtube.com/watch?v=HbDHohXPLnU.

Winston, Patrick. 2014. “17. Learning: Boosting.” 2014. https://www.youtube.com/watch?v=UHBmv7qCey4.

Reuse

CC BY-SA 4.0