Unsupervised Learning

Authors:

Abdelwahed Khamis, Mohamed Tarek

What is Unsupervised Learning?

Learning from unlabeled data
Goal: Find hidden patterns or structures in data
No “teacher” providing correct answers

Example Tasks

Dimensionality Reduction
- Reduce data complexity
- Preserve important information
- Methods: PCA, t-SNE
Clustering
- Group similar data points together
- Example: Patient stratification
- Methods: k-means, hierarchical clustering

Common Applications in Drug Development

Patient subgroup identification
Drug response patterns
Drug-drug similarity analysis

Dimension Reduction

Principal Component Analysis (PCA)

One of the most popular and powerful tools in data science
Helps us make sense of complex data by simplifying it while keeping the most interesting information.
Think of it like looking at a complex 3D object from different angles to understand its true shape
PCA helps us find the best angles to view our data.
A dimension reduction technique that transforms data into a new coordinate system
Finds directions of maximum variance (principal components)

Common Uses of PCA

Reduce computational complexity, as a preprocessing step for ML models (e.g clustering using k-means)
Remove noise and redundancy
Visualize high-dimensional data in a low dimensional space
Feature extraction and selection
Data compression

PCA Algorithm

Input

\(X \in \mathbb{R}^{n \times d}\): Data matrix with \(n\) samples and \(d\) features

Center data

For each feature \(j = 1, \ldots, d\):

Compute means: \(\mu_j = \text{mean}(X[:, j])\)
Center features: \(\tilde{X}[:, j] = X[:, j] - \mu_j\)

Compute covariance matrix

\[ \Sigma = \frac{1}{n - 1} \tilde{X}^T \tilde{X} \quad \text{(a } d \times d \text{ matrix)} \]

PCA Algorithm

Find eigenvectors and eigenvalues

\[ (\Sigma - \lambda I) \cdot v = 0 \]

Sort the eigenvalues in descending order.
Denote the sorted eigenvalues by \(\lambda = [\lambda_1, \lambda_2, \dots, \lambda_d ]\).
Denote the eigenvectors of the sorted eigenvalues by \(V = [v_1 \, v_2 \, \dots \, v_d]\).

Select components

Select the top \(k\) eigenvalues and their associated eigenvectors
Based on desired dimensionality or explained variance percentage
Explained variance percentage is \(\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i}\)

PCA Algorithm

Project data \[ Z = \overbrace{\tilde{X}}^{n \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \quad \text{(an } n \times k \text{ matrix)} \]

Output

\[ Z \in \mathbb{R}^{n \times k} \]

Transformed data in reduced \(k\)-dimensional space

Bonus Exercise

Prove that the variance of the projected data along an eigenvector \(v\) is equal to the corresponding eigenvalue \(\lambda\).

PCA Algorithm

Project new point \(x\)

\[ \overbrace{z}^{1 \times k} = \overbrace{(x - \mu)}^{1 \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \]

\(z\) is the projected point in the new (lower dimensional) coordinate system

Lossy reconstruction

\[ \overbrace{x_{\text{reconstructed}}}^{1 \times d} = \overbrace{z}^{1 \times k} \cdot \overbrace{V[:, 1:k]^T}^{k \times d} + \overbrace{\mu}^{1 \times d} \]

\(x_{\text{reconstructed}}\) is the projected point in the original coordinate system as \(x\)

PCA Algorithm

Reconstruction error

\[ |x - x_{\text{reconstructed}}| \]

Reconstruction error is a common model evaluation metric.
The error is a decreasing function of \(k\).
Can be used to compare different dimension reduction methods for a given \(k\).
Model comparison should be done using unseen test data.

PCA Algorithm

PCA caveats

PCA is sensitive to variance difference in the variables. Variables with a higher variance will have more pull.
If the goal is to remove dependent variables only, the variables should be standardized/scaled appropriately first.
If the transformed variables \(Z\) will be used in linear regression, PCA does not necessarily give us the most predictive features.

PCA Animation

Basic Implementation

using MLJ, DataFrames, Random, MLJMultivariateStatsInterface
Random.seed!(123)

# Create sample data
X = DataFrame(randn(100, 5), :auto)

# Load and fit PCA
PCA = @load PCA pkg = MultivariateStats
pca = machine(PCA(maxoutdim = 3), X)
fit!(pca)

# Transform data and get variance explained
Z = MLJ.transform(pca, X)
var_ratio = report(pca).principalvars ./ sum(report(pca).principalvars)

println("Variance explained: ", round.(var_ratio, digits = 3))
println("\nFirst 3 rows of transformed data:")
println(first(Z, 5))

Why the Eigenvectors?

The PCA is a greedy algorithm that tries to:
1. Find the direction of maximum variance in the data.
2. Find the direction that is orthogonal to the previously found directions and has maximum variance in the data.
3. Repeat step 2 until we have \(k\) directions.
Finding the direction \(v\) of maximum variance can be formulated as an optimization problem. \[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]
This is a non-convex optimization problem, but it has an analytic solution!

Why the Eigenvectors?

\[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]

The solution to this optimization problem is exactly the eigenvector \(v_1\) corresponding to the largest eigenvalue \(\lambda_1\) of \(\Sigma\).
The eigenvector \(v_1\) is sometimes unique up to a sign, but can also be non-unique if multiple directions have the same variance.

Bonus Exercise

Prove that any eigenvector is at best unique up to a sign.
Show an example where the eigenvector is not unique, with more than a sign difference.

Why the Eigenvectors?

To find the next direction of maximum variance, we can formulate a similar optimization problem with the additional constraint that the new direction is orthogonal to the previously found direction. \[ \max_{||v|| = 1, v \perp v_1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1, v \perp v_1} v^T \Sigma v \]
The solution to this optimization problem is exactly the eigenvector \(v_2\) corresponding to the second largest eigenvalue \(\lambda_2\) of \(\Sigma\).
One can find all principal components by solving a series of such optimization problems.
Alternatively, one can find all eigenvalues and eigenvectors of \(\Sigma\) using any eigenvalue decomposition algorithm.

Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS) is a technique used for dimensionality reduction and data visualization.
The goal is to embed the high-dimensional input data \(\{x_i\}_{i=1}^n\) into the lower-dimensional embedded data \(\{y_i\}_{i=1}^n\), where \(n\) is the number of data points.
The dimensionality of the input and embedding spaces is denoted by \(d\) and \(p \leq d\), respectively, i.e., \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}^p\).
Additionally, we denote the collection of points by \(X := [x_1, \ldots, x_n] \in \mathbb{R}^{d \times n}\) and \(Y := [y_1, \ldots, y_n] \in \mathbb{R}^{p \times n}\).

Classical MDS

Also known as Torgerson Scaling, Torgerson-Gower Scaling or principal coordinates analysis (not to be confused with PCA).
The goal of classical MDS is to preserve the similarity of the data points in the embedding space as it was in the input space.
One way to measure similarity between 2 points is to calculate the inner product. \[ s_{ij} = x_i^T x_j \]
Classical MDS tries to minimize the following objective: \[ \min_{Y} \sum_{i,j} (x_i^T x_j - y_i^T y_j)^2 = \min_{Y} ||X^T X - Y^T Y||_F^2 = \min_{Y} \text{tr}((X^T X - Y^T Y)^2) \] where \(||\cdot||_F\) is the Frobenius norm.
\(X^T X\) and \(Y^T Y\) are called Gram matrices.

Classical MDS

A non-unique solution to the above optimization problem is given by the top \(p\) eigenvectors of the Gram matrix \(X^T X\).
Let \(X^T X = V \Lambda V^T\) be the eigenvalue decomposition of \(X^T X\), where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\) is the diagonal matrix of eigenvalues sorted in descending order, and \(V = [v_1, \ldots, v_n]\) is the matrix of corresponding eigenvectors.
An optimal solution is given by: \[ Y = \Lambda[1:p, 1:p]^{1/2} \cdot V[:, 1:p]^T \]
This is the same solution as PCA, assuming the data is centered.
The solution to classical MDS is unique up to orthogonal transformations (e.g., rotation, reflection).
Rotating or reflecting all the points in the embedding space does not change the inner products between the individual points.

Generalized Classical MDS

Let’s center the input data first by subtracting the mean: \[ \tilde{x_i} \leftarrow x_i - \bar{x} \] \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \]
In matrix form, this can be written as: \[ \tilde{X} = X - \bar{x} \cdot 1_n^T = X - \frac{1}{n} X \cdot 1_n \cdot 1_n^T = X \cdot \left( I - \frac{1}{n} 1_n \cdot 1_n^T \right) = X H \] where \(1_n\) is the vector of all ones of size \(n\) and \(H = I - \frac{1}{n} 1_n \cdot 1_n^T\) is called the centering matrix.

Generalized Classical MDS

The centered Gram matrix is given by: \[ \tilde{X}^T \tilde{X} = H X^T X H \]
This is equivalent to double centering the distance matrix \(D\) of the input data, where \(D_{ij} = ||x_i - x_j||^2\). \[ \tilde{X}^T \tilde{X} = -\frac{1}{2} H D H \]
Only requiring a distance matrix allows us to generalize classical MDS to work with any distance/dissimilarity measure, not just the Euclidean distance.
If \(D\) is not a proper distance matrix, the centered Gram matrix may not be positive semi-definite, and some eigenvalues may be negative.
In this case, we drop any negative eigenvalues and their corresponding eigenvectors when computing the embedding.

Distance and Dissimilarity Measures

Distance and Dissimilarity Measures

A distance or dissimilarity measure is a function that quantifies how different two data points are.
Distance function \(d\) should satisfy the following properties:
1. Non-negativity: \(d(x, y) \geq 0\) for all \(x, y\)
2. Identity: \(d(x, y) = 0\) if and only if \(x = y\)
3. Symmetry: \(d(x, y) = d(y, x)\) for all \(x, y\)
4. Triangle inequality: \(d(x, z) \leq d(x, y) + d(y, z)\) for all \(x, y, z\)
Dissimilarity measures may not satisfy all these properties.

Common Distance/Dissimilarity Measures

Distance/Dissimilarity Measure	Suitable for
Euclidean distance	Continuous data
Manhattan distance	Continuous data
Cosine distance	Continuous data
Earth mover’s distance (EMD)	Point clouds and distributions
Kullback-Leibler (KL) divergence	Distributions
Dynamic time warping (DTW)	Time series
Embedding-based distances	Text, images, graphs
Graph edit distance	Graphs

Kernel MDS

Instead of doing the inner product of the input data points \(x_i^T x_j\) in the Gram matrix, one can use any positive semi-definite kernel function \(k(x_i, x_j)\) to measure similarity.
The kernel function implicitly maps the input data points to a high-dimensional space, where the inner product is computed. Examples:
- Linear kernel \(k(x_i, x_j) = x_i^T x_j\) performs the inner product in the original space, classical MDS.
- Polynomial kernel \(k(x_i, x_j) = (x_i^T x_j + c)^d\) is equivalent to mapping the data to a higher-dimensional polynomial feature space and then performing the inner product in that space.
- Radial basis function (RBF) or Gaussian kernel \(k(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)\) is equivalent to mapping each point \(x_i\) to an infinite number of basis functions and then computing the inner product in that space.
There are many other kernel functions.

Kernel MDS

When using a kernel function, the Gram matrix is known as the kernel matrix: \[ K_{ij} = k(x_i, x_j) \]
The centered Gram matrix is given by: \[ \tilde{K} = H K H \]
This MDS variant is known as kernel MDS.
Using the eigenvalue decomposition algorithm, kernel MDS is equivalent to kernel PCA.
The idea of replacing the inner product with a kernel function is known as the kernel trick. It is widely used in many machine learning algorithms, e.g, kernel PCA, kernel SVM, Gaussian processes.

Metric MDS

Instead of defining similarity using the inner product, one can use a more flexible distance/dissimilarity function between data points.
The goal of metric MDS is to preserve the pairwise distances between data points in the embedding space as they were in the input space.
Metric MDS tries to minimize the following objective: \[ \min_{Y} \left( \frac{\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n (d(x_i, x_j) - \delta(y_i, y_j))^2}{\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n d(x_i, x_j)^2} \right)^{\frac{1}{2}} \] where \(d\) is the distance function in the input space and \(\delta\) is the distance function in the embedding space.
The cost function is usually referred to as the stress function.

Metric MDS

If we remove the normalization term in the denominator, this is known as raw stress. \[ \min_{Y} \left( \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n (d(x_i, x_j) - \delta(y_i, y_j))^2 \right)^{\frac{1}{2}} \]
These optimization problems are non-convex and do not have an analytic solution.
They can be solved using iterative optimization algorithms, e.g., gradient descent.

Clustering

What is Clustering?

What is Clustering?

Cluster: a collection of data objects
- Similar (or related) to one another within the same group
- Dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
- Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes
Applications: exploratory data analysis, outlier detection, image segmentation, detailed model diagnostics stratifying by the cluster, generating pseudo-labels for self-supervised learning

Considerations in Clustering

Partitioning criteria
- Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters
- Exclusive (e.g., one subject belongs to only one sub-group) vs. non-exclusive (e.g., one subject may belong to more than one sub-group)
Dissimilarity measure
- How to measure the similarity or dissimilarity between data objects
Clustering space
- Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

k-means

We want to classify patients based on their tumor sizes to identify different risk categories.
The dataset contains tumor size measurements (in cm) for a group of patients.
Suppose we decide to group patients into three clusters (k = 3).
- Small Tumors (Low Risk)
- Medium Tumors (Moderate Risk)
- Large Tumors (High Risk)

k-means

Patient	Tumor size (cm)
1	0.45
2	0.70
3	1.00
4	1.38
5	2.14
6	2.50
7	3.00
8	3.50
9	4.00
10	4.50
11	5.00

k-means

Visually we can easily identify the clusters as the data is simple in this case

k-means

Select the number of clusters you want to identify in your data. This is the k in k-means clustering.

k-means

Randomly select k data points.

k-means

Measure the distance between the first point and the k clusters.
Assign the first point to the nearest cluster.

k-means

Repeat steps 3 and 4 for all data points.

k-means

k-means

k-means

k-means

k-means

k-means

The process stops when a convergence criterion is met (e.g., no change in cluster assignments or means).

k-means

Initialize the cluster means \(m_j\) randomly for \(j = 1, \ldots, k\).
Repeat until convergence :
- Assign each data point \(x_i\) to the nearest cluster \(C_j\). \[ \text{assignment}(x_i) = \arg \min_j \|x_i - m_j\|^2 \]
- Recalculate each cluster mean: \[ m_j = \frac{1}{|C_j|} \sum_{x_i\in C_j} x_i \]
- Check for convergence, stop when convergence criteria is met.
Return final cluster mean and assignments.

Stopping Criteria

One stopping criteria is when the within-cluster sum of squares (WCSS) stops decreasing significantly: \[ \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \]
- \(x\): a point in a cluster \(C_i\)
- \(m_i\): the mean of a cluster \(C_i\)
This criterion tries to make the resulting k clusters as compact as and as separate as possible

Strengths and Limitations of k-means

Strength

Efficient: \(O(tkn)\), where \(n\) is number of objects, \(k\) is number of clusters, and \(t\) is number of iterations; usually, \(k, t \ll n\)

Limitations

Often terminates at a local optimal
Applicable only to objects in a continuous space
Need to specify k, the number of clusters, in advance
Not suitable to discover clusters with non-convex shapes or clusters of very different sizes or density
Sensitive to noisy data and outliers

Alternative: k-medoids

Similar to k-means but uses actual data points as cluster centers (medoids)
The mean is the point that minimizes the sum of squared Euclidean distances to all points in the cluster. \[ m_i = \arg \min_{y \in \mathbb{R}^d} \sum_{x_k \in C_i} ||y - x_k||^2 = \frac{1}{|C_i|} \sum_{x_k \in C_i} x_k \]

Bonus Exercise

Prove the above formula.

The medoid is the data point that minimizes the sum of distances to all points in the cluster \[ \text{medoid}(C_i) = \arg \min_{y \in C_i} \sum_{x_k \in C_i} d(y, x_k) \]

Alternative: k-medoids

More robust to noise and outliers, but computationally more expensive
Only requires a distance function, so can be used with non-numeric data

Cluster Evaluation Metrics

Cohesion: Measures how closely related the objects in a cluster are. High cohesion means that the points within a cluster are close to each other. Common metrics:
1. Within-cluster sum of squares (WCSS), sometimes called the (moment of) inertia \[ \begin{aligned} \text{WCSS} &= \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \\ &= \sum_{i=1}^{k} \frac{1}{2|C_i|} \sum_{x, y \in C_i} ||x - y||^2 \end{aligned} \]
Bonus Exercise
1. Prove the above equivalence.
2. Prove that WCSS is a decreasing function of \(k\).

Cluster Evaluation Metrics

Cohesion: Measures how closely related the objects in a cluster are. High cohesion means that the points within a cluster are close to each other. Common metrics:
1. Total intra-cluster distance \[ \text{Intra-cluster distance} = \sum_{i=1}^{k} \sum_{x, y \in C_i} d(x, y) \]
This is useful for non-Euclidean distances.

Cluster Evaluation Metrics

Separation: Measures how distinct or well-separated a cluster is from other clusters. High separation means that the clusters are far apart from each other. Common metrics:
1. Between-cluster sum of squares (BCSS) \[ \begin{aligned} \text{BCSS} &= \sum_{i=1}^{k} |C_i| \cdot ||m_i - m||^2 \\ &= \frac{1}{2n} \sum_{i=1}^{k} \sum_{j=1}^{k} |C_i| \cdot |C_j| \cdot ||m_i - m_j||^2 \end{aligned} \] where \(m\) is the overall mean of the data and \(n\) is the total number of data points.
Bonus Exercise

Prove the above equivalence.

Cluster Evaluation Metrics

Separation: Measures how distinct or well-separated a cluster is from other clusters. High separation means that the clusters are far apart from each other. Common metrics:
1. Between-cluster sum of distances (BCSD) \[ \text{BCSD} = \sum_{i=1}^{k} \sum_{j=1}^{k} |C_i| \cdot |C_j| \cdot d(m_i, m_j) \] where \(m_i\) and \(m_j\) are the medoids of clusters \(C_i\) and \(C_j\), respectively, and \(d\) is the chosen distance metric.

Total Variance

For continuous data, the total variance is the sum of cohesion and separation:

\[ \text{Total Variance} = \text{WCSS} + \text{BCSS} \]

Bonus Exercise

Prove the above result.

Total variance is constant for a given dataset, regardless of the clustering. \[ \text{Total Variance} = \sum_{i=1}^{n} ||x_i - m||^2 \] where \(m\) is the overall mean of the data. The total variance is also known as total sum of squares (TSS).

Clustering as an Optimization Problem

The clustering problem can be formulated as an optimization problem: \[ \min_{C_1, \ldots, C_k} \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \]
Given that \(\text{WCSS} = \text{TSS} - \text{BCSS}\) and TSS is constant, this is equivalent to: \[ \max_{C_1, \ldots, C_k} \text{BCSS} = \sum_{i=1}^{k} |C_i| \cdot ||m_i - m||^2 \]
This is a combinatorial optimization problem and is NP-hard in general.
The k-means algorithm is a heuristic to solve this optimization problem to a local optimum.

Elbow Method for Selecting k in k-means

Run k-means clustering for a range of \(k\) values (e.g., 1 to 10).
For each \(k\), calculate the WCSS.
Plot \(k\) vs WCSS.

Elbow Method for Selecting k in k-means

Interpret the plot:
- WCSS decreases as \(k\) increases, but the rate of decrease slows.
- The “elbow” point—where the curve bends—suggests the optimal \(k\).
- Adding more clusters beyond this point yields diminishing returns.

Elbow Method for Selecting k in k-means

Advantages:
- Simple and intuitive visual method
Limitations:
- The “elbow” is sometimes ambiguous or not clearly visible

Silhouette Score/Coefficient for Selecting k in k-means

The silhouette coefficient combines the ideas of both cohesion and separation, but for individual points \(x_i\).
For the \(i^\text{th}\) point, let
- \(a_i\) be its average distance to all other points in its cluster \[ a_i = \frac{1}{|C_i| - 1} \sum_{y \in C_i} d(x_i, y) \]
- \(b_i\) be the minimum (over clusters) of its average distance to all the points in the other clusters \[ b_i = \min_{k \neq \text{assignment}(x_i)} \frac{1}{|C_k|} \sum_{y \in C_k} d(x_i, y) \]

Silhouette Score for Selecting k in k-means

The silhouette coefficient for a point \(x_i\) is: \[ s_i = \frac{(b_i - a_i)}{\max(a_i, b_i)} \]
Intuitively, it is a measure of how much closer \(x_i\) is to its own cluster compared to the nearest other cluster.
The value of \(s_i\) is between -1 and 1; when it is closer to 1, the cluster result is better.
Choose \(k\) with the highest average silhouette score.
Advantage:
- Provides a quantitative measure of clustering quality
Limitation:
- Can be computationally intensive for large datasets