Unsupervised Learning

Authors:
Abdelwahed Khamis, Mohamed Tarek

What is Unsupervised Learning?

  • Learning from unlabeled data
  • Goal: Find hidden patterns or structures in data
  • No “teacher” providing correct answers

Example Tasks

  1. Dimensionality Reduction
    • Reduce data complexity
    • Preserve important information
    • Methods: PCA, t-SNE
  2. Clustering
    • Group similar data points together
    • Example: Patient stratification
    • Methods: k-means, hierarchical clustering

Common Applications in Drug Development

  • Patient subgroup identification
  • Drug response patterns
  • Drug-drug similarity analysis

Dimension Reduction

Principal Component Analysis (PCA)

  • One of the most popular and powerful tools in data science
  • Helps us make sense of complex data by simplifying it while keeping the most interesting information.
  • Think of it like looking at a complex 3D object from different angles to understand its true shape
  • PCA helps us find the best angles to view our data.
  • A dimension reduction technique that transforms data into a new coordinate system
  • Finds directions of maximum variance (principal components)

Common Uses of PCA

  • Reduce computational complexity, as a preprocessing step for ML models (e.g clustering using k-means)
  • Remove noise and redundancy
  • Visualize high-dimensional data in a low dimensional space
  • Feature extraction and selection
  • Data compression

PCA Algorithm

Input

  • \(X \in \mathbb{R}^{n \times d}\): Data matrix with \(n\) samples and \(d\) features

Center data

For each feature \(j = 1, \ldots, d\):

  • Compute means: \(\mu_j = \text{mean}(X[:, j])\)

  • Center features: \(\tilde{X}[:, j] = X[:, j] - \mu_j\)

Compute covariance matrix

\[ \Sigma = \frac{1}{n - 1} \tilde{X}^T \tilde{X} \quad \text{(a } d \times d \text{ matrix)} \]

PCA Algorithm

Find eigenvectors and eigenvalues

\[ (\Sigma - \lambda I) \cdot v = 0 \]

  • Sort the eigenvalues in descending order.
  • Denote the sorted eigenvalues by \(\lambda = [\lambda_1, \lambda_2, \dots, \lambda_d ]\).
  • Denote the eigenvectors of the sorted eigenvalues by \(V = [v_1 \, v_2 \, \dots \, v_d]\).

Select components

  • Select the top \(k\) eigenvalues and their associated eigenvectors
  • Based on desired dimensionality or explained variance percentage
  • Explained variance percentage is \(\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i}\)

PCA Algorithm

Project data \[ Z = \overbrace{\tilde{X}}^{n \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \quad \text{(an } n \times k \text{ matrix)} \]

Output

\[ Z \in \mathbb{R}^{n \times k} \]

  • Transformed data in reduced \(k\)-dimensional space

Bonus Exercise

Prove that the variance of the projected data along an eigenvector \(v\) is equal to the corresponding eigenvalue \(\lambda\).

PCA Algorithm

Project new point \(x\)

\[ \overbrace{z}^{1 \times k} = \overbrace{(x - \mu)}^{1 \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \]

  • \(z\) is the projected point in the new (lower dimensional) coordinate system

Lossy reconstruction

\[ \overbrace{x_{\text{reconstructed}}}^{1 \times d} = \overbrace{z}^{1 \times k} \cdot \overbrace{V[:, 1:k]^T}^{k \times d} + \overbrace{\mu}^{1 \times d} \]

  • \(x_{\text{reconstructed}}\) is the projected point in the original coordinate system as \(x\)

PCA Algorithm

Reconstruction error

\[ |x - x_{\text{reconstructed}}| \]

  • Reconstruction error is a common model evaluation metric.
  • The error is a decreasing function of \(k\).
  • Can be used to compare different dimension reduction methods for a given \(k\).
  • Model comparison should be done using unseen test data.

PCA Algorithm

PCA caveats

  • PCA is sensitive to variance difference in the variables. Variables with a higher variance will have more pull.
  • If the goal is to remove dependent variables only, the variables should be standardized/scaled appropriately first.
  • If the transformed variables \(Z\) will be used in linear regression, PCA does not necessarily give us the most predictive features.

PCA Animation

Basic Implementation

using MLJ, DataFrames, Random, MLJMultivariateStatsInterface
Random.seed!(123)

# Create sample data
X = DataFrame(randn(100, 5), :auto)

# Load and fit PCA
PCA = @load PCA pkg=MultivariateStats
pca = machine(PCA(maxoutdim=3), X)
fit!(pca)

# Transform data and get variance explained
Z = MLJ.transform(pca, X)
var_ratio = report(pca).principalvars ./ sum(report(pca).principalvars)

println("Variance explained: ", round.(var_ratio, digits=3))
println("\nFirst 3 rows of transformed data:")
println(first(Z, 5))

Why the Eigenvectors?

  • The PCA is a greedy algorithm that tries to:
    1. Find the direction of maximum variance in the data.
    2. Find the direction that is orthogonal to the previously found directions and has maximum variance in the data.
    3. Repeat step 2 until we have \(k\) directions.
  • Finding the direction \(v\) of maximum variance can be formulated as an optimization problem. \[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]
  • This is a non-convex optimization problem, but it has an analytic solution!

Why the Eigenvectors?

\[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]

  • The solution to this optimization problem is exactly the eigenvector \(v_1\) corresponding to the largest eigenvalue \(\lambda_1\) of \(\Sigma\).

  • The eigenvector \(v_1\) is sometimes unique up to a sign, but can also be non-unique if multiple directions have the same variance.

Bonus Exercise

  1. Prove that any eigenvector is at best unique up to a sign.
  2. Show an example where the eigenvector is not unique, with more than a sign difference.

Why the Eigenvectors?

  • To find the next direction of maximum variance, we can formulate a similar optimization problem with the additional constraint that the new direction is orthogonal to the previously found direction. \[ \max_{||v|| = 1, v \perp v_1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1, v \perp v_1} v^T \Sigma v \]
  • The solution to this optimization problem is exactly the eigenvector \(v_2\) corresponding to the second largest eigenvalue \(\lambda_2\) of \(\Sigma\).
  • One can find all principal components by solving a series of such optimization problems.
  • Alternatively, one can find all eigenvalues and eigenvectors of \(\Sigma\) using any eigenvalue decomposition algorithm.

Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS)

  • Multidimensional Scaling (MDS) is a technique used for dimensionality reduction and data visualization.
  • The goal is to embed the high-dimensional input data \(\{x_i\}_{i=1}^n\) into the lower-dimensional embedded data \(\{y_i\}_{i=1}^n\), where \(n\) is the number of data points.
  • The dimensionality of the input and embedding spaces is denoted by \(d\) and \(p \leq d\), respectively, i.e., \(x_i \in \mathbb{R}^d\) and \(y_i \in \mathbb{R}^p\).
  • Additionally, we denote the collection of points by \(X := [x_1, \ldots, x_n] \in \mathbb{R}^{d \times n}\) and \(Y := [y_1, \ldots, y_n] \in \mathbb{R}^{p \times n}\).

Classical MDS

  • Also known as Torgerson Scaling, Torgerson-Gower Scaling or principal coordinates analysis (not to be confused with PCA).
  • The goal of classical MDS is to preserve the similarity of the data points in the embedding space as it was in the input space.
  • One way to measure similarity between 2 points is to calculate the inner product. \[ s_{ij} = x_i^T x_j \]
  • Classical MDS tries to minimize the following objective: \[ \min_{Y} \sum_{i,j} (x_i^T x_j - y_i^T y_j)^2 = \min_{Y} ||X^T X - Y^T Y||_F^2 = \min_{Y} \text{tr}(X^T X - Y^T Y)^2 \] where \(||\cdot||_F\) is the Frobenius norm.
  • \(X^T X\) and \(Y^T Y\) are called Gram matrices.

Classical MDS

  • A non-unique solution to the above optimization problem is given by the top \(p\) eigenvectors of the Gram matrix \(X^T X\).
  • Let \(X^T X = V \Lambda V^T\) be the eigenvalue decomposition of \(X^T X\), where \(\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)\) is the diagonal matrix of eigenvalues sorted in descending order, and \(V = [v_1, \ldots, v_n]\) is the matrix of corresponding eigenvectors.
  • An optimal solution is given by: \[ Y = \Lambda[1:p, 1:p]^{1/2} \cdot V[:, 1:p]^T \]
  • This is the same solution as PCA, assuming the data is centered.
  • The solution to classical MDS is unique up to orthogonal transformations (e.g., rotation, reflection).
  • Rotating or reflecting all the points in the embedding space does not change the inner products between the individual points.

Generalized Classical MDS

  • Let’s center the input data first by subtracting the mean: \[ \tilde{x_i} \leftarrow x_i - \bar{x} \] \[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \]
  • In matrix form, this can be written as: \[ \tilde{X} = X - \bar{x} \cdot 1_n^T = X - \frac{1}{n} X \cdot 1_n \cdot 1_n^T = X \cdot \left( I - \frac{1}{n} 1_n \cdot 1_n^T \right) = X H \] where \(1_n\) is the vector of all ones of size \(n\) and \(H = I - \frac{1}{n} 1_n \cdot 1_n^T\) is called the centering matrix.

Generalized Classical MDS

  • The centered Gram matrix is given by: \[ \tilde{X}^T \tilde{X} = H X^T X H \]
  • This is equivalent to double centering the distance matrix \(D\) of the input data, where \(D_{ij} = ||x_i - x_j||^2\). \[ \tilde{X}^T \tilde{X} = -\frac{1}{2} H D H \]
  • Only requiring a distance matrix allows us to generalize classical MDS to work with any distance/dissimilarity measure, not just the Euclidean distance.
  • If \(D\) is not a proper distance matrix, the centered Gram matrix may not be positive semi-definite, and some eigenvalues may be negative.
  • In this case, we drop any negative eigenvalues and their corresponding eigenvectors when computing the embedding.

Distance and Dissimilarity Measures

Distance and Dissimilarity Measures

  • A distance or dissimilarity measure is a function that quantifies how different two data points are.
  • Distance function \(d\) should satisfy the following properties:
    1. Non-negativity: \(d(x, y) \geq 0\) for all \(x, y\)
    2. Identity: \(d(x, y) = 0\) if and only if \(x = y\)
    3. Symmetry: \(d(x, y) = d(y, x)\) for all \(x, y\)
    4. Triangle inequality: \(d(x, z) \leq d(x, y) + d(y, z)\) for all \(x, y, z\)
  • Dissimilarity measures may not satisfy all these properties.

Common Distance/Dissimilarity Measures

Distance/Dissimilarity Measure Suitable for
Euclidean distance Continuous data
Manhattan distance Continuous data
Cosine distance Continuous data
Earth mover’s distance (EMD) Point clouds and distributions
Kullback-Leibler (KL) divergence Distributions
Dynamic time warping (DTW) Time series
Embedding-based distances Text, images, graphs
Graph edit distance Graphs

Kernel MDS

  • Instead of doing the inner product of the input data points \(x_i^T x_j\) in the Gram matrix, one can use any positive semi-definite kernel function \(k(x_i, x_j)\) to measure similarity.
  • The kernel function implicitly maps the input data points to a high-dimensional space, where the inner product is computed. Examples:
    • Linear kernel \(k(x_i, x_j) = x_i^T x_j\) performs the inner product in the original space, classical MDS.
    • Polynomial kernel \(k(x_i, x_j) = (x_i^T x_j + c)^d\) is equivalent to mapping the data to a higher-dimensional polynomial feature space and then performing the inner product in that space.
    • Radial basis function (RBF) or Gaussian kernel \(k(x_i, x_j) = \exp(-\gamma ||x_i - x_j||^2)\) is equivalent to mapping each point \(x_i\) to an infinite number of basis functions and then computing the inner product in that space.
  • There are many other kernel functions.

Kernel MDS

  • When using a kernel function, the Gram matrix is known as the kernel matrix: \[ K_{ij} = k(x_i, x_j) \]
  • The centered Gram matrix is given by: \[ \tilde{K} = H K H \]
  • This MDS variant is known as kernel MDS.
  • Using the eigenvalue decomposition algorithm, kernel MDS is equivalent to kernel PCA.
  • The idea of replacing the inner product with a kernel function is known as the kernel trick. It is widely used in many machine learning algorithms, e.g, kernel PCA, kernel SVM, Gaussian processes.

Metric MDS

  • Instead of defining similarity using the inner product, one can use a more flexible distance/dissimilarity function between data points.
  • The goal of metric MDS is to preserve the pairwise distances between data points in the embedding space as they were in the input space.
  • Metric MDS tries to minimize the following objective: \[ \min_{Y} \left( \frac{\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n (d(x_i, x_j) - \delta(y_i, y_j))^2}{\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n d(x_i, x_j)^2} \right)^{\frac{1}{2}} \] where \(d\) is the distance function in the input space and \(\delta\) is the distance function in the embedding space.
  • The cost function is usually referred to as the stress function.

Metric MDS

  • If we remove the normalization term in the denominator, this is known as raw stress. \[ \min_{Y} \left( \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^n (d(x_i, x_j) - \delta(y_i, y_j))^2 \right)^{\frac{1}{2}} \]
  • These optimization problems are non-convex and do not have an analytic solution.
  • They can be solved using iterative optimization algorithms, e.g., gradient descent.

Clustering

What is Clustering?

What is Clustering?

  • Cluster: a collection of data objects
    • Similar (or related) to one another within the same group
    • Dissimilar (or unrelated) to the objects in other groups
  • Cluster analysis (or clustering, data segmentation, …)
    • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
  • Unsupervised learning: no predefined classes
  • Applications: exploratory data analysis, outlier detection, image segmentation, detailed model diagnostics stratifying by the cluster, generating pseudo-labels for self-supervised learning

Considerations in Clustering

  • Partitioning criteria
    • Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
  • Separation of clusters
    • Exclusive (e.g., one subject belongs to only one sub-group) vs. non-exclusive (e.g., one subject may belong to more than one sub-group)
  • Dissimilarity measure
    • How to measure the similarity or dissimilarity between data objects
  • Clustering space
    • Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

k-means

  • We want to classify patients based on their tumor sizes to identify different risk categories.
  • The dataset contains tumor size measurements (in cm) for a group of patients.
  • Suppose we decide to group patients into three clusters (k = 3).
    • Small Tumors (Low Risk)
    • Medium Tumors (Moderate Risk)
    • Large Tumors (High Risk)

k-means

Patient Tumor size (cm)
1 0.45
2 0.70
3 1.00
4 1.38
5 2.14
6 2.50
7 3.00
8 3.50
9 4.00
10 4.50
11 5.00

k-means

Visually we can easily identify the clusters as the data is simple in this case

k-means

  1. Select the number of clusters you want to identify in your data. This is the k in k-means clustering.

k-means

  1. Randomly select k data points.

k-means

  1. Measure the distance between the first point and the k clusters.
  2. Assign the first point to the nearest cluster.

k-means

  1. Repeat steps 3 and 4 for all data points.

k-means

k-means

k-means

k-means

k-means

k-means

The process stops when a convergence criterion is met (e.g., no change in cluster assignments or means).

k-means

  1. Initialize the cluster means \(m_j\) randomly for \(j = 1, \ldots, k\).
  2. Repeat until convergence :
    • Assign each data point \(x_i\) to the nearest cluster \(C_j\). \[ \text{assignment}(x_i) = \arg \min_j \|x_i - m_j\|^2 \]
    • Recalculate each cluster mean: \[ m_j = \frac{1}{|C_j|} \sum_{x_i\in C_j} x_i \]
    • Check for convergence, stop when convergence criteria is met.
  3. Return final cluster mean and assignments.

Stopping Criteria

  • One stopping criteria is when the within-cluster sum of squares (WCSS) stops decreasing significantly: \[ \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \]
    • \(x\): a point in a cluster \(C_i\)
    • \(m_i\): the mean of a cluster \(C_i\)
  • This criterion tries to make the resulting k clusters as compact as and as separate as possible

Strengths and Limitations of k-means

Strength

  • Efficient: \(O(tkn)\), where \(n\) is number of objects, \(k\) is number of clusters, and \(t\) is number of iterations; usually, \(k, t \ll n\)

Limitations

  • Often terminates at a local optimal
  • Applicable only to objects in a continuous space
  • Need to specify k, the number of clusters, in advance
  • Not suitable to discover clusters with non-convex shapes or clusters of very different sizes or density
  • Sensitive to noisy data and outliers

Alternative: k-medoids

  • Similar to k-means but uses actual data points as cluster centers (medoids)
  • The mean is the point that minimizes the sum of squared Euclidean distances to all points in the cluster. \[ m_i = \arg \min_{y \in \mathbb{R}^d} \sum_{x_k \in C_i} ||y - x_k||^2 = \frac{1}{|C_i|} \sum_{x_k \in C_i} x_k \]

Bonus Exercise

Prove the above formula.

  • The medoid is the data point that minimizes the sum of distances to all points in the cluster \[ \text{medoid}(C_i) = \arg \min_{y \in C_i} \sum_{x_k \in C_i} d(y, x_k) \]

Alternative: k-medoids

  • More robust to noise and outliers, but computationally more expensive
  • Only requires a distance function, so can be used with non-numeric data

Cluster Evaluation Metrics

  • Cohesion: Measures how closely related the objects in a cluster are. High cohesion means that the points within a cluster are close to each other. Common metrics:
    1. Within-cluster sum of squares (WCSS), sometimes called the (moment of) inertia \[ \begin{aligned} \text{WCSS} &= \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \\ &= \sum_{i=1}^{k} \frac{1}{2|C_i|} \sum_{x, y \in C_i} ||x - y||^2 \end{aligned} \]

    Bonus Exercise

    1. Prove the above equivalence.
    2. Prove that WCSS is a decreasing function of \(k\).

Cluster Evaluation Metrics

  • Cohesion: Measures how closely related the objects in a cluster are. High cohesion means that the points within a cluster are close to each other. Common metrics:
    1. Total intra-cluster distance \[ \text{Intra-cluster distance} = \sum_{i=1}^{k} \sum_{x, y \in C_i} d(x, y) \]
    This is useful for non-Euclidean distances.

Cluster Evaluation Metrics

  • Separation: Measures how distinct or well-separated a cluster is from other clusters. High separation means that the clusters are far apart from each other. Common metrics:
    1. Between-cluster sum of squares (BCSS) \[ \begin{aligned} \text{BCSS} &= \sum_{i=1}^{k} |C_i| \cdot ||m_i - m||^2 \\ &= \frac{1}{2n} \sum_{i=1}^{k} \sum_{j=1}^{k} |C_i| \cdot |C_j| \cdot ||m_i - m_j||^2 \end{aligned} \] where \(m\) is the overall mean of the data and \(n\) is the total number of data points.

    Bonus Exercise

    Prove the above equivalence.

Cluster Evaluation Metrics

  • Separation: Measures how distinct or well-separated a cluster is from other clusters. High separation means that the clusters are far apart from each other. Common metrics:
    1. Between-cluster sum of distances (BCSD) \[ \text{BCSD} = \sum_{i=1}^{k} \sum_{j=1}^{k} |C_i| \cdot |C_j| \cdot d(m_i, m_j) \] where \(m_i\) and \(m_j\) are the medoids of clusters \(C_i\) and \(C_j\), respectively, and \(d\) is the chosen distance metric.

Total Variance

  • For continuous data, the total variance is the sum of cohesion and separation:

\[ \text{Total Variance} = \text{WCSS} + \text{BCSS} \]

Bonus Exercise

Prove the above result.

  • Total variance is constant for a given dataset, regardless of the clustering. \[ \text{Total Variance} = \sum_{i=1}^{n} ||x_i - m||^2 \] where \(m\) is the overall mean of the data. The total variance is also known as total sum of squares (TSS).

Clustering as an Optimization Problem

  • The clustering problem can be formulated as an optimization problem: \[ \min_{C_1, \ldots, C_k} \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \]
  • Given that \(\text{WCSS} = \text{TSS} - \text{BCSS}\) and TSS is constant, this is equivalent to: \[ \max_{C_1, \ldots, C_k} \text{BCSS} = \sum_{i=1}^{k} |C_i| \cdot ||m_i - m||^2 \]
  • This is a combinatorial optimization problem and is NP-hard in general.
  • The k-means algorithm is a heuristic to solve this optimization problem to a local optimum.

Elbow Method for Selecting k in k-means

  • Run k-means clustering for a range of \(k\) values (e.g., 1 to 10).
  • For each \(k\), calculate the WCSS.
  • Plot \(k\) vs WCSS.

Elbow Method for Selecting k in k-means

  • Interpret the plot:
    • WCSS decreases as \(k\) increases, but the rate of decrease slows.
    • The “elbow” point—where the curve bends—suggests the optimal \(k\).
    • Adding more clusters beyond this point yields diminishing returns.

Elbow Method for Selecting k in k-means

  • Advantages:
    • Simple and intuitive visual method
  • Limitations:
    • The “elbow” is sometimes ambiguous or not clearly visible

Silhouette Score/Coefficient for Selecting k in k-means

  • The silhouette coefficient combines the ideas of both cohesion and separation, but for individual points \(x_i\).
  • For the \(i^\text{th}\) point, let
    • \(a_i\) be its average distance to all other points in its cluster \[ a_i = \frac{1}{|C_i| - 1} \sum_{y \in C_i} d(x_i, y) \]
    • \(b_i\) be the minimum (over clusters) of its average distance to all the points in the other clusters \[ b_i = \min_{k \neq \text{assignment}(x_i)} \frac{1}{|C_k|} \sum_{y \in C_k} d(x_i, y) \]

Silhouette Score for Selecting k in k-means

  • The silhouette coefficient for a point \(x_i\) is: \[ s_i = \frac{(b_i - a_i)}{\max(a_i, b_i)} \]
  • Intuitively, it is a measure of how much closer \(x_i\) is to its own cluster compared to the nearest other cluster.
  • The value of \(s_i\) is between -1 and 1; when it is closer to 1, the cluster result is better.
  • Choose \(k\) with the highest average silhouette score.
  • Advantage:
    • Provides a quantitative measure of clustering quality
  • Limitation:
    • Can be computationally intensive for large datasets