Dimensionality Reduction
Clustering
Input
Center data
For each feature \(j = 1, \ldots, d\):
Compute means: \(\mu_j = \text{mean}(X[:, j])\)
Center features: \(\tilde{X}[:, j] = X[:, j] - \mu_j\)
Compute covariance matrix
\[ \Sigma = \frac{1}{n - 1} \tilde{X}^T \tilde{X} \quad \text{(a } d \times d \text{ matrix)} \]
Find eigenvectors and eigenvalues
\[ (\Sigma - \lambda I) \cdot v = 0 \]
Select components
Project data \[ Z = \overbrace{\tilde{X}}^{n \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \quad \text{(an } n \times k \text{ matrix)} \]
Output
\[ Z \in \mathbb{R}^{n \times k} \]
Bonus Exercise
Prove that the variance of the projected data along an eigenvector \(v\) is equal to the corresponding eigenvalue \(\lambda\).
Project new point \(x\)
\[ \overbrace{z}^{1 \times k} = \overbrace{(x - \mu)}^{1 \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \]
Lossy reconstruction
\[ \overbrace{x_{\text{reconstructed}}}^{1 \times d} = \overbrace{z}^{1 \times k} \cdot \overbrace{V[:, 1:k]^T}^{k \times d} + \overbrace{\mu}^{1 \times d} \]
Reconstruction error
\[ |x - x_{\text{reconstructed}}| \]
PCA caveats
using MLJ, DataFrames, Random, MLJMultivariateStatsInterface
Random.seed!(123)
# Create sample data
X = DataFrame(randn(100, 5), :auto)
# Load and fit PCA
PCA = @load PCA pkg = MultivariateStats
pca = machine(PCA(maxoutdim = 3), X)
fit!(pca)
# Transform data and get variance explained
Z = MLJ.transform(pca, X)
var_ratio = report(pca).principalvars ./ sum(report(pca).principalvars)
println("Variance explained: ", round.(var_ratio, digits = 3))
println("\nFirst 3 rows of transformed data:")
println(first(Z, 5))The PCA is a greedy algorithm that tries to:
Finding the direction \(v\) of maximum variance can be formulated as an optimization problem. \[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]
This is a non-convex optimization problem, but it has an analytic solution!
\[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]
The solution to this optimization problem is exactly the eigenvector \(v_1\) corresponding to the largest eigenvalue \(\lambda_1\) of \(\Sigma\).
The eigenvector \(v_1\) is sometimes unique up to a sign, but can also be non-unique if multiple directions have the same variance.
Bonus Exercise
A distance or dissimilarity measure is a function that quantifies how different two data points are.
Distance function \(d\) should satisfy the following properties:
Dissimilarity measures may not satisfy all these properties.
| Distance/Dissimilarity Measure | Suitable for |
|---|---|
| Euclidean distance | Continuous data |
| Manhattan distance | Continuous data |
| Cosine distance | Continuous data |
| Earth mover’s distance (EMD) | Point clouds and distributions |
| Kullback-Leibler (KL) divergence | Distributions |
| Dynamic time warping (DTW) | Time series |
| Embedding-based distances | Text, images, graphs |
| Graph edit distance | Graphs |
Instead of doing the inner product of the input data points \(x_i^T x_j\) in the Gram matrix, one can use any positive semi-definite kernel function \(k(x_i, x_j)\) to measure similarity.
The kernel function implicitly maps the input data points to a high-dimensional space, where the inner product is computed. Examples:
There are many other kernel functions.
Cluster: a collection of data objects
Cluster analysis (or clustering, data segmentation, …)
Unsupervised learning: no predefined classes
Applications: exploratory data analysis, outlier detection, image segmentation, detailed model diagnostics stratifying by the cluster, generating pseudo-labels for self-supervised learning
Partitioning criteria
Separation of clusters
Dissimilarity measure
Clustering space
We want to classify patients based on their tumor sizes to identify different risk categories.
The dataset contains tumor size measurements (in cm) for a group of patients.
Suppose we decide to group patients into three clusters (k = 3).
| Patient | Tumor size (cm) |
|---|---|
| 1 | 0.45 |
| 2 | 0.70 |
| 3 | 1.00 |
| 4 | 1.38 |
| 5 | 2.14 |
| 6 | 2.50 |
| 7 | 3.00 |
| 8 | 3.50 |
| 9 | 4.00 |
| 10 | 4.50 |
| 11 | 5.00 |
Visually we can easily identify the clusters as the data is simple in this case
The process stops when a convergence criterion is met (e.g., no change in cluster assignments or means).
Initialize the cluster means \(m_j\) randomly for \(j = 1, \ldots, k\).
Repeat until convergence :
Return final cluster mean and assignments.
One stopping criteria is when the within-cluster sum of squares (WCSS) stops decreasing significantly: \[ \text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||^2 \]
This criterion tries to make the resulting k clusters as compact as and as separate as possible
Bonus Exercise
Prove the above formula.
Cohesion: Measures how closely related the objects in a cluster are. High cohesion means that the points within a cluster are close to each other. Common metrics:
Bonus Exercise
Separation: Measures how distinct or well-separated a cluster is from other clusters. High separation means that the clusters are far apart from each other. Common metrics:
Bonus Exercise
Prove the above equivalence.
\[ \text{Total Variance} = \text{WCSS} + \text{BCSS} \]
Bonus Exercise
Prove the above result.
Interpret the plot:
Advantages:
Limitations:
The silhouette coefficient combines the ideas of both cohesion and separation, but for individual points \(x_i\).
For the \(i^\text{th}\) point, let
The silhouette coefficient for a point \(x_i\) is: \[ s_i = \frac{(b_i - a_i)}{\max(a_i, b_i)} \]
Intuitively, it is a measure of how much closer \(x_i\) is to its own cluster compared to the nearest other cluster.
The value of \(s_i\) is between -1 and 1; when it is closer to 1, the cluster result is better.
Choose \(k\) with the highest average silhouette score.
Advantage:
Limitation: