Input
Center data
For each feature \(j = 1, \ldots, d\):
Compute means: \(\mu_j = \text{mean}(X[:, j])\)
Center features: \(\tilde{X}[:, j] = X[:, j] - \mu_j\)
Compute covariance matrix
\[ \Sigma = \frac{1}{n - 1} \tilde{X}^T \tilde{X} \quad \text{(a } d \times d \text{ matrix)} \]
Find eigenvectors and eigenvalues
\[ (\Sigma - \lambda I) \cdot v = 0 \]
Select components
Project data \[ Z = \overbrace{\tilde{X}}^{n \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \quad \text{(an } n \times k \text{ matrix)} \]
Output
\[ Z \in \mathbb{R}^{n \times k} \]
Bonus Exercise
Prove that the variance of the projected data along an eigenvector \(v\) is equal to the corresponding eigenvalue \(\lambda\).
Project new point \(x\)
\[ \overbrace{z}^{1 \times k} = \overbrace{(x - \mu)}^{1 \times d} \cdot \overbrace{V[:, 1:k]}^{d \times k} \]
Lossy reconstruction
\[ \overbrace{x_{\text{reconstructed}}}^{1 \times d} = \overbrace{z}^{1 \times k} \cdot \overbrace{V[:, 1:k]^T}^{k \times d} + \overbrace{\mu}^{1 \times d} \]
Reconstruction error
\[ |x - x_{\text{reconstructed}}| \]
PCA caveats
using MLJ, DataFrames, Random, MLJMultivariateStatsInterface
Random.seed!(123)
# Create sample data
X = DataFrame(randn(100, 5), :auto)
# Load and fit PCA
PCA = @load PCA pkg=MultivariateStats
pca = machine(PCA(maxoutdim=3), X)
fit!(pca)
# Transform data and get variance explained
Z = MLJ.transform(pca, X)
var_ratio = report(pca).principalvars ./ sum(report(pca).principalvars)
println("Variance explained: ", round.(var_ratio, digits=3))
println("\nFirst 3 rows of transformed data:")
println(first(Z, 5))
\[ \max_{||v|| = 1} \text{Var}(\tilde{X} \cdot v) = \max_{||v|| = 1} v^T \Sigma v \]
The solution to this optimization problem is exactly the eigenvector \(v_1\) corresponding to the largest eigenvalue \(\lambda_1\) of \(\Sigma\).
The eigenvector \(v_1\) is sometimes unique up to a sign, but can also be non-unique if multiple directions have the same variance.
Bonus Exercise
Distance/Dissimilarity Measure | Suitable for |
---|---|
Euclidean distance | Continuous data |
Manhattan distance | Continuous data |
Cosine distance | Continuous data |
Earth mover’s distance (EMD) | Point clouds and distributions |
Kullback-Leibler (KL) divergence | Distributions |
Dynamic time warping (DTW) | Time series |
Embedding-based distances | Text, images, graphs |
Graph edit distance | Graphs |
Patient | Tumor size (cm) |
---|---|
1 | 0.45 |
2 | 0.70 |
3 | 1.00 |
4 | 1.38 |
5 | 2.14 |
6 | 2.50 |
7 | 3.00 |
8 | 3.50 |
9 | 4.00 |
10 | 4.50 |
11 | 5.00 |
Visually we can easily identify the clusters as the data is simple in this case
The process stops when a convergence criterion is met (e.g., no change in cluster assignments or means).
Bonus Exercise
Prove the above formula.
Bonus Exercise
Bonus Exercise
Prove the above equivalence.
\[ \text{Total Variance} = \text{WCSS} + \text{BCSS} \]
Bonus Exercise
Prove the above result.