Dimensionality Reduction

What is Dimensionality Reduction?

Data today can be very large. Some datasets can have instances with multiple variables, and this characteristic imposes some challenges when working with this kind of data, due to its inherent complexity.

Dimensionality reduction

NOTE

You can think of attributes (variables) as dimensions to the data, like in a 3D shape, where each vertex is an attribute of a given instance.

Some of these challenges can be overfitting when training a model, or problems related to visualizations and overall understanding of the data. And that is because the data becomes more sparse when it has multiple dimensions, which makes it harder for models to find patterns and be able to make good inferences.

Dimensionality reduction is a fundamental technique to, in short, simplify complex data. It helps us achieve computational efficiency, improve predictions, and make our lives easier in general when working with this kind of data.

It works by extracting essential information from the data and representing it in a more compact way, therefore preserving its original context and characteristics, without losing the essential structure. It also removes the redundancy and noise from the data.

There are different techniques to perform dimensionality reduction besides PCA:

t-SNE: is a non-linear technique, good for visualizing the data and exploring its structure, especially by identifying hidden patterns or groups.
LDA: is a supervised technique that takes classes into consideration. It maximizes the separation between classes while minimizing the dispersion within them.
autoencoders: deep learning technique (neural net) that just learns efficient representations from the data, very flexible.

Preprocessing is an important first step to be done, as well as evaluation.

Simpler alternatives

Before reaching for PCA or its siblings, two lighter strategies are worth considering:

Manual feature selection — picking attributes by hand. Common, but obviously doesn’t scale.
Hierarchical clustering — especially for clustering problems, grouping the data into smaller clusters to avoid complexity.

If these aren’t sufficient, then move on to PCA and the techniques above.

WARNING

Dimensionality reduction has real costs: you lose information by definition (hopefully noise, not signal), and can overfit if you tune k on the same data you train on. Pick the technique by goal — PCA for linear structure, t-SNE for visualization, LDA when you have labels, autoencoders for non-linear cases.

Where it’s used

Dimensionality reduction is frequently used in computer vision to extract features from images and enable a more efficient analysis and classification, since images are usually represented as high-dimensional vectors where each pixel corresponds to a dimension.

In NLP, texts are also represented as high-dimensional vectors such as embeddings, and techniques like PCA or autoencoders can be used to compress them for faster retrieval or visualization.

In other areas like bioinformatics and finance, which usually work with high-dimensional data, dimensionality reduction is also valuable to improve analysis.

Principal Component Analysis (PCA)

PCA works by finding new axes — linear combinations of the original features — that maximize the variance of the data along them. Given a 50-column dataset, for example, PCA can compress it down to two of these axes while preserving most of the variance.

The steps for implementing PCA are the following:

Data Standardization: all attributes have a mean of 0 and a standard deviation of 1, since PCA is highly sensitive to scale and outliers.
Covariance Matrix or SVD: compute either the covariance matrix (capturing how variables change together) or apply SVD directly to the data — both routes lead to the same components.
Calculate Eigenvalues and Eigenvectors: perform the decomposition of the covariance matrix to extract its main components.
Select Components: select the top k eigenvectors that correspond to the largest eigenvalues, effectively capturing the majority of the variance in the data.
Project the Data: construct a projection matrix using the selected k eigenvectors to represent the reduced dimensionality of the data.

NOTE

What are components? New axes built from the eigenvectors of the covariance matrix, ordered by variance. If your data is an elongated cloud of points, PC1 is the long axis of the cloud, PC2 the next-longest perpendicular axis, and so on.

In practice, sklearn handles all five steps:

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
 
X, _ = load_wine(return_X_y=True)  # 13 features: alcohol, magnesium, proline, ...
X_std = StandardScaler().fit_transform(X)
 
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_std)
 
print("Explained variance per PC:", pca.explained_variance_ratio_)
print("Loadings:\n", pca.components_)

Two components capture ~55% of the variance from 13 original features, and the loadings tell you which originals each component leans on (e.g., PC1 weighted heavily on phenols and flavanoids).

So, in summary, the outcome of PCA is a projection in a reduced feature space:

Projection_{PC A} (X) = X \cdot V_{k}

Where:

$X$ is the original matrix with dimensions $n \times m$
$V_{k}$ is the matrix with the principal eigenvector components, having $k$ as the most relevant eigenvectors of $V$ , resulting in a matrix of dimension $m \times k$ .

The result of the multiplication is a matrix projected in a subspace with principal components, where each line of this matrix represents coordinates to the original data in this reduced dimension space defined by its main components.

PCA can not only be used for dimensionality reduction, but also visualization and feature extraction.

WARNING

The resulting matrix is not as interpretable as the original data — columns are linear combinations of every original feature, not “age” or “income”. Plot the loadings (eigenvector weights) as a heatmap to see which originals each component leans on.

Concepts touched on here that deserve their own leaf — and would be reused well beyond dimensionality reduction:

Eigenvalues and eigenvectors — show up in PCA, spectral clustering, PageRank, Markov chains.
SVD — low-rank approximation, recommender systems, least squares, latent semantic analysis.
Orthogonality and projection — PCA, least squares, Fourier analysis, Gram-Schmidt.
Covariance matrix — multivariate Gaussians, Mahalanobis distance, Kalman filters.
Curse of dimensionality — kNN, kernel methods, sampling in high dimensions.

References

Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.
Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2009.

$ luctre

Recent Posts

The residue is the point

The small clever move

Do less, but better

Dimensionality Reduction

What is Dimensionality Reduction?

Simpler alternatives

Where it’s used

Principal Component Analysis (PCA)

References

Graph View

Table of Contents

Backlinks

$ luctre

Recent Posts

The residue is the point

The small clever move

Do less, but better

Dimensionality Reduction

What is Dimensionality Reduction?

Simpler alternatives

Where it’s used

Principal Component Analysis (PCA)

Related notes (to write)

References

Graph View

Table of Contents

Backlinks