Dokumentation (english)

Dimensionality Reduction

Reduce high-dimensional data while preserving important information

Dimensionality reduction transforms high-dimensional data into fewer dimensions while preserving essential patterns and structure. Use it for visualization, noise reduction, feature extraction, or speeding up downstream models.

🎓 Learn About Dimensionality Reduction

New to dimensionality reduction? Visit our Dimensionality Reduction Concepts Guide to learn about the curse of dimensionality, evaluation metrics (Explained Variance, Trustworthiness), and when to use these techniques for your data.

Available Models

We support 14 different dimensionality reduction techniques:

Linear Methods

  • PCA - Principal Component Analysis for variance-based reduction
  • Truncated SVD - SVD for sparse matrices and text data
  • Factor Analysis - Statistical method for latent factors
  • ICA - Independent Component Analysis for source separation
  • NMF - Non-negative matrix factorization for parts-based representations
  • LDA - Linear Discriminant Analysis (supervised)

Non-Linear Manifold Methods

  • t-SNE - Powerful visualization for 2D/3D embeddings
  • UMAP - Fast, preserves both local and global structure
  • Isomap - Geodesic distance preservation
  • LLE - Locally Linear Embedding for local relationships
  • MDS - Multidimensional Scaling for distance preservation
  • Spectral Embedding - Graph-based embedding

Kernel Methods

Common Configuration

Feature Configuration

Feature Columns (required) Select which columns to use for dimensionality reduction. Include all relevant numerical features that contribute to the patterns you want to preserve.

Number of Components (required) Target number of dimensions. Common choices:

  • 2-3: For visualization
  • 10-50: For feature extraction before modeling
  • ~80-95% variance: Use scree plot to determine

Hyperparameter Tuning

Some models support hyperparameter tuning:

  • Grid Search: Systematic exploration
  • Random Search: Faster approximate search
  • Bayesian Search: Intelligent optimization

Scoring Metrics:

  • Explained Variance: For linear methods (higher is better)
  • Trustworthiness: For manifold methods (higher is better, 0-1)
  • Reconstruction Error: How well data can be reconstructed (lower is better)

Understanding Dimensionality Reduction Metrics

Explained Variance Ratio

Proportion of variance explained by each component (linear methods).

  • Use for: PCA, Factor Analysis, Truncated SVD
  • Interpretation: Sum should be high (>80-95% for good reduction)
  • Cumulative plot: Shows how many components needed for desired variance

Trustworthiness

Measures whether nearest neighbors in high-D remain nearest in low-D (0-1, higher is better).

  • Use for: t-SNE, UMAP, manifold methods
  • Good: >0.9 (excellent), 0.8-0.9 (good)
  • Interpretation: How well local structure is preserved

Reconstruction Error

Error when reconstructing original data from reduced representation.

  • Lower is better: Smaller error = better preservation
  • Use for: All methods that support inverse_transform

Stress (for MDS)

Measure of discrepancy between distances (lower is better).

  • Good: <0.1 (excellent), 0.1-0.2 (good)
  • Poor: >0.2 (poor fit)

Choosing the Right Model

Quick Start Guide

  1. Start with PCA: Fast baseline, interpretable
  2. Try UMAP: If non-linear structure expected
  3. Use t-SNE: For beautiful 2D visualizations
  4. Go supervised (LDA): If you have labels and want separation

By Goal

Visualization (2D/3D):

  • Best: t-SNE, UMAP
  • Fast: PCA
  • With labels: LDA

Feature Extraction (before modeling):

  • Best: PCA, UMAP
  • With labels: LDA
  • Text data: Truncated SVD
  • Non-negative: NMF

Data Compression:

  • Best: PCA, Truncated SVD
  • Non-negative: NMF

Noise Reduction:

  • Best: PCA, Factor Analysis
  • Signal separation: ICA

By Data Type

Dense numerical:

  • PCA, UMAP, t-SNE, Kernel PCA

Sparse (text):

  • Truncated SVD, NMF

Images:

  • PCA, NMF

Time series / signals:

  • ICA, PCA

With labels:

  • LDA

Non-negative:

  • NMF

By Data Size

Small (<1k samples):

  • Any method

Medium (1k-10k):

  • PCA, UMAP, LDA, Truncated SVD

Large (>10k):

  • PCA, UMAP, Truncated SVD
  • Avoid: t-SNE, MDS, LLE

By Requirements

Need inference on new data:

  • Yes: PCA, UMAP, Truncated SVD, LDA, Kernel PCA, Isomap, ICA, NMF, Factor Analysis
  • No: t-SNE, LLE, MDS, Spectral Embedding

Need interpretability:

  • High: PCA, LDA, NMF, Factor Analysis
  • Medium: Truncated SVD, ICA
  • Low: t-SNE, UMAP, Kernel PCA

Need speed:

  • Fastest: PCA, Truncated SVD
  • Fast: UMAP, LDA
  • Slow: t-SNE, MDS, LLE

Best Practices

  1. Scale your features - Essential for distance-based methods (PCA, t-SNE, UMAP)
  2. Start with PCA - Fast baseline to understand your data
  3. Check explained variance - Plot cumulative variance to choose n_components
  4. Try multiple methods - Different methods reveal different aspects
  5. Tune hyperparameters - Especially for t-SNE (perplexity, learning_rate) and UMAP (n_neighbors, min_dist)
  6. Validate results - Use downstream task performance or visualization quality
  7. Use appropriate metrics - Explained variance for linear, trustworthiness for manifold methods
  8. Consider data size - Large datasets need scalable methods (PCA, UMAP)
  9. Match method to goal - Visualization vs. feature extraction need different approaches
  10. Reproducibility - Always set random_state for stochastic methods

Common Pitfalls

  • Not scaling data: Distance-based methods are sensitive to feature scales
  • Too few components: Missing important variance/structure
  • Too many components: Including noise
  • Wrong method for goal: t-SNE for feature extraction (no inference!)
  • Ignoring explained variance: Not checking how much information is preserved
  • Over-interpreting t-SNE: Distances between clusters are not meaningful
  • Default hyperparameters: t-SNE and UMAP benefit greatly from tuning
  • Using on small data: Manifold methods need sufficient samples

Tips for Better Results

For PCA:

  • Plot scree plot (explained variance)
  • Check component loadings for interpretation
  • Scale features to same range

For t-SNE:

  • Run multiple times with different perplexities (5, 30, 50, 100)
  • Increase iterations if plot still changing
  • Try different random seeds
  • Don't over-interpret global structure

For UMAP:

  • Tune n_neighbors (local vs. global trade-off)
  • Tune min_dist (clumpy vs. spread out)
  • Works well with >2 components for feature extraction
  • Much faster than t-SNE

For LDA:

  • Need sufficient samples per class
  • Works best with normally distributed classes
  • Max components = n_classes - 1

Next Steps

Ready to reduce? Head to the Training page and:

  1. Select your dataset
  2. Scale your features if needed
  3. Choose a method based on this guide
  4. Start with 2 components for visualization
  5. Evaluate with appropriate metrics
  6. Tune hyperparameters for better results
  7. Use reduced data for visualization or downstream tasks

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items