Dokumentation (english)

Matrix Factorization (SVD)

Matrix factorization using scipy's sparse SVD for collaborative filtering. Learns latent factors for users and items from rating patterns.

Matrix factorization using scipy's sparse SVD for collaborative filtering. Learns latent factors for users and items from rating patterns.

When to use:

  • Have explicit ratings (1-5 stars, thumbs up/down)
  • Want to predict ratings for unrated items
  • Need accurate collaborative filtering
  • Have enough user-item interactions

Strengths: Excellent for rating prediction, handles sparsity well, scalable, learns meaningful latent factors Weaknesses: Requires ratings (explicit feedback), cold start for new users/items, can overfit with too many factors

How it Works

Matrix Factorization decomposes the user-item rating matrix into two lower-dimensional matrices: user factors and item factors. Each user and item is represented by a vector of latent features (factors) that capture hidden patterns in preferences.

The model learns these factors by minimizing the error between actual and predicted ratings. Once trained, it predicts ratings by computing the dot product of user and item factor vectors.

Key Concept: If a user and item have similar latent factors, the predicted rating will be high, suggesting a good recommendation.

Parameters

Feature Configuration

Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally rating.

User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.

Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.

Rating Column (optional) Name of the column containing ratings. If provided, used for explicit feedback (e.g., 1-5 stars). If not provided, presence of interaction indicates implicit positive feedback.

Model-Specific Parameters

Number of Factors (default: 100) Number of latent factors to learn for users and items. Controls model capacity.

  • 20-50: Simpler model, less overfitting, faster
  • 50-100: Good balance (default)
  • 100-200: Captures more nuanced patterns, may overfit
  • 200+: For very large datasets with complex patterns

Number of Epochs (default: 20) Number of training iterations through the data.

  • 10-20: Usually sufficient (default)
  • 20-50: For better convergence on complex data
  • 50+: Rarely needed, risk of overfitting

Learning Rate (default: 0.005) Step size for gradient descent optimization. Controls how fast the model learns.

  • 0.001-0.003: Conservative, slow but stable convergence
  • 0.005-0.01: Balanced (default range)
  • 0.01-0.05: Aggressive, faster but may overshoot
  • Start higher and decrease if training is unstable

Regularization (default: 0.02) L2 regularization term to prevent overfitting. Penalizes large factor values.

  • 0.0-0.01: Light regularization (may overfit)
  • 0.01-0.05: Standard regularization (default range)
  • 0.05-0.1: Heavy regularization (underfitting risk)
  • Increase if model overfits training data

Top-K Recommendations (default: 10) Number of items to recommend for each user.

  • 5-10: Focused recommendations
  • 10-20: Standard recommendation lists
  • 20-50: For exploration

Configuration Tips

Dataset Size Considerations

  • Small (<10k interactions): Use 20-50 factors, higher regularization (0.05)
  • Medium (10k-1M): Use 50-100 factors, standard regularization (0.02)
  • Large (>1M): Use 100-200 factors, experiment with regularization

Parameter Tuning Guidance

  1. Start with defaults: 100 factors, 0.005 learning rate, 0.02 regularization
  2. Adjust factors: If underfitting (high train/test error), increase factors
  3. Adjust regularization: If overfitting (low train error, high test error), increase regularization
  4. Adjust learning rate: If loss doesn't decrease, reduce learning rate; if too slow, increase
  5. Monitor metrics: Watch RMSE for rating prediction, Precision@K for recommendations

When to Choose This Over Alternatives

  • vs. sklearn SVD: Choose this for explicit ratings and rating prediction
  • vs. Item-Based KNN: Choose this for better accuracy on sparse data
  • vs. User-Based KNN: Choose this for scalability and rating prediction
  • vs. Content-Based: Choose this when you have sufficient interaction data
  • vs. Hybrid: Choose this when you don't have item content features

Common Issues and Solutions

Cold Start Problem

Issue: Cannot recommend to new users or recommend new items. Solution:

  • Use item/user averages for new entities
  • Combine with content-based features (use Hybrid model instead)
  • Collect initial preferences through onboarding
  • Fall back to popularity-based recommendations

Sparsity Issues

Issue: Most user-item pairs have no rating, making learning difficult. Solution:

  • Use implicit feedback (clicks, views) in addition to ratings
  • Apply regularization (default 0.02 is usually good)
  • Consider using more factors to capture patterns
  • Combine with other approaches (Hybrid model)

Overfitting

Issue: Great training performance but poor test performance. Solution:

  • Increase regularization (try 0.05-0.1)
  • Reduce number of factors (try 50-70)
  • Use fewer epochs (10-15)
  • Ensure sufficient training data (1000+ interactions minimum)

Slow Training

Issue: Training takes too long. Solution:

  • Reduce number of factors (try 50)
  • Reduce number of epochs (try 10)
  • Subsample data for initial experiments
  • Use sklearn SVD for faster alternative

Poor Prediction Accuracy

Issue: High RMSE or MAE on test set. Solution:

  • Ensure data quality (valid ratings, no duplicates)
  • Normalize ratings if they have different scales
  • Try more factors (100-200)
  • Experiment with learning rate (0.001-0.01)
  • Check for sufficient data density

Example Use Cases

Movie Recommendations

Scenario: Streaming service with 10M users rating 50k movies Configuration:

  • 150 factors (complex preference patterns)
  • 20 epochs
  • 0.005 learning rate
  • 0.02 regularization
  • Top-10 recommendations Why: Large dataset with explicit ratings, need accurate predictions

E-commerce Product Ratings

Scenario: Online store with 100k users rating 20k products Configuration:

  • 80 factors
  • 20 epochs
  • 0.01 learning rate
  • 0.03 regularization (more overfitting risk with smaller data)
  • Top-20 recommendations Why: Medium dataset, need both rating prediction and recommendations

Music Streaming

Scenario: Music app with implicit feedback (plays) and explicit ratings Configuration:

  • 100 factors
  • 15 epochs
  • 0.005 learning rate
  • 0.02 regularization
  • Top-15 recommendations Why: Mix of explicit and implicit feedback, balanced approach

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items