Matrix Factorization (SVD)
Matrix factorization using scipy's sparse SVD for collaborative filtering. Learns latent factors for users and items from rating patterns.
Matrix factorization using scipy's sparse SVD for collaborative filtering. Learns latent factors for users and items from rating patterns.
When to use:
- Have explicit ratings (1-5 stars, thumbs up/down)
- Want to predict ratings for unrated items
- Need accurate collaborative filtering
- Have enough user-item interactions
Strengths: Excellent for rating prediction, handles sparsity well, scalable, learns meaningful latent factors Weaknesses: Requires ratings (explicit feedback), cold start for new users/items, can overfit with too many factors
How it Works
Matrix Factorization decomposes the user-item rating matrix into two lower-dimensional matrices: user factors and item factors. Each user and item is represented by a vector of latent features (factors) that capture hidden patterns in preferences.
The model learns these factors by minimizing the error between actual and predicted ratings. Once trained, it predicts ratings by computing the dot product of user and item factor vectors.
Key Concept: If a user and item have similar latent factors, the predicted rating will be high, suggesting a good recommendation.
Parameters
Feature Configuration
Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally rating.
User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.
Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.
Rating Column (optional) Name of the column containing ratings. If provided, used for explicit feedback (e.g., 1-5 stars). If not provided, presence of interaction indicates implicit positive feedback.
Model-Specific Parameters
Number of Factors (default: 100) Number of latent factors to learn for users and items. Controls model capacity.
- 20-50: Simpler model, less overfitting, faster
- 50-100: Good balance (default)
- 100-200: Captures more nuanced patterns, may overfit
- 200+: For very large datasets with complex patterns
Number of Epochs (default: 20) Number of training iterations through the data.
- 10-20: Usually sufficient (default)
- 20-50: For better convergence on complex data
- 50+: Rarely needed, risk of overfitting
Learning Rate (default: 0.005) Step size for gradient descent optimization. Controls how fast the model learns.
- 0.001-0.003: Conservative, slow but stable convergence
- 0.005-0.01: Balanced (default range)
- 0.01-0.05: Aggressive, faster but may overshoot
- Start higher and decrease if training is unstable
Regularization (default: 0.02) L2 regularization term to prevent overfitting. Penalizes large factor values.
- 0.0-0.01: Light regularization (may overfit)
- 0.01-0.05: Standard regularization (default range)
- 0.05-0.1: Heavy regularization (underfitting risk)
- Increase if model overfits training data
Top-K Recommendations (default: 10) Number of items to recommend for each user.
- 5-10: Focused recommendations
- 10-20: Standard recommendation lists
- 20-50: For exploration
Configuration Tips
Dataset Size Considerations
- Small (<10k interactions): Use 20-50 factors, higher regularization (0.05)
- Medium (10k-1M): Use 50-100 factors, standard regularization (0.02)
- Large (>1M): Use 100-200 factors, experiment with regularization
Parameter Tuning Guidance
- Start with defaults: 100 factors, 0.005 learning rate, 0.02 regularization
- Adjust factors: If underfitting (high train/test error), increase factors
- Adjust regularization: If overfitting (low train error, high test error), increase regularization
- Adjust learning rate: If loss doesn't decrease, reduce learning rate; if too slow, increase
- Monitor metrics: Watch RMSE for rating prediction, Precision@K for recommendations
When to Choose This Over Alternatives
- vs. sklearn SVD: Choose this for explicit ratings and rating prediction
- vs. Item-Based KNN: Choose this for better accuracy on sparse data
- vs. User-Based KNN: Choose this for scalability and rating prediction
- vs. Content-Based: Choose this when you have sufficient interaction data
- vs. Hybrid: Choose this when you don't have item content features
Common Issues and Solutions
Cold Start Problem
Issue: Cannot recommend to new users or recommend new items. Solution:
- Use item/user averages for new entities
- Combine with content-based features (use Hybrid model instead)
- Collect initial preferences through onboarding
- Fall back to popularity-based recommendations
Sparsity Issues
Issue: Most user-item pairs have no rating, making learning difficult. Solution:
- Use implicit feedback (clicks, views) in addition to ratings
- Apply regularization (default 0.02 is usually good)
- Consider using more factors to capture patterns
- Combine with other approaches (Hybrid model)
Overfitting
Issue: Great training performance but poor test performance. Solution:
- Increase regularization (try 0.05-0.1)
- Reduce number of factors (try 50-70)
- Use fewer epochs (10-15)
- Ensure sufficient training data (1000+ interactions minimum)
Slow Training
Issue: Training takes too long. Solution:
- Reduce number of factors (try 50)
- Reduce number of epochs (try 10)
- Subsample data for initial experiments
- Use sklearn SVD for faster alternative
Poor Prediction Accuracy
Issue: High RMSE or MAE on test set. Solution:
- Ensure data quality (valid ratings, no duplicates)
- Normalize ratings if they have different scales
- Try more factors (100-200)
- Experiment with learning rate (0.001-0.01)
- Check for sufficient data density
Example Use Cases
Movie Recommendations
Scenario: Streaming service with 10M users rating 50k movies Configuration:
- 150 factors (complex preference patterns)
- 20 epochs
- 0.005 learning rate
- 0.02 regularization
- Top-10 recommendations Why: Large dataset with explicit ratings, need accurate predictions
E-commerce Product Ratings
Scenario: Online store with 100k users rating 20k products Configuration:
- 80 factors
- 20 epochs
- 0.01 learning rate
- 0.03 regularization (more overfitting risk with smaller data)
- Top-20 recommendations Why: Medium dataset, need both rating prediction and recommendations
Music Streaming
Scenario: Music app with implicit feedback (plays) and explicit ratings Configuration:
- 100 factors
- 15 epochs
- 0.005 learning rate
- 0.02 regularization
- Top-15 recommendations Why: Mix of explicit and implicit feedback, balanced approach