BERT4Rec (Sequential Transformer)
BERT-based sequential recommendation using transformer architecture. Models temporal patterns in user-item interactions for next-item prediction. Best for session-based and time-ordered recommendations.
BERT-based sequential recommendation using transformer architecture. Models temporal patterns in user-item interactions for next-item prediction. Best for session-based and time-ordered recommendations.
When to use:
- Have sequential/temporal interaction data
- User behavior follows patterns over time
- Session-based recommendations (e.g., "next video")
- Need to predict what users will interact with next
Strengths: Captures temporal patterns, understands sequences, state-of-the-art accuracy, learns complex user behavior Weaknesses: Requires large datasets, computationally expensive, needs sequential data, complex to tune
How it Works
BERT4Rec applies the BERT (Bidirectional Encoder Representations from Transformers) architecture to sequential recommendation. It treats a user's interaction history as a sequence and learns to predict masked items.
Training Process:
- Take user's interaction sequence: [item1, item2, item3, item4, item5]
- Randomly mask some items: [item1, [MASK], item3, [MASK], item5]
- Use transformer to predict masked items using bidirectional context
- Learn patterns like "users who watched A then B often watch C next"
Inference:
- Given user's recent interactions, predict most likely next items
- Attention mechanism captures both short-term and long-term patterns
- Positional encoding preserves sequence order
Key Advantage: Unlike traditional methods, BERT4Rec understands that order matters and can capture complex sequential patterns like "after watching trilogy part 1, likely to watch part 2".
Parameters
Feature Configuration
Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally timestamp.
User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.
Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item in sequences.
Timestamp Column (optional) Name of the column containing timestamps for ordering interactions. If provided, sequences are ordered by time. If not provided, assumes data is already ordered.
- Essential for accurate sequential modeling
- Can be Unix timestamp, datetime, or ordinal values
- Used to sort interactions chronologically
Model-Specific Parameters
Hidden Size (default: 64) Dimension of transformer hidden layers. Controls model capacity.
- 32-64: Small model, fast, good for small datasets
- 64-128: Balanced (default range)
- 128-256: Large model, better patterns, needs more data
- 256+: Very large, requires massive datasets
Number of Attention Heads (default: 2) Number of parallel attention heads in transformer.
- 1-2: Simple attention, faster (default)
- 2-4: Better pattern capture
- 4-8: Complex patterns, needs large hidden_size
- Must divide hidden_size evenly
Number of Layers (default: 2) Number of transformer blocks stacked.
- 1-2: Simple model, fast (default)
- 2-4: Standard depth
- 4-8: Deep model, complex patterns, slower
- More layers = more capacity but more data needed
Max Sequence Length (default: 50) Maximum number of recent items to consider in sequence.
- 10-20: Short-term patterns only
- 20-50: Good balance (default range)
- 50-100: Long-term patterns
- 100+: Very long context, expensive
- Longer = more memory and computation
Mask Probability (default: 0.2) Probability of masking items during training (BERT-style).
- 0.1-0.15: Conservative masking
- 0.15-0.2: Standard (default range)
- 0.2-0.3: Aggressive masking
- Higher = more training signal but harder task
Batch Size (default: 256) Number of sequences processed together.
- 64-128: Small batch, less memory
- 128-256: Balanced (default)
- 256-512: Large batch, faster training, more memory
- Larger = more stable but more memory
Number of Epochs (default: 10) Number of training passes through data.
- 5-10: Quick training (default)
- 10-20: Standard training
- 20-50: Extensive training, risk of overfitting
- Monitor validation metrics to avoid overfitting
Learning Rate (default: 0.001) Step size for optimizer (Adam).
- 0.0001-0.0005: Conservative, stable
- 0.0005-0.001: Balanced (default range)
- 0.001-0.005: Aggressive, faster convergence
- Start higher, reduce if training unstable
Top-K Recommendations (default: 10) Number of next items to recommend.
- 5-10: Focused next-item predictions
- 10-20: Standard recommendation lists
- 20-50: Broad exploration
Configuration Tips
Dataset Size Considerations
- Small (<10k sequences): Hidden: 32, Layers: 1-2, may not have enough data
- Medium (10k-100k): Hidden: 64, Layers: 2, Heads: 2 (default)
- Large (100k-1M): Hidden: 128, Layers: 3, Heads: 4
- Very Large (>1M): Hidden: 256, Layers: 4-6, Heads: 8
Parameter Tuning Guidance
Start with defaults, then adjust:
-
Model Capacity:
- If underfitting: Increase hidden_size, layers, or heads
- If overfitting: Decrease hidden_size or layers, add dropout
-
Sequence Length:
- Short sessions (e.g., shopping): 20-30
- Medium sessions (e.g., browsing): 30-50
- Long sessions (e.g., binge-watching): 50-100
-
Training:
- Start: 10 epochs, 0.001 learning rate, 256 batch size
- If loss plateaus: Reduce learning rate, increase epochs
- If overfitting: Reduce epochs, increase regularization
-
Architecture:
- hidden_size should be divisible by num_heads
- More layers = more capacity but diminishing returns
- Balance model size with dataset size
When to Choose This Over Alternatives
- vs. Matrix Factorization: Choose this for sequential patterns and "what's next"
- vs. Item-Based KNN: Choose this when order/time matters
- vs. Content-Based: Choose this for behavior-based sequential prediction
- vs. Association Rules: Choose this for personalized sequential patterns
- Best for: Session-based, next-item prediction, temporal patterns, binge-watching
Common Issues and Solutions
Insufficient Sequential Data
Issue: Users have too few interactions to form meaningful sequences. Solution:
- Reduce max_seq_len to work with shorter sequences
- Combine sessions from multiple users
- Use simpler model (Item-Based KNN)
- Collect more interaction data before using BERT4Rec
Overfitting
Issue: Great training performance, poor test performance. Solution:
- Reduce model capacity (hidden_size, num_layers)
- Increase mask_prob (0.3)
- Use fewer epochs (5-10)
- Increase dropout (if configurable)
- Ensure sufficient training data
Slow Training
Issue: Training takes too long. Solution:
- Reduce hidden_size (try 32-64)
- Reduce num_layers (try 2)
- Reduce max_seq_len (try 30)
- Increase batch_size (if memory allows)
- Use GPU if available
- Sample data for initial experiments
Cold Start Problem
Issue: New users have no sequence history. Solution:
- Use popularity-based recommendations initially
- Quick preference collection (rate a few items)
- Fall back to content-based or hybrid
- Minimum sequence length before BERT4Rec kicks in
Memory Issues
Issue: Out of memory during training. Solution:
- Reduce batch_size (try 128 or 64)
- Reduce hidden_size (try 32-64)
- Reduce max_seq_len (try 30)
- Use gradient accumulation
- Use mixed precision training
Poor Test Metrics
Issue: Low Hit Rate@K or NDCG. Solution:
- Ensure proper temporal split (train past, test future)
- Check sequence quality (enough interactions per user)
- Tune hyperparameters (hidden_size, layers)
- Increase training epochs
- Ensure timestamp ordering is correct
Example Use Cases
Video Streaming "Next Episode"
Scenario: Streaming service predicting next video users will watch Configuration:
- Hidden: 128, Heads: 4, Layers: 3
- Max sequence: 50 (capture binge-watching patterns)
- 20 epochs, batch size 256
- Top-10 recommendations
- Use viewing timestamp Why: Clear sequential patterns (series episodes, related content), session-based viewing
E-commerce Session-Based
Scenario: Online store recommending next product during browsing session Configuration:
- Hidden: 64, Heads: 2, Layers: 2
- Max sequence: 20 (shorter shopping sessions)
- 15 epochs, batch size 256
- Top-5 recommendations
- Use click timestamp Why: Shopping sessions have order (browse categories, compare products), predict purchase intent
Music Playlist Prediction
Scenario: Music app predicting next song user will play Configuration:
- Hidden: 96, Heads: 3, Layers: 3
- Max sequence: 30 (recent listening history)
- 25 epochs, batch size 512
- Top-15 recommendations
- Use play timestamp Why: Music listening has strong sequential patterns (genre, mood, artist), temporal context matters