Dokumentation (english)

BERT4Rec (Sequential Transformer)

BERT-based sequential recommendation using transformer architecture. Models temporal patterns in user-item interactions for next-item prediction. Best for session-based and time-ordered recommendations.

BERT-based sequential recommendation using transformer architecture. Models temporal patterns in user-item interactions for next-item prediction. Best for session-based and time-ordered recommendations.

When to use:

  • Have sequential/temporal interaction data
  • User behavior follows patterns over time
  • Session-based recommendations (e.g., "next video")
  • Need to predict what users will interact with next

Strengths: Captures temporal patterns, understands sequences, state-of-the-art accuracy, learns complex user behavior Weaknesses: Requires large datasets, computationally expensive, needs sequential data, complex to tune

How it Works

BERT4Rec applies the BERT (Bidirectional Encoder Representations from Transformers) architecture to sequential recommendation. It treats a user's interaction history as a sequence and learns to predict masked items.

Training Process:

  1. Take user's interaction sequence: [item1, item2, item3, item4, item5]
  2. Randomly mask some items: [item1, [MASK], item3, [MASK], item5]
  3. Use transformer to predict masked items using bidirectional context
  4. Learn patterns like "users who watched A then B often watch C next"

Inference:

  • Given user's recent interactions, predict most likely next items
  • Attention mechanism captures both short-term and long-term patterns
  • Positional encoding preserves sequence order

Key Advantage: Unlike traditional methods, BERT4Rec understands that order matters and can capture complex sequential patterns like "after watching trilogy part 1, likely to watch part 2".

Parameters

Feature Configuration

Feature Columns (required) List of columns to use: must include user_id, item_id, and optionally timestamp.

User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.

Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item in sequences.

Timestamp Column (optional) Name of the column containing timestamps for ordering interactions. If provided, sequences are ordered by time. If not provided, assumes data is already ordered.

  • Essential for accurate sequential modeling
  • Can be Unix timestamp, datetime, or ordinal values
  • Used to sort interactions chronologically

Model-Specific Parameters

Hidden Size (default: 64) Dimension of transformer hidden layers. Controls model capacity.

  • 32-64: Small model, fast, good for small datasets
  • 64-128: Balanced (default range)
  • 128-256: Large model, better patterns, needs more data
  • 256+: Very large, requires massive datasets

Number of Attention Heads (default: 2) Number of parallel attention heads in transformer.

  • 1-2: Simple attention, faster (default)
  • 2-4: Better pattern capture
  • 4-8: Complex patterns, needs large hidden_size
  • Must divide hidden_size evenly

Number of Layers (default: 2) Number of transformer blocks stacked.

  • 1-2: Simple model, fast (default)
  • 2-4: Standard depth
  • 4-8: Deep model, complex patterns, slower
  • More layers = more capacity but more data needed

Max Sequence Length (default: 50) Maximum number of recent items to consider in sequence.

  • 10-20: Short-term patterns only
  • 20-50: Good balance (default range)
  • 50-100: Long-term patterns
  • 100+: Very long context, expensive
  • Longer = more memory and computation

Mask Probability (default: 0.2) Probability of masking items during training (BERT-style).

  • 0.1-0.15: Conservative masking
  • 0.15-0.2: Standard (default range)
  • 0.2-0.3: Aggressive masking
  • Higher = more training signal but harder task

Batch Size (default: 256) Number of sequences processed together.

  • 64-128: Small batch, less memory
  • 128-256: Balanced (default)
  • 256-512: Large batch, faster training, more memory
  • Larger = more stable but more memory

Number of Epochs (default: 10) Number of training passes through data.

  • 5-10: Quick training (default)
  • 10-20: Standard training
  • 20-50: Extensive training, risk of overfitting
  • Monitor validation metrics to avoid overfitting

Learning Rate (default: 0.001) Step size for optimizer (Adam).

  • 0.0001-0.0005: Conservative, stable
  • 0.0005-0.001: Balanced (default range)
  • 0.001-0.005: Aggressive, faster convergence
  • Start higher, reduce if training unstable

Top-K Recommendations (default: 10) Number of next items to recommend.

  • 5-10: Focused next-item predictions
  • 10-20: Standard recommendation lists
  • 20-50: Broad exploration

Configuration Tips

Dataset Size Considerations

  • Small (<10k sequences): Hidden: 32, Layers: 1-2, may not have enough data
  • Medium (10k-100k): Hidden: 64, Layers: 2, Heads: 2 (default)
  • Large (100k-1M): Hidden: 128, Layers: 3, Heads: 4
  • Very Large (>1M): Hidden: 256, Layers: 4-6, Heads: 8

Parameter Tuning Guidance

Start with defaults, then adjust:

  1. Model Capacity:

    • If underfitting: Increase hidden_size, layers, or heads
    • If overfitting: Decrease hidden_size or layers, add dropout
  2. Sequence Length:

    • Short sessions (e.g., shopping): 20-30
    • Medium sessions (e.g., browsing): 30-50
    • Long sessions (e.g., binge-watching): 50-100
  3. Training:

    • Start: 10 epochs, 0.001 learning rate, 256 batch size
    • If loss plateaus: Reduce learning rate, increase epochs
    • If overfitting: Reduce epochs, increase regularization
  4. Architecture:

    • hidden_size should be divisible by num_heads
    • More layers = more capacity but diminishing returns
    • Balance model size with dataset size

When to Choose This Over Alternatives

  • vs. Matrix Factorization: Choose this for sequential patterns and "what's next"
  • vs. Item-Based KNN: Choose this when order/time matters
  • vs. Content-Based: Choose this for behavior-based sequential prediction
  • vs. Association Rules: Choose this for personalized sequential patterns
  • Best for: Session-based, next-item prediction, temporal patterns, binge-watching

Common Issues and Solutions

Insufficient Sequential Data

Issue: Users have too few interactions to form meaningful sequences. Solution:

  • Reduce max_seq_len to work with shorter sequences
  • Combine sessions from multiple users
  • Use simpler model (Item-Based KNN)
  • Collect more interaction data before using BERT4Rec

Overfitting

Issue: Great training performance, poor test performance. Solution:

  • Reduce model capacity (hidden_size, num_layers)
  • Increase mask_prob (0.3)
  • Use fewer epochs (5-10)
  • Increase dropout (if configurable)
  • Ensure sufficient training data

Slow Training

Issue: Training takes too long. Solution:

  • Reduce hidden_size (try 32-64)
  • Reduce num_layers (try 2)
  • Reduce max_seq_len (try 30)
  • Increase batch_size (if memory allows)
  • Use GPU if available
  • Sample data for initial experiments

Cold Start Problem

Issue: New users have no sequence history. Solution:

  • Use popularity-based recommendations initially
  • Quick preference collection (rate a few items)
  • Fall back to content-based or hybrid
  • Minimum sequence length before BERT4Rec kicks in

Memory Issues

Issue: Out of memory during training. Solution:

  • Reduce batch_size (try 128 or 64)
  • Reduce hidden_size (try 32-64)
  • Reduce max_seq_len (try 30)
  • Use gradient accumulation
  • Use mixed precision training

Poor Test Metrics

Issue: Low Hit Rate@K or NDCG. Solution:

  • Ensure proper temporal split (train past, test future)
  • Check sequence quality (enough interactions per user)
  • Tune hyperparameters (hidden_size, layers)
  • Increase training epochs
  • Ensure timestamp ordering is correct

Example Use Cases

Video Streaming "Next Episode"

Scenario: Streaming service predicting next video users will watch Configuration:

  • Hidden: 128, Heads: 4, Layers: 3
  • Max sequence: 50 (capture binge-watching patterns)
  • 20 epochs, batch size 256
  • Top-10 recommendations
  • Use viewing timestamp Why: Clear sequential patterns (series episodes, related content), session-based viewing

E-commerce Session-Based

Scenario: Online store recommending next product during browsing session Configuration:

  • Hidden: 64, Heads: 2, Layers: 2
  • Max sequence: 20 (shorter shopping sessions)
  • 15 epochs, batch size 256
  • Top-5 recommendations
  • Use click timestamp Why: Shopping sessions have order (browse categories, compare products), predict purchase intent

Music Playlist Prediction

Scenario: Music app predicting next song user will play Configuration:

  • Hidden: 96, Heads: 3, Layers: 3
  • Max sequence: 30 (recent listening history)
  • 25 epochs, batch size 512
  • Top-15 recommendations
  • Use play timestamp Why: Music listening has strong sequential patterns (genre, mood, artist), temporal context matters

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items