Dokumentation (english)

Content-Based Filtering (TF-IDF)

Content-based recommendation using TF-IDF on item descriptions/features. Recommends items with similar content to user's history.

Content-based recommendation using TF-IDF on item descriptions/features. Recommends items with similar content to user's history.

When to use:

  • Have rich item descriptions or metadata
  • Cold start: new users or new items
  • Items have textual features (descriptions, tags, categories)
  • Need recommendations without user interaction data

Strengths: No cold start problem for items, explainable, doesn't need many users, privacy-friendly Weaknesses: Limited discovery (recommends similar content), requires good item descriptions, can create filter bubbles

How it Works

Content-Based Filtering with TF-IDF analyzes the textual content of items to find similarities. It uses Term Frequency-Inverse Document Frequency (TF-IDF) to convert item descriptions into numerical vectors, where:

  • TF (Term Frequency): How often a word appears in an item's description
  • IDF (Inverse Document Frequency): How unique a word is across all items

For each user, the algorithm:

  1. Creates a user profile from items they've interacted with (aggregates their TF-IDF vectors)
  2. Computes cosine similarity between user profile and all candidate items
  3. Ranks items by similarity to user's preferences
  4. Returns top-K most similar items

Key Concept: If you liked items with certain features/descriptions, you'll likely enjoy other items with similar content.

Parameters

Feature Configuration

Feature Columns (required) List of columns to use: must include user_id, item_id, and content/description.

User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.

Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.

Content Column (default: "description", required) Name of the column containing item descriptions or features. This is the text content used to compute similarities.

  • Can be product descriptions, article text, movie plots, song metadata
  • Better quality content = better recommendations
  • Can concatenate multiple fields (e.g., "title + description + tags")

Model-Specific Parameters

Top-K Recommendations (default: 10) Number of items to recommend for each user.

  • 5-10: Focused, high-relevance recommendations
  • 10-20: Standard recommendation lists
  • 20-50: For exploration and discovery

Configuration Tips

Dataset Size Considerations

  • Small (<1k items): Works well, but limited diversity
  • Medium (1k-100k items): Ideal range, good performance
  • Large (100k-1M items): Good, but TF-IDF vectors can be large
  • Very Large (>1M items): Consider Embeddings Similarity for better scaling

Parameter Tuning Guidance

  1. Improve content quality:

    • Clean text (remove HTML, special characters)
    • Include relevant metadata (categories, tags, attributes)
    • Combine multiple text fields
    • Use item titles + descriptions + tags
  2. Handle vocabulary:

    • Remove stop words for cleaner signals
    • Use stemming/lemmatization for word variations
    • Consider n-grams for multi-word concepts
  3. Monitor diversity:

    • Content-based can create filter bubbles
    • Apply diversity post-processing
    • Combine with collaborative filtering (Hybrid model)
  4. Optimize for scale:

    • Limit vocabulary size (top 10k-50k terms)
    • Use sparse matrix operations
    • Pre-compute item similarities

When to Choose This Over Alternatives

  • vs. Collaborative Filtering: Choose this for cold start handling and when interaction data is sparse
  • vs. Embeddings Similarity: Choose this for simpler, faster, more interpretable approach
  • vs. Hybrid: Choose this when you don't have sufficient interaction data yet
  • vs. Item-Based KNN: Choose this when you have better item descriptions than interaction data
  • vs. BERT4Rec: Choose this for non-sequential, content-focused recommendations

Common Issues and Solutions

Cold Start Problem (New Users)

Issue: New users have no interaction history. Solution:

  • Collect initial preferences through questionnaire
  • Use demographic or contextual information
  • Show popular items until interactions accumulate
  • Ask users to rate/like a few initial items

Filter Bubbles

Issue: Users only see items similar to what they've already seen. Solution:

  • Add diversity objectives (e.g., MMR - Maximal Marginal Relevance)
  • Mix content-based with collaborative signals (Hybrid model)
  • Include exploration (random or trending items)
  • Apply category diversification

Poor Content Quality

Issue: Item descriptions are sparse, generic, or low-quality. Solution:

  • Enrich content with external data sources
  • Use user-generated tags/reviews
  • Combine multiple metadata fields
  • Consider switching to Embeddings Similarity for semantic understanding
  • Fall back to collaborative filtering when content is insufficient

Over-specialization

Issue: Recommendations are too narrow or predictable. Solution:

  • Reduce similarity threshold (allow more diverse items)
  • Weight recent interactions less heavily
  • Include serendipitous recommendations
  • Combine with collaborative filtering

Scalability with Large Vocabulary

Issue: TF-IDF vectors become very large with extensive text. Solution:

  • Limit vocabulary to top N terms (10k-50k)
  • Use min/max document frequency thresholds
  • Apply dimensionality reduction
  • Consider Embeddings Similarity for dense representations

Language and Domain Issues

Issue: TF-IDF struggles with synonyms, multi-language content. Solution:

  • Apply stemming/lemmatization
  • Use domain-specific stop words
  • Consider Embeddings Similarity for semantic understanding
  • Handle multiple languages separately or use translation

Example Use Cases

News Article Recommendations

Scenario: News platform with 500k articles, need immediate recommendations for new content Configuration:

  • Content column: article_text (title + body + category)
  • Top-20 recommendations
  • No cold start delay Why: New articles published constantly, need instant recommendations, content-rich, explainable

Job Recommendations

Scenario: Job board with 100k job postings, want to match to user preferences Configuration:

  • Content column: job_description (title + description + skills + location)
  • Top-15 recommendations
  • Combine with user resume/profile Why: Rich job descriptions, need to match skills and requirements, limited interaction data initially

Product Recommendations (Cold Start)

Scenario: New e-commerce site with 10k products, few user interactions Configuration:

  • Content column: product_details (title + description + category + brand + attributes)
  • Top-10 recommendations
  • Bootstrap recommendations until sufficient interaction data Why: New platform, need recommendations from day one, detailed product information available

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items