Content-Based Filtering (TF-IDF)
Content-based recommendation using TF-IDF on item descriptions/features. Recommends items with similar content to user's history.
Content-based recommendation using TF-IDF on item descriptions/features. Recommends items with similar content to user's history.
When to use:
- Have rich item descriptions or metadata
- Cold start: new users or new items
- Items have textual features (descriptions, tags, categories)
- Need recommendations without user interaction data
Strengths: No cold start problem for items, explainable, doesn't need many users, privacy-friendly Weaknesses: Limited discovery (recommends similar content), requires good item descriptions, can create filter bubbles
How it Works
Content-Based Filtering with TF-IDF analyzes the textual content of items to find similarities. It uses Term Frequency-Inverse Document Frequency (TF-IDF) to convert item descriptions into numerical vectors, where:
- TF (Term Frequency): How often a word appears in an item's description
- IDF (Inverse Document Frequency): How unique a word is across all items
For each user, the algorithm:
- Creates a user profile from items they've interacted with (aggregates their TF-IDF vectors)
- Computes cosine similarity between user profile and all candidate items
- Ranks items by similarity to user's preferences
- Returns top-K most similar items
Key Concept: If you liked items with certain features/descriptions, you'll likely enjoy other items with similar content.
Parameters
Feature Configuration
Feature Columns (required) List of columns to use: must include user_id, item_id, and content/description.
User Column (default: "user_id", required) Name of the column containing user identifiers. Each unique value represents a different user.
Item Column (default: "item_id", required) Name of the column containing item identifiers. Each unique value represents a different item to recommend.
Content Column (default: "description", required) Name of the column containing item descriptions or features. This is the text content used to compute similarities.
- Can be product descriptions, article text, movie plots, song metadata
- Better quality content = better recommendations
- Can concatenate multiple fields (e.g., "title + description + tags")
Model-Specific Parameters
Top-K Recommendations (default: 10) Number of items to recommend for each user.
- 5-10: Focused, high-relevance recommendations
- 10-20: Standard recommendation lists
- 20-50: For exploration and discovery
Configuration Tips
Dataset Size Considerations
- Small (<1k items): Works well, but limited diversity
- Medium (1k-100k items): Ideal range, good performance
- Large (100k-1M items): Good, but TF-IDF vectors can be large
- Very Large (>1M items): Consider Embeddings Similarity for better scaling
Parameter Tuning Guidance
-
Improve content quality:
- Clean text (remove HTML, special characters)
- Include relevant metadata (categories, tags, attributes)
- Combine multiple text fields
- Use item titles + descriptions + tags
-
Handle vocabulary:
- Remove stop words for cleaner signals
- Use stemming/lemmatization for word variations
- Consider n-grams for multi-word concepts
-
Monitor diversity:
- Content-based can create filter bubbles
- Apply diversity post-processing
- Combine with collaborative filtering (Hybrid model)
-
Optimize for scale:
- Limit vocabulary size (top 10k-50k terms)
- Use sparse matrix operations
- Pre-compute item similarities
When to Choose This Over Alternatives
- vs. Collaborative Filtering: Choose this for cold start handling and when interaction data is sparse
- vs. Embeddings Similarity: Choose this for simpler, faster, more interpretable approach
- vs. Hybrid: Choose this when you don't have sufficient interaction data yet
- vs. Item-Based KNN: Choose this when you have better item descriptions than interaction data
- vs. BERT4Rec: Choose this for non-sequential, content-focused recommendations
Common Issues and Solutions
Cold Start Problem (New Users)
Issue: New users have no interaction history. Solution:
- Collect initial preferences through questionnaire
- Use demographic or contextual information
- Show popular items until interactions accumulate
- Ask users to rate/like a few initial items
Filter Bubbles
Issue: Users only see items similar to what they've already seen. Solution:
- Add diversity objectives (e.g., MMR - Maximal Marginal Relevance)
- Mix content-based with collaborative signals (Hybrid model)
- Include exploration (random or trending items)
- Apply category diversification
Poor Content Quality
Issue: Item descriptions are sparse, generic, or low-quality. Solution:
- Enrich content with external data sources
- Use user-generated tags/reviews
- Combine multiple metadata fields
- Consider switching to Embeddings Similarity for semantic understanding
- Fall back to collaborative filtering when content is insufficient
Over-specialization
Issue: Recommendations are too narrow or predictable. Solution:
- Reduce similarity threshold (allow more diverse items)
- Weight recent interactions less heavily
- Include serendipitous recommendations
- Combine with collaborative filtering
Scalability with Large Vocabulary
Issue: TF-IDF vectors become very large with extensive text. Solution:
- Limit vocabulary to top N terms (10k-50k)
- Use min/max document frequency thresholds
- Apply dimensionality reduction
- Consider Embeddings Similarity for dense representations
Language and Domain Issues
Issue: TF-IDF struggles with synonyms, multi-language content. Solution:
- Apply stemming/lemmatization
- Use domain-specific stop words
- Consider Embeddings Similarity for semantic understanding
- Handle multiple languages separately or use translation
Example Use Cases
News Article Recommendations
Scenario: News platform with 500k articles, need immediate recommendations for new content Configuration:
- Content column: article_text (title + body + category)
- Top-20 recommendations
- No cold start delay Why: New articles published constantly, need instant recommendations, content-rich, explainable
Job Recommendations
Scenario: Job board with 100k job postings, want to match to user preferences Configuration:
- Content column: job_description (title + description + skills + location)
- Top-15 recommendations
- Combine with user resume/profile Why: Rich job descriptions, need to match skills and requirements, limited interaction data initially
Product Recommendations (Cold Start)
Scenario: New e-commerce site with 10k products, few user interactions Configuration:
- Content column: product_details (title + description + category + brand + attributes)
- Top-10 recommendations
- Bootstrap recommendations until sufficient interaction data Why: New platform, need recommendations from day one, detailed product information available