Apriori
Classic breadth-first algorithm for association rule mining
Classic algorithm that finds frequent itemsets using a breadth-first approach. Generates candidate itemsets and prunes infrequent ones iteratively.
When to Use Apriori
- Learning association analysis (most intuitive)
- Small to medium datasets (<100k transactions)
- Need to understand how the algorithm works
- Educational purposes
Strengths
- Easy to understand
- Straightforward implementation
- Well-documented with extensive resources
- Guaranteed to find all patterns above threshold
Weaknesses
- Can be slow on large datasets
- Generates many candidate itemsets
- Memory intensive for low support values
- Multiple database scans required
How it Works
- Find all frequent 1-itemsets (single items above min_support)
- Generate candidate 2-itemsets from frequent 1-itemsets
- Keep only frequent 2-itemsets
- Repeat: Generate k-itemsets from (k-1)-itemsets
- Stop when no more frequent itemsets found
- Generate rules from frequent itemsets
Key Principle: If an itemset is infrequent, all its supersets are also infrequent (Apriori property). This allows pruning of candidate itemsets without checking them against the database.
Parameters
All association algorithms share these common parameters:
Data Format
Input Format: 'long' or 'wide'
How your transaction data is structured:
Wide Format:
- Each column represents one item
- Each row is a transaction
- Values are 1 (item present) or 0 (item absent)
- Example:
TransactionID | Bread | Milk | Eggs | Butter 1 | 1 | 1 | 0 | 1 2 | 0 | 1 | 1 | 0
Long Format:
- Each row is one item in a transaction
- Requires Transaction ID column to group items
- More natural for real-world data
- Example:
TransactionID | Item 1 | Bread 1 | Milk 1 | Butter 2 | Milk 2 | Eggs
Feature Configuration
Feature Columns (required)
- Wide format: List all item columns
- Long format: Select the single column containing item names
Transaction ID Column (required for long format) Column that identifies which transaction each item belongs to.
Contains Multiple Items (long format only) Check if a single row can contain multiple items (e.g., "Bread, Milk, Eggs").
Item Separator (if multiset) Character separating multiple items (default: comma).
- Example: "Bread, Milk, Eggs" uses "," as separator
Segmentation (Optional)
Segmentation Column Analyze different customer segments separately:
- Store locations (downtown vs. suburban)
- Customer types (premium vs. regular)
- Time periods (weekday vs. weekend)
Target Segment Value Filter to analyze only specific segment.
Model Parameters
Minimum Support (default: 0.02, required) Threshold for how frequently an itemset must appear.
- 0.02 = 2% of transactions
- Lower values: Find rare patterns, but slower and more results
- Higher values: Only common patterns, faster
- Recommendations:
- Large stores (>10k transactions): 0.001-0.01 (0.1%-1%)
- Medium stores: 0.01-0.05 (1%-5%)
- Small datasets: 0.05-0.1 (5%-10%)
Maximum Itemset Length (default: 3, required) Maximum number of items in a pattern.
- 2: Pairs only (A -> B)
- 3: Triples (A, B -> C)
- 4+: Complex patterns (slower, harder to interpret)
- Recommendations:
- Start with 2-3 for interpretability
- Increase only if needed
Rule Evaluation Metric (default: "lift", required) How to measure rule strength:
- lift: Strength of association (recommended)
- confidence: Reliability of rule
- leverage: Lift adjusted by item frequencies
- conviction: Dependency strength
Metric Threshold (default: 1.2, required) Minimum value for the selected metric to keep a rule.
- For lift: >1.0 (1.2 = 20% more likely)
- For confidence: 0.5-0.9 (50%-90% probability)
Advanced Filtering (Optional)
Enable Advanced Filtering Set both confidence and lift thresholds simultaneously for stricter rules.
Minimum Confidence (default: 0.6) Probability that Y is purchased given X is purchased.
- 0.6 = 60% of transactions with X also have Y
- Range: 0.1-1.0
Minimum Lift (default: 1.1) How much more likely Y is with X versus without X.
- 1.0 = No association (independent)
- 1.1 = 10% increase in likelihood
- 2.0 = 2x more likely
- Range: >0.0 (typically >1.0 for meaningful rules)
Understanding Association Metrics
Support
Definition: How frequently an itemset appears in the database.
Formula: support(X) = (transactions containing X) / (total transactions)
Example:
- 100 transactions total
- [Bread, Milk] appears in 20 transactions
- support([Bread, Milk]) = 20/100 = 0.2 = 20%
Interpretation:
- 0.01 (1%): Rare pattern
- 0.05 (5%): Moderate frequency
- 0.2 (20%): Very common pattern
Use: Filter out rare, potentially spurious patterns
Confidence
Definition: Probability of finding Y in transactions that contain X.
Formula: confidence(X -> Y) = support(X U Y) / support(X)
Example:
- support([Bread]) = 0.5 (50% of transactions)
- support([Bread, Butter]) = 0.3 (30% of transactions)
- confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%
Interpretation:
- 0.6 = 60% of customers who buy bread also buy butter
- Higher confidence = more reliable rule
Limitation: Can be misleading if Y is very common
Lift
Definition: How much more likely Y is with X versus without X.
Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)
Example:
- confidence(Bread -> Butter) = 0.6
- support(Butter) = 0.4 (40% buy butter overall)
- lift(Bread -> Butter) = 0.6 / 0.4 = 1.5
Interpretation:
- lift = 1.0: No association (X and Y are independent)
- lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
- lift < 1.0: Negative association (Y less likely with X)
Why Lift is Best for Discovery:
- Accounts for item popularity
- Detects true associations vs. coincidence
- Symmetric: lift(X -> Y) = lift(Y -> X)
Leverage
Definition: Difference between observed and expected co-occurrence.
Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)
Example:
- support([Bread, Butter]) = 0.3 (observed)
- support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
- leverage = 0.3 - 0.2 = 0.1
Interpretation:
- 0: No association
- Positive: Items appear together more than expected
- Negative: Items appear together less than expected
- Magnitude matters: Higher absolute value = stronger relationship
Conviction
Definition: Dependency measure - how much more Y depends on X.
Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))
Example:
- support(Butter) = 0.4
- confidence(Bread -> Butter) = 0.6
- conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5
Interpretation:
- 1.0: No association (independent)
- >1.0: Y depends on X
- infinity: Perfect dependency (always Y when X)
Use: Measures how much the rule deviates from independence
Configuration Tips
Best Practices for Apriori
Start Conservative:
- Begin with min_support = 0.05 (5%)
- Use max_length = 2 or 3
- Monitor number of results
For Learning:
- Apriori is the best algorithm to understand association mining
- Observe how itemset sizes grow each iteration
- Watch how the Apriori property prunes candidates
Performance Tips:
- Increase min_support if too slow
- Reduce max_length for faster results
- Consider switching to FP-Growth for larger datasets
When to Use:
- Datasets under 100k transactions
- Learning and understanding the fundamentals
- Need to explain the algorithm to stakeholders
Common Issues and Solutions
Too Many Patterns
Symptom: Thousands of itemsets and rules generated
Solutions:
- Increase min_support (0.02 -> 0.05)
- Enable advanced filtering
- Reduce max_length to 2
- Focus on specific product categories
Algorithm Too Slow
Symptom: Takes many minutes or doesn't complete
Solutions:
- Increase min_support significantly
- Reduce max_length to 2
- Switch to FP-Growth algorithm
- Filter data to fewer items
No Patterns Found
Symptom: Zero itemsets or rules generated
Solutions:
- Lower min_support (0.05 -> 0.01)
- Check data format (wide vs. long)
- Verify transaction ID column is correct
- Ensure items are properly separated
All High Confidence, Low Lift
Symptom: Many rules with 90%+ confidence but lift near 1.0
Solutions:
- Focus on lift instead of confidence
- Set min_lift to 1.5 or higher
- High confidence + low lift = item is just popular
- Look for surprising patterns (high lift)