Association Analysis
Discover patterns and relationships in transaction data
Association analysis discovers relationships between items that frequently occur together in transactions. Use it for market basket analysis, recommendation systems, cross-selling strategies, and pattern discovery in transaction databases.
Understanding Association Analysis
New to association analysis? Check out our Association Analysis AI Task Guide to learn the fundamentals of market basket analysis, key concepts like itemsets and rules, and when to use this technique.
Available Algorithms
We support 5 algorithms for mining frequent itemsets and generating association rules. Each algorithm finds the same frequent itemsets, but uses different strategies with various performance trade-offs.
Core Algorithms
- Apriori - Classic breadth-first algorithm, easy to understand, good for learning
- FP-Growth - Fast tree-based algorithm, best for most use cases
- Eclat - Vertical format algorithm, fast for sparse data
Specialized Algorithms
- Relim - Memory-efficient recursive elimination
- FPMax - Finds only maximal itemsets (compact representation)
What is Association Analysis?
Association analysis answers questions like:
- "What products do customers buy together?"
- "If someone buys X, what else are they likely to buy?"
- "What items frequently appear in the same transaction?"
Classic Example: "Customers who buy diapers also buy beer" - a famous retail discovery showing unexpected associations.
Key Concepts
Itemset: A collection of items (e.g., [Bread, Milk])
Transaction: A set of items purchased together (e.g., one shopping cart)
Association Rule: If X then Y, written as X -> Y
- Example: [Bread, Butter] -> [Milk]
- Meaning: "Customers who buy bread and butter also buy milk"
Understanding Association Metrics
Support
Definition: How frequently an itemset appears in the database.
Formula: support(X) = (transactions containing X) / (total transactions)
Example:
- 100 transactions total
- [Bread, Milk] appears in 20 transactions
- support([Bread, Milk]) = 20/100 = 0.2 = 20%
Interpretation:
- 0.01 (1%): Rare pattern
- 0.05 (5%): Moderate frequency
- 0.2 (20%): Very common pattern
Use: Filter out rare, potentially spurious patterns
Confidence
Definition: Probability of finding Y in transactions that contain X.
Formula: confidence(X -> Y) = support(X U Y) / support(X)
Example:
- support([Bread]) = 0.5 (50% of transactions)
- support([Bread, Butter]) = 0.3 (30% of transactions)
- confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%
Interpretation:
- 0.6 = 60% of customers who buy bread also buy butter
- Higher confidence = more reliable rule
Limitation: Can be misleading if Y is very common
Lift
Definition: How much more likely Y is with X versus without X.
Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)
Example:
- confidence(Bread -> Butter) = 0.6
- support(Butter) = 0.4 (40% buy butter overall)
- lift(Bread -> Butter) = 0.6 / 0.4 = 1.5
Interpretation:
- lift = 1.0: No association (X and Y are independent)
- lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
- lift < 1.0: Negative association (Y less likely with X)
Why Lift is Best for Discovery:
- Accounts for item popularity
- Detects true associations vs. coincidence
- Symmetric: lift(X -> Y) = lift(Y -> X)
Leverage
Definition: Difference between observed and expected co-occurrence.
Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)
Example:
- support([Bread, Butter]) = 0.3 (observed)
- support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
- leverage = 0.3 - 0.2 = 0.1
Interpretation:
- 0: No association
- Positive: Items appear together more than expected
- Negative: Items appear together less than expected
- Magnitude matters: Higher absolute value = stronger relationship
Conviction
Definition: Dependency measure - how much more Y depends on X.
Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))
Example:
- support(Butter) = 0.4
- confidence(Bread -> Butter) = 0.6
- conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5
Interpretation:
- 1.0: No association (independent)
- >1.0: Y depends on X
- infinity: Perfect dependency (always Y when X)
Use: Measures how much the rule deviates from independence
Choosing the Right Algorithm
Quick Decision Guide
Start with FP-Growth unless:
- Learning (use Apriori for intuition)
- Sparse data (try Eclat)
- Limited memory (try Relim)
- Want compact results (use FPMax)
By Dataset Size
- Small (<1k transactions): Any algorithm, Apriori is fine
- Medium (1k-100k): FP-Growth (best), Eclat (for sparse)
- Large (>100k): FP-Growth, Eclat, Relim
By Data Characteristics
Dense transactions (many items per transaction):
- FP-Growth (best)
- Apriori (small datasets)
Sparse transactions (few items per transaction):
- Eclat (best for sparse)
- FP-Growth
Many unique items:
- Eclat (handles many items well)
- FP-Growth
By Goal
Learning / Understanding:
- Apriori (most intuitive)
Production / Performance:
- FP-Growth (fastest, most reliable)
Compact Results:
- FPMax (only longest patterns)
Memory Constraints:
- Relim (memory-efficient)
Best Practices
1. Start with the Right Support
- Don't start too low (<0.001)
- Begin with moderate support (0.01-0.05)
- Lower gradually if needed
- Monitor number of results
2. Focus on Actionable Rules
- High lift (>1.5) for strong associations
- Reasonable confidence (>0.5) for reliability
- Consider support (not too rare)
- Look for surprising patterns (high lift + moderate confidence)
3. Filter and Interpret Results
Good Rules:
- Lift >1.5 (strong association)
- Confidence >0.5 (reliable)
- Support >0.01 (not too rare)
- Make business sense
Suspicious Rules:
- Lift ≈ 1.0 (no real association)
- Very high confidence + low lift (item just popular)
- Very low support (might be noise)
- Contradicts domain knowledge
4. Domain Validation
- Validate with domain experts
- Check if patterns make business sense
- Look for actionable insights
- Test recommendations with A/B testing
5. Segment Your Analysis
Analyze different segments separately:
- Store locations
- Customer demographics
- Time periods (seasonal patterns)
- Product categories
6. Practical Applications
Retail / E-commerce:
- Product recommendations ("You might also like...")
- Store layout optimization
- Promotional bundling
- Cross-selling strategies
Healthcare:
- Symptom-disease associations
- Drug interaction patterns
- Treatment combinations
Web Analytics:
- Page navigation patterns
- Feature usage combinations
- User behavior sequences
Common Pitfalls
1. Support Too Low
- Generates too many patterns
- Includes noise and spurious patterns
- Very slow computation
- Fix: Start with 0.01-0.05, lower gradually
2. Ignoring Lift
- Using only confidence can be misleading
- Popular items have high confidence by default
- Fix: Always check lift >1.0, prefer >1.5
3. Too Many Items
- Exponential growth in patterns
- Overwhelming results
- Fix:
- Increase min_support
- Limit max_length to 2-3
- Focus on specific product categories
4. Not Filtering Results
- Raw output is overwhelming
- Many redundant patterns
- Fix:
- Use advanced filters (confidence + lift)
- Focus on high-lift rules
- Sort by interestingness metrics
5. Misinterpreting Causation
- Association ≠ causation
- Correlation might be coincidental
- Fix: Validate with experiments and domain knowledge