Eclat
Vertical data format algorithm using depth-first search
Uses vertical data format (itemset -> transaction list) instead of horizontal format (transaction -> itemset list). Finds frequent itemsets through set intersection operations.
When to Use Eclat
- Sparse datasets (many items, few per transaction)
- Need fast depth-first search
- Memory is available for vertical representation
- Alternative to FP-Growth for sparse data
Strengths
- Fast for sparse data
- Simple set intersection operations
- Good for large number of items
- Depth-first search is memory efficient
- No repeated database scans
Weaknesses
- High memory for dense data
- Less intuitive vertical format
- May be slower than FP-Growth on dense data
- Requires full vertical database in memory
How it Works
- Convert to vertical format: Each item -> list of transactions containing it
- Intersect transaction lists to find co-occurrences
- Use depth-first search to find patterns
- Calculate support from transaction list sizes
Example Vertical Format:
Bread: [1, 3, 5, 7] (appears in transactions 1, 3, 5, 7)
Milk: [1, 2, 5, 8]
Butter: [1, 5]
[Bread, Milk] -> intersection: [1, 5] -> support = 2/8 = 0.25Key Advantage: Set intersection is very fast, especially for sparse data where transaction lists are short.
When to Choose Eclat
Best for:
- Retail data with many SKUs but few items per basket
- Web clickstream data (many pages, few per session)
- Library checkout records
- Any scenario with high item count, low items per transaction
Choose FP-Growth instead when:
- Dense transactions (many items per transaction)
- Limited memory
- Need the fastest possible algorithm
Parameters
All association algorithms share these common parameters:
Data Format
Input Format: 'long' or 'wide'
How your transaction data is structured:
Wide Format:
- Each column represents one item
- Each row is a transaction
- Values are 1 (item present) or 0 (item absent)
- Example:
TransactionID | Bread | Milk | Eggs | Butter 1 | 1 | 1 | 0 | 1 2 | 0 | 1 | 1 | 0
Long Format:
- Each row is one item in a transaction
- Requires Transaction ID column to group items
- More natural for real-world data
- Example:
TransactionID | Item 1 | Bread 1 | Milk 1 | Butter 2 | Milk 2 | Eggs
Feature Configuration
Feature Columns (required)
- Wide format: List all item columns
- Long format: Select the single column containing item names
Transaction ID Column (required for long format) Column that identifies which transaction each item belongs to.
Contains Multiple Items (long format only) Check if a single row can contain multiple items (e.g., "Bread, Milk, Eggs").
Item Separator (if multiset) Character separating multiple items (default: comma).
- Example: "Bread, Milk, Eggs" uses "," as separator
Segmentation (Optional)
Segmentation Column Analyze different customer segments separately:
- Store locations (downtown vs. suburban)
- Customer types (premium vs. regular)
- Time periods (weekday vs. weekend)
Target Segment Value Filter to analyze only specific segment.
Model Parameters
Minimum Support (default: 0.02, required) Threshold for how frequently an itemset must appear.
- 0.02 = 2% of transactions
- Lower values: Find rare patterns, but slower and more results
- Higher values: Only common patterns, faster
- Recommendations:
- Large stores (>10k transactions): 0.001-0.01 (0.1%-1%)
- Medium stores: 0.01-0.05 (1%-5%)
- Small datasets: 0.05-0.1 (5%-10%)
Maximum Itemset Length (default: 3, required) Maximum number of items in a pattern.
- 2: Pairs only (A -> B)
- 3: Triples (A, B -> C)
- 4+: Complex patterns (slower, harder to interpret)
- Recommendations:
- Start with 2-3 for interpretability
- Increase only if needed
Rule Evaluation Metric (default: "lift", required) How to measure rule strength:
- lift: Strength of association (recommended)
- confidence: Reliability of rule
- leverage: Lift adjusted by item frequencies
- conviction: Dependency strength
Metric Threshold (default: 1.2, required) Minimum value for the selected metric to keep a rule.
- For lift: >1.0 (1.2 = 20% more likely)
- For confidence: 0.5-0.9 (50%-90% probability)
Advanced Filtering (Optional)
Enable Advanced Filtering Set both confidence and lift thresholds simultaneously for stricter rules.
Minimum Confidence (default: 0.6) Probability that Y is purchased given X is purchased.
- 0.6 = 60% of transactions with X also have Y
- Range: 0.1-1.0
Minimum Lift (default: 1.1) How much more likely Y is with X versus without X.
- 1.0 = No association (independent)
- 1.1 = 10% increase in likelihood
- 2.0 = 2x more likely
- Range: >0.0 (typically >1.0 for meaningful rules)
Understanding Association Metrics
Support
Definition: How frequently an itemset appears in the database.
Formula: support(X) = (transactions containing X) / (total transactions)
Example:
- 100 transactions total
- [Bread, Milk] appears in 20 transactions
- support([Bread, Milk]) = 20/100 = 0.2 = 20%
Interpretation:
- 0.01 (1%): Rare pattern
- 0.05 (5%): Moderate frequency
- 0.2 (20%): Very common pattern
Use: Filter out rare, potentially spurious patterns
Confidence
Definition: Probability of finding Y in transactions that contain X.
Formula: confidence(X -> Y) = support(X U Y) / support(X)
Example:
- support([Bread]) = 0.5 (50% of transactions)
- support([Bread, Butter]) = 0.3 (30% of transactions)
- confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%
Interpretation:
- 0.6 = 60% of customers who buy bread also buy butter
- Higher confidence = more reliable rule
Limitation: Can be misleading if Y is very common
Lift
Definition: How much more likely Y is with X versus without X.
Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)
Example:
- confidence(Bread -> Butter) = 0.6
- support(Butter) = 0.4 (40% buy butter overall)
- lift(Bread -> Butter) = 0.6 / 0.4 = 1.5
Interpretation:
- lift = 1.0: No association (X and Y are independent)
- lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
- lift < 1.0: Negative association (Y less likely with X)
Why Lift is Best for Discovery:
- Accounts for item popularity
- Detects true associations vs. coincidence
- Symmetric: lift(X -> Y) = lift(Y -> X)
Leverage
Definition: Difference between observed and expected co-occurrence.
Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)
Example:
- support([Bread, Butter]) = 0.3 (observed)
- support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
- leverage = 0.3 - 0.2 = 0.1
Interpretation:
- 0: No association
- Positive: Items appear together more than expected
- Negative: Items appear together less than expected
- Magnitude matters: Higher absolute value = stronger relationship
Conviction
Definition: Dependency measure - how much more Y depends on X.
Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))
Example:
- support(Butter) = 0.4
- confidence(Bread -> Butter) = 0.6
- conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5
Interpretation:
- 1.0: No association (independent)
-
1.0: Y depends on X
- infinity: Perfect dependency (always Y when X)
Use: Measures how much the rule deviates from independence
Configuration Tips
Best Practices for Eclat
Optimal Use Cases:
- Sparse transaction data
- Many unique items (thousands of SKUs)
- Few items per transaction (average < 10)
- Need depth-first mining strategy
Memory Considerations:
- Eclat stores transaction IDs for each item
- Sparse data: Small transaction lists, low memory
- Dense data: Large transaction lists, high memory
- Monitor memory usage with large datasets
Performance Tips:
- Works best with min_support >= 0.01
- Increase min_support if memory issues occur
- Consider data characteristics before choosing
When Eclat Performs Best
Ideal Characteristics:
- 1000+ unique items
- Average 5-15 items per transaction
- Sparse transaction matrix
- Need to find rare patterns efficiently
Examples:
- Supermarket with 10,000 products, baskets of 20 items
- E-commerce site with 50,000 products, orders of 3-5 items
- Library with 100,000 books, checkouts of 2-3 books
Common Issues and Solutions
High Memory Usage
Symptom: Out of memory errors or swapping
Causes:
- Dense transaction data
- Very low min_support
- Large number of transactions
Solutions:
- Increase min_support to 0.02 or higher
- Switch to FP-Growth for dense data
- Process data in segments
- Filter to most relevant items first
Slower than Expected
Symptom: Eclat slower than FP-Growth
Causes:
- Dense transactions (many items per transaction)
- Data not actually sparse
- Very low min_support
Solutions:
- Verify data is sparse (check avg items per transaction)
- Switch to FP-Growth if data is dense
- Increase min_support slightly
- Reduce max_length
Results Differ from Other Algorithms
Symptom: Different itemsets found
Note: All algorithms should find identical frequent itemsets above threshold. If they differ:
- Verify parameters match exactly
- Check data preprocessing
- Ensure min_support is identical
- Order may differ, but content should match
Conversion to Vertical Format Takes Long
Symptom: Slow initialization before mining starts
Explanation: Eclat must convert horizontal to vertical format first. This is one-time cost.
Solutions:
- Normal for large datasets
- Consider caching vertical format if running multiple times
- Switch to FP-Growth if initialization dominates runtime