Apriori

Classic algorithm that finds frequent itemsets using a breadth-first approach. Generates candidate itemsets and prunes infrequent ones iteratively.

When to Use Apriori

Learning association analysis (most intuitive)
Small to medium datasets (<100k transactions)
Need to understand how the algorithm works
Educational purposes

Strengths

Easy to understand
Straightforward implementation
Well-documented with extensive resources
Guaranteed to find all patterns above threshold

Weaknesses

Can be slow on large datasets
Generates many candidate itemsets
Memory intensive for low support values
Multiple database scans required

How it Works

Find all frequent 1-itemsets (single items above min_support)
Generate candidate 2-itemsets from frequent 1-itemsets
Keep only frequent 2-itemsets
Repeat: Generate k-itemsets from (k-1)-itemsets
Stop when no more frequent itemsets found
Generate rules from frequent itemsets

Key Principle: If an itemset is infrequent, all its supersets are also infrequent (Apriori property). This allows pruning of candidate itemsets without checking them against the database.

Parameters

All association algorithms share these common parameters:

Data Format

Input Format: 'long' or 'wide'

How your transaction data is structured:

Wide Format:

Each column represents one item
Each row is a transaction
Values are 1 (item present) or 0 (item absent)

Example:

TransactionID | Bread | Milk | Eggs | Butter
1             | 1     | 1    | 0    | 1
2             | 0     | 1    | 1    | 0

Long Format:

Each row is one item in a transaction
Requires Transaction ID column to group items
More natural for real-world data

Example:

TransactionID | Item
1             | Bread
1             | Milk
1             | Butter
2             | Milk
2             | Eggs

Feature Configuration

Feature Columns (required)

Wide format: List all item columns
Long format: Select the single column containing item names

Transaction ID Column (required for long format) Column that identifies which transaction each item belongs to.

Contains Multiple Items (long format only) Check if a single row can contain multiple items (e.g., "Bread, Milk, Eggs").

Item Separator (if multiset) Character separating multiple items (default: comma).

Example: "Bread, Milk, Eggs" uses "," as separator

Segmentation (Optional)

Segmentation Column Analyze different customer segments separately:

Store locations (downtown vs. suburban)
Customer types (premium vs. regular)
Time periods (weekday vs. weekend)

Target Segment Value Filter to analyze only specific segment.

Model Parameters

Minimum Support (default: 0.02, required) Threshold for how frequently an itemset must appear.

0.02 = 2% of transactions
Lower values: Find rare patterns, but slower and more results
Higher values: Only common patterns, faster
Recommendations:
- Large stores (>10k transactions): 0.001-0.01 (0.1%-1%)
- Medium stores: 0.01-0.05 (1%-5%)
- Small datasets: 0.05-0.1 (5%-10%)

Maximum Itemset Length (default: 3, required) Maximum number of items in a pattern.

2: Pairs only (A -> B)
3: Triples (A, B -> C)
4+: Complex patterns (slower, harder to interpret)
Recommendations:
- Start with 2-3 for interpretability
- Increase only if needed

Rule Evaluation Metric (default: "lift", required) How to measure rule strength:

lift: Strength of association (recommended)
confidence: Reliability of rule
leverage: Lift adjusted by item frequencies
conviction: Dependency strength

Metric Threshold (default: 1.2, required) Minimum value for the selected metric to keep a rule.

For lift: >1.0 (1.2 = 20% more likely)
For confidence: 0.5-0.9 (50%-90% probability)

Advanced Filtering (Optional)

Enable Advanced Filtering Set both confidence and lift thresholds simultaneously for stricter rules.

Minimum Confidence (default: 0.6) Probability that Y is purchased given X is purchased.

0.6 = 60% of transactions with X also have Y
Range: 0.1-1.0

Minimum Lift (default: 1.1) How much more likely Y is with X versus without X.

1.0 = No association (independent)
1.1 = 10% increase in likelihood
2.0 = 2x more likely
Range: >0.0 (typically >1.0 for meaningful rules)

Understanding Association Metrics

Support

Definition: How frequently an itemset appears in the database.

Formula: support(X) = (transactions containing X) / (total transactions)

Example:

100 transactions total
[Bread, Milk] appears in 20 transactions
support([Bread, Milk]) = 20/100 = 0.2 = 20%

Interpretation:

0.01 (1%): Rare pattern
0.05 (5%): Moderate frequency
0.2 (20%): Very common pattern

Use: Filter out rare, potentially spurious patterns

Confidence

Definition: Probability of finding Y in transactions that contain X.

Formula: confidence(X -> Y) = support(X U Y) / support(X)

Example:

support([Bread]) = 0.5 (50% of transactions)
support([Bread, Butter]) = 0.3 (30% of transactions)
confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%

Interpretation:

0.6 = 60% of customers who buy bread also buy butter
Higher confidence = more reliable rule

Limitation: Can be misleading if Y is very common

Lift

Definition: How much more likely Y is with X versus without X.

Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)

Example:

confidence(Bread -> Butter) = 0.6
support(Butter) = 0.4 (40% buy butter overall)
lift(Bread -> Butter) = 0.6 / 0.4 = 1.5

Interpretation:

lift = 1.0: No association (X and Y are independent)
lift > 1.0: Positive association (Y more likely with X)
- 1.5 = 50% increase in likelihood
- 2.0 = 2x more likely (100% increase)
lift < 1.0: Negative association (Y less likely with X)

Why Lift is Best for Discovery:

Accounts for item popularity
Detects true associations vs. coincidence
Symmetric: lift(X -> Y) = lift(Y -> X)

Leverage

Definition: Difference between observed and expected co-occurrence.

Formula: leverage(X -> Y) = support(X U Y) - support(X) x support(Y)

Example:

support([Bread, Butter]) = 0.3 (observed)
support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
leverage = 0.3 - 0.2 = 0.1

Interpretation:

0: No association
Positive: Items appear together more than expected
Negative: Items appear together less than expected
Magnitude matters: Higher absolute value = stronger relationship

Conviction

Definition: Dependency measure - how much more Y depends on X.

Formula: conviction(X -> Y) = (1 - support(Y)) / (1 - confidence(X -> Y))

Example:

support(Butter) = 0.4
confidence(Bread -> Butter) = 0.6
conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5

Interpretation:

1.0: No association (independent)
>1.0: Y depends on X
infinity: Perfect dependency (always Y when X)

Use: Measures how much the rule deviates from independence

Configuration Tips

Best Practices for Apriori

Start Conservative:

Begin with min_support = 0.05 (5%)
Use max_length = 2 or 3
Monitor number of results

For Learning:

Apriori is the best algorithm to understand association mining
Observe how itemset sizes grow each iteration
Watch how the Apriori property prunes candidates

Performance Tips:

Increase min_support if too slow
Reduce max_length for faster results
Consider switching to FP-Growth for larger datasets

When to Use:

Datasets under 100k transactions
Learning and understanding the fundamentals
Need to explain the algorithm to stakeholders

Common Issues and Solutions

Too Many Patterns

Symptom: Thousands of itemsets and rules generated

Solutions:

Increase min_support (0.02 -> 0.05)
Enable advanced filtering
Reduce max_length to 2
Focus on specific product categories

Algorithm Too Slow

Symptom: Takes many minutes or doesn't complete

Solutions:

Increase min_support significantly
Reduce max_length to 2
Switch to FP-Growth algorithm
Filter data to fewer items

No Patterns Found

Symptom: Zero itemsets or rules generated

Solutions:

Lower min_support (0.05 -> 0.01)
Check data format (wide vs. long)
Verify transaction ID column is correct
Ensure items are properly separated

All High Confidence, Low Lift

Symptom: Many rules with 90%+ confidence but lift near 1.0

Solutions:

Focus on lift instead of confidence
Set min_lift to 1.5 or higher
High confidence + low lift = item is just popular
Look for surprising patterns (high lift)

Apriori

When to Use Apriori

Strengths

Weaknesses

How it Works

Parameters

Data Format

Feature Configuration

Segmentation (Optional)

Model Parameters

Advanced Filtering (Optional)

Understanding Association Metrics

Support

Confidence

Lift

Leverage

Conviction

Configuration Tips

Best Practices for Apriori

Common Issues and Solutions

Too Many Patterns

Algorithm Too Slow

No Patterns Found

All High Confidence, Low Lift

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Apriori

When to Use Apriori

Strengths

Weaknesses

How it Works

Parameters

Data Format

Feature Configuration

Segmentation (Optional)

Model Parameters

Advanced Filtering (Optional)

Understanding Association Metrics

Support

Confidence

Lift

Leverage

Conviction

Configuration Tips

Best Practices for Apriori

Common Issues and Solutions

Too Many Patterns

Algorithm Too Slow

No Patterns Found

All High Confidence, Low Lift

On this page

Command Palette