Dokumentation (english)

Association Analysis

Discover patterns and relationships in transaction data

Association analysis discovers relationships between items that frequently occur together in transactions. Use it for market basket analysis, recommendation systems, cross-selling strategies, and pattern discovery in transaction databases.

Understanding Association Analysis

New to association analysis? Check out our Association Analysis AI Task Guide to learn the fundamentals of market basket analysis, key concepts like itemsets and rules, and when to use this technique.

Available Algorithms

We support 5 algorithms for mining frequent itemsets and generating association rules. Each algorithm finds the same frequent itemsets, but uses different strategies with various performance trade-offs.

Core Algorithms

  • Apriori - Classic breadth-first algorithm, easy to understand, good for learning
  • FP-Growth - Fast tree-based algorithm, best for most use cases
  • Eclat - Vertical format algorithm, fast for sparse data

Specialized Algorithms

  • Relim - Memory-efficient recursive elimination
  • FPMax - Finds only maximal itemsets (compact representation)

What is Association Analysis?

Association analysis answers questions like:

  • "What products do customers buy together?"
  • "If someone buys X, what else are they likely to buy?"
  • "What items frequently appear in the same transaction?"

Classic Example: "Customers who buy diapers also buy beer" - a famous retail discovery showing unexpected associations.

Key Concepts

Itemset: A collection of items (e.g., [Bread, Milk])

Transaction: A set of items purchased together (e.g., one shopping cart)

Association Rule: If X then Y, written as X -> Y

  • Example: [Bread, Butter] -> [Milk]
  • Meaning: "Customers who buy bread and butter also buy milk"

Understanding Association Metrics

Support

Definition: How frequently an itemset appears in the database.

Formula: support(X) = (transactions containing X) / (total transactions)

Example:

  • 100 transactions total
  • [Bread, Milk] appears in 20 transactions
  • support([Bread, Milk]) = 20/100 = 0.2 = 20%

Interpretation:

  • 0.01 (1%): Rare pattern
  • 0.05 (5%): Moderate frequency
  • 0.2 (20%): Very common pattern

Use: Filter out rare, potentially spurious patterns

Confidence

Definition: Probability of finding Y in transactions that contain X.

Formula: confidence(X -> Y) = support(X U Y) / support(X)

Example:

  • support([Bread]) = 0.5 (50% of transactions)
  • support([Bread, Butter]) = 0.3 (30% of transactions)
  • confidence(Bread -> Butter) = 0.3 / 0.5 = 0.6 = 60%

Interpretation:

  • 0.6 = 60% of customers who buy bread also buy butter
  • Higher confidence = more reliable rule

Limitation: Can be misleading if Y is very common

Lift

Definition: How much more likely Y is with X versus without X.

Formula: lift(X -> Y) = confidence(X -> Y) / support(Y)

Example:

  • confidence(Bread -> Butter) = 0.6
  • support(Butter) = 0.4 (40% buy butter overall)
  • lift(Bread -> Butter) = 0.6 / 0.4 = 1.5

Interpretation:

  • lift = 1.0: No association (X and Y are independent)
  • lift > 1.0: Positive association (Y more likely with X)
    • 1.5 = 50% increase in likelihood
    • 2.0 = 2x more likely (100% increase)
  • lift < 1.0: Negative association (Y less likely with X)

Why Lift is Best for Discovery:

  • Accounts for item popularity
  • Detects true associations vs. coincidence
  • Symmetric: lift(X -> Y) = lift(Y -> X)

Leverage

Definition: Difference between observed and expected co-occurrence.

Formula: leverage(X -&gt; Y) = support(X U Y) - support(X) x support(Y)

Example:

  • support([Bread, Butter]) = 0.3 (observed)
  • support(Bread) x support(Butter) = 0.5 x 0.4 = 0.2 (expected if independent)
  • leverage = 0.3 - 0.2 = 0.1

Interpretation:

  • 0: No association
  • Positive: Items appear together more than expected
  • Negative: Items appear together less than expected
  • Magnitude matters: Higher absolute value = stronger relationship

Conviction

Definition: Dependency measure - how much more Y depends on X.

Formula: conviction(X -&gt; Y) = (1 - support(Y)) / (1 - confidence(X -&gt; Y))

Example:

  • support(Butter) = 0.4
  • confidence(Bread -> Butter) = 0.6
  • conviction = (1 - 0.4) / (1 - 0.6) = 0.6 / 0.4 = 1.5

Interpretation:

  • 1.0: No association (independent)
  • >1.0: Y depends on X
  • infinity: Perfect dependency (always Y when X)

Use: Measures how much the rule deviates from independence

Choosing the Right Algorithm

Quick Decision Guide

Start with FP-Growth unless:

  • Learning (use Apriori for intuition)
  • Sparse data (try Eclat)
  • Limited memory (try Relim)
  • Want compact results (use FPMax)

By Dataset Size

  • Small (<1k transactions): Any algorithm, Apriori is fine
  • Medium (1k-100k): FP-Growth (best), Eclat (for sparse)
  • Large (>100k): FP-Growth, Eclat, Relim

By Data Characteristics

Dense transactions (many items per transaction):

  • FP-Growth (best)
  • Apriori (small datasets)

Sparse transactions (few items per transaction):

  • Eclat (best for sparse)
  • FP-Growth

Many unique items:

  • Eclat (handles many items well)
  • FP-Growth

By Goal

Learning / Understanding:

  • Apriori (most intuitive)

Production / Performance:

  • FP-Growth (fastest, most reliable)

Compact Results:

  • FPMax (only longest patterns)

Memory Constraints:

  • Relim (memory-efficient)

Best Practices

1. Start with the Right Support

  • Don't start too low (<0.001)
  • Begin with moderate support (0.01-0.05)
  • Lower gradually if needed
  • Monitor number of results

2. Focus on Actionable Rules

  • High lift (>1.5) for strong associations
  • Reasonable confidence (>0.5) for reliability
  • Consider support (not too rare)
  • Look for surprising patterns (high lift + moderate confidence)

3. Filter and Interpret Results

Good Rules:

  • Lift >1.5 (strong association)
  • Confidence >0.5 (reliable)
  • Support >0.01 (not too rare)
  • Make business sense

Suspicious Rules:

  • Lift ≈ 1.0 (no real association)
  • Very high confidence + low lift (item just popular)
  • Very low support (might be noise)
  • Contradicts domain knowledge

4. Domain Validation

  • Validate with domain experts
  • Check if patterns make business sense
  • Look for actionable insights
  • Test recommendations with A/B testing

5. Segment Your Analysis

Analyze different segments separately:

  • Store locations
  • Customer demographics
  • Time periods (seasonal patterns)
  • Product categories

6. Practical Applications

Retail / E-commerce:

  • Product recommendations ("You might also like...")
  • Store layout optimization
  • Promotional bundling
  • Cross-selling strategies

Healthcare:

  • Symptom-disease associations
  • Drug interaction patterns
  • Treatment combinations

Web Analytics:

  • Page navigation patterns
  • Feature usage combinations
  • User behavior sequences

Common Pitfalls

1. Support Too Low

  • Generates too many patterns
  • Includes noise and spurious patterns
  • Very slow computation
  • Fix: Start with 0.01-0.05, lower gradually

2. Ignoring Lift

  • Using only confidence can be misleading
  • Popular items have high confidence by default
  • Fix: Always check lift >1.0, prefer >1.5

3. Too Many Items

  • Exponential growth in patterns
  • Overwhelming results
  • Fix:
    • Increase min_support
    • Limit max_length to 2-3
    • Focus on specific product categories

4. Not Filtering Results

  • Raw output is overwhelming
  • Many redundant patterns
  • Fix:
    • Use advanced filters (confidence + lift)
    • Focus on high-lift rules
    • Sort by interestingness metrics

5. Misinterpreting Causation

  • Association ≠ causation
  • Correlation might be coincidental
  • Fix: Validate with experiments and domain knowledge

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items