Mining frequent patterns and rules
Association rules: conditional dependencies
Two stages
- Find frequent patterns
- Derive associations (A → B) from frequent patterns
Find patterns in
- Sequences (time series data, fault analysis)
- Transactions (market basket data)
- Graphs (social network analysis)
Mining Transactions
- A (sub) of items is called an itemsetAssociation rules: conditional dependencies
Two stages
- Find frequent patterns
- Derive associations (A → B) from frequent patterns
Find patterns in
- Sequences (time series data, fault analysis)
- Transactions (market basket data)
- Graphs (social network analysis)
Mining Transactions
- Transaction is a collection of items bought together
- Find frequent itemsets
- Itemset A → B, if both A and A ሀ B are frequent itemsets.
Frequent Pattern Mining
- Support of rule is the percentage of itemsets containing A ሀ B
- Confidence of a rule is the percentage of itemsets containing A that also contain A ሀ B
- We look for rules with both high support and confidence
Association Rules
Applications
- Market Basket analysis
- Topic identification
- co-occurrence of words
- Plagiarism Detection !
- Biomarkers
- Genes or proteins vs. disease
- Time series analysis !
- Trigger Events
Finding Frequent Itemsets
- Generate candidate frequent itemsets and then prune based on count
- Combinatorial number of candidates !
- Need a clever way of generating fewer candidates.
- Apriori Property !
Apriori Algorithm
- Apriori property : All nonempty subsets of a frequent itemset must also be frequent.
- R. Agrawal and R. Srikant Fast algorithms for mining association rules. VLDB 1994. pp. 487-499
- Seminal algorithm
- Set the tone for the field
- Numerous improvements
Apriori illustrated
Caveat
- High confidence is not always a good idea
- Buys games => Buys videos confidence 66% support 37%
- But, Buys videos 75% of the transactions !
- Negative correlation
- Lift
- Ratio of Confidence of rule to that of default rule
- Interest : difference
Challenges
- Millions of transaction
- Billions of potential itemsets
- Non discrete data
- Time series and images
- Graphs
- Non frequent, but significant
- A occurs 0.3% of all transactions, but when B occurs it occurs in 1.2% of the transactions
No comments:
Post a Comment