Goal: Provide an overview of basic Association
Association Rules Rule mining techniques ● Association Rules Problem Overview
Association Rules Outline Goal: Provide an overview of basic Association
- Large itemsets ●
Association Rules Algorithms
- Apriori
- Sampling
- Partitioning
- Parallel Algorithms ●
Comparing Techniques ● Incremental Algorithms ● Advanced AR Techniques
Market basket analysis
● Example: Market Basket Data Items frequently purchased together:
Bread PeanutButter ● ⇒
Uses:
- Placement
- Advertising
- Sales
- ●
Coupons Objective: increase sales and reduce costs Association Rule Definitions ●
Set of items: I={I ,I ,…,I } ● 1
2 m
Transactions: D={t ,t , …, t }, t
I ⊆ ●
1 2 n j Itemset: {I ,I , …, I }
I
⊆
● i1 i2 ikSupport of an itemset: Percentage of ● transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.
Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread, PeanutButter} is 60% Association Rule Definitions ●
Association Rule (AR): implication X Y ⇒ where X,Y I and X Y = ; ● ⊆ ∩ Support of AR (s) X
⇒
Y: Percentage of transactions that contain X Y.∪ ● P(X,Y) = σ (X Y) / |T| ∪
- Confidence of AR ( ) X Y: Ratio of
α ⇒ number of transactions that contain X Y
∪ to the number that contain X.
- P(Y|X) = σ (X Y) / σ (X)
∪ Association Rules Ex (cont’d)
Example 500,000 transactions 20,000 transactions contains diapers 30,000 transactions contains milk 10,000 transactions contains both diapers & milk
E s c
460
20
10
⇒
0.02
0.33
diapers milk Diapers Milk
⇒
0.02
0.50 Example 10,000 transactions contains wipes 8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk 200 transactions contains wipes & diapers & milk
E milk diapers
90.91
8
9
20
1980 7800
⇒ Diapers c,% s,% wipes
2 Milk
33.33
⇒ Diapers
0.04 Wipes & Milk
⇒ Milk
200 19980
2 Diapers
50
⇒ Wipes
1.6 Diapers
40
⇒ Wipes
0.04 Milk
0.73
2200 ? …
00
● Association Rule Problem Given a set of items I={I ,I ,…,I } and a
1 2 m database of transactions D={t ,t , …, t } where
1 2 n t ={I ,I , …, I } and I I, the Association
∈ i i1 i2 ik ij
Rule Problem is to identify all association rules X Y with a minimum support and
⇒ ● confidence. ● Link Analysis
NOTE: Support of X Y is same as support
⇒ of X Y. ∪
Association Rule Techniques 1
Find Large Itemsets. 2. Generate rules from frequent itemsets.
● Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions TID Items Example of Association Rules
{Diaper} {Cocoa}, →
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
→
2 Bread, Diaper, Cocoa, {Cocoa, Bread} {Milk}, →
Eggs
3 Milk, Diaper, Cocoa, Coke Implication means co-occurrence,
4 Bread, Milk, Diaper, not causality!
Cocoa
5 Bread, Milk, Diaper, Coke
● Definition: Frequent Itemset Itemset
- A collection of one or more TID Items items
- Example: {Milk, Bread,
1 Bread, Milk Diaper}
2 Bread, Diaper, Cocoa,
- k-itemset
Eggs
- An itemset that contains k
3 Milk, Diaper, Cocoa, Coke items ●
4 Bread, Milk, Diaper, Cocoa Support count ( ) σ
5 Bread, Milk, Diaper, Coke
- Frequency of occurrence of an itemset
- E.g. ({Milk, Bread,Diaper})
σ ● = 2 Support
Frequent Itemset An itemset whose support is
- Fraction of transactions that greater than or equal to a contain an itemset
minsup threshold
- E.g. s({Milk, Bread, Diaper})
● Definition: Association Rule TID Items Association Rule
- An implication expression of the form X
1 Bread, Milk Y, where X and Y are itemsets
→
2 Bread, Diaper, Cocoa,
- Eggs
Example:
3 Milk, Diaper, Cocoa, Coke {Milk, Diaper} {Cocoa}
→
4 Bread, Milk, Diaper, ● Cocoa Rule Evaluation Metrics
5 Bread, Milk, Diaper, Coke
- Support (s)
Example:
- Fraction of transactions that contain both X and Y
{ Milk , Diaper } Cocoa ⇒
- Confidence (c)
( Milk , Diaper, Cocoa )
2
- Measures how often items in Y
s .
4 = σ
= =
5 appear in transactions that | T | contain X
( Milk, Diaper, Cocoa )
2 σ
c
.67 =
= = ( Milk , Diaper )
3 σ Given a set of transactions T, the goal of association rule mining is to find all rules having
Association Rule Mining Task ●
- support ≥ minsup threshold
- confidence ≥ minconf threshold
●
Brute-force approach:
- List all possible association rules
- Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Mining Association Rules TID Items Example of Rules:
1 Bread, Milk {Milk,Diaper} {Cocoa} (s=0.4,
→
2 Bread, Diaper, Cocoa,
c=0.67)
Eggs
3 Milk, Diaper, Cocoa, Coke {Milk,Cocoa} {Diaper} (s=0.4, c=1.0) →
{Diaper,Cocoa} {Milk} (s=0.4,
4 Bread, Milk, Diaper, →
Cocoa
c=0.67)
5 Bread, Milk, Diaper, Coke {Cocoa} {Milk,Diaper} (s=0.4,
→ c=0.67)
Observations: {Diaper} {Milk,Cocoa} (s=0.4, c=0.5)
→ {Milk} {Diaper,Cocoa} (s=0.4, c=0.5)
- All the above rules are binary partitions of the same itemset:
→ {Milk, Diaper, Cocoa}
- Rules originating from the same itemset have identical support but can have different confidence
- Thus, we may decouple the support and confidence requirements
Mining Association Rules ●
Two-step approach:
- – Frequent Itemset Generation
- – Generate all itemsets whose support
≥ minsup
- – Rule Generation
- – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
● Frequent itemset generation is still computationally expensive
Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE
Given d items, there d ABCDE are 2 possible candidate itemsets
● Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
- Count the support of each candidate by scanning the database
List of TID Items Transactions TID Items 1 Bread, Milk Candidates
1 Bread, Milk 2 Bread, Diaper, Cocoa,
2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Cocoa, Coke Eggs
3 Milk, Diaper, Beer, Coke M N 4 Bread, Milk, Diaper,
4 Bread, Milk, Diaper, Beer Cocoa
5 Bread, Milk, Diaper, Coke 5 Bread, Milk, Diaper, Coke w
Match each transaction against every candidate d Complexity ~ O(NMw) => Expensive since M = 2 !!! k M = 2 - 1
● Computational Complexity Given d unique items: d
- Total number of itemsets = 2
- Total number of possible association rules:
d − 1 d − k d d k
− R ∑ ∑
= × k = 1 j = 1 k j
- + = − d d 1 +
- Complete search: M=2
- ●
Reduce size of N as the size of itemset increases
- ●
- Use efficient data structures to store the candidates or transactions
No need to match every candidate against every
transaction- If an itemset is frequent, then all of its subsets must also be frequent
- Support of an itemset never exceeds the support of its subsets
- This is known as the anti-monotone property of support
- Let k=1
- Generate frequent itemsets of length 1 >Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets- Prune candidate itemsets containing subsets of length k that are infrequent
- Count the support of each candidate by scanning the DB
- Eliminate candidates that are infrequent, leaving only those that are frequent
- P(A,B) = σ (A ) / |T|
- B
- ● The overall % of students eating cereal is 75% > 66.7%.
- 3000 / 5000 3750 / 5000
- − = χ ●
- −
- −
3
2
1 If d=6, R = 602 rules
Frequent Itemset Generation
● StrategiesReduce the number of candidates (M) d
Use pruning techniques to reduce M Reduce the number of transactions (N)
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Apriori ●
Large Itemset Property: ● Any subset of a large itemset is large.
Contrapositive: If an itemset is not large, none of its supersets are large.
● Reducing Number of Candidates Apriori principle:
● Apriori principle holds due to the following property of the support measure:
X , Y : (
X Y ) s ( X ) s ( Y ) ∀ ⊆ ⇒ ≥
Large Itemset Property
Apriori Ex (cont’d) s=30%
α = 50% Found to be Infrequent AB AC AD AE BC BD BE CD CE DE null A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Illustrating Apriori Principle null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE
Pruned supersets Illustrating Apriori Principle Item Count Bread
4 Coke
3 Itemset Count {Bread,Milk,Diaper}
4 Bread, Milk, Diaper, Cocoa
3 Milk, Diaper, Cocoa, Coke
2 Bread, Diaper, Cocoa, Eggs
1 Bread, Milk
If every subset is considered,
6
C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, TID ItemsTriplets (3-itemsets) Minimum Support = 3
3 Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs)
3 {Cocoa,Diaper}
2 Milk
2 {Milk,Diaper}
3 {Milk,Cocoa}
2 {Bread,Diaper}
3 {Bread,Cocoa}
1 Itemset Count {Bread,Milk}
4 Eggs
3 Diaper
4 Cocoa
5 Bread, Milk, Diaper, Coke
The Apriori Algorithm—An Example Database TDB
{C, E} {B, E} {B, C} {A, E} {A, C} {A, B}
2 {B, C, E}
Itemset
{B, C, E}
Itemset sup
2 {A, C} 2 {B, C} 3 {B, E} 2 {C, E}
Itemset sup
2 {B, C} 3 {B, E} 2 {C, E}
1 {A, E}
2 {A, C}
Itemset 1 {A, B}
Itemset sup
1 st scan C 1 L 1 L 2 C 2 C 2
3 {E} 3 {C} 3 {B} 2 {A}
Itemset sup
3 {E} 3 {C} 3 {B} 2 {A}
1 {D}
10 Items Tid
20 A, C, D
30 B, C, E
40 A, B, C, E
B, E
3 rd scan
2 nd scan C 3 L 3
Itemset sup Sup min = 2 Apriori Algorithm
C 1 = Itemsets of size one in I; Determine all large itemsets of size 1, L 1; i = 1;
Repeat i = i + 1; C i = Apriori-Gen(L i-1 ); Count C i to determine L i; until no more large itemsets found;
Apriori Algorithm ●
Method:
● Problems with support and confidence; a rule may have high support & high confidence because the rule is an obvious rule … someone
who buys potato chips is highly likely to buy soft
● drink -> not surprising Confidence totally ignore P(B)B
∪
P(B|A) = σ (A ) / σ (A) ∪ Interestingness Measure: Correlations ● (Lift)
Lift (Wipes, Milk Diapers) is 22.73 ● ⇒
People who purchased Wipes & Milk are 22.73 times more likely to also purchase Diapers than people who do ● not purchase Wipes & Milk .
Measure of dependent/correlated events: lift P ( A B )
∪ lift ( A B )
⇒ =
P ( A ) P ( B )
>1 positively correlated = 1 independent < 1 negatively correlated
● play basketball eat cereal [40%, 66.7%] is misleading
⇒
play basketball not eat cereal [20%, 33.3%] is more ⇒
accurate, although with lower support and confidence
Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000
1000 / 5000 2000 / 5000 lift ( B , C ) 1 .
33 ¬ =
= lift ( B , C ) .
89 = =
3000 / 5000 1250 / 5000 *
Conviction P ( A ) P ( B )
¬ conviction ( A B )
⇒ = P ( A , B )
¬ Conviction = 1, independent (not related) Conviction = infinity, then events always true
2 Chi-square, χ
2 ( observed exp ected )
−
2 χ =
∑ exp ected
2 c r
( o e ) − ij ij
2 χ =
∑ ∑ e i 1 j
1 = =
ij
count ( A a ) count ( B b )= × = i j e
= ij
N DOF ( r 1 ) ( c 1 )
= − × − A has c distinct values (a , a , a , …) 1 2 3
Example
76
Expected
16 Osteoporosis TOTAL Non-Coke Coke
38
38 No osteoporosis
32 TOTAL
44
76
44
16
38
10 No osteoporosis
28
38
32 TOTAL
22 Osteoporosis TOTAL Non-Coke Coke
●
DOF (df) = (2-1)(2-1) = 1 ● significant level, α = 0.01 ●
χ 2
= 6.63 (tabulated) 77 .
7
22 ) 22 28 (
16 ) 16 10 (
22 ) 22 16 (
16 ) 16 22 ( 2 2 2 2 2
= −
Null hypothesis: A & B are not correlated ● Calculate expected values ● Calculate chi-square value ● Compare calculated & tabulated values ● Justify hypothesis
Chi-square distribution table
Which Measures Should Be Used?