Lect4 Association Rules
Association Rules
Lecture 4/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia
Objectives
Introduction
What is Association Mining?
Mining Association Rules
Al ith ffor A
Algorithms
Association
i ti R
Rules
l Mi
Mining
i
Visualization
`
`
`
`
`
2
University of Indonesia
Introduction
You sell more if customers can see the product.
Customers that purchase one type of product are likely
to be interested in other particular products.
Market-basket analysis ¼ studying the composition of
shopping
h
i bbasket
k off products
d
purchased
h d during
d i a single
i l
shopping event.
Market-basket data ¼ the transactional list of purchases
p
by customer. It is challenging, because
`
`
`
`
`
`
`
Very large number of records (often millions of trans/day)
Sparseness (each market-basket
market basket contains only a small portion of
items carried)
Heterogeneity (those with different tastes tend to purchase a specific
subset of items).
items)
University of Indonesia
Introduction (2)
Product presentations can be more intelligently planned
for specific times a day, days of the week, or holidays.
Can also involve sequential relationships.
Market-basket analysis
y is an undirected ((alongg with
clustering) DM operation, seeking patterns that were
previously unknown.
Cross-selling
`
`
`
`
`
`
4
The propensity for the purchaser of a specific item to purchase
a different item
Can be maximized by locating those products that tend to be
purchased by the same consumer in places where both
products can be seen.
University of Indonesia
What is Association Mining?
Association rule mining (ARM):
`
Finding frequent patterns,
patterns association,
association correlation,
correlation or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
` Frequent pattern: pattern that occurs frequently in a database.
`
Motivation: finding regularities in data
`
`
`
`
`
What products were often purchased together? — Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
5
University of Indonesia
Why is Frequent Pattern or Association
Mining an Essential Task in DM?
Foundation for many essential data mining tasks
`
`
Association correlation,
Association,
correlation causality
`
Sequential patterns, temporal or cyclic association, partial
pperiodicity,
y spatial
p
and multimedia association
`
Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
Broad applications
`
6
`
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis
`
Web log (click stream) analysis, DNA sequence analysis, etc.
University of Indonesia
What is Association Mining?
Examples:
`
Rule form: “A
A → B [support,
[support confidence]
confidence]”.
` buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
` major(x,
j ( , “CS”)) ^ takes(x,
( , “DB”)) → ggrade(x,
( , “A”)) [[1%,, 75%]]
`
A support of 0.5% for Assoc Rule means that 0.5%
of all the transaction show that diapers and beers are
purchased together.
A confidence of 60% means that 60% of the
customers who purchased diapers also bought beers.
Rules that satisfy both minimum support and
minimum confidence
fd
threshold
h h ld are called
ll d strong.
`
`
`
7
University of Indonesia
What is Association Mining?
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset.
{beer, diaper} is a 2-itemset.
If an itemset satisfies minimum support,
support then it is a
frequent itemset.
The set of frequent
q
k-itemsets is commonlyy denoted byy Lk.
ARM is a two-step process:
`
`
`
`
`
`
`
`
Find all frequent itemsets
Generate strong AR from the frequent itemsets.
The second step is the easiest of the two. Overall
performance
f
off mining
i i AR iis determined
d t
i d bby th
the first
fi t step.
t
`
8
University of Indonesia
Association Mining
mining association rules
(Agrawal et
et. al SIGMOD93)
Better algorithms
Problem extension
Fast algorithm
Generalized A.R.
(Agrawal et. al VLDB94)
(Srikant et.
et al; Han et.
et al.
al VLDB95)
Hash-based
(Park et. al SIGMOD95)
Partitioning
g
Quantitative A.R.
(Srikant et. al SIGMOD96)
(Navathe et. al VLDB95)
N-dimensional A.R.
Direct Itemset Counting
(Lu et. al DMKD’98)
((Brin et. al SIGMOD97))
Meta-ruleguided
mining
Parallel mining
(Agrawal et. al TKDE96)
Distributed mining
Incremental mining
(Cheung et. al PDIS96)(Cheung et. al ICDE96)
9
University of Indonesia
Many Kinds of Association Rules
`
Boolean association rule:
`
`
`
Quantitative association rule:
`
`
`
`
`
If a rule concerns associations between the presence or absence
of items.
Example: buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] (R1)
Describes associations between quantitative items. Quantitative
values for items are partitioned into intervals.
intervals
Example: age(X,”30-39”) ∧ income(X,“42K-48K”) →
buys(X,“LCD TV”)
(R2)
Age and income have been discretized.
Single-dimensional association rule ¼ R1
Multidimensional association rule ¼ R2
10
University of Indonesia
Many Kinds of Association Rules
`
Single-level association rule
`
`
Multilevel association rule
`
`
`
Example: age(X,
age(X ”30-39”)
30 39 ) → buys(X,
buys(X “laptop
laptop computer
computer”))
Example:
p age(X,”30-39”)
g (
) → buys(X,“computer”)
y(
p
)
Computer is a higher-level abstraction of laptop
Various Extensions
`
`
`
Mining maximal frequent patterns.
If p is a maximal frequent pattern, then any superpattern of p is
not frequent.
frequent
Used to substantially reduce the number of frequent itemsts
generated in mining.
11
University of Indonesia
Mining Single-Dimensional
Boolean Assocation Rules
`
`
Given
` A database of customer transactions
` Each transaction is a list of items (purchased by a customer in
a visit)
Find all rules that correlate the presence of one set of items
with that of another set of items
` Example: 98% of people who purchase tires and auto accessories
also get automotive services done
` Any number of items in the consequent/antecedent of rule
` Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).
12
University of Indonesia
Application Examples
`
Market-basket Analysis
`
`
`
* → Fanta -- what the store should do to boost Fanta sales
Bodrex → * -- what other products should the store stocks
up on if the store has a sale on Bodrex
Attached mailing in direct marketing
13
University of Indonesia
Rule Measures: Support and Confidence
Customer
buys both
Customer
buys beer
Transaction ID
2000
1000
4000
5000
14
Customer
`
buys diaper
Find all the rules X & Y ⇒ Z with
minimum confidence and support
` support, s, probability that a
transaction contains {X
{X,Y,
Y Z}
` confidence, c, conditional probability
that a transaction having {X,Y} also
contains Z.
Items Bought
ABC
A,B,C
A,C
A,D
,
B,E,F
Let minimum support 50%, and
minimum confidence 50%, we have
A ⇒ C (50%, 66.6%)
C ⇒ A (50%
(50%, 100%)
University of Indonesia
Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A ⇒ C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent.
15
University of Indonesia
Mining Frequent Itemsets: the Key Step
¦ Find
the frequent itemsets: the sets of items that have
minimum support
A
subset of a frequent itemset must also be a frequent itemset,
i e if {AB} is a frequent itemset
i.e.,
itemset, both {A} and {B} should be a
frequent itemset
Iteratively
find frequent itemsets with cardinality from 1 to k
(k-itemset)
§ Use
16
the frequent
q
itemsets to generate
g
association rules.
University of Indonesia
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent
f
itemset
i
off size
i k
L1 = {frequent items};
f (k = 1;
for
1 Lk !=∅;
! ∅ k++) do
d begin
b i
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
17
University of Indonesia
The Apriori Algorithm – Example 1
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
L1 itemset sup.
{1}
{2}
{3}
{5}
C2 itemset sup
L2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
18
sup
p
2
2
3
2
itemset
{2 3 5}
{
{1
{1
{1
{{2
{2
{3
S
Scan
D
2}}
3}
5}
3}}
5}
5}
1
2
1
2
3
2
L3
C2
Scan D
2
3
3
3
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{2 3 5} 2
University of Indonesia
The Apriori Algorithm – Example 2
Tid
1
2
3
4
5
3
1
1
1
1
4
3
2
3
3
Items
5 6 7 9
4 5 13
4 5 7 11
4 8
4 10
Sample
Database
Itemset
1
3
4
5
7
4
5
4
4
3
3
1
1
1
1
1
3
7
7
5
5
5
4
5
4
4
3
3
4
7
5
5
4
Tid
2,3,4,5
1,2,4,5
1,2,3,4,5
1,2,3
1,3
1,3
,
1,3
1,3
1,2,3
,
1,2
1,2
2,3
2,3
2,3,4,5
2,4,5
2,4,5
1,2,4,5
Support
80% (4)
80% (4)
100% (5)
60% (3)
40% (2)
40% ((2))
40% (2)
40% /2
60% (3)
40% ((2))
40% (2)
40% (2)
40% (2)
80% ((4))
60% (3)
60% (3)
80% (4)
Rule
7⇒ 5
7⇒ 4
5⇒ 4
3⇒ 1
1⇒ 3
3⇒ 4
4⇒ 3
1⇒ 4
4⇒ 1
57⇒ 4
47⇒ 5
35⇒ 4
15⇒ 4
13⇒ 4
34⇒ 1
14⇒ 3
Frequent Patterns
with MinSupport = 40%
Support Confidence
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%
80% (4)
100%
80% (4)
80%
80% (4)
100%
80% (4)
80%
40% (2)
100%
40% (2)
100%
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%
Association Rules
with MinSupport = 40% and MinConf = 70%
19
University of Indonesia
C1
Scan D for
count of each
candidate
The
Apriori
Algorithm
–
Example 2
Pattern
1
2
3
4
5
6
7
8
9
10
11
13
Support
4
1
4
5
3
1
2
1
1
1
1
1
C2
Generate C2
Candidates
From L1
Pattern
13
14
15
17
34
35
37
45
47
57
20
Pattern
1 34
1 35
1 45
1 47
1 57
3 45
3 47
3 57
4 57
4
5
7
5
3
2
C2
Scan D for
count of each
candidate
C3
Generate C3
Candidates
From L2
L1
Prune
Prune
infrequent Pattern Support
1
4
patterns
3
4
Pattern
P
tt
S
Support
t
13
3
14
4
15
2
1
17
34
3
35
2
37
1
3
45
2
47
57
2
L2
Prune
Pattern Support
infrequent
13
3
14
4
patterns
1
3
3
4
4
5
5
4
5
5
7
7
2
3
2
3
2
2
C3
Scan D for
count of each
candidate
Pattern Support
1 34
3
1 35
1
1 45
2
1 47
1
1 57
1
3 45
2
3 47
1
3 57
1
4 57
2
L3
Prune
infrequent Pattern Support
13 4
3
patterns
14 5
34 5
45 7
2
2
2
University of Indonesia
Major Drawbacks of Apriori
`
`
Apriori has to read the database many times to test the
support of candidate patterns. In order to find a frequent
pattern X with length 50, it has to traverse the database
50 times.
On dense datasets with long patterns, as the length of the
pattern increases, the performance of Apriori drops rapidly
due to the explosion of the candidate patterns
patterns.
21
University of Indonesia
TreeProjection Algorithm
(Agarwal, Aggarwal & Prasad 2000)
null
7
75
754
{54, 54}
74
5
51
53
514
534
{34, 134, 14}
13
3
14
34
Level 1
4
Level 2
134
Lexicographical Tree
and
Triangular Matrix for Counting Frequent Patterns
with Length Two
22
{4,4,
4,4}
1 {34, 4,
34, 34}
54
Lev el 0
{7534, 5134, 7514, 134, 134}
Level 3
7
5
1
3
4
7
5
1
3
2
1
1
2
2
2
3
3
4
4
4
University of Indonesia
Eclat Algorithm
(Zaki et al. 1997)
13457
1345
134
13
135
14
1347
137
15
1357
345
1
2
1457
145
147
157
345
17
34
35
37
3457
347
45
357
47
tid‐list intersection
34 35
1 1
2 2
4
5
457
57
tid‐list intersection
1
3
4
5
1
2
3
4
5
7
3
1
2
4
5
{}
{}
23
FP-Growth
FP
Growth
4
1
2
3
4
5
5 7
1 1
2 3
3
University of Indonesia
(Han, Pei & Yin 2000)
Root
HeaderTable
4
4:5
3
1
5
3:4
1:1
5:1
1:3
5:1
7:1
5:1
7:1
7
Item Head of
node‐links
FP-Tree for the sample database
24
University of Indonesia
CT-PRO
CT
PRO
(Sucahyo & Gopalan 2004)
Level 0
1
Tid Items
1 3 4 5 7
2 1 3 4 5
3 1 4 5 7
4 1 3 4
4 1 34
5 1 3 4
(a) Frequent Items
Tid Items
1 1 2 4 5
2 1 2 3 4
3 1 3 4 5
4 1 2 3
4 12 3
5 1 2 3
(b) Mapped
ItemTable
1 4 5
2 3 4
3 1 4
4 5 3
4 5 3
5 7 2
I I C P
n t o S
d e
e u
u T
e m n
x t
0 5 1
Level 1
Level 1
2
0 4 12
1 0 2
3
0 1 13
Level 2
3
0 3 123
1 0 23
2 0 3
4
0 1 124
1 0 24
4 0 1 134
5
0 1 1245
1 0 245
5 0 1 1345
Level 3
4
0
1
2
3
1
0
0
0
1234
234
34
4
Level 4
5
0
1
2
3
4
0
0
0
0
0
12345
2345
345
45
5
(c) Global CFP-Tree
25
University of Indonesia
Mining Very Large Database
Partition Algorithm (Savasere, Omiecinski & Navathe 1995)
Tid Items
1 3 4 5 6 7 9 Scan P1 and P2 to find local
frequent patterns
frequent patterns
2 1 3 4 5 13
2 1 3 4 5 13
3 1 2 4 5 7 11
4 1 3 4 8
5 1 3 4 10
C: {{1},{1,3},{1,4},{1,5},
{1,3,4},{1,4,5},
{3} {3,4},
{3},
{3 4} {3
{3,4,5},
4 } {3
{3,5},
}
{4}, {4,5}, {4,7}, {4,5,7},
{5}, {5,7}, {7} ……. {8},{10}}
26
FP 1: {{1},{1,4},{1,5},{1,4,5},
{3}, {3,4}, {3,4,5}, {3,5},
{4} {4,5},
{4},
{4 5} {4,7},
{4 7} {4,5,7},
{4 5 7}
{5}, {5,7}, {7}}
FP 2: {{1},{1,3},{1,4},{1,3,4}
{3},{3,4},{4},……. {8},{10}
Scan database to count
support for patterns in C
FP: {{1},{1,3},{1,4},
{1 5} {1,3,4},{1,4,5},
{1,5},
{1 3 4} {1 4 5}
{3}, {3,4}, {3,4,5},
{3,5}, {4}, {4,5}, {4,7},
{4,5,7}, {5}, {5,7},
{7}}
University of Indonesia
Mining Very Large Database
`
Projection (Pei 2002)
Tid Items
1 7 5 3 4
2 5 3 1 4
3 7 5 1 4
4 3 1 4
5 3 1 4
Items
It
5 3 4
5 1 4
Items
It
3 4
3 1 4
Projection 7 1 4
Projection 5
Items
Items
4
1 4
1 4
1 4
Projection 3
Tid Items
1 7 5 3 4
2 5 3 1 4
3 7 5 1 4
3 7 5 1 4
4 3 1 4
5 3 1 4
Items
Items
4
4
4
4
Items
It
Projection 4
Projection 1
(a) Parallel Projection
27
Items
5 3 4
5 1 4
Projection 7
Projection 7
Items
3 1 4
Projection 5
Items
1 4
1 4
Items
Projection 3
Projection 3
Projection 1
(b) Partition Projection
University of Indonesia
Presentation of Association Rules
(Table Form)
28
University of Indonesia
Items
Projection 4
Visualization of Association Rule
Using Rule Graph
29
University of Indonesia
Visualization of Association Rule
Using Plane Graph
30
University of Indonesia
Conclusion
`
Association rule mining
`
`
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
`
Many interesting issues have been explored
`
An interesting research direction
`
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
31
University of Indonesia
References
`
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.
`
David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,
2007.
`
Agarwal,
g
R. C., Aggarwal,
gg
C. C. & Prasad, V.V.V. 2001, 'A Tree Projection
j
Algorithm for Generation of Frequent Item Sets', Journal of Parallel and
Distributed Computing (Special Issue on High-Performance Data Mining), vol. 61, no.
3, pp. 350-371.
`
Han, J., Pei, J. & Yin,Y. 2000, 'Mining Frequent Patterns without Candidate
Generation', in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Dallas, Texas, USA, pp. 1-12.
`
Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for
Mining Association Rules in Large Databases', in Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, pp.
432-444
32
University of Indonesia
References (2)
`
Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon
Fraser University, Canada.
`
Zaki,
Z
k M
M. J.J 1997,
1997 'Parallel
'P ll l Algorithms
Al
h ffor FFast D
Discovery off Association
A
Rules',
R l ' Data
D
Mining and Knowledge Discovery: An International Journal, vol. 1, no. 4, pp. 343-373
`
Sucahyo,Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent
Itemset Mining Algorithm Using Compressed FP-Tree
FP Tree Data Structure
Structure', in Proceedings
of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),
Brighton, UK.
33
University of Indonesia
Lecture 4/DMBI/IKI83403T/MTI/UI
Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia
Objectives
Introduction
What is Association Mining?
Mining Association Rules
Al ith ffor A
Algorithms
Association
i ti R
Rules
l Mi
Mining
i
Visualization
`
`
`
`
`
2
University of Indonesia
Introduction
You sell more if customers can see the product.
Customers that purchase one type of product are likely
to be interested in other particular products.
Market-basket analysis ¼ studying the composition of
shopping
h
i bbasket
k off products
d
purchased
h d during
d i a single
i l
shopping event.
Market-basket data ¼ the transactional list of purchases
p
by customer. It is challenging, because
`
`
`
`
`
`
`
Very large number of records (often millions of trans/day)
Sparseness (each market-basket
market basket contains only a small portion of
items carried)
Heterogeneity (those with different tastes tend to purchase a specific
subset of items).
items)
University of Indonesia
Introduction (2)
Product presentations can be more intelligently planned
for specific times a day, days of the week, or holidays.
Can also involve sequential relationships.
Market-basket analysis
y is an undirected ((alongg with
clustering) DM operation, seeking patterns that were
previously unknown.
Cross-selling
`
`
`
`
`
`
4
The propensity for the purchaser of a specific item to purchase
a different item
Can be maximized by locating those products that tend to be
purchased by the same consumer in places where both
products can be seen.
University of Indonesia
What is Association Mining?
Association rule mining (ARM):
`
Finding frequent patterns,
patterns association,
association correlation,
correlation or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
` Frequent pattern: pattern that occurs frequently in a database.
`
Motivation: finding regularities in data
`
`
`
`
`
What products were often purchased together? — Beer and
diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
5
University of Indonesia
Why is Frequent Pattern or Association
Mining an Essential Task in DM?
Foundation for many essential data mining tasks
`
`
Association correlation,
Association,
correlation causality
`
Sequential patterns, temporal or cyclic association, partial
pperiodicity,
y spatial
p
and multimedia association
`
Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
Broad applications
`
6
`
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis
`
Web log (click stream) analysis, DNA sequence analysis, etc.
University of Indonesia
What is Association Mining?
Examples:
`
Rule form: “A
A → B [support,
[support confidence]
confidence]”.
` buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
` major(x,
j ( , “CS”)) ^ takes(x,
( , “DB”)) → ggrade(x,
( , “A”)) [[1%,, 75%]]
`
A support of 0.5% for Assoc Rule means that 0.5%
of all the transaction show that diapers and beers are
purchased together.
A confidence of 60% means that 60% of the
customers who purchased diapers also bought beers.
Rules that satisfy both minimum support and
minimum confidence
fd
threshold
h h ld are called
ll d strong.
`
`
`
7
University of Indonesia
What is Association Mining?
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset.
{beer, diaper} is a 2-itemset.
If an itemset satisfies minimum support,
support then it is a
frequent itemset.
The set of frequent
q
k-itemsets is commonlyy denoted byy Lk.
ARM is a two-step process:
`
`
`
`
`
`
`
`
Find all frequent itemsets
Generate strong AR from the frequent itemsets.
The second step is the easiest of the two. Overall
performance
f
off mining
i i AR iis determined
d t
i d bby th
the first
fi t step.
t
`
8
University of Indonesia
Association Mining
mining association rules
(Agrawal et
et. al SIGMOD93)
Better algorithms
Problem extension
Fast algorithm
Generalized A.R.
(Agrawal et. al VLDB94)
(Srikant et.
et al; Han et.
et al.
al VLDB95)
Hash-based
(Park et. al SIGMOD95)
Partitioning
g
Quantitative A.R.
(Srikant et. al SIGMOD96)
(Navathe et. al VLDB95)
N-dimensional A.R.
Direct Itemset Counting
(Lu et. al DMKD’98)
((Brin et. al SIGMOD97))
Meta-ruleguided
mining
Parallel mining
(Agrawal et. al TKDE96)
Distributed mining
Incremental mining
(Cheung et. al PDIS96)(Cheung et. al ICDE96)
9
University of Indonesia
Many Kinds of Association Rules
`
Boolean association rule:
`
`
`
Quantitative association rule:
`
`
`
`
`
If a rule concerns associations between the presence or absence
of items.
Example: buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] (R1)
Describes associations between quantitative items. Quantitative
values for items are partitioned into intervals.
intervals
Example: age(X,”30-39”) ∧ income(X,“42K-48K”) →
buys(X,“LCD TV”)
(R2)
Age and income have been discretized.
Single-dimensional association rule ¼ R1
Multidimensional association rule ¼ R2
10
University of Indonesia
Many Kinds of Association Rules
`
Single-level association rule
`
`
Multilevel association rule
`
`
`
Example: age(X,
age(X ”30-39”)
30 39 ) → buys(X,
buys(X “laptop
laptop computer
computer”))
Example:
p age(X,”30-39”)
g (
) → buys(X,“computer”)
y(
p
)
Computer is a higher-level abstraction of laptop
Various Extensions
`
`
`
Mining maximal frequent patterns.
If p is a maximal frequent pattern, then any superpattern of p is
not frequent.
frequent
Used to substantially reduce the number of frequent itemsts
generated in mining.
11
University of Indonesia
Mining Single-Dimensional
Boolean Assocation Rules
`
`
Given
` A database of customer transactions
` Each transaction is a list of items (purchased by a customer in
a visit)
Find all rules that correlate the presence of one set of items
with that of another set of items
` Example: 98% of people who purchase tires and auto accessories
also get automotive services done
` Any number of items in the consequent/antecedent of rule
` Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).
12
University of Indonesia
Application Examples
`
Market-basket Analysis
`
`
`
* → Fanta -- what the store should do to boost Fanta sales
Bodrex → * -- what other products should the store stocks
up on if the store has a sale on Bodrex
Attached mailing in direct marketing
13
University of Indonesia
Rule Measures: Support and Confidence
Customer
buys both
Customer
buys beer
Transaction ID
2000
1000
4000
5000
14
Customer
`
buys diaper
Find all the rules X & Y ⇒ Z with
minimum confidence and support
` support, s, probability that a
transaction contains {X
{X,Y,
Y Z}
` confidence, c, conditional probability
that a transaction having {X,Y} also
contains Z.
Items Bought
ABC
A,B,C
A,C
A,D
,
B,E,F
Let minimum support 50%, and
minimum confidence 50%, we have
A ⇒ C (50%, 66.6%)
C ⇒ A (50%
(50%, 100%)
University of Indonesia
Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A ⇒ C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent.
15
University of Indonesia
Mining Frequent Itemsets: the Key Step
¦ Find
the frequent itemsets: the sets of items that have
minimum support
A
subset of a frequent itemset must also be a frequent itemset,
i e if {AB} is a frequent itemset
i.e.,
itemset, both {A} and {B} should be a
frequent itemset
Iteratively
find frequent itemsets with cardinality from 1 to k
(k-itemset)
§ Use
16
the frequent
q
itemsets to generate
g
association rules.
University of Indonesia
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent
f
itemset
i
off size
i k
L1 = {frequent items};
f (k = 1;
for
1 Lk !=∅;
! ∅ k++) do
d begin
b i
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
17
University of Indonesia
The Apriori Algorithm – Example 1
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
L1 itemset sup.
{1}
{2}
{3}
{5}
C2 itemset sup
L2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
18
sup
p
2
2
3
2
itemset
{2 3 5}
{
{1
{1
{1
{{2
{2
{3
S
Scan
D
2}}
3}
5}
3}}
5}
5}
1
2
1
2
3
2
L3
C2
Scan D
2
3
3
3
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{2 3 5} 2
University of Indonesia
The Apriori Algorithm – Example 2
Tid
1
2
3
4
5
3
1
1
1
1
4
3
2
3
3
Items
5 6 7 9
4 5 13
4 5 7 11
4 8
4 10
Sample
Database
Itemset
1
3
4
5
7
4
5
4
4
3
3
1
1
1
1
1
3
7
7
5
5
5
4
5
4
4
3
3
4
7
5
5
4
Tid
2,3,4,5
1,2,4,5
1,2,3,4,5
1,2,3
1,3
1,3
,
1,3
1,3
1,2,3
,
1,2
1,2
2,3
2,3
2,3,4,5
2,4,5
2,4,5
1,2,4,5
Support
80% (4)
80% (4)
100% (5)
60% (3)
40% (2)
40% ((2))
40% (2)
40% /2
60% (3)
40% ((2))
40% (2)
40% (2)
40% (2)
80% ((4))
60% (3)
60% (3)
80% (4)
Rule
7⇒ 5
7⇒ 4
5⇒ 4
3⇒ 1
1⇒ 3
3⇒ 4
4⇒ 3
1⇒ 4
4⇒ 1
57⇒ 4
47⇒ 5
35⇒ 4
15⇒ 4
13⇒ 4
34⇒ 1
14⇒ 3
Frequent Patterns
with MinSupport = 40%
Support Confidence
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%
80% (4)
100%
80% (4)
80%
80% (4)
100%
80% (4)
80%
40% (2)
100%
40% (2)
100%
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%
Association Rules
with MinSupport = 40% and MinConf = 70%
19
University of Indonesia
C1
Scan D for
count of each
candidate
The
Apriori
Algorithm
–
Example 2
Pattern
1
2
3
4
5
6
7
8
9
10
11
13
Support
4
1
4
5
3
1
2
1
1
1
1
1
C2
Generate C2
Candidates
From L1
Pattern
13
14
15
17
34
35
37
45
47
57
20
Pattern
1 34
1 35
1 45
1 47
1 57
3 45
3 47
3 57
4 57
4
5
7
5
3
2
C2
Scan D for
count of each
candidate
C3
Generate C3
Candidates
From L2
L1
Prune
Prune
infrequent Pattern Support
1
4
patterns
3
4
Pattern
P
tt
S
Support
t
13
3
14
4
15
2
1
17
34
3
35
2
37
1
3
45
2
47
57
2
L2
Prune
Pattern Support
infrequent
13
3
14
4
patterns
1
3
3
4
4
5
5
4
5
5
7
7
2
3
2
3
2
2
C3
Scan D for
count of each
candidate
Pattern Support
1 34
3
1 35
1
1 45
2
1 47
1
1 57
1
3 45
2
3 47
1
3 57
1
4 57
2
L3
Prune
infrequent Pattern Support
13 4
3
patterns
14 5
34 5
45 7
2
2
2
University of Indonesia
Major Drawbacks of Apriori
`
`
Apriori has to read the database many times to test the
support of candidate patterns. In order to find a frequent
pattern X with length 50, it has to traverse the database
50 times.
On dense datasets with long patterns, as the length of the
pattern increases, the performance of Apriori drops rapidly
due to the explosion of the candidate patterns
patterns.
21
University of Indonesia
TreeProjection Algorithm
(Agarwal, Aggarwal & Prasad 2000)
null
7
75
754
{54, 54}
74
5
51
53
514
534
{34, 134, 14}
13
3
14
34
Level 1
4
Level 2
134
Lexicographical Tree
and
Triangular Matrix for Counting Frequent Patterns
with Length Two
22
{4,4,
4,4}
1 {34, 4,
34, 34}
54
Lev el 0
{7534, 5134, 7514, 134, 134}
Level 3
7
5
1
3
4
7
5
1
3
2
1
1
2
2
2
3
3
4
4
4
University of Indonesia
Eclat Algorithm
(Zaki et al. 1997)
13457
1345
134
13
135
14
1347
137
15
1357
345
1
2
1457
145
147
157
345
17
34
35
37
3457
347
45
357
47
tid‐list intersection
34 35
1 1
2 2
4
5
457
57
tid‐list intersection
1
3
4
5
1
2
3
4
5
7
3
1
2
4
5
{}
{}
23
FP-Growth
FP
Growth
4
1
2
3
4
5
5 7
1 1
2 3
3
University of Indonesia
(Han, Pei & Yin 2000)
Root
HeaderTable
4
4:5
3
1
5
3:4
1:1
5:1
1:3
5:1
7:1
5:1
7:1
7
Item Head of
node‐links
FP-Tree for the sample database
24
University of Indonesia
CT-PRO
CT
PRO
(Sucahyo & Gopalan 2004)
Level 0
1
Tid Items
1 3 4 5 7
2 1 3 4 5
3 1 4 5 7
4 1 3 4
4 1 34
5 1 3 4
(a) Frequent Items
Tid Items
1 1 2 4 5
2 1 2 3 4
3 1 3 4 5
4 1 2 3
4 12 3
5 1 2 3
(b) Mapped
ItemTable
1 4 5
2 3 4
3 1 4
4 5 3
4 5 3
5 7 2
I I C P
n t o S
d e
e u
u T
e m n
x t
0 5 1
Level 1
Level 1
2
0 4 12
1 0 2
3
0 1 13
Level 2
3
0 3 123
1 0 23
2 0 3
4
0 1 124
1 0 24
4 0 1 134
5
0 1 1245
1 0 245
5 0 1 1345
Level 3
4
0
1
2
3
1
0
0
0
1234
234
34
4
Level 4
5
0
1
2
3
4
0
0
0
0
0
12345
2345
345
45
5
(c) Global CFP-Tree
25
University of Indonesia
Mining Very Large Database
Partition Algorithm (Savasere, Omiecinski & Navathe 1995)
Tid Items
1 3 4 5 6 7 9 Scan P1 and P2 to find local
frequent patterns
frequent patterns
2 1 3 4 5 13
2 1 3 4 5 13
3 1 2 4 5 7 11
4 1 3 4 8
5 1 3 4 10
C: {{1},{1,3},{1,4},{1,5},
{1,3,4},{1,4,5},
{3} {3,4},
{3},
{3 4} {3
{3,4,5},
4 } {3
{3,5},
}
{4}, {4,5}, {4,7}, {4,5,7},
{5}, {5,7}, {7} ……. {8},{10}}
26
FP 1: {{1},{1,4},{1,5},{1,4,5},
{3}, {3,4}, {3,4,5}, {3,5},
{4} {4,5},
{4},
{4 5} {4,7},
{4 7} {4,5,7},
{4 5 7}
{5}, {5,7}, {7}}
FP 2: {{1},{1,3},{1,4},{1,3,4}
{3},{3,4},{4},……. {8},{10}
Scan database to count
support for patterns in C
FP: {{1},{1,3},{1,4},
{1 5} {1,3,4},{1,4,5},
{1,5},
{1 3 4} {1 4 5}
{3}, {3,4}, {3,4,5},
{3,5}, {4}, {4,5}, {4,7},
{4,5,7}, {5}, {5,7},
{7}}
University of Indonesia
Mining Very Large Database
`
Projection (Pei 2002)
Tid Items
1 7 5 3 4
2 5 3 1 4
3 7 5 1 4
4 3 1 4
5 3 1 4
Items
It
5 3 4
5 1 4
Items
It
3 4
3 1 4
Projection 7 1 4
Projection 5
Items
Items
4
1 4
1 4
1 4
Projection 3
Tid Items
1 7 5 3 4
2 5 3 1 4
3 7 5 1 4
3 7 5 1 4
4 3 1 4
5 3 1 4
Items
Items
4
4
4
4
Items
It
Projection 4
Projection 1
(a) Parallel Projection
27
Items
5 3 4
5 1 4
Projection 7
Projection 7
Items
3 1 4
Projection 5
Items
1 4
1 4
Items
Projection 3
Projection 3
Projection 1
(b) Partition Projection
University of Indonesia
Presentation of Association Rules
(Table Form)
28
University of Indonesia
Items
Projection 4
Visualization of Association Rule
Using Rule Graph
29
University of Indonesia
Visualization of Association Rule
Using Plane Graph
30
University of Indonesia
Conclusion
`
Association rule mining
`
`
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
`
Many interesting issues have been explored
`
An interesting research direction
`
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
31
University of Indonesia
References
`
Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.
`
David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,
2007.
`
Agarwal,
g
R. C., Aggarwal,
gg
C. C. & Prasad, V.V.V. 2001, 'A Tree Projection
j
Algorithm for Generation of Frequent Item Sets', Journal of Parallel and
Distributed Computing (Special Issue on High-Performance Data Mining), vol. 61, no.
3, pp. 350-371.
`
Han, J., Pei, J. & Yin,Y. 2000, 'Mining Frequent Patterns without Candidate
Generation', in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Dallas, Texas, USA, pp. 1-12.
`
Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for
Mining Association Rules in Large Databases', in Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, pp.
432-444
32
University of Indonesia
References (2)
`
Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon
Fraser University, Canada.
`
Zaki,
Z
k M
M. J.J 1997,
1997 'Parallel
'P ll l Algorithms
Al
h ffor FFast D
Discovery off Association
A
Rules',
R l ' Data
D
Mining and Knowledge Discovery: An International Journal, vol. 1, no. 4, pp. 343-373
`
Sucahyo,Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent
Itemset Mining Algorithm Using Compressed FP-Tree
FP Tree Data Structure
Structure', in Proceedings
of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),
Brighton, UK.
33
University of Indonesia