Lect4 Association Rules

Association Rules
Lecture 4/DMBI/IKI83403T/MTI/UI

Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia

Objectives
Introduction
What is Association Mining?
Mining Association Rules
Al ith ffor A
Algorithms
Association
i ti R
Rules
l Mi
Mining
i
Visualization

`

`
`
`
`

2

University of Indonesia

Introduction
You sell more if customers can see the product.
Customers that purchase one type of product are likely
to be interested in other particular products.
Market-basket analysis ¼ studying the composition of
shopping
h
i bbasket
k off products
d
purchased

h d during
d i a single
i l
shopping event.
Market-basket data ¼ the transactional list of purchases
p
by customer. It is challenging, because

`
`
`

`

`
`
`

Very large number of records (often millions of trans/day)
Sparseness (each market-basket

market basket contains only a small portion of
items carried)
Heterogeneity (those with different tastes tend to purchase a specific
subset of items).
items)
University of Indonesia

Introduction (2)
Product presentations can be more intelligently planned
for specific times a day, days of the week, or holidays.
Can also involve sequential relationships.
Market-basket analysis
y is an undirected ((alongg with
clustering) DM operation, seeking patterns that were
previously unknown.
Cross-selling

`
`
`


`

`
`

4

The propensity for the purchaser of a specific item to purchase
a different item
Can be maximized by locating those products that tend to be
purchased by the same consumer in places where both
products can be seen.
University of Indonesia

What is Association Mining?
Association rule mining (ARM):

`


Finding frequent patterns,
patterns association,
association correlation,
correlation or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
` Frequent pattern: pattern that occurs frequently in a database.
`

Motivation: finding regularities in data

`

`
`
`
`

What products were often purchased together? — Beer and

diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?

5

University of Indonesia

Why is Frequent Pattern or Association
Mining an Essential Task in DM?
Foundation for many essential data mining tasks

`

`

Association correlation,
Association,
correlation causality


`

Sequential patterns, temporal or cyclic association, partial
pperiodicity,
y spatial
p
and multimedia association

`

Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)

Broad applications

`

6


`

Basket data analysis, cross-marketing, catalog design, sale
campaign analysis

`

Web log (click stream) analysis, DNA sequence analysis, etc.

University of Indonesia

What is Association Mining?
Examples:

`

Rule form: “A
A → B [support,
[support confidence]
confidence]”.

` buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
` major(x,
j ( , “CS”)) ^ takes(x,
( , “DB”)) → ggrade(x,
( , “A”)) [[1%,, 75%]]
`

A support of 0.5% for Assoc Rule means that 0.5%
of all the transaction show that diapers and beers are
purchased together.
A confidence of 60% means that 60% of the
customers who purchased diapers also bought beers.
Rules that satisfy both minimum support and
minimum confidence
fd
threshold
h h ld are called
ll d strong.

`


`
`

7

University of Indonesia

What is Association Mining?
A set of items is referred to as an itemset.
An itemset that contains k items is a k-itemset.
{beer, diaper} is a 2-itemset.
If an itemset satisfies minimum support,
support then it is a
frequent itemset.
The set of frequent
q
k-itemsets is commonlyy denoted byy Lk.
ARM is a two-step process:


`
`
`
`
`
`

`
`

Find all frequent itemsets
Generate strong AR from the frequent itemsets.

The second step is the easiest of the two. Overall
performance
f
off mining
i i AR iis determined
d t
i d bby th
the first
fi t step.
t

`

8

University of Indonesia

Association Mining
mining association rules

(Agrawal et
et. al SIGMOD93)
Better algorithms

Problem extension

Fast algorithm

Generalized A.R.

(Agrawal et. al VLDB94)

(Srikant et.
et al; Han et.
et al.
al VLDB95)

Hash-based

(Park et. al SIGMOD95)
Partitioning
g

Quantitative A.R.

(Srikant et. al SIGMOD96)

(Navathe et. al VLDB95)

N-dimensional A.R.

Direct Itemset Counting

(Lu et. al DMKD’98)

((Brin et. al SIGMOD97))

Meta-ruleguided
mining

Parallel mining

(Agrawal et. al TKDE96)
Distributed mining

Incremental mining

(Cheung et. al PDIS96)(Cheung et. al ICDE96)

9

University of Indonesia

Many Kinds of Association Rules
`

Boolean association rule:
`
`

`

Quantitative association rule:
`
`
`

`
`

If a rule concerns associations between the presence or absence
of items.
Example: buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%] (R1)
Describes associations between quantitative items. Quantitative
values for items are partitioned into intervals.
intervals
Example: age(X,”30-39”) ∧ income(X,“42K-48K”) →
buys(X,“LCD TV”)
(R2)
Age and income have been discretized.

Single-dimensional association rule ¼ R1
Multidimensional association rule ¼ R2
10

University of Indonesia

Many Kinds of Association Rules
`

Single-level association rule
`

`

Multilevel association rule
`
`

`

Example: age(X,
age(X ”30-39”)
30 39 ) → buys(X,
buys(X “laptop
laptop computer
computer”))
Example:
p age(X,”30-39”)
g (
) → buys(X,“computer”)
y(
p
)
Computer is a higher-level abstraction of laptop

Various Extensions
`
`
`

Mining maximal frequent patterns.
If p is a maximal frequent pattern, then any superpattern of p is
not frequent.
frequent
Used to substantially reduce the number of frequent itemsts
generated in mining.

11

University of Indonesia

Mining Single-Dimensional
Boolean Assocation Rules
`

`

Given
` A database of customer transactions
` Each transaction is a list of items (purchased by a customer in
a visit)
Find all rules that correlate the presence of one set of items
with that of another set of items
` Example: 98% of people who purchase tires and auto accessories
also get automotive services done
` Any number of items in the consequent/antecedent of rule
` Possible to specify constraints on rules (e.g., find only rules
involving Home Laundry Appliances).

12

University of Indonesia

Application Examples
`

Market-basket Analysis
`
`

`

* → Fanta -- what the store should do to boost Fanta sales
Bodrex → * -- what other products should the store stocks
up on if the store has a sale on Bodrex

Attached mailing in direct marketing

13

University of Indonesia

Rule Measures: Support and Confidence
Customer
buys both

Customer
buys beer

Transaction ID
2000
1000
4000
5000
14

Customer
`
buys diaper

Find all the rules X & Y ⇒ Z with
minimum confidence and support
` support, s, probability that a
transaction contains {X
{X,Y,
Y Z}
` confidence, c, conditional probability
that a transaction having {X,Y} also
contains Z.

Items Bought
ABC
A,B,C
A,C
A,D
,
B,E,F

Let minimum support 50%, and
minimum confidence 50%, we have
—
—

A ⇒ C (50%, 66.6%)
C ⇒ A (50%
(50%, 100%)
University of Indonesia

Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000

Items Bought
A,B,C
A,C
A,D
B,E,F

For rule A ⇒ C:

Min. support 50%
Min. confidence 50%

Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%

support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%

The Apriori principle:
Any subset of a frequent itemset must be frequent.
15

University of Indonesia

Mining Frequent Itemsets: the Key Step
¦ Find

the frequent itemsets: the sets of items that have
minimum support
‹A

subset of a frequent itemset must also be a frequent itemset,
i e if {AB} is a frequent itemset
i.e.,
itemset, both {A} and {B} should be a
frequent itemset

‹ Iteratively

find frequent itemsets with cardinality from 1 to k
(k-itemset)

§ Use

16

the frequent
q
itemsets to generate
g
association rules.

University of Indonesia

The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent
f
itemset
i
off size
i k
L1 = {frequent items};
f (k = 1;
for
1 Lk !=∅;
! ∅ k++) do
d begin
b i
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t

Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
17

University of Indonesia

The Apriori Algorithm – Example 1
Database D

TID
100
200
300
400

itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3

Items
134
235
1235
25

L1 itemset sup.

{1}
{2}
{3}
{5}

C2 itemset sup
L2

itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3

18

sup
p
2
2
3
2

itemset
{2 3 5}

{
{1
{1
{1
{{2
{2
{3
S
Scan
D

2}}
3}
5}
3}}
5}
5}

1
2
1
2
3
2
L3

C2
Scan D

2
3
3
3

itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}

itemset sup
{2 3 5} 2
University of Indonesia

The Apriori Algorithm – Example 2
Tid
1
2
3
4
5

3
1
1
1
1

4
3
2
3
3

Items
5 6 7 9
4 5 13
4 5 7 11
4 8
4 10

Sample
Database

Itemset
1
3
4
5
7
4
5
4
4
3
3
1
1
1
1
1
3

7
7
5
5
5
4
5
4
4
3
3
4

7

5
5

4

Tid
2,3,4,5
1,2,4,5
1,2,3,4,5
1,2,3
1,3
1,3
,
1,3
1,3
1,2,3
,
1,2
1,2
2,3
2,3
2,3,4,5
2,4,5
2,4,5
1,2,4,5

Support
80% (4)
80% (4)
100% (5)
60% (3)
40% (2)
40% ((2))
40% (2)
40% /2
60% (3)
40% ((2))
40% (2)
40% (2)
40% (2)
80% ((4))
60% (3)
60% (3)
80% (4)

Rule
7⇒ 5
7⇒ 4
5⇒ 4
3⇒ 1
1⇒ 3
3⇒ 4
4⇒ 3
1⇒ 4
4⇒ 1
57⇒ 4
47⇒ 5
35⇒ 4
15⇒ 4
13⇒ 4
34⇒ 1
14⇒ 3

Frequent Patterns
with MinSupport = 40%

Support Confidence
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%
80% (4)
100%
80% (4)
80%
80% (4)
100%
80% (4)
80%
40% (2)
100%
40% (2)
100%
40% (2)
100%
40% (2)
100%
60% (3)
100%
60% (3)
75%
60% (3)
75%

Association Rules
with MinSupport = 40% and MinConf = 70%

19

University of Indonesia

 

C1

Scan D for  
count of each 
candidate 

The
Apriori
Algorithm

Example 2

Pattern
1
2
3
4
5
6
7
8
9
10
11
13

Support
4
1
4
5
3
1
2
1
1
1
1
1

C2
Generate C2  
Candidates
From L1  

Pattern
13
14
15
17
34
35
37
45
47
57

20

Pattern
1 34
1 35
1 45
1 47
1 57
3 45
3 47
3 57
4 57

4
5
7

5
3
2

 

C2
Scan D for 
count of each 
candidate 

 

C3
Generate C3  
Candidates
From L2  

L1
Prune  
Prune
infrequent  Pattern Support
1
4
patterns
3
4

Pattern
P
tt
S
Support
t
13
3
14
4
15
2
1
17
34
3
35
2
37
1
3
45
2
47
57
2

L2
Prune  
Pattern Support
infrequent 
13
3
14
4
patterns 

 

1
3
3
4
4
5

5
4
5
5
7
7

2
3
2
3
2
2

C3
Scan D for  
count of each
candidate

Pattern Support
1 34
3
1 35
1
1 45
2
1 47
1
1 57
1
3 45
2
3 47
1
3 57
1
4 57
2

L3
Prune  
infrequent  Pattern Support
13 4
3
patterns 
14 5
34 5
45 7

2
2
2

University of Indonesia

Major Drawbacks of Apriori
`

`

Apriori has to read the database many times to test the
support of candidate patterns. In order to find a frequent
pattern X with length 50, it has to traverse the database
50 times.
On dense datasets with long patterns, as the length of the
pattern increases, the performance of Apriori drops rapidly
due to the explosion of the candidate patterns
patterns.

21

University of Indonesia

TreeProjection Algorithm
(Agarwal, Aggarwal & Prasad 2000)
 

null

7

75

754

{54, 54}

74

5

51

53

514

534

{34, 134, 14}

13

3

14

34

Level 1

4

Level 2

134

Lexicographical Tree
and
Triangular Matrix for Counting Frequent Patterns
with Length Two
22

{4,4,
4,4}

1 {34, 4,

34, 34}

54

Lev el 0

{7534, 5134, 7514, 134, 134}

Level 3

 
7
5
1
3
4

7

5

1

3

2
1
1
2

2
2
3

3
4

4

4

University of Indonesia

Eclat Algorithm
 

(Zaki et al. 1997)

13457

1345

134

13

135

14

1347

137

15

1357

345
1
2
1457

145

147

157

345

17

34

35

37

3457

347

45

357

47

tid‐list intersection 
34 35
1 1
2 2
4
5

457

57
tid‐list intersection

1

3

4

5

1
2
3
4
5

7

3
1
2
4
5

{}
{} 
23

FP-Growth
FP
Growth

4
1
2
3
4
5

5 7
1 1
2 3
3

University of Indonesia

(Han, Pei & Yin 2000)
 

Root

HeaderTable
4   

4:5

 3   
1    
 5 

3:4

1:1

5:1

1:3

5:1

7:1

5:1

7:1

 7   
Item Head of 
node‐links

FP-Tree for the sample database

24

University of Indonesia

CT-PRO
CT
PRO

(Sucahyo & Gopalan 2004)
Level 0

1
   Tid     Items   
  1     3  4  5  7  
  2     1  3  4  5   
  3     1  4  5  7 
  4     1  3  4 
4 1 34
  5     1  3  4 

(a) Frequent Items

 Tid     Items   
  1     1  2  4  5  
  2     1  2  3  4   
  3     1  3  4  5 
  4     1  2  3 
4 12 3
  5     1  2  3 

(b) Mapped

ItemTable
 1    4   5 
 2    3   4 
 3    1   4 
 4    5   3 
4 5 3
 5    7   2 
I  I  C P
n t o  S
d e 
e u 
u T
e m n 
x    t 

0 5 1
Level 1
Level 1

2

0 4 12
1 0 2

3

0 1 13

 Level 2 

3

0 3 123
1 0 23
2 0 3

4

0 1 124
1 0 24

4 0 1 134

5

0 1 1245
1 0 245

5 0 1 1345

Level 3

4

0
1
2
3

1
0
0
0

1234
234
34
4

Level 4

5

0
1
2
3
4

0
0
0
0
0

12345
2345
345
45
5

(c) Global CFP-Tree
25

University of Indonesia

Mining Very Large Database
Partition Algorithm (Savasere, Omiecinski & Navathe 1995)
 

 Tid          Items   
  1     3  4  5  6  7  9  Scan P1 and P2 to find local 
frequent patterns
frequent patterns
  2     1  3  4  5  13 
2 1 3 4 5 13
  3     1  2  4  5  7  11
  4     1  3  4  8 
5 1 3 4 10
C: {{1},{1,3},{1,4},{1,5},
{1,3,4},{1,4,5},
{3} {3,4},
{3},
{3 4} {3
{3,4,5},
4 } {3
{3,5},
}
{4}, {4,5}, {4,7}, {4,5,7},
{5}, {5,7}, {7} ……. {8},{10}}

26

FP 1: {{1},{1,4},{1,5},{1,4,5},
{3}, {3,4}, {3,4,5}, {3,5},
{4} {4,5},
{4},
{4 5} {4,7},
{4 7} {4,5,7},
{4 5 7}
{5}, {5,7}, {7}}
FP 2: {{1},{1,3},{1,4},{1,3,4}
{3},{3,4},{4},……. {8},{10}

Scan database to count 
support for patterns in C 

FP: {{1},{1,3},{1,4},
{1 5} {1,3,4},{1,4,5},
{1,5},
{1 3 4} {1 4 5}
{3}, {3,4}, {3,4,5},
{3,5}, {4}, {4,5}, {4,7},
{4,5,7}, {5}, {5,7},
{7}}
University of Indonesia

Mining Very Large Database
`

Projection (Pei 2002)
 Tid     Items   
  1     7  5  3  4  
  2     5  3  1  4   
  3     7  5  1  4 
  4     3  1  4 
  5     3  1  4 

     Items   
It
   5  3  4  
   5  1  4  

     Items   
It
    3  4   
    3  1  4   
Projection 7      1  4  
Projection 5 

     Items   
Items
     4  
     1  4   
     1  4 
     1  4  
Projection 3 

Tid     Items  
 1     7  5  3  4  
 2     5  3  1  4   
 3     7  5  1  4 
3 7 5 1 4
 4     3  1  4 
 5     3  1  4 

    Items   
Items
        4   
        4 
        4 
        4 

     Items   
It
    
Projection 4 

Projection 1 

(a) Parallel Projection

27

      Items   
   5  3  4  
   5  1  4  
Projection 7
Projection 7 

       Items   
      3  1  4   
Projection 5

     Items   
      1  4 
      1  4  

    Items  
    

Projection 3
Projection 3 

Projection 1 

(b) Partition Projection

University of Indonesia

Presentation of Association Rules
(Table Form)

28

University of Indonesia

     Items   
    
Projection 4 

Visualization of Association Rule
Using Rule Graph

29

University of Indonesia

Visualization of Association Rule
Using Plane Graph

30

University of Indonesia

Conclusion
`

Association rule mining
`

`

probably the most significant contribution from the
database community in KDD
A large number of papers have been published

`

Many interesting issues have been explored

`

An interesting research direction
`

Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.

31

University of Indonesia

References
`

Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001.

`

David Olson and Yong Shi, Introduction to Business Data Mining, McGraw-Hill,
2007.

`

Agarwal,
g
R. C., Aggarwal,
gg
C. C. & Prasad, V.V.V. 2001, 'A Tree Projection
j
Algorithm for Generation of Frequent Item Sets', Journal of Parallel and
Distributed Computing (Special Issue on High-Performance Data Mining), vol. 61, no.
3, pp. 350-371.

`

Han, J., Pei, J. & Yin,Y. 2000, 'Mining Frequent Patterns without Candidate
Generation', in Proceedings of the ACM SIGMOD International Conference on
Management of Data, Dallas, Texas, USA, pp. 1-12.

`

Savasere, A., Omiecinski, E. & Navathe, S. 1995, 'An Efficient Algorithm for
Mining Association Rules in Large Databases', in Proceedings of the 21st
International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, pp.
432-444
32

University of Indonesia

References (2)
`

Pei, J. 2002, Pattern-growth Methods for Frequent Pattern Mining, PhD Thesis, Simon
Fraser University, Canada.

`

Zaki,
Z
k M
M. J.J 1997,
1997 'Parallel
'P ll l Algorithms
Al
h ffor FFast D
Discovery off Association
A
Rules',
R l ' Data
D
Mining and Knowledge Discovery: An International Journal, vol. 1, no. 4, pp. 343-373

`

Sucahyo,Y. G. & Gopalan, R. P. 2004, 'CT-PRO: A Bottom-Up Non Recursive Frequent
Itemset Mining Algorithm Using Compressed FP-Tree
FP Tree Data Structure
Structure', in Proceedings
of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI),
Brighton, UK.

33

University of Indonesia