Goal: Provide an overview of basic Association

Association Rules Rule mining techniques _● Association Rules Problem Overview

Association Rules Outline Goal: Provide an overview of basic Association

Large itemsets _●

Association Rules Algorithms

Apriori
Sampling
Partitioning
Parallel Algorithms _●

Comparing Techniques _● Incremental Algorithms _● Advanced AR Techniques

Market basket analysis

_● Example: Market Basket Data Items frequently purchased together:

Bread PeanutButter _● ⇒

Uses:

Placement
Advertising
Sales
_●

Coupons Objective: increase sales and reduce costs Association Rule Definitions ●

Set of items: I={I ,I ,…,I } _● 1

2 m

Transactions: D={t ,t , …, t }, t

I ⊆ _●

1 2 n j Itemset: {I ,I , …, I }

⊆

_● i1 i2 ik

Support of an itemset: Percentage of _● transactions which contain that itemset.

Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread, PeanutButter} is 60% Association Rule Definitions ●

Association Rule (AR): implication X Y ⇒ where X,Y I and X Y = ; _● ⊆ ∩ Support of AR (s) X

⇒

Y: Percentage of transactions that contain X Y.

∪ _● P(X,Y) = σ (X Y) / |T| ∪

Confidence of AR ( ) X Y: Ratio of

α ⇒ number of transactions that contain X Y

∪ to the number that contain X.

P(Y|X) = σ (X Y) / σ (X)

∪ Association Rules Ex (cont’d)

Example 500,000 transactions 20,000 transactions contains diapers 30,000 transactions contains milk 10,000 transactions contains both diapers & milk

E s c

460

⇒

0.02

0.33

diapers milk Diapers Milk

⇒

0.02

0.50 Example 10,000 transactions contains wipes 8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk 200 transactions contains wipes & diapers & milk

E milk diapers

90.91

1980 7800

⇒ Diapers c,% s,% wipes

2 Milk

33.33

⇒ Diapers

0.04 Wipes & Milk

⇒ Milk

200 19980

2 Diapers

⇒ Wipes

1.6 Diapers

⇒ Wipes

0.04 Milk

0.73

2200 ? …

_● Association Rule Problem Given a set of items I={I ,I ,…,I } and a

1 2 m database of transactions D={t ,t , …, t } where

1 2 n t ={I ,I , …, I } and I I, the Association

∈ i i1 i2 ik ij

Rule Problem is to identify all association rules X Y with a minimum support and

⇒ _● confidence. _● Link Analysis

NOTE: Support of X Y is same as support

⇒ of X Y. ∪

Association Rule Techniques 1

Find Large Itemsets. _2. Generate rules from frequent itemsets.

_● Association Rule Mining

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions _{TID Items} Example of Association Rules

{Diaper} {Cocoa}, →

1 Bread, Milk {Milk, Bread} {Eggs,Coke},

→

2 Bread, Diaper, Cocoa, {Cocoa, Bread} {Milk}, →

Eggs

3 Milk, Diaper, Cocoa, Coke Implication means co-occurrence,

4 Bread, Milk, Diaper, not causality!

Cocoa

5 Bread, Milk, Diaper, Coke

_● Definition: Frequent Itemset Itemset

A collection of one or more _{TID Items} items
Example: {Milk, Bread,

1 Bread, Milk Diaper}

2 Bread, Diaper, Cocoa,

k-itemset

Eggs

An itemset that contains k

3 Milk, Diaper, Cocoa, Coke items _●

4 Bread, Milk, Diaper, Cocoa Support count ( ) σ

5 Bread, Milk, Diaper, Coke

Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper})

σ _● = 2 Support

Frequent Itemset An itemset whose support is

Fraction of transactions that greater than or equal to a contain an itemset

minsup threshold

E.g. s({Milk, Bread, Diaper})

_● Definition: Association Rule _{TID Items} Association Rule

An implication expression of the form X

1 Bread, Milk Y, where X and Y are itemsets

→

2 Bread, Diaper, Cocoa,

Eggs

Example:

3 Milk, Diaper, Cocoa, Coke {Milk, Diaper} {Cocoa}

→

4 Bread, Milk, Diaper, _● Cocoa Rule Evaluation Metrics

5 Bread, Milk, Diaper, Coke

Support (s)

Example:

Fraction of transactions that contain both X and Y

{ Milk , Diaper } Cocoa ⇒

Confidence (c)

( Milk , Diaper, Cocoa )

Measures how often items in Y

s .

4 = σ

= =

5 appear in transactions that | T | contain X

( Milk, Diaper, Cocoa )

2 σ

67 =

= = ( Milk , Diaper )

3 σ Given a set of transactions T, the goal of association rule mining is to find all rules having

Association Rule Mining Task ●

support ≥ minsup threshold
confidence ≥ minconf threshold

●

Brute-force approach:

List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds

Mining Association Rules _{TID Items} Example of Rules:

1 Bread, Milk {Milk,Diaper} {Cocoa} (s=0.4,

→

2 Bread, Diaper, Cocoa,

c=0.67)

Eggs

3 Milk, Diaper, Cocoa, Coke {Milk,Cocoa} {Diaper} (s=0.4, c=1.0) →

{Diaper,Cocoa} {Milk} (s=0.4,

4 Bread, Milk, Diaper, →

Cocoa

c=0.67)

5 Bread, Milk, Diaper, Coke {Cocoa} {Milk,Diaper} (s=0.4,

→ c=0.67)

Observations: {Diaper} {Milk,Cocoa} (s=0.4, c=0.5)

→ {Milk} {Diaper,Cocoa} (s=0.4, c=0.5)

All the above rules are binary partitions of the same itemset:

→ {Milk, Diaper, Cocoa}

Rules originating from the same itemset have identical support but can have different confidence
Thus, we may decouple the support and confidence requirements

Mining Association Rules ●

Two-step approach:

– Frequent Itemset Generation
– Generate all itemsets whose support

≥ minsup

– Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

● Frequent itemset generation is still computationally expensive

Frequent Itemset Generation _null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE

Given d items, there _d _ABCDE are 2 possible candidate itemsets

_● Frequent Itemset Generation

Brute-force approach:

Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database

List of _{TID Items} Transactions _{TID Items} _{1 Bread, Milk} Candidates

1 Bread, Milk _{2 Bread, Diaper, Cocoa,}

2 Bread, Diaper, Beer, Eggs _{3 Milk, Diaper, Cocoa, Coke} _Eggs

3 Milk, Diaper, Beer, Coke M N _{4 Bread, Milk, Diaper,}

4 Bread, Milk, Diaper, Beer _Cocoa

5 Bread, Milk, Diaper, Coke _{5 Bread, Milk, Diaper, Coke} w

Match each transaction against every candidate _d Complexity ~ O(NMw) => Expensive since M = 2 !!! _k M = 2 - 1

_● Computational Complexity Given d unique items: ^d

Total number of itemsets = 2
Total number of possible association rules:

d − ^{1 d − k} d d k

     −  R ∑ ∑

= × _{k =} _{1 j =}     ₁   k j

     

⁺ = − ^{d d}

1 If d=6, R = 602 rules

Frequent Itemset Generation

_●

Reduce the number of candidates (M) ^d

Complete search: M=2
_●

Use pruning techniques to reduce M Reduce the number of transactions (N)

Reduce size of N as the size of itemset increases
_●

Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)

Use efficient data structures to store the candidates or transactions
No need to match every candidate against every
transaction

Apriori ●

Large Itemset Property: _● Any subset of a large itemset is large.

Contrapositive: If an itemset is not large, none of its supersets are large.

_● Reducing Number of Candidates Apriori principle:

If an itemset is frequent, then all of its subsets must also be frequent

● Apriori principle holds due to the following property of the support measure:

X , Y : (

X Y ) s ( X ) s ( Y ) ∀ ⊆ ⇒ ≥

Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support

Large Itemset Property

Apriori Ex (cont’d) s=30%

α = 50% Found to be Infrequent ^{AB AC AD AE BC BD BE CD CE DE} ^null ^{A B C D E} _{ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE} ABCD ABCE ABDE ACDE BCDE ABCDE

Illustrating Apriori Principle _null AB AC AD AE BC BD BE CD CE DE ^{A B C D E} ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

Pruned supersets Illustrating Apriori Principle Item Count Bread

4 Coke

3 Itemset Count {Bread,Milk,Diaper}

4 Bread, Milk, Diaper, Cocoa

3 Milk, Diaper, Cocoa, Coke

2 Bread, Diaper, Cocoa, Eggs

1 Bread, Milk

If every subset is considered,

₆

₁

⁶

₂

⁶

₃

^{TID Items}

Triplets (3-itemsets) Minimum Support = 3

3 Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs)

3 {Cocoa,Diaper}

2 Milk

2 {Milk,Diaper}

3 {Milk,Cocoa}

2 {Bread,Diaper}

3 {Bread,Cocoa}

1 Itemset Count {Bread,Milk}

4 Eggs

3 Diaper

4 Cocoa

5 Bread, Milk, Diaper, Coke

The Apriori Algorithm—An Example Database TDB

{C, E} {B, E} {B, C} {A, E} {A, C} {A, B}

2 {B, C, E}

Itemset

{B, C, E}

Itemset sup

2 {A, C} 2 {B, C} 3 {B, E} 2 {C, E}

Itemset sup

2 {B, C} 3 {B, E} 2 {C, E}

1 {A, E}

2 {A, C}

Itemset 1 {A, B}

Itemset sup

1 ^st scan C ₁ L ₁ L ₂ C ₂ C ₂

3 {E} 3 {C} 3 {B} 2 {A}

Itemset sup

3 {E} 3 {C} 3 {B} 2 {A}

1 {D}

10 Items Tid

20 A, C, D

30 B, C, E

40 A, B, C, E

B, E

3 ^rd scan

2 ^nd scan C ₃ L ₃

Itemset sup Sup _min = 2 Apriori Algorithm 

C _ ₁ = Itemsets of size one in I; Determine all large itemsets of size 1, L _1; _ i = 1; _

Repeat _ i = i + 1; _ C _i = Apriori-Gen(L _i-1 ); _ Count C _i to determine L _i; _ until no more large itemsets found;

Apriori Algorithm ●

Method:

Let k=1
Generate frequent itemsets of length 1
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those that are frequent

● Problems with support and confidence; a rule may have high support & high confidence because the rule is an obvious rule … someone

who buys potato chips is highly likely to buy soft

_●

drink -> not surprising

P(B)

P(A,B) = σ (A ) / |T|

∪

P(B|A) = σ (A ) / σ (A) ∪ Interestingness Measure: Correlations _● (Lift)

Lift (Wipes, Milk Diapers) is 22.73 _● ⇒

People who purchased Wipes & Milk are 22.73 times more likely to also purchase Diapers than people who do _● not purchase Wipes & Milk .

Measure of dependent/correlated events: lift P ( A B )

∪ lift ( A B )

⇒ =

P ( A ) P ( B )

>1 positively correlated = 1 independent < 1 negatively correlated

● play basketball eat cereal [40%, 66.7%] is misleading

⇒

_● The overall % of students eating cereal is 75% > 66.7%.

play basketball not eat cereal [20%, 33.3%] is more ⇒

accurate, although with lower support and confidence

Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

1000 / 5000 2000 / 5000 lift ( B , C ) 1 .

33 ¬ =

= lift ( B , C ) .

89 = =

3000 / 5000 1250 / 5000 *

3000 / 5000 3750 / 5000

Conviction P ( A ) P ( B )

¬ conviction ( A B )

⇒ = P ( A , B )

¬ Conviction = 1, independent (not related) Conviction = infinity, then events always true

2 Chi-square, χ

2 ( observed exp ected )

−

2 χ =

∑ exp ected

2 c r

( o e ) − ij ij

2 χ =

∑ ∑ e i 1 j

1 = =

count

A a

count

B b

= × = i j e

= ij

N DOF ( r 1 ) ( c 1 )

= − × − A has c distinct values (a , a , a , …) ₁ ₂ ₃

Example

Expected

16 Osteoporosis TOTAL Non-Coke Coke

38 No osteoporosis

32 TOTAL

10 No osteoporosis

32 TOTAL

22 Osteoporosis TOTAL Non-Coke Coke

− = χ ^●
−
−

●

DOF (df) = (2-1)(2-1) = 1 _● significant level, α = 0.01 _●

χ ²

= 6.63 (tabulated) 77 .

22 ) 22 28 (

16 ) 16 10 (

22 ) 22 16 (

16 ) 16 22 ( ² ² ² ² ₂

= −

Null hypothesis: A & B are not correlated _● Calculate expected values _● Calculate chi-square value _● Compare calculated & tabulated values _● Justify hypothesis

Chi-square distribution table

Which Measures Should Be Used?

Goal: Provide an overview of basic Association

Uses:

NOTE: Support of X Y is same as support

Association Rule Techniques 1

Brute-force approach:

Reduce the number of comparisons (NM)

P ( A ) P ( B )

Example

Dokumen yang terkait

An overview of Daniel goleman's emotional theory in D.H. Lawence's rocking-horse winner

Association Analysis of Historical Bread 001

Association analysis of historical bread 002

Applicational possibilities of linear an

An overview of results from the first UNPRPD Funding Round

12.Step 9 Provide Effective Feedback

Association of Sexual Maturation and Body Size of Arfak Children

Market price stabilizing strategy: overview of the combination of SWOT and QSPM analysis

Climate change, plant diseases and food security: an overview

The present paper gives an overview of

Dukungan

Links

Goal: Provide an overview of basic Association

Uses:

NOTE: Support of X Y is same as support

Association Rule Techniques 1

Brute-force approach:

Reduce the number of comparisons (NM)

P ( A ) P ( B )

Example

Dokumen yang terkait

An overview of Daniel goleman's emotional theory in D.H. Lawence's rocking-horse winner

Association Analysis of Historical Bread 001

Association analysis of historical bread 002

Applicational possibilities of linear an

An overview of results from the first UNPRPD Funding Round

12.Step 9 Provide Effective Feedback

Association of Sexual Maturation and Body Size of Arfak Children

Market price stabilizing strategy: overview of the combination of SWOT and QSPM analysis

Climate change, plant diseases and food security: an overview

The present paper gives an overview of

Dokumen yang Anda mencari sudah siap untuk unduhkan