Goal: Provide an overview of basic Association

  

Association Rules Rule mining techniques Association Rules Problem Overview

  Association Rules Outline Goal: Provide an overview of basic Association

  • Large itemsets

  Association Rules Algorithms

  • Apriori
  • Sampling
  • Partitioning
  • Parallel Algorithms

  Comparing Techniques Incremental Algorithms Advanced AR Techniques

  Market basket analysis

   Example: Market Basket Data Items frequently purchased together:

  Bread PeanutButter

Uses:

  • Placement
  • Advertising
  • Sales

  Coupons Objective: increase sales and reduce costs Association Rule Definitions ●

  Set of items: I={I ,I ,…,I } 1

2 m

  Transactions: D={t ,t , …, t }, t

  I ⊆

  1 2 n j Itemset: {I ,I , …, I }

  I

i1 i2 ik

  Support of an itemset: Percentage of transactions which contain that itemset.

  Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

  Association Rules Example I = { Beer, Bread, Jelly, Milk, PeanutButter} Support of {Bread, PeanutButter} is 60% Association Rule Definitions ●

  Association Rule (AR): implication X Y ⇒ where X,Y I and X Y = ; ⊆ ∩ Support of AR (s) X

  

Y: Percentage of transactions that contain X Y.

  ∪ P(X,Y) = σ (X Y) / |T| ∪

  • Confidence of AR ( ) X Y: Ratio of

  α ⇒ number of transactions that contain X Y

  ∪ to the number that contain X.

  • P(Y|X) = σ (X Y) / σ (X)

  ∪ Association Rules Ex (cont’d)

  Example 500,000 transactions 20,000 transactions contains diapers 30,000 transactions contains milk 10,000 transactions contains both diapers & milk

  E s c

  460

  20

  10

  ⇒

  0.02

  0.33

  diapers milk Diapers Milk

  ⇒

  0.02

  0.50 Example 10,000 transactions contains wipes 8,000 transactions contains wipes & diapers 220 transactions contains wipes & milk 200 transactions contains wipes & diapers & milk

  E milk diapers

  90.91

  8

  9

  20

  1980 7800

  ⇒ Diapers c,% s,% wipes

  2 Milk

  33.33

  ⇒ Diapers

  0.04 Wipes & Milk

  ⇒ Milk

  200 19980

  2 Diapers

  50

  ⇒ Wipes

  1.6 Diapers

  40

  ⇒ Wipes

  0.04 Milk

  0.73

  2200 ?

  00

   Association Rule Problem Given a set of items I={I ,I ,…,I } and a

  1 2 m database of transactions D={t ,t , …, t } where

  1 2 n t ={I ,I , …, I } and I I, the Association

  ∈ i i1 i2 ik ij

  Rule Problem is to identify all association rules X Y with a minimum support and

  ⇒ confidence. Link Analysis

NOTE: Support of X Y is same as support

  ⇒ of X Y. ∪

Association Rule Techniques 1

  Find Large Itemsets. 2. Generate rules from frequent itemsets.

   Association Rule Mining

  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

  Market-Basket transactions TID Items Example of Association Rules

  {Diaper} {Cocoa}, →

  1 Bread, Milk {Milk, Bread} {Eggs,Coke},

  →

  2 Bread, Diaper, Cocoa, {Cocoa, Bread} {Milk}, →

  Eggs

  3 Milk, Diaper, Cocoa, Coke Implication means co-occurrence,

  4 Bread, Milk, Diaper, not causality!

  Cocoa

  5 Bread, Milk, Diaper, Coke

   Definition: Frequent Itemset Itemset

  • A collection of one or more TID Items items
  • Example: {Milk, Bread,

  1 Bread, Milk Diaper}

  2 Bread, Diaper, Cocoa,

  • k-itemset

  Eggs

  • An itemset that contains k

  3 Milk, Diaper, Cocoa, Coke items

  4 Bread, Milk, Diaper, Cocoa Support count ( ) σ

  5 Bread, Milk, Diaper, Coke

  • Frequency of occurrence of an itemset
  • E.g. ({Milk, Bread,Diaper})

  σ = 2 Support

  Frequent Itemset An itemset whose support is

  • Fraction of transactions that greater than or equal to a contain an itemset

  minsup threshold

  • E.g. s({Milk, Bread, Diaper})

   Definition: Association Rule TID Items Association Rule

  • An implication expression of the form X

  1 Bread, Milk Y, where X and Y are itemsets

  →

  2 Bread, Diaper, Cocoa,

  • Eggs

  Example:

  3 Milk, Diaper, Cocoa, Coke {Milk, Diaper} {Cocoa}

  →

  4 Bread, Milk, Diaper, Cocoa Rule Evaluation Metrics

  5 Bread, Milk, Diaper, Coke

  • Support (s)

  Example:

  • Fraction of transactions that contain both X and Y

  { Milk , Diaper } Cocoa ⇒

  • Confidence (c)

  ( Milk , Diaper, Cocoa )

  2

  • Measures how often items in Y

  s .

  4 = σ

  = =

  5 appear in transactions that | T | contain X

  ( Milk, Diaper, Cocoa )

  2 σ

c

.

  67 =

  = = ( Milk , Diaper )

  3 σ Given a set of transactions T, the goal of association rule mining is to find all rules having

  Association Rule Mining Task ●

  • support ≥ minsup threshold
  • confidence ≥ minconf threshold

  ●

Brute-force approach:

  • List all possible association rules
  • Compute the support and confidence for each rule
  • Prune rules that fail the minsup and minconf

    thresholds

  Mining Association Rules TID Items Example of Rules:

  1 Bread, Milk {Milk,Diaper} {Cocoa} (s=0.4,

  →

  2 Bread, Diaper, Cocoa,

c=0.67)

  Eggs

  3 Milk, Diaper, Cocoa, Coke {Milk,Cocoa} {Diaper} (s=0.4, c=1.0) →

  {Diaper,Cocoa} {Milk} (s=0.4,

  4 Bread, Milk, Diaper,

  Cocoa

c=0.67)

  5 Bread, Milk, Diaper, Coke {Cocoa} {Milk,Diaper} (s=0.4,

  → c=0.67)

  Observations: {Diaper} {Milk,Cocoa} (s=0.4, c=0.5)

  → {Milk} {Diaper,Cocoa} (s=0.4, c=0.5)

  • All the above rules are binary partitions of the same itemset:

  → {Milk, Diaper, Cocoa}

  • Rules originating from the same itemset have identical support but can have different confidence
  • Thus, we may decouple the support and confidence requirements

  Mining Association Rules ●

  Two-step approach:

  • – Frequent Itemset Generation
  • – Generate all itemsets whose support

  ≥ minsup

  • – Rule Generation
  • – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

  ● Frequent itemset generation is still computationally expensive

  Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE

  Given d items, there d ABCDE are 2 possible candidate itemsets

   Frequent Itemset Generation

  Brute-force approach:

  • Each itemset in the lattice is a candidate frequent itemset

  • Count the support of each candidate by scanning the database

  List of TID Items Transactions TID Items 1 Bread, Milk Candidates

  1 Bread, Milk 2 Bread, Diaper, Cocoa,

  2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Cocoa, Coke Eggs

  3 Milk, Diaper, Beer, Coke M N 4 Bread, Milk, Diaper,

  4 Bread, Milk, Diaper, Beer Cocoa

  5 Bread, Milk, Diaper, Coke 5 Bread, Milk, Diaper, Coke w

  Match each transaction against every candidate d Complexity ~ O(NMw) => Expensive since M = 2 !!! k M = 2 - 1

   Computational Complexity Given d unique items: d

  • Total number of itemsets = 2
  • Total number of possible association rules:

  d1 dk d d k

       −  R ∑ ∑

  = × k = 1 j =     1   k j

       

  • + = − d d
  • 1 +

      3

      2

      1 If d=6, R = 602 rules

      

    Frequent Itemset Generation

    Strategies

      Reduce the number of candidates (M) d

    • Complete search: M=2

      Use pruning techniques to reduce M Reduce the number of transactions (N)

    • Reduce size of N as the size of itemset increases

      Used by DHP and vertical-based mining algorithms

    Reduce the number of comparisons (NM)

    • Use efficient data structures to store the candidates or transactions
    • No need to match every candidate against every

      transaction

      Apriori ●

      Large Itemset Property: Any subset of a large itemset is large.

      Contrapositive: If an itemset is not large, none of its supersets are large.

       Reducing Number of Candidates Apriori principle:

    • If an itemset is frequent, then all of its subsets must also be frequent

      ● Apriori principle holds due to the following property of the support measure:

      X , Y : (

      X Y ) s ( X ) s ( Y ) ∀ ⊆ ⇒ ≥

    • Support of an itemset never exceeds the support of its subsets
    • This is known as the anti-monotone property of support

      Large Itemset Property

      Apriori Ex (cont’d) s=30%

      α = 50% Found to be Infrequent AB AC AD AE BC BD BE CD CE DE null A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

      Illustrating Apriori Principle null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

      Pruned supersets Illustrating Apriori Principle Item Count Bread

      4 Coke

      3 Itemset Count {Bread,Milk,Diaper}

      4 Bread, Milk, Diaper, Cocoa

      3 Milk, Diaper, Cocoa, Coke

      2 Bread, Diaper, Cocoa, Eggs

      1 Bread, Milk

      If every subset is considered,

    6

    C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, TID Items

      Triplets (3-itemsets) Minimum Support = 3

      3 Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs)

      3 {Cocoa,Diaper}

      2 Milk

      2 {Milk,Diaper}

      3 {Milk,Cocoa}

      2 {Bread,Diaper}

      3 {Bread,Cocoa}

      1 Itemset Count {Bread,Milk}

      4 Eggs

      3 Diaper

      4 Cocoa

      5 Bread, Milk, Diaper, Coke

      The Apriori Algorithm—An Example Database TDB

      {C, E} {B, E} {B, C} {A, E} {A, C} {A, B}

      2 {B, C, E}

      Itemset

      {B, C, E}

      Itemset sup

      2 {A, C} 2 {B, C} 3 {B, E} 2 {C, E}

      Itemset sup

      2 {B, C} 3 {B, E} 2 {C, E}

      1 {A, E}

      2 {A, C}

      Itemset 1 {A, B}

      Itemset sup

      1 st scan C 1 L 1 L 2 C 2 C 2

      3 {E} 3 {C} 3 {B} 2 {A}

      

    Itemset sup

      3 {E} 3 {C} 3 {B} 2 {A}

      

    1 {D}

      10 Items Tid

      20 A, C, D

      30 B, C, E

      40 A, B, C, E

      B, E

      3 rd scan

      2 nd scan C 3 L 3

      Itemset sup Sup min = 2 Apriori Algorithm 

      C 1 = Itemsets of size one in I; Determine all large itemsets of size 1, L 1; i = 1;

      Repeat i = i + 1; C i = Apriori-Gen(L i-1 ); Count C i to determine L i; until no more large itemsets found;

      Apriori Algorithm ●

      Method:

    • Let k=1
    • Generate frequent itemsets of length 1
    • >Repeat until no new frequent itemsets are identified
    • • Generate length (k+1) candidate itemsets from length k

      frequent itemsets
    • Prune candidate itemsets containing subsets of length k that are infrequent
    • Count the support of each candidate by scanning the DB
    • Eliminate candidates that are infrequent, leaving only those that are frequent

      ● Problems with support and confidence; a rule may have high support & high confidence because the rule is an obvious rule … someone

    who buys potato chips is highly likely to buy soft

    drink -> not surprising Confidence totally ignore P(B)

      B

    • P(A,B) = σ (A ) / |T|

      ∪

    • B

      P(B|A) = σ (A ) / σ (A) ∪ Interestingness Measure: Correlations (Lift)

      Lift (Wipes, Milk Diapers) is 22.73

      People who purchased Wipes & Milk are 22.73 times more likely to also purchase Diapers than people who do not purchase Wipes & Milk .

      Measure of dependent/correlated events: lift P ( A B )

      ∪ lift ( A B )

      ⇒ =

    P ( A ) P ( B )

      >1 positively correlated = 1 independent < 1 negatively correlated

      ● play basketball eat cereal [40%, 66.7%] is misleading

      ⇒

    • The overall % of students eating cereal is 75% > 66.7%.

      play basketball not eat cereal [20%, 33.3%] is more ⇒

    accurate, although with lower support and confidence

      Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000

      1000 / 5000 2000 / 5000 lift ( B , C ) 1 .

      33 ¬ =

      = lift ( B , C ) .

      89 = =

      3000 / 5000 1250 / 5000 *

    • 3000 / 5000 3750 / 5000

      Conviction P ( A ) P ( B )

      ¬ conviction ( A B )

      ⇒ = P ( A , B )

      ¬ Conviction = 1, independent (not related) Conviction = infinity, then events always true

      2 Chi-square, χ

      2 ( observed exp ected )

      −

      2 χ =

      ∑ exp ected

      2 c r

      ( o e ) − ij ij

      2 χ =

      ∑ ∑ e i 1 j

      1 = =

    ij

    count ( A a ) count ( B b )

      = × = i j e

      = ij

      N DOF ( r 1 ) ( c 1 )

      = − × − A has c distinct values (a , a , a , …) 1 2 3

    Example

      76

      Expected

      16 Osteoporosis TOTAL Non-Coke Coke

      38

      38 No osteoporosis

      32 TOTAL

      44

      76

      44

      16

      38

      10 No osteoporosis

      28

      38

      32 TOTAL

      22 Osteoporosis TOTAL Non-Coke Coke

    • − = χ

      ●

      DOF (df) = (2-1)(2-1) = 1 significant level, α = 0.01

      χ 2

      = 6.63 (tabulated) 77 .

      7

      22 ) 22 28 (

      16 ) 16 10 (

      22 ) 22 16 (

      16 ) 16 22 ( 2 2 2 2 2

      = −

      Null hypothesis: A & B are not correlated Calculate expected values Calculate chi-square value Compare calculated & tabulated values Justify hypothesis

      

    Chi-square distribution table

      

    Which Measures Should Be Used?