0000015157 05 Classification Algoritma Decision Tree

  Ericks CLASSIFICATION Algoritma Decision Tree K EY P ROBLEM No Savings Assets Income Credit Risk

  1 Medium High High Good

  2 Low Low Medium Bad

  3 High Medium Low Bad

  4 Medium Medium Medium Good

  5 Low Medium High Good

  6 High High Low Good

  7 Low Low Low Bad

  8 Medium Medium Medium Good Savings Assets Income Credit Risk Medium Low Medium ? Low High High ?

  I Menghit ung Kesam aan dat a ( hom ogeneit y) at au ket idaksam aan dat a

  ENGHITUNG MPURITY M

  ( het erogeneit y) dalam sebuah t abel yang m engandung at ribut dan Kelas dari at ribut .

  Sebuah t abel dikat akan Pure at au Hom ogenous j ika hanya m engandung sat u class. Jika m engandung lebih dari sat u kelas disebut I m pure at au Het erogeneous. Unt uk m enghit ung I m purit y dapat dilakukan dengan form ula berikut :

  ROBABILITY P Atribut

  Class

No Savings Assets Income Credit Risk

  1 Medium High High Good

  2 Low Low Medium Bad

  3 High Medium Low Bad

  4 Medium Medium Medium Good

  5 Low Medium High Good

  6 High High Low Good

  7 Low Low Low Bad

  8 Medium Medium Medium Good Terdapat 3 Class Bad dan 5 Class Good. Total data 8 baris.

  Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625

  NTROPY ARENT E P

  Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625

  Ent ropy Parent = – 0.375 log ( 0.375) – 0.625 log ( 0.625) = – 0.375 ( – 0.426) – 0.625 ( – 0.205) = 0.15975 + 0.12815 = 0.29

  I Probabilit y Class adalah :

  INI NDEX G

  Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625

  2

  2 Gini I ndex = 1 – ( 0.375 + 0.625 )

  = 1 – ( 0.14 + 0.39) = 1 – 0.53 = 0.47 C LASSIFICATION E RROR

  I NDEX

  Classificat ion Error I ndex = 1 – Max{ 0.375, 0.625} = 1 - 0.625 = 0.375

  Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625 S

UBSET

  • S

  • 0.33

  5 Medium Good

  6 Medium Good

  No Savings Credit Risk

  7 High Bad Prob ( Bad) : 2/ 3 = 0.66 Prob ( Good) : 1/ 3 = 0.33 Gini I ndex : 1 – ( 0.66

  2

  2

  Prob ( Bad) : 0/ 3 = 0 Prob ( Good) : 3/ 3 = 1 Gini I ndex : 1 – ( 0

  ) : 0.46

  2

  2

  ) : 0

  Prob ( Bad) : 1/ 2 = 0.5 Prob ( Good) : 1/ 2 = 0.5 Gini I ndex : 1 – ( 0.5

  2

  2

  4 Medium Good

  No Savings Credit Risk

  3 Low Good

  2 Low Bad

  1 Low Bad

  No Savings Credit Risk

  8 High Good

  7 High Bad

  6 Medium Good

  5 Medium Good

  4 Medium Good

  3 Low Good

  2 Low Bad

  1 Low Bad

  AVINGS Atribut Class No Savings Credit Risk

  S UBSET

  • 1
  • 0.5

  )

  • A
    • 0.75

  4 Medium Good

  2

  2

  Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0

  ) : 0.375

  2

  2

  ) = 0 Prob ( Bad) : 1/ 4 = 0.25 Prob ( Good) : 3/ 4 = 0.75 Gini I ndex : 1 – ( 0.25

  2

  2

  8 High Good Prob ( Bad) : 2/ 2 = 1 Prob ( Good) : 0/ 2 = 0 Gini I ndex : 1 – ( 1

  7 High Good

  No Assets Credit Risk

  6 Medium Good

  5 Medium Good

  3 Medium Bad

  No Assets Credit Risk

  2 Low Bad

  1 Low Bad

  No Assets Credit Risk

  8 High Good

  7 High Good

  6 Medium Good

  5 Medium Good

  4 Medium Good

  3 Medium Bad

  2 Low Bad

  1 Low Bad

  SSETS Atribut Class No Assets Credit Risk

  S UBSET

  • 0
  • 1
  • I

  ) = 0

  • 0.33

  ) : 0.46

  2

  2

  Prob ( Bad) : 1/ 3 = 0.33 Prob ( Good) : 2/ 3 = 0.66 Gini I ndex : 1 – ( 0.33

  ) : 0.46

  2

  2

  ) = 0 Prob ( Bad) : 2/ 3 = 0.66 Prob ( Good) : 1/ 3 = 0.33 Gini I ndex : 1 – ( 0.66

  2

  2

  6 Medium Good Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0

  5 Medium Good

  4 Medium Bad

  No Income Credit Risk

  8 High Good

  No Income Credit Risk

  7 High Good

  3 Low Good

  2 Low Bad

  1 Low Bad

  No Income Credit Risk

  8 High Good

  7 High Good

  6 Medium Good

  5 Medium Good

  4 Medium Bad

  3 Low Good

  2 Low Bad

  1 Low Bad

  NCOME No Income Credit Risk

  S UBSET

  • 1
  • 0.66

NFORMATION AIN

  I G I nform at ion Gain ( i) Ent ropy : Ent ropy dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah Dat a Parent ) * Ent ropy set iap Subset ) I nform at ion Gain ( i) Gini I ndex :

Gini I ndex dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah

Dat a Parent ) * Gini I ndex set iap Subset ) I nform at ion Gain ( i) Classificat ion Error : Classificat ion Error dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah Dat a Parent ) Subset * Classificat ion Error set iap

  I NFORMATION G AIN Savings Assets Income

  0.2825

  Medium

  Low High

  Assets

  Gini I ndex Parent = 0.47, Jum lah Dat a Parent = 8 Maxim um I nform at ion Gain = Subset Asset s Pure ( Hom ogen) Subset Asset s = Low dan High

  0.125

  0.47 – ( 3/8 * 0.46) + (3/8 * 0.46) + (2/8 * 0) )

  0.47 – ( (2/8 * 0) + (4/8 * 0.375) + (2/8 * 0) )

  Gini Index

  0.1725

  0.47 – ( (3/8 * 0.46) + (3/8 * 0) + (2/8 * 0.5) )

  0.5 High (2) High (2) Information Gain

  0.46 High (2)

  0.46 Medium (3) Medium (4) 0.375 Medium (3)

  0.46 Low (2) Low (3)

  Low (3)

  Bad Good ? ECISION REE ULE D T R Assets

  Low Medium High

  Bad ?

  Good I f Asset s = Low Then Credit Risk = Bad

I f Asset s = High Then Credit Risk = Good

ARENT TERATION

  P - I #2

No Assets Savings Income Credit Risk

  1 Medium High Low Bad

  2 Medium Medium Medium Good

  3 Medium Medium Medium Good

  4 Medium Low High Good Prob ( Bad) : 1/ 4 = 0.25 Prob ( Good) : 3/ 4 = 0.75

  2

2 Gini I ndex : 1 – ( 0.25 + 0.75 )

  : 0.375

  • #2
    • 1

  2 Medium Good

  2

  2

  ) = 0 Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0

  2

  2

  ) = 0 Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0

  2

  2

  1 Low Good Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0

  No Savings Credit Risk

  1 Medium Good

  No Savings Credit Risk

  1 High Bad

  No Savings Credit Risk

  4 Low Good

  3 Medium Good

  2 Medium Good

  1 High Bad

  No Savings Credit Risk

  S UBSET S AVINGS

  • 1

  ) = 0

  • 1
  • #2

  • 0

  No Income Credit Risk

  2

  2

  ) = 0 Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0

  2

  2

  ) = 0 Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0

  2

  2

  2 Medium Good Prob ( Bad) : 1/ 1 = 0 Prob ( Good) : 0/ 1 = 1 Gini I ndex : 1 – ( 1

  1 Medium Good

  1 High Good

  No Income Credit Risk

  1 Low Bad

  No Income Credit Risk

  4 High Good

  3 Medium Good

  2 Medium Good

  1 Low Bad

  No Income Credit Risk

  I NCOME

  S UBSET

  • 1

  ) = 0

  • 1
I NFORMATION G AIN

  • - #2

  Savings Income

  Gini Index

  Low (1) Low (1) Medium (2) Medium (2) High (1) High (1)

  Information Gain

  0. 375 – ( (1/4 * 0) + (2/4 * 0) + (1/4 * 0) )

  0.375

  0. 375 – ( (1/4 * 0) + (2/4 * 0) + (1/4 * 0) )

  0.375

  Gini I ndex Parent = 0.375, Jum lah Dat a Parent = 4 Maxim um I nform at ion Gain = Subset Savings dan Subset I ncom e Pure ( Hom ogen) Subset Savings = Low, Medium dan High Pure ( Hom ogen) Subset I ncom e = Low, Medium dan High

ECISION REE ULE ERSI AVINGS

  D T R #2 V S Assets

  Low Medium High

  Bad Savings Good

  Medium Low

  High

  Good Good Bad

  # 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3. I f Asset s = Medium And Savings = Low Then Credit Risk = Good # 4. I f Asset s = Medium And Savings = High Then Credit Risk = Bad # 5. I f Asset s = Medium And Savings = Medium Then Credit Risk = Good

ECISION REE ULE ERSI NCOME

  D T R #2 V

  I Assets

  Low Medium High

  Bad Income Good

  Medium Low

  High

  Bad Good Good

  # 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3. I f Asset s = Medium And I ncom e = Low Then Credit Risk = Bad # 4. I f Asset s = Medium And I ncom e = High Then Credit Risk = Good # 5. I f Asset s = Medium And I ncom e = Medium Then Credit Risk = Good

ECISION REE ULE ESULT

  D T R - R

  # 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3a. I f Asset s = Medium And Savings = Low Then Credit Risk = Good # 4a. I f Asset s = Medium And Savings = High Then Credit Risk = Bad # 5a. I f Asset s = Medium And Savings = Medium Then Credit Risk = Good # 3b. I f Asset s = Medium And I ncom e = Low Then Credit Risk = Bad # 4b. I f Asset s = Medium And I ncom e = High Then Credit Risk = Good # 5b. I f Asset s = Medium And I ncom e = Medium Then Credit Risk = Good

  

Savings Assets Income Credit Risk Savings Or Income

Medium Low Medium ? Bad / Bad Low High High ? Good / Good

  EFERENCES R

| Discovering Knowledge in Data (Introduction to

  Data Mining), Chapter 6, Daniel T. Larose, Wiley, 2004

  Attributes Classes

  Gender Car Ownership Travel Cost ($)/km Income Level Transportation Mode Male

  Cheap Low Bus Male

  1 Cheap Medium Bus Female

  1 Cheap Medium Train Female Cheap Low Bus Male

  1 Cheap Medium Bus Male Standard Medium Train Female

  1 Standard Medium Train Female

  1 Expensive High Car Male

  2 Expensive Medium Car Female

  2 Expensive High Car Male Expensive Medium