0000015157 05 Classification Algoritma Decision Tree
Ericks CLASSIFICATION Algoritma Decision Tree K EY P ROBLEM No Savings Assets Income Credit Risk
1 Medium High High Good
2 Low Low Medium Bad
3 High Medium Low Bad
4 Medium Medium Medium Good
5 Low Medium High Good
6 High High Low Good
7 Low Low Low Bad
8 Medium Medium Medium Good Savings Assets Income Credit Risk Medium Low Medium ? Low High High ?
I Menghit ung Kesam aan dat a ( hom ogeneit y) at au ket idaksam aan dat a
ENGHITUNG MPURITY M
( het erogeneit y) dalam sebuah t abel yang m engandung at ribut dan Kelas dari at ribut .
Sebuah t abel dikat akan Pure at au Hom ogenous j ika hanya m engandung sat u class. Jika m engandung lebih dari sat u kelas disebut I m pure at au Het erogeneous. Unt uk m enghit ung I m purit y dapat dilakukan dengan form ula berikut :
ROBABILITY P Atribut
Class
No Savings Assets Income Credit Risk
1 Medium High High Good
2 Low Low Medium Bad
3 High Medium Low Bad
4 Medium Medium Medium Good
5 Low Medium High Good
6 High High Low Good
7 Low Low Low Bad
8 Medium Medium Medium Good Terdapat 3 Class Bad dan 5 Class Good. Total data 8 baris.
Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625
NTROPY ARENT E P
Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625
Ent ropy Parent = – 0.375 log ( 0.375) – 0.625 log ( 0.625) = – 0.375 ( – 0.426) – 0.625 ( – 0.205) = 0.15975 + 0.12815 = 0.29
I Probabilit y Class adalah :
INI NDEX G
Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625
2
2 Gini I ndex = 1 – ( 0.375 + 0.625 )
= 1 – ( 0.14 + 0.39) = 1 – 0.53 = 0.47 C LASSIFICATION E RROR
I NDEX
Classificat ion Error I ndex = 1 – Max{ 0.375, 0.625} = 1 - 0.625 = 0.375
Probabilit y Class adalah : Prob ( Bad) = 3 / 8 = 0.375 Prob ( Good) = 5 / 8 = 0.625 S
UBSET
- S
- 0.33
5 Medium Good
6 Medium Good
No Savings Credit Risk
7 High Bad Prob ( Bad) : 2/ 3 = 0.66 Prob ( Good) : 1/ 3 = 0.33 Gini I ndex : 1 – ( 0.66
2
2
Prob ( Bad) : 0/ 3 = 0 Prob ( Good) : 3/ 3 = 1 Gini I ndex : 1 – ( 0
) : 0.46
2
2
) : 0
Prob ( Bad) : 1/ 2 = 0.5 Prob ( Good) : 1/ 2 = 0.5 Gini I ndex : 1 – ( 0.5
2
2
4 Medium Good
No Savings Credit Risk
3 Low Good
2 Low Bad
1 Low Bad
No Savings Credit Risk
8 High Good
7 High Bad
6 Medium Good
5 Medium Good
4 Medium Good
3 Low Good
2 Low Bad
1 Low Bad
AVINGS Atribut Class No Savings Credit Risk
S UBSET
- 1
- 0.5
)
- A
- 0.75
4 Medium Good
2
2
Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0
) : 0.375
2
2
) = 0 Prob ( Bad) : 1/ 4 = 0.25 Prob ( Good) : 3/ 4 = 0.75 Gini I ndex : 1 – ( 0.25
2
2
8 High Good Prob ( Bad) : 2/ 2 = 1 Prob ( Good) : 0/ 2 = 0 Gini I ndex : 1 – ( 1
7 High Good
No Assets Credit Risk
6 Medium Good
5 Medium Good
3 Medium Bad
No Assets Credit Risk
2 Low Bad
1 Low Bad
No Assets Credit Risk
8 High Good
7 High Good
6 Medium Good
5 Medium Good
4 Medium Good
3 Medium Bad
2 Low Bad
1 Low Bad
SSETS Atribut Class No Assets Credit Risk
S UBSET
- 0
- 1
- I
) = 0
- 0.33
) : 0.46
2
2
Prob ( Bad) : 1/ 3 = 0.33 Prob ( Good) : 2/ 3 = 0.66 Gini I ndex : 1 – ( 0.33
) : 0.46
2
2
) = 0 Prob ( Bad) : 2/ 3 = 0.66 Prob ( Good) : 1/ 3 = 0.33 Gini I ndex : 1 – ( 0.66
2
2
6 Medium Good Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0
5 Medium Good
4 Medium Bad
No Income Credit Risk
8 High Good
No Income Credit Risk
7 High Good
3 Low Good
2 Low Bad
1 Low Bad
No Income Credit Risk
8 High Good
7 High Good
6 Medium Good
5 Medium Good
4 Medium Bad
3 Low Good
2 Low Bad
1 Low Bad
NCOME No Income Credit Risk
S UBSET
- 1
- 0.66
NFORMATION AIN
I G I nform at ion Gain ( i) Ent ropy : Ent ropy dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah Dat a Parent ) * Ent ropy set iap Subset ) I nform at ion Gain ( i) Gini I ndex :
Gini I ndex dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah
Dat a Parent ) * Gini I ndex set iap Subset ) I nform at ion Gain ( i) Classificat ion Error : Classificat ion Error dari Parent Tabel D – Sum ( ( Jum lah Dat a Subset / Jum lah Dat a Parent ) Subset * Classificat ion Error set iapI NFORMATION G AIN Savings Assets Income
0.2825
Medium
Low High
Assets
Gini I ndex Parent = 0.47, Jum lah Dat a Parent = 8 Maxim um I nform at ion Gain = Subset Asset s Pure ( Hom ogen) Subset Asset s = Low dan High
0.125
0.47 – ( 3/8 * 0.46) + (3/8 * 0.46) + (2/8 * 0) )
0.47 – ( (2/8 * 0) + (4/8 * 0.375) + (2/8 * 0) )
Gini Index
0.1725
0.47 – ( (3/8 * 0.46) + (3/8 * 0) + (2/8 * 0.5) )
0.5 High (2) High (2) Information Gain
0.46 High (2)
0.46 Medium (3) Medium (4) 0.375 Medium (3)
0.46 Low (2) Low (3)
Low (3)
Bad Good ? ECISION REE ULE D T R Assets
Low Medium High
Bad ?
Good I f Asset s = Low Then Credit Risk = Bad
I f Asset s = High Then Credit Risk = Good
ARENT TERATION
P - I #2
No Assets Savings Income Credit Risk
1 Medium High Low Bad
2 Medium Medium Medium Good
3 Medium Medium Medium Good
4 Medium Low High Good Prob ( Bad) : 1/ 4 = 0.25 Prob ( Good) : 3/ 4 = 0.75
2
2 Gini I ndex : 1 – ( 0.25 + 0.75 )
: 0.375
- #2
- 1
2 Medium Good
2
2
) = 0 Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 2/ 2 = 1 Gini I ndex : 1 – ( 0
2
2
) = 0 Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0
2
2
1 Low Good Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0
No Savings Credit Risk
1 Medium Good
No Savings Credit Risk
1 High Bad
No Savings Credit Risk
4 Low Good
3 Medium Good
2 Medium Good
1 High Bad
No Savings Credit Risk
S UBSET S AVINGS
- 1
) = 0
- 1
- #2
- 0
No Income Credit Risk
2
2
) = 0 Prob ( Bad) : 0/ 2 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0
2
2
) = 0 Prob ( Bad) : 0/ 1 = 0 Prob ( Good) : 1/ 1 = 1 Gini I ndex : 1 – ( 0
2
2
2 Medium Good Prob ( Bad) : 1/ 1 = 0 Prob ( Good) : 0/ 1 = 1 Gini I ndex : 1 – ( 1
1 Medium Good
1 High Good
No Income Credit Risk
1 Low Bad
No Income Credit Risk
4 High Good
3 Medium Good
2 Medium Good
1 Low Bad
No Income Credit Risk
I NCOME
S UBSET
- 1
) = 0
- 1
- #2
Savings Income
Gini Index
Low (1) Low (1) Medium (2) Medium (2) High (1) High (1)
Information Gain
0. 375 – ( (1/4 * 0) + (2/4 * 0) + (1/4 * 0) )
0.375
0. 375 – ( (1/4 * 0) + (2/4 * 0) + (1/4 * 0) )
0.375
Gini I ndex Parent = 0.375, Jum lah Dat a Parent = 4 Maxim um I nform at ion Gain = Subset Savings dan Subset I ncom e Pure ( Hom ogen) Subset Savings = Low, Medium dan High Pure ( Hom ogen) Subset I ncom e = Low, Medium dan High
ECISION REE ULE ERSI AVINGS
D T R #2 V S Assets
Low Medium High
Bad Savings Good
Medium Low
High
Good Good Bad
# 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3. I f Asset s = Medium And Savings = Low Then Credit Risk = Good # 4. I f Asset s = Medium And Savings = High Then Credit Risk = Bad # 5. I f Asset s = Medium And Savings = Medium Then Credit Risk = Good
ECISION REE ULE ERSI NCOME
D T R #2 V
I Assets
Low Medium High
Bad Income Good
Medium Low
High
Bad Good Good
# 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3. I f Asset s = Medium And I ncom e = Low Then Credit Risk = Bad # 4. I f Asset s = Medium And I ncom e = High Then Credit Risk = Good # 5. I f Asset s = Medium And I ncom e = Medium Then Credit Risk = Good
ECISION REE ULE ESULT
D T R - R
# 1. I f Asset s = Low Then Credit Risk = Bad # 2. I f Asset s = High Then Credit Risk = Good # 3a. I f Asset s = Medium And Savings = Low Then Credit Risk = Good # 4a. I f Asset s = Medium And Savings = High Then Credit Risk = Bad # 5a. I f Asset s = Medium And Savings = Medium Then Credit Risk = Good # 3b. I f Asset s = Medium And I ncom e = Low Then Credit Risk = Bad # 4b. I f Asset s = Medium And I ncom e = High Then Credit Risk = Good # 5b. I f Asset s = Medium And I ncom e = Medium Then Credit Risk = Good
Savings Assets Income Credit Risk Savings Or Income
Medium Low Medium ? Bad / Bad Low High High ? Good / GoodEFERENCES R
| Discovering Knowledge in Data (Introduction to
Data Mining), Chapter 6, Daniel T. Larose, Wiley, 2004
Attributes Classes
Gender Car Ownership Travel Cost ($)/km Income Level Transportation Mode Male
Cheap Low Bus Male
1 Cheap Medium Bus Female
1 Cheap Medium Train Female Cheap Low Bus Male
1 Cheap Medium Bus Male Standard Medium Train Female
1 Standard Medium Train Female
1 Expensive High Car Male
2 Expensive Medium Car Female
2 Expensive High Car Male Expensive Medium