Lect5 Classification and Prediction

Objectives
Introduction
What is Classification?
Classification vs Prediction
Supervised and Unsupervised Learning
D t P
Data
Preparation
ti
Classification Accuracy
ID3 Algorithm
Information Gain
Bayesian Classification
Predictive Modelling

`
`
`
`
`
`


Classification and Prediction

`

Lecture 5/DMBI/IKI83403T/MTI/UI

`

Yudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia

`
`
2

Introduction
`
`


`
`

`

What is Classification? – A two-step
two step process

Databases are rich with hidden information that can be used
for making intelligent business decisions.
Classification and prediction can be used to extract models
d
describing
ibi iimportant d
data classes
l
or to predict
di ffuture data
d
trends.

Classification predicts categorical labels.
labels Ex: categorize bank
loan applications Æ safe or risky.
Prediction models continuous-valued functions. Ex: predict the
expenditures of potential customers on computer equipment
given their income and occupation.
Typical Applications:
`
`

University of Indonesia

`

Model construction:
`
`
`
`


`

Credit approval, target marketing,
Medical diagnosis,
g
, treatment effectiveness analysis
y

Each tuple is assumed to belong to a predefined class, as
determined by one of the attributes, called the class label.
Data tuples are also referred to as samples, examples, or objects.
All tuples used for construction is called training set.
Since the class label of each training sample is provided Æ
supervised learning. In clustering (unsupervised learning),
th class
the
l llabels
b l off each
h training
t i i sample

l iis nott known,
k
and
d th
the
number or set of classes to be learned may not be known in
advance.
The model is represented in the following forms:
`

University of Indonesia

4

Classification rules,, ((IF-THEN statements),
), decision tree,, mathematical
formulae
University of Indonesia

What is Classification? – A two-step

two step process (2)

Classification Process (1)

The model is used for classifying future or
unknown objects.

`

`

Training
Data

First, the predictive accuracy of the model is estimated
`
`
`

`


Classification
Algorithms

The known label of test sample is compared with the classified result
from the model.
Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
Test set is independent of training set otherwise over-fitting (it may
have incorporated some particular anomalies of the training data that
are not present in the overall sample population) will occur.

If the accuracy of the model is considered acceptable
acceptable, the
model can be used to classify future objects for which the
class label is not known ((unknown, ppreviouslyy unseen data).
)

NAM E
M ike

M ary
Bill
Jim
Dave
Anne

RANK
YEARS TENURED
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes

Assistant Prof
6
no
Associate Prof
3
no

Classifier
(Model)

IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
yes

6
5

University of Indonesia


Classification Process (2)

University of Indonesia

What is Prediction?
Prediction is similar to classification

`

`

Classifier

`

First, construct model.
Second, use model to ppredict future or unknown objects
j
`


Testing
Data

7

RANK
YEARS TENURED
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
es

…

Unseen Data

…

(Jeff, Professor, 4)

NAM E
Tom
M erlisa
George
Joseph

Major method for prediction is regression:

Prediction is different from classification

`

`

Tenured?

University of Indonesia

Linear and multiple regression
Non-liner regression

`

8

Classification refers to predict categorical class label.
Prediction refers to predict continuous value.

University of Indonesia

Classification vs Prediction

Supervised vs Unsupervised Learning

Sending out promotional literature to every new
customer in the database can be quite costly. A more cosefficient method would be to target only those new
customers who
h are likely
lik l to purchase
h
a new computer Æ
classification.
P d the
Predict
h number
b off major purchases
h
that
h a customer
will make during a fiscal year Æ prediction.

`

`

9

`

Remove/reduce noise and the treatment of missing values

`
`
11

`

Supervision: The training data (observations, measurements,
p
byy labels indicatingg the class of the
etc.)) are accompanied
observations

`

Based on the trainingg set to classifyy new data

Unsupervised learning (clustering)
`

We are given a set of measurements,
measurements observations
observations, etc with
the aim of establishing the existence of classes or clusters in
the data

`

No training data, or the “training data” are not accompanied
by class labels

10

`

University of Indonesia

Data Transformation
`
`
`
`
`

Relevance Analysis
`

Supervised learning (classification)

Issues – Data Preparation

Data preprocessing can be used to help improve the
accuracy, efficiency, and scalability of the classification or
prediction process.
Data Cleaning
`

`

`

University of Indonesia

Issues – Data Preparation
`

`

Many of the attributes in the data may be irrelevant to the
classification or prediction task. Ex: data recording the day of
the week on which a bank loan application was filed is unlikely
to be relevant to the success of the application.
application
Other attributes may be redundant.
This step is known as feature selection.
selection
University of Indonesia

`

12

Data can be generalized to higher-level concepts.
Useful fot continuous-valued attributes.
Income can be generalized Æ low, medium, high.
Street Æ city.
Generalization compresses the original training data, fewer
input/output operations may be involved during learning.
Wh using
When
i neurall networks
t
k (or
( other
th methods
th d involving
i l i
distance measurements), data may also be normalized.

University of Indonesia

Comparing Classification Method

Classification Accuracy: Estimating Error Rates

Predictive accuracy
Speed and scalability

`
`

`
`

`

`

handling noise and missing values

Cross-validation
` divide the data set into k subsamples
` use k-1 subsamples
p as trainingg data and one sub-sample
p as test
data --- k-fold cross-validation
` for data set with moderate size

`

Bootstrapping (leave-one-out)
` for small size data

efficiency in large databases (not memory resident data)

Interpretability:

`

`

the level of understanding and insight provided by the model

Goodness of rules

`

`
`

decision tree size
the compactness of classification rules

13

14

University of Indonesia

What is a decision tree?

`

`

Internal node denotes a test on an attribute
Branch represents an outcome of the test
` All tuples in branch have the same value for the tested
attribute.
Leaf node represents class label or class label distribution.

`

An Example
from Quinlan’s
ID3

To classify an unknown sample, the attribute values of the
sample are tested against the decision tree. A path is traced
from the root to a leaf node that holds the class prediction
f that
for
h sample.
l
Decision trees can easily be converted to classification rules.
15

University of Indonesia

Training Dataset

A decision tree is a flow-chart-like tree structure.
`

`

`

time to construct the model
time to use the model

Scalability

`

`

Partition: Training-and-testing
` use two independent data sets, e.g., training set (2/3), test
set(1/3)
` used for data set with large number of samples

Robustness

`

`

`

University of Indonesia

16

Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
y
sunny
rain
sunny
overcast
overcast
rain

Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
g
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
tr e
N
cool
normal true
P
mild
high
g
false
N
cool
normal false
P
mild
normal false
P
mild
ild
normall true
t
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
University of Indonesia

A Sample Decision Tree

Decision Tree Classification Methods
Decision-Tree
`

Outlook

sunny

overcast
overcast

humidity

The basic top-down decision tree generation approach
usually consists of two phases:
`

rain

windy

P

`
high

normal

true

false

N

P

N

P

17

`

`

`

`
`
`

`

return N as a leaf node labeled with the most common class

Partition examples recursively based on selected
attributes.

Tree pruning
Aiming at removing tree branches that may lead to errors
when
h classifying
l if i ttestt d
data
t (t
(training
i i d
data
t may contain
t i noise,
i
outliers, …)

University of Indonesia

Assume all attributes to be categorical (discrete-values).
Continuous-valued attributes must be discretized.
Used to select the test attribute at each node in the tree.
Also called measure of the goodness of split.
The attribute with the highest
g
information ggain is chosen
as the test attribute for the current node.

label N with the split-attribute
f eachh value
for
l Ai off split-attribute,
lit tt ib t grow a bbranchh from
f
Node
N d N
let Si be the branch in which all tuples have the value Ai for split- attribute
if Si is empty then
…
…

19

`

select split-attribute with highest information gain
`

`

`

return N as a leaf node labeled with C

if attribute-list is empty then
`

`

18

`

Create a node N;
if samples are all of the same class C, then
`

At start, all the training examples are at the root.

Choosing Split Attribute –
Information Gain (ID3/C4
(ID3/C4.5)
5) (1)

All attributes are categorical
`

`

`

University of Indonesia

ID3 Algorithm
`

Tree construction

attach a leaf labeled with the most common class
Else recursively run the algorithm at Node Si

until all branches reach leaf nodes
University of Indonesia

20

University of Indonesia

Information Gain (ID3/C4
(ID3/C4.5)
5) (2)
`

Assume that there are two classes, P and N.
` Let
L the
h set off examples
l S contain p elements
l
off class
l P and
dn
elements of class N.
` The amount of information
information, needed to decide if an arbitrary
example in S belong to P or N is defined as
I ( p, n ) = −

`

Information Gain (ID3/C4
(ID3/C4.5)
5) (3)

p
p
n
n

lo g 2
lo g 2
p+n
p+n
p+n
p+n

`

is maximal,, that is,, E(A)
( ) is minimal since I(p,
(p, n)) is the same to all
attributes at a node.
`

Assume that using attribute A as the root in the tree will partition
S in sets {S1, S2 , …, Sv}.
}
` If Si contains pi examples of P and ni examples of N, the information
needed to classify objects in all subtrees Si :
E( A) =

v



i =1

pi + ni
p+n

`
`
`

gain(temperature) = 0.029
gain(humidity) = 0.151

22

University of Indonesia

`

See Table 7.1.
Class label: buys_computer. Two values:YES, NO.
m = 2. C1 correspond to yes, C2 correspond to no.
9 samples of class yes and 5 samples of class no.
Compute the expected information needed to classify a given
sample
9
9
5
5
I ( s1 , s 2 ) = I (9,5) = − log
− log
= 0 .940
2 14
2 14
14
14

University of Indonesia

Next, compute the entropy of each attribute. Let’s start with the
attribute
ib
age.
`
`
`

`

For age = “40”: s13 = 3 s23 = 2 I (s13, s23) = 0.971

Using equation (7.2),
(7 2) the expected information needed to classify
a given sample if the samples are partitioned by age is
E ( age ) =

`

`

23

University of Indonesia

Information Gain (ID3/C4
(ID3/C4.5)
5) (4)

Examples:
`

gain(outlook)
i ( tl k) = 0.246
0 246

gain(windy) = 0.048

Information Gain (ID3/C4
(ID3/C4.5)
5) (3)
`

In the given sample data,
data attribute outlook is chosen to split at
the root :

I ( pi , ni)

21

`

The attribute A is selected such that the information gain
gain(A) = I(p, n) - E(A)

5
4
5
I ( s 11 , s 21 ) +
I ( s 12 , s 22 ) +
I ( s 13 , s 23 ) = 0 . 694
14
14
14

Hence, the gain in information from such a partitioning:
Gain(age)
( g ) = I ((s1, s2) – E ((age)
g ) = 0.246
Similarly, we can compute Gain(income) = 0.029, Gain(student) =
0.151, Gain(Credit_rating) = 0.048.
24

University of Indonesia

How to use a tree?
`

Directly
`
`

`

Tree Pruning
`

test the attribute value of unknown sample against the tree.
A path is traced from root to a leaf which holds the label

Indirectly
`
`
`

decision tree is converted to classification rules
one rule is created for each path from the root to a leaf
IF-THEN is easier for humans to understand
`

`

Example:
IF age = “