For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules wit

Simplicity first Inferring rudimentary rules

✁ Simple algorithms often work very well!

1R: learns a 1-level decision tree

There are many kinds of simple structure, eg: ♦ ✁ I.e., rules that all test one particular attribute

Basic version

One attribute does all the work

All attributes contribute equally & independently

One branch for each value

A weighted linear combination might do ♦ Each branch assigns most frequent class

Instance-based: use a few prototypes

Error rate: proportion of instances that don’t belong to the majority class of their

✁ ♦ Use simple logical rules

corresponding branch

Success of method depends on the domain

Choose attribute with lowest error rate (assumes nominal attributes)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 4

Pseudo-code for 1R Evaluating the weather attributes

Outlook

Temp

Hum idity

Windy

Play

For each attribute,

Errors Total

errors

For each value of the attribute, make a rule as follows:

count how often each class appears

Sunny → No

2/ 5 4/ 14 Overcast → Yes

make the rule assign that class to this attribute-value find the most frequent class

Calculate the error rate of the rules

Rainy

Cool

Norm al

Choose the rules with the smallest error rate

Rainy

Cool

Norm al

Norm al

Note: “missing” is treated as a separate attribute

value 1/ 7 Rainy Mild Norm al False Yes

Sunny

Cool

Norm al

False

Yes

Hum idity

Norm al

Norm al

Norm al

* indicates a tie

Dealing with numeric attributes The problem of overfitting

✁ Discretize numeric attributes

This procedure is very sensitive to noise Divide each attribute’s range into intervals

One instance with an incorrect class label will

Sort instances according to attribute’s values

✁ probably produce a separate interval

Place breakpoints where class changes (majority class) ✁ Also: time stamp attribute will have zero errors ✁ ♦ This minimizes the total error

Simple solution:

Example: temperature from weather data enforce minimum number of instances in

✁ majority class per interval Example (with min = 3):

Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No 64 65 68 69 70 71 72 72 75 75 80 81 83 85

Outl ook Temperat ure Humidi ty

Windy

Play Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No Sunny 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85 85 False No

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 8

With overfitting avoidance

Discussion of 1R

✁ 1R was described in a paper by Holte (1993) Resulting rule set:

Contains an experimental evaluation on 16 datasets (using cross-validation so that results were

Attribute Rules

Errors

Total errors

representative of performance on future data)

Outlook Sunny → No

Overcast → Yes

Minimum number of instances was set to 6 after

some experimentation

Rainy → Yes

Tem perature ≤ 77.5 → Yes

1R’s simple rules performed not much worse than

> 77.5 → No*

much more complex decision trees

Humidity ≤ 82.5 → Yes

✁ Simplicity first pays off!

Windy False → Yes

Ve ry Sim ple Clas si fication Rul es Perfo rm We ll on Mos t

True → No*

Com m only Use d Datas ets

Robert C. Holt e, Com p ut er Science Dep ar tm ent , University of Ot tawa

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 10

Discussion of 1R: Hyperpipes

Statistical modeling

Another simple technique: build one rule for each class

♦ ✁ “Opposite” of 1R: use all the attributes Each rule is a conjunction of tests, one for

each attribute ✁ Two assumptions: Attributes are ♦ ♦ equally important

For numeric attributes: test checks whether

instance's value is inside an interval ✂ (given the class value)

statistically independent

I.e., knowing the value of one attribute says nothing Interval given by minimum and maximum

✁ about the value of another (if the class is known) observed in training data

✁ Independence assumption is never correct! ♦ For nominal attributes: test checks whether

But … this scheme works well in practice value is one of a subset of attribute values ✁

Subset given by all possible values observed in training data

Class with most matching tests is predicted Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Probabilities for weather data Probabilities for weather data

Out look Temperat ure Humi dity

Play Yes

Windy

Play

Outl ook

Temperat ure

Humidit y

Windy

No Yes No Sunny

4 0 Mild 4 2 Norm al

Sunny 2/ 9 3/ 5 Hot 2/ 9 2/ 5 Hi gh 3/ 9

4/ 9 0/ 5 Mild 4/ 9 2/ 5 Norm al 6/ 9

Sunny Out look

Hot Tem p

Hum idit y

Rainy Ov ercast Cool Mil d Rainy Hot

Sunny

Hot

High High

False True

No No

A new day:

Outlook

Temp.

Humidit y

Windy

Play

High High

False False

True True

False

No Yes

Lik elihood of the two classes

Rainy Sunny Sunny

Ov ercast

Cool

Mil d Cool Mil d

Nor m al Nor m al High

Nor m al

Yes

For “yes” = 2/ 9

False False

Yes Yes No

For “no” = 3/ 5

Ov ercast Sunny

Mil d Mil d

Nor m al High

False

Conversion int o a prob abilit y by norm aliz ation:

True True

Yes Yes

P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(“no”) = 0.0206 / (0.0053 + 0.0 206) = 0.795

Ov ercast

03/04/06 Yes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) Rainy Mil d High True No 13 03/04/06

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Bayes’s rule Naïve Bayes for classification

Probability of event H given evidence E: ✁ ✂ ✄ ☎ ✆ ✝ ✂ ✆ ☎ ✄ ✝ ✂ ✄ ✝ ✞ ✁ ✁

Classification learning: what’s the probability of the class given an instance?

Evidence E = instance

A priori probability of H : ✁ ♦ ✁ Event H = class value for instance

Probability of event before evidence is seen ✟ ✠ ✡ ☛ ☞ ✌ ✍ Naïve assumption: evidence splits into parts

A posteriori probability of H :

(i.e. attributes) that are independent

✁ Probability of event after evidence is seen

Thom as Bay e s Die d : 1 76 1 in Tu n brid ge Wells , Ken t, En glan d Bo rn : 1 70 2 in Lon d on , En gland

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Weather data example The “zero-frequency problem”

Outlook Temp. Hum idity Windy

Play

Sunny Cool High True

Evidence E

What if an attribute value doesn’t occur with every

class value?

(e.g. “Humidity = high” for class “yes”) ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✺ ✼ ✽ ✾ ✷ ✺ ✿ ❀ ❁ ❄ ❅

Probability will be zero!

✴ ✵ ✶ ✽ ❂ ❃ ❁ ❆ ❄ ✾ Probability of ❅ class “yes”

A posteriori probability will also be zero!

(No matter how likely the other values are!)

Remedy: add 1 to the count for every attribute

value-class combination (Laplace estimator)

Result: probabilities will never be zero!

(also: stabilizes probability estimates)

Modified probability estimates

Missing values

Training: instance is not included in frequency In some cases adding a constant different

count for attribute value-class combination ✁ from 1 might be more appropriate

Classification: attribute will be omitted from Example: attribute outlook for class yes

calculation

Example: Outlook Temp. Hum idit y Windy Play

× Sunny 9/ 14 = 0.0238 Overcast Rainy ×

Likelihood of “yes” = 3/ 9 ×

Likelihood of “no” = 1/ 5

Weights don’t need to be equal (but they must sum to 1) P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%

P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 20

Numeric attributes Statistics for weather data

Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)

Outl ook

Temp erature

Humidit y

Windy Play

The probability density function for the No

6 2 9 normal distribution is defined by two 5

Sample mean µ ✂ ✾ ☞ ✍ ✟ ✑ ✍

✁ ✂ ✾ ☞ ✒ ☛ ✓ Standard deviation ✍ ✟ ✑ σ ✎ ✍ ✒ ✂ ✔ ✠

Example density value:

Then the density function f(x) is ✤

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 22

Classifying a new day

Probability densities

✳ A new day: Out look Temp. Hum idity Windy Play

✳ Relationship between probability and

Likelihood of “yes” = 2/ 9 × 0.0340 × 0.0221

Likelihood of “no” = 3/ 5 × 0.0221 × 0.0381

P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%

✳ But: this doesn’t change calculation of a

P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%

✳ posteriori probabilities because ε cancels out

Exact relationship:

Missing values during training are not included in calculation of mean and

standard deviation

Multinomial naïve Bayes I Multinomial naïve Bayes II

✳ Version of naïve Bayes used for document ✳ Suppose dictionary has two words, yellow and blue classification using bag of words model

✳ n Suppose Pr[yellow | H] = 75% and Pr[blue | H] = 25% 1 ,n 2 , ... , n k : number of times word i occurs in ✳ Suppose E is the document “blue yellow blue”

✳ document

Probability of observing document:

P ,P , ... , P : probability of obtaining word i when

✳ sampling from documents in class H

Suppose there is another class H' that has

Probability of observing document E given class H Pr[yellow | H'] = 75% and Pr[yellow | H'] = 25%: ✴ ✵ ✶ ✓ ✔ ✕ ✖ ✗ ✘ ✗ ✕ ✕ ✙ ✚ ✔ ✕ ✖ ✗ ✛ ✜ ✢ ✰ ✾ ✿ ✣ ✤ ❀ ✥

✍ ✲ ★ ✍ ✧ ❀ ✥ ✲ ✥ ✪ ★ ✤ ✩ ✬ ✺ (based on multinomial distribution): ✯ ✭

Need to take prior probability of class into account to

✳ Ignores probability of generating a document of the make final classification Factorials don't actually need to be computed right length (prob. assumed constant for each class)

25 03/04/06 Underflows can be prevented by using logarithms Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Naïve Bayes: discussion Constructing decision trees

✧ Naïve Bayes works surprisingly well (even if

✳ Strategy: top down

✧ independence assumption is clearly violated) Recursive divide-and-conquer fashion Why? Because classification doesn’t require accurate probability estimates as long as maximum

First: select attribute for root node Create branch for each possible attribute value

✧ probability is assigned to correct class However: adding too many redundant attributes

Then: split instances into subsets

will cause problems (e.g. identical attributes) One for each branch extending from the node

Note also: many numeric attributes are not ♦ Finally: repeat recursively for each branch, normally distributed ( → kernel density estimators)

✳ using only instances that reach the branch Stop if all instances have the same class

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 28

Which attribute to select? Which attribute to select?

Criterion for attribute selection Computing information

✧ Which is the best attribute? Measure information in bits

Want to get the smallest tree Given a probability distribution, the info required to predict an event is the

Heuristic: choose the attribute that produces the

distribution’s entropy

✳ “purest” nodes

Entropy gives the information required in bits Popular impurity criterion: information gain

✧ ♦ (can involve fractions of bits!) Information gain increases with the average Formula for computing the entropy:

✳ purity of the subsets

✗ ✂ ✙ ✄ ✘ ❂ ☎ ✆ ✝ ☎ ✞ ✝ ✭ ✭ ✭ ✟ ☎ ✠ ❃ ✩ ✸ ☎ ✆ ✕ ✙ ✡ ☎ ✆ ✸ ☎ ✞ ✕ ✙ ✡ ☎ ✞ ✭ ✭ ✭ ✸ ☎ ✠ ✕ ✙ ✡ ☎ Strategy: choose attribute that gives greatest ✠ information gain

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 32

Example: attribute Outlook Computing information gain

Outlook ✧ ☛ ☞ ✌ ✍ ✎ ✂ ✏ ✑ ✒ ✝ ✓ ✔ = Sunny : ✕ ☞ ✖ ✗ ✍ ✘ ✙ ✎ ✏ ✚ ✛ ✑ ✒ ✚ ✛ ✓ ✔ ✜ ✏ ✚ ✛ ✢ ✍ ✣ ✎ ✏ ✚ ✛ ✓ ✜ ✒ ✚ ✛ ✢ ✍ ✣ ✎ ✒ ✚ ✛ ✓ ✔ ✤ ✥ ✦ ✧ ★ ✩ ☛ ✖ ✪ Information gain: information before splitting – information after splitting

✧ Outlook = Overcast : gain( Out look ) = inf o([9,5]) – inf o([2,3],[4 ,0 ],[3,2])

☛ ☞ ✌ ✍ ✎ ✂ ✫ ✑ ✤ ✝ ✓ ✔ ✕ ☞ ✖ ✗ ✍ ✘ ✙ ✎ ★ ✑ ✤ ✓ ✔ ✜ ★ ✢ ✍ ✣ ✎ ★ ✓ ✜ ✤ ✢ ✍ ✣ ✎ ✤ ✓ ✔ ✤ ✩ ☛ ✖ ✪ Note: this

is normally

= 0.247 bit s

Outlook = Rainy :

undefined.

Information gain for attributes from weather data: ✧ Expected information for attribute:

gain( Out look )

= 0 .2 47 bit s

gain( Temp er at ur e )

= 0.02 9 bit s

gain( Hum idit y )

= 0 .1 52 bit s

gain( Windy )

= 0 .0 48 bit s

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 34

Continuing to split

Final decision tree

✳ Note: not all leaves need to be pure; sometimes gain( Tem perat ure )

identical instances have different classes gain( Humidi ty )

Splitting stops when data can’t be split any further gain( Windy )

= 0.571 bit s

= 0.971 bit s

= 0.020 bit s

Wishlist for a purity measure Properties of the entropy

✳ Properties we require from a purity measure:

✳ The multistage property:

When node is pure, measure should be zero ✗ ✁ ✂ ✙ ✄ ✘ ❂ ☎ ✟ ✟ ✟ ✵ ❃ ✩ ✗ ✁ ✂ ✙ ✄ ✘ ❂ ☎ ✟ ✟ ✽ ✵ ❃ ✽ ❂ ✟ ✽ ✵ ❃ ❀ ✗ ✁ ✂ ✙ ✄ ✘ ♦ ❂ When impurity is maximal (i.e. all classes equally ✠ ✠ ✡ ☛ ✟ ✠ ☛ ✡ ☛ ❃

likely), measure should be maximal

Measure should obey multistage property (i.e. ✳ Simplification of computation: decisions can be made in several stages):

Entropy is the only function that satisfies all ✳ Note: instead of maximizing info gain we three properties!

could just minimize information

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 38

Highly-branching attributes Weather data with ID code

ID code

Outlook

Temp.

y Hum idit

Windy Play

Problematic: attributes with a large number

A Sunny

of values (extreme case: ID code)

B Sunny

Subsets are more likely to be pure if there is Yes

C Overcast

D Rainy

a large number of values

E Rainy

Cool

Norm al

False

Yes

⇒ Information gain is biased towards choosing

F Rainy

Cool

Norm al

True

No

attributes with a large number of values

G Overcast

Cool

Norm al

True

Yes

⇒ No This may result in overfitting (selection of an

H Sunny

✳ attribute that is non-optimal for prediction)

I Sunny

Cool

Norm al

Norm al

False

Yes

Another problem: fragmentation

Sunny

Mild

Norm al

Norm al

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 40

Tree stump for ID code attribute

Gain ratio

✳ Gain ratio : a modification of the information gain ✳ that reduces its bias Gain ratio takes number and size of branches into

account when choosing an attribute

It corrects the information gain by taking the intrinsic information of a split into account

Entropy of split: ✳ Intrinsic information: entropy of distribution of

instances into branches (i.e. how much info do we ⇒ need to tell which branch an instance belongs to) Information gain is maximal for ID code

(namely 0.940 bits)

Computing the gain ratio Gain ratios for weather data

✳ Example: intrinsic information for ID code

Value of attribute decreases as intrinsic 0.029

Gain: 0.940- 0.693

Gain: 0.940- 0.911

✳ information gets larger

Split info: info([5,4,5])

Split info: info([4,6,4]) 1.557

Gain ratio: 0.247/ 1.577

Gain ratio: 0.029/ 1.557 0.019

Definition of gain ratio:

Hum idity

Gain: 0.940- 0.788

Gain: 0.940- 0.892

Split info: info([7,7])

Split info: info([8,6])

Example: 0.049

Gain ratio: 0.152/ 1

Gain ratio: 0.048/ 0.985

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 44

More on the gain ratio

Discussion

✳ “Outlook” still comes out top

Top-down induction of decision trees: ID3, However: “ID code” has greater gain ratio

algorithm developed by Ross Quinlan

Standard fix: ad hoc test to prevent splitting on that ♦ Gain ratio just one modification of this basic ✳ type of attribute

algorithm

Problem with gain ratio: it may overcompensate C4.5: deals with numeric attributes, missing

✳ values, noisy data

May choose an attribute just because its intrinsic

✳ Similar approach: CART

information is very low

Standard fix: only consider attributes with greater There are many other attribute selection than average information gain

criteria! (But little difference in accuracy of result)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 46

Covering algorithms Example: generating a rule

✳ Convert decision tree into a rule set

Straightforward, but rule set overly complex ✳ ♦ More effective conversions are not trivial

Instead, can generate rule set directly

for each class in turn find rule set that covers

If true

If x > 1.2 and y > 2.6 then class = a

all instances in it

then class = a

✳ (excluding instances not in the class)

If x > 1.2

Called a covering approach: then class = a ♦ ✳ at each stage a rule is identified that “covers” Possible rule set for class “b”: some of the instances

If x

1.2 then class = b If x > 1.2 and y

2.6 then class = b

✳ Could add more rules, get “perfect” rule set

Rules vs. trees Simple covering algorithm

Corresponding decision tree: ✳ Generates a rule by adding tests that maximize (produces exactly the same

rule’s accuracy

predictions) ✳ Similar to situation in decision trees: problem

of selecting an attribute to split on ✳ ♦ But: decision tree inducer maximizes overall purity

Each new test reduces

But: rule sets can be more perspicuous when

rule’s coverage:

✧ decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas

decision tree learner takes all classes into account

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 50

Selecting a test Example: contact lens data

✧ Goal: maximize accuracy If ? then recommendation = hard

✧ Rule we seek: Possible tests:

t total number of instances covered by rule

p positive examples of the class covered by rule

t–p number of errors made by rule

Age = Young

✳ ⇒ Select test that maximizes the ratio p/t

Age = Pre-presbyopic

We are finished when p/t = 1 or the set of

Age = Presbyopic

instances can’t be split any further 3/12

Spectacle prescription = Myope

Spectacle prescription = Hypermetrope

Astigmatism = no

Astigmatism = yes

Tear production rate = Reduced

Tear production rate = Normal

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 52

Modified rule and resulting data

Further refinement

✧ Rule with best test added:

✧ Current state: If astigmatism = yes and ?

If astigmatism = yes then recommendation = hard then recommendation = hard

Instances covered by modified rule: ✧ Possible tests:

Age prescri ption Sp ectacle Asti gmati sm

Tear product ion rate

Recomm ended

Young Young Myop e Myop e Yes Yes

Norm al Reduced

None lenses

Age = Young

Pre- presbyopic Young Hypermet rope Pre- presbyopic

Young Hypermet rope Yes

Reduced

None Hard

Age = Pre-presbyopic

Myop e Yes Yes

Reduced Norm al

hard

Age = Presbyopic

Pre- presbyopic Myop e Hypermet rope Yes Yes

Norm al

Hard None

Spectacle prescription = Myope

Presbyopi c Pre- presbyopic Hypermet rope Myop e Yes Yes

Norm al Reduced

None

Spectacle prescription = Hypermetrope

Presbyopi c Myop e Yes

Norm al

Reduced

None

None Hard

Tear production rate = Reduced

Presbyopi c Presbyopi c Hypermet rope Hypermet rope Yes Yes

Reduced Norm al

None None

Tear production rate = Normal

Modified rule and resulting data

Further refinement

Current state:

Rule with best test added:

If astigmatism = yes and tear production rate = normal then recommendation = hard and tear production rate = normal

If astigmatism = yes

then recommendation = hard and ?

✧ Instances covered by modified rule:

Possible tests:

Age = Young

Age Sp ectacle prescri ption Asti gmati sm

Young Young Myop e

Yes Yes

Norm al Norm al

rate

Tear product ion

lenses

Recomm ended

Age = Pre-presbyopic

Hard

Age = Presbyopic

Pre- presbyopic Hypermet rope Yes Yes

Pre- presbyopic Myop e Hypermet rope

Norm al Norm al

Hard hard

Spectacle prescription = Myope

3/3 Spectacle prescription = Hypermetrope

Presbyopi c Hypermet rope Yes

Presbyopi c Myop e Yes

Hard None

Norm al

Norm al

None

✳ Tie between the first and the fourth test

We choose the one with greater coverage

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 56

The result Pseudo-code for PRISM

For each class C

Final rule:

If astigmatism = yes and tear production rate = normal

While E contains instances in class C Initialize E to the instance set

and spectacle prescription = myope Create a rule R with an empty left-hand side that predicts class C then recommendation = hard

Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v,

Second rule for recommending “hard lenses”: Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t

(built from instances not covered by first rule)

Remove the instances covered by R from E Add A = v to R (break ties by choosing the condition with the largest p) If age = young and astigmatism = yes then recommendation = hard and tear production rate = normal

✳ These two rules cover all “hard lenses”:

Process is repeated with other two classes

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 58

Rules vs. decision lists

Separate and conquer

PRISM with outer loop removed generates a ✳ Methods like PRISM (for dealing with one decision list for one class

class) are separate-and-conquer algorithms:

Subsequent rules are designed for rules that are not

First, identify a useful rule

covered by previous rules

Then, separate out all the instances it covers

✳ same class ✳ Finally, “conquer” the remaining instances Outer loop considers all classes separately

But: order doesn’t matter because all rules predict the

Difference to divide-and-conquer methods:

Subset covered by rule doesn’t need to be ✳ No order dependence implied

explored any further

Problems: overlapping rules, default rule required

Mining association rules

Item sets

Naïve method for finding association rules: ✳ Support: number of instances correctly covered

Use separate-and-conquer method

by association rule

The same as the number of instances covered by all ✳ values as a separate class

Treat every possible combination of attribute

✳ tests in the rule (LHS and RHS!)

Two problems:

✳ Item : one test/attribute-value pair

✳ Item set : all items occurring in a rule

Resulting number of rules (which would have to be Goal: only rules that exceed pre-defined support ✳ pruned on the basis of support and confidence)

Computational complexity

Do it by finding all item sets with the given minimum support and generating rules from them! But: we can look for high support rules directly!

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 62

Weather data Item sets for weather data

Outlook Tem p Humidity

Windy

Play

Sunny Hot High

Four- it em set s Sunny

False

No

One- item sets

Two- item sets

Three- item sets

Overcast Out look = Sunny Hot High False Yes Temperat ure = Hot (2)

Hot High

True

No

Out look = Sunny (5)

Out look = Sunny Temperat ure = Hot

Out look = Sunny Temperat ure = Hot

Rainy Mild High

Humi dity = High Rainy

False

Yes

Humi dity = High (2)

Temperat ure = Mi ld Out look = Rainy Overcast

Cool Normal

False

Yes

Play = No (2)

Rainy Cool Normal

True

No

(4) Temperat ure = Cool

Out look = Sunny Humi dity = High (3)

Humi dity = High Out look = Sunny

Windy = False Sunny

Play = Yes (2) Sunny

Cool Normal

True

Yes

Windy = False (2)

Mild High

False

No

Cool Normal

False

Yes

Rainy Mild Normal

False

Yes

Sunny Mild Normal

True

Yes

✳ In total: 12 one-item sets, 47 two-item sets, 39

Overcast Mild High

True

Yes

three-item sets, 6 four-item sets and 0 five-

Overcast Hot Normal

False

Yes

item sets (with minimum support of two)

Rainy Mild High

True

No

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 64

Generating rules from an item set Rules for weather data

Once all item sets with minimum support have ✧ Rules with support > 1 and confidence = 100%: been generated, we can turn them into rules

Association rule

Sup. Conf.

4 Example: 100%

1 Humidity=Normal Windy=False

⇒ Play=Yes

4 100% Humidity = Normal, Windy = False, Play = Yes (4)

2 Temperature=Cool

⇒ Humidity=Normal

3 Outlook=Overcast

⇒ Play=Yes

3 Seven (2 100% -1) potential rules:

4 Temperature=Cold Play=Yes

⇒ Humidity=Normal

... ... If Humidity = Normal and Windy = False then Play = Yes

2 100% If Humidity = Normal and Play = Yes then Windy = False

58 Outlook=Sunny Temperature=Hot

⇒ Humidity=High

If Humidity = Normal then Windy = False and Play = Yes If Windy = False and Play = Yes then Humidity = Normal

In total:

3 rules with support four

If Windy = False then Humidity = Normal and Play = Yes

5 with support three

If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False

50 with support two

and Play = Yes

Example rules from the same set Generating item sets efficiently

✳ Item set:

✳ How can we efficiently find all frequent item sets? ✳ Resulting rules (all with 100% confidence):

Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)

✳ Finding one-item sets easy

Idea: use one-item sets to generate two-item sets,

Temperature = Cool, Windy = False, Humidity = Normal Temperature = Cool, Windy = False ⇒ Humidity = Normal, Play = Yes

two-item sets to generate three-item sets, …

Humidity = Normal ⇒ Play = Yes

If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well!

Temperature = Cool, Windy = False, Play = Yes

In general: if X is frequent k-item set, then all (k-1)- due to the following “frequent” item sets:

item subsets of X are also frequent ⇒

Compute k-item set by merging (k-1)-item sets

Temperature = Cool, Windy = False, Play = Yes (2) Temperature = Cool, Humidity = Normal, Windy = False (2) Temperature = Cool, Windy = False (2)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 68

Example Generating rules efficiently

✳ Given: five three-item sets

We are looking for all high-confidence rules

Support of antecedent obtained from hash table Lexicographically ordered!

(A B C), (A B D), (A C D), (A C E), (B C D)

✳ ♦ But: brute-force method is (2 N ✳ -1)

Candidate four-item sets: Better way: building (c + 1)-consequent rules

from c-consequent ones

(A B C D) OK because of (B C D)

Observation: (c + 1)-consequent rule can only hold

✳ if all corresponding c-consequent rules also hold Final check by counting instances in

(A C D E) Not OK because of (C D E)

Resulting algorithm similar to procedure for dataset!

large item sets

(k –1)-item sets are stored in hash table

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 70

Example Association rules: discussion

✳ 1-consequent rules: If Outlook = Sunny and Windy = False and Play = No ✳ Above method makes one pass through the data

then Humidity = High (2/2)

for each different size item set

If Humidity = High and Windy = False and Play = No

Other possibility: generate (k+2)-item sets just after

then Outlook = Sunny (2/2)

(k+1)-item sets have been generated

Result: more (k+2)-item sets than necessary will be

considered but less passes through the data Corresponding 2-consequent rule:

✳ ♦ Makes sense if data too large for main memory

Practical issue: generating a certain number of

rules (e.g. by incrementally reducing min. support) ✳ Final check of antecedent against hash table!

If Windy = False and Play = No

then Outlook = Sunny and Humidity = High (2/2)

Other issues Linear models: linear regression

✳ Standard ARFF format very inefficient for typical

✳ market basket data Work most naturally with numeric attributes

Attributes represent items in a basket and most Standard technique for numeric prediction

items are usually missing

Outcome is linear combination of attributes

☞ Data should be represented in sparse format ✁ ✥ ✖ ✁ ✍ ✂ ✍ ✖ ✁ ✥ ✂ ✥ ✖ ✄ ✄ ✄ ✖ ✁ ✎ ✂ ✎

✳ Instances are also called transactions ✳ Weights are calculated from the training data Confidence is not necessarily the best measure

Predicted value for first training instance a (1)

Example: milk occurs in almost every supermarket

transaction ✁ ✥ ✂ ✥ ✬ ✍ ✰ ✖ ✁ ✍ ✂ ✬ ✍ ✍ ✰ ✖ ✁ ✥ ✂ ✥ ✬ ✍ ✰ ✖ ✄ ✄ ✄ ✖ ✁ ✎ ✂ ✬ ✎ ✍ ✰ ☞ ☎ ✎ ✆ ✌ ✥ ✁ ✆ ✂ ✬ ✆ ✍ ✰

Other measures have been devised (e.g. lift)

(assuming each instance is extended with a constant attribute with value 1)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 74

Minimizing the squared error

Classification

Any regression technique can be used for Choose k +1 coefficients to minimize the

classification

✝ squared error on the training data

Training: perform a regression for each class, setting ✝ Squared error: ☎ ✟ ✞ ✌ ✍ ✠ ✬ ✞ ✰ ✍ ☎ ✡ ☛ ✌ ✥ ✁ ✡ ✂ ✡ ✬ ✞ ✰ ☞ ✥

the output to 1 for training instances that belong to class, and 0 for those that don’t

Derive coefficients using standard matrix Prediction: predict class corresponding to model ✳ with largest output value (membership value)

✝ operations For linear regression this is known as multi- Can be done if there are more instances than

✳ response linear regression

attributes (roughly speaking)

Problem: membership values are not in [0,1] Minimizing the absolute error is more difficult

range, so aren't proper probability estimates

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 76

Linear models: logistic regression

Logit transformation

✝ Builds a linear model for a transformed target variable

✝ Assume we have two classes ✝ Logistic regression replaces the target

by this target

Resulting model:

✯ ✯ ✝ ✵ ✶ ✮ ✷ ✸ ✹ ✺ ✻ ✹ ✺ ✙ ✼ ✙ ✹ ✽ ✽ ✽ ✹ ✺ ✣ ✼ ✣ ✾ Logit transformation maps [0,1] to (- ∞ ,+ ∞ )

Example logistic regression model

Maximum likelihood

Model with w ✝

0.5 and w 1 = 1:

Aim: maximize probability of training data wrt parameters

✝ Can use logarithms of probabilities and

maximize log-likelihood of model:

where the x (i) are either 0 or 1

Weights w i need to be chosen to maximize Parameters are found from training data

log-likelihood (relatively simple method: using maximum likelihood

iteratively re-weighted least squares )

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 80

Multiple classes

Pairwise classification

Can perform logistic regression ✝ Idea: build model for each pair of classes, independently for each class

using only training data from those classes ✝ (like multi-response linear regression)

✝ Problem? Have to solve k(k-1)/2

Problem: probability estimates for different classification problems for k-class problem ✝ classes won't sum to one

✝ Turns out not to be a problem in many Better: train coupled models by

cases because training sets become small: maximizing likelihood over all classes

Assume data evenly distributed, i.e. 2n/k per

Alternative that often works well in

learning problem for n instances in total

practice: pairwise classification

♦ Suppose learning algorithm is linear in n ♦ Then runtime of pairwise classification is

proportional to (k(k-1)/2)×(2n/k) = (k-1)n

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 82

Linear models are hyperplanes Linear models: the perceptron

Decision boundary for two-class logistic ✳ Don't actually need probability estimates if all we regression is where probability equals 0.5:

want to do is classification

✳ Different approach: learn separating hyperplane

Assumption: data is linearly separable

✝ which occurs when ✥ ✦ ✫ ✥ ✦ ✬ ✬ ✥ ✕ ✕ ✕ ✥ ✦ ✭ ✭ ✙

✳ Algorithm for learning separating hyperplane:

Thus logistic regression can only separate

✳ perceptron learning rule ✴ ✵ ✁ ✂ ✶ ✷ ✁ ✏ ✂ ✏ ✷

✶ ✁ ✒ ✂ ✒ data that can be separated by a hyperplane ✷ ✸ ✸ ✸ ✷ ✁ ☛ ✂ ☛

Hyperplane:

✝ Multi-response has the same problem.

where we again assume that there is a constant

Class 1 is assigned if: attribute with value 1 (bias)

If sum is greater than zero we predict the first

class, otherwise the second class

The algorithm Perceptron as a neural network

Set all weights to zero Until all instances in the training data are classified correctly

For each instance I in the training data If I is classified incorrectly by the perceptron

If I belongs to the first class add it to the weight vector else subtract it from the weight vector

✳ Why does this work?

Output

Consider situation where instance a pertaining to

layer

the first class has been added: ✁ ✶ ✷ ✂ ✶ ✁ ✂ ✶ ✷ ✁ ✏ ✷ ✂ ✏ ✁ ✂ ✏ ✷ ✁ ✒ ✷ ✂ ✒ ✁ ✂ ✒ ✷ ✸ ✸ ✸ ✷ ✁ ☛ ✷ ✂ ☛ ✁ ✂ ☛

This means output for a has increased by: ✂ ✶ ✂ ✶ ✷ ✂ ✏ ✂ ✏ ✷ ✂ ✒ ✂ ✒ ✷ ✸ ✸ ✸ ✷ ✂ ☛ ✂ ☛

Input layer

This number is always positive, thus the hyperplane has moved into the correct direction (and we can show output decreases for instances of other class)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 86

Linear models: Winnow

The algorithm

✝ Another mistake-driven algorithm for

while some instances are misclassified for each instance a in the training data

finding a separating hyperplane

classify a using the current weights if the predicted class is incorrect

♦ Assumes binary data (i.e. attribute values are

either zero or one) by alpha

if a belongs to the first class for each a that is 1, multiply w

is 0, leave w (if a i i

unchanged)

Difference: multiplicative updates instead otherwise for each a that is 1, divide w by alpha

of additive updates

is 0, leave w (if a i

unchanged)

♦ ✳ Weights are multiplied by a user-specified Winnow is very effective in homing in on relevant parameter α> 1 (or its inverse)

✳ features (it is attribute efficient)

Another difference: user-specified threshold

Can also be used in an on-line setting in which

parameter θ new instances arrive continuously

(like the perceptron algorithm)

♦ Predict first class if ✁ ✂ ✂ ✂ ✷ ✁ ✖ ✂ ✖ ✷ ✁ ✄ ✂ ✄ ✷ ✸ ✸ ✸ ✷ ✁ ☎ ✂ ☎ ✆ ✝

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 88

Balanced Winnow Instance-based learning

✳ Winnow doesn't allow negative weights and this ✳ can be a drawback in some applications

Balanced Winnow ✑ maintains two weight vectors, ✑ Distance function defines what’s learned one for each class:

Most instance-based schemes use

while some instances are misclassified for each instance a in the training data

Euclidean distance ✓

if the predicted class is incorrect classify a using the current weights if a belongs to the first class

for each a i that is 1, multiply w i +

by alpha and divide w i -

by alpha

: two instances with k attributes

otherwise ✑ Taking the square root is not required when

i is 0, leave w (if a and w i -

i unchanged)

i is 0, leave w (if a i + i

for each a i that is 1, multiply w -

and w i -

by alpha and divide w unchanged)

by alpha

✑ comparing distances

✳ Instance is classified as belonging to the first class ✒ Other popular metric: city-block metric

Adds differences without squaring them

(of two classes) if: ✞ ✞ ✍ ✍ ✍ ✞

Normalization and other issues

Finding nearest neighbors efficiently

Simplest way of finding nearest neighbour: linear Different attributes are measured on different

scan of the data

scales ⇒ need to be normalized:

Classification takes time proportional to the product

✳ of the number of instances in training and test sets Nearest-neighbor search can be done more

v i : the actual value of attribute i ✳ ✳ efficiently using appropriate data structures We will discuss two methods that represent ✳ Nominal attributes: distance either 0 or 1

training data in a tree structure:

Common policy for missing values: assumed to be maximally distant (given normalized attributes)

kD-trees and ball trees

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 92

k D-tree example Using kD-trees: example

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 94

More on kD-trees Building trees incrementally

✑ Complexity depends on depth of tree, given by ✑ Big advantage of instance-based learning: ✑ logarithm of number of nodes

classifier can be updated incrementally Amount of backtracking required depends on

♦ Just add new training instance!

quality of tree (“square” vs. “skinny” nodes)

Can we do the same with kD-trees?

How to build a good tree? Need to find good split point and split direction ✑ Heuristic strategy:

Split direction: direction with greatest variance Find leaf node containing new instance

Split point: median value along that direction ♦ Place instance into leaf if leaf is empty

♦ Otherwise, split leaf according to the longest

Using value closest to mean (rather than

dimension (to preserve squareness)

✑ median) can be better if data is skewed

✑ Tree should be re-built occasionally (i.e. if Can apply this recursively

95 03/04/06 depth grows to twice the optimum depth) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 96

Ball trees

Ball tree example

✑ Problem in kD-trees: corners ✑ Observation: no need to make sure that

✑ regions don't overlap Can use balls (hyperspheres) instead of

hyperrectangles

A ball tree organizes the data into a tree of k- dimensional hyperspheres ♦ Normally allows for a better fit to the data and thus more efficient search

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 98

Using ball trees

Building ball trees

Nearest-neighbor search is done using the ✑ Ball trees are built top down (like kD-trees) same backtracking strategy as in kD-trees

Don't have to continue until leaf balls Ball can be ruled out from consideration if:

contain just two points: can enforce distance from target to ball's center exceeds ball's radius plus current upper bound

✑ minimum occupancy (same in kD-trees) ✑ Basic problem: splitting a ball into two

Simple (linear-time) split selection strategy:

♦ Choose point farthest from ball's center ♦ Choose second point farthest from first one ♦ Assign each point to these two points ♦ Compute cluster centers and radii based on the

two subsets to get two balls Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

Discussion of nearest-neighbor learning

More discussion

Often very accurate ✑ Instead of storing all training instances,

✑ Assumes all attributes are equally important

compress them into regions

✑ Remedy: attribute selection or weights

✑ Example: hyperpipes (from discussion of 1R)

✒ Possible remedies against noisy instances: ✒ Take a majority vote over the k nearest neighbors

Another simple technique (Voting Feature

Intervals):

Removing noisy instances from dataset (difficult!)

✑ Statisticians have used k-NN since early 1950s

♦ Construct intervals for each attribute

✑ If n → ∞ and k/n → 0, error approaches minimum

Discretize numeric attributes

k D-trees become inefficient when number of Treat each value of a nominal attribute as an “interval” ✑ attributes is too large (approximately > 10)

♦ Count number of times class occurs in interval Ball trees (which are instances of metric trees)

♦ Prediction is generated by letting intervals vote work well in higher-dimensional spaces

(those that contain the test instance)

Clustering The k-means algorithm

✑ Clustering techniques apply when there is no class to be predicted

To cluster data into k groups:

Aim: divide instances into “natural” groups

(k is predefined)

✑ As we've seen clusters can be:

1. Choose k cluster centers

disjoint vs. overlapping e.g. at random

2. Assign instances to clusters

deterministic vs. probabilistic

based on distance to cluster centers

✑ flat vs. hierarchical

3. Compute centroids of clusters

We'll look at a classic clustering algorithm

4. Go to step 1

called k-means

until convergence

♦ k-means clusters are disjoint, deterministic, and flat

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

103

03/04/06

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 104

Discussion Faster distance calculations

✳ Algorithm minimizes squared distance to

cluster centers ✑ Can we use kD-trees or ball trees to speed

up the process? Yes:

✳ Result can vary significantly

based on initial choice of seeds ♦ First, build tree, which remains static, for all

the data points

Can get trapped in local minimum

Example:

At each node, store number of instances and

centres cluster

initial

sum of all instances ♦ In each iteration, descend tree and find out

instances

which cluster each node belongs to

To increase chance of finding global optimum: Can stop descending as soon as we find out that a node belongs entirely to a particular cluster ✳ restart with different random seeds

Use statistics stored at the nodes to compute new Can we applied recursively with k = 2

cluster centers

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)

105

03/04/06

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 1 06

Example Comments on basic methods

✳ Bayes’ rule stems from his “Essay towards solving

a problem in the doctrine of chances” (1763)

Difficult bit in general: estimating prior probabilities ✳ (easy in the case of naïve Bayes)

Extension of naïve Bayes: Bayesian networks ✳ (which we'll discuss later)

✳ Algorithm for association rules is called APRIORI Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR

But: combinations of them can ( →

multi-layer neural

nets, which we'll discuss later)