For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules wit
Simplicity first Inferring rudimentary rules
✁ Simple algorithms often work very well!
1R: learns a 1-level decision tree
There are many kinds of simple structure, eg: ♦ ✁ I.e., rules that all test one particular attribute
Basic version
One attribute does all the work
All attributes contribute equally & independently
One branch for each value
A weighted linear combination might do ♦ Each branch assigns most frequent class
Instance-based: use a few prototypes
Error rate: proportion of instances that don’t belong to the majority class of their
✁ ♦ Use simple logical rules
corresponding branch
Success of method depends on the domain
Choose attribute with lowest error rate (assumes nominal attributes)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 4
Pseudo-code for 1R Evaluating the weather attributes
Outlook
Temp
Hum idity
Windy
Play
For each attribute,
Errors Total
errors
For each value of the attribute, make a rule as follows:
count how often each class appears
Sunny → No
2/ 5 4/ 14 Overcast → Yes
make the rule assign that class to this attribute-value find the most frequent class
Calculate the error rate of the rules
Rainy
Cool
Norm al
Choose the rules with the smallest error rate
Rainy
Cool
Norm al
Norm al
Note: “missing” is treated as a separate attribute
value 1/ 7 Rainy Mild Norm al False Yes
Sunny
Cool
Norm al
False
Yes
Hum idity
Norm al
Norm al
Norm al
* indicates a tie
Dealing with numeric attributes The problem of overfitting
✁ Discretize numeric attributes
This procedure is very sensitive to noise Divide each attribute’s range into intervals
One instance with an incorrect class label will
Sort instances according to attribute’s values
✁ probably produce a separate interval
Place breakpoints where class changes (majority class) ✁ Also: time stamp attribute will have zero errors ✁ ♦ This minimizes the total error
Simple solution:
Example: temperature from weather data enforce minimum number of instances in
✁ majority class per interval Example (with min = 3):
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Outl ook Temperat ure Humidi ty
Windy
Play Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No Sunny 64 65 68 69 70 71 72 72 75 75 80 81 83 85 85 85 False No
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 8
With overfitting avoidance
Discussion of 1R
✁ 1R was described in a paper by Holte (1993) Resulting rule set:
Contains an experimental evaluation on 16 datasets (using cross-validation so that results were
Attribute Rules
Errors
Total errors
representative of performance on future data)
Outlook Sunny → No
Overcast → Yes
Minimum number of instances was set to 6 after
some experimentation
Rainy → Yes
Tem perature ≤ 77.5 → Yes
1R’s simple rules performed not much worse than
> 77.5 → No*
much more complex decision trees
Humidity ≤ 82.5 → Yes
✁ Simplicity first pays off!
Windy False → Yes
Ve ry Sim ple Clas si fication Rul es Perfo rm We ll on Mos t
True → No*
Com m only Use d Datas ets
Robert C. Holt e, Com p ut er Science Dep ar tm ent , University of Ot tawa
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 10
Discussion of 1R: Hyperpipes
Statistical modeling
Another simple technique: build one rule for each class
♦ ✁ “Opposite” of 1R: use all the attributes Each rule is a conjunction of tests, one for
each attribute ✁ Two assumptions: Attributes are ♦ ♦ equally important
For numeric attributes: test checks whether
instance's value is inside an interval ✂ (given the class value)
statistically independent
I.e., knowing the value of one attribute says nothing Interval given by minimum and maximum
✁ about the value of another (if the class is known) observed in training data
✁ Independence assumption is never correct! ♦ For nominal attributes: test checks whether
But … this scheme works well in practice value is one of a subset of attribute values ✁
Subset given by all possible values observed in training data
Class with most matching tests is predicted Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Probabilities for weather data Probabilities for weather data
Out look Temperat ure Humi dity
Play Yes
Windy
Play
Outl ook
Temperat ure
Humidit y
Windy
No Yes No Sunny
4 0 Mild 4 2 Norm al
Sunny 2/ 9 3/ 5 Hot 2/ 9 2/ 5 Hi gh 3/ 9
4/ 9 0/ 5 Mild 4/ 9 2/ 5 Norm al 6/ 9
Sunny Out look
Hot Tem p
Hum idit y
Rainy Ov ercast Cool Mil d Rainy Hot
Sunny
Hot
High High
False True
No No
A new day:
Outlook
Temp.
Humidit y
Windy
Play
High High
False False
True True
False
No Yes
Lik elihood of the two classes
Rainy Sunny Sunny
Ov ercast
Cool
Mil d Cool Mil d
Nor m al Nor m al High
Nor m al
Yes
For “yes” = 2/ 9
False False
Yes Yes No
For “no” = 3/ 5
Ov ercast Sunny
Mil d Mil d
Nor m al High
False
Conversion int o a prob abilit y by norm aliz ation:
True True
Yes Yes
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0 206) = 0.795
Ov ercast
03/04/06 Yes Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) Rainy Mil d High True No 13 03/04/06
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Bayes’s rule Naïve Bayes for classification
Probability of event H given evidence E: ✁ ✂ ✄ ☎ ✆ ✝ ✂ ✆ ☎ ✄ ✝ ✂ ✄ ✝ ✞ ✁ ✁
Classification learning: what’s the probability of the class given an instance?
Evidence E = instance
A priori probability of H : ✁ ♦ ✁ Event H = class value for instance
Probability of event before evidence is seen ✟ ✠ ✡ ☛ ☞ ✌ ✍ Naïve assumption: evidence splits into parts
A posteriori probability of H :
(i.e. attributes) that are independent
✁ Probability of event after evidence is seen
Thom as Bay e s Die d : 1 76 1 in Tu n brid ge Wells , Ken t, En glan d Bo rn : 1 70 2 in Lon d on , En gland
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Weather data example The “zero-frequency problem”
Outlook Temp. Hum idity Windy
Play
Sunny Cool High True
Evidence E
What if an attribute value doesn’t occur with every
class value?
(e.g. “Humidity = high” for class “yes”) ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✺ ✼ ✽ ✾ ✷ ✺ ✿ ❀ ❁ ❄ ❅
Probability will be zero!
✴ ✵ ✶ ✽ ❂ ❃ ❁ ❆ ❄ ✾ Probability of ❅ class “yes”
A posteriori probability will also be zero!
(No matter how likely the other values are!)
Remedy: add 1 to the count for every attribute
value-class combination (Laplace estimator)
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Modified probability estimates
Missing values
Training: instance is not included in frequency In some cases adding a constant different
count for attribute value-class combination ✁ from 1 might be more appropriate
Classification: attribute will be omitted from Example: attribute outlook for class yes
calculation
Example: Outlook Temp. Hum idit y Windy Play
× Sunny 9/ 14 = 0.0238 Overcast Rainy ×
Likelihood of “yes” = 3/ 9 ×
Likelihood of “no” = 1/ 5
Weights don’t need to be equal (but they must sum to 1) P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 20
Numeric attributes Statistics for weather data
Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)
Outl ook
Temp erature
Humidit y
Windy Play
The probability density function for the No
6 2 9 normal distribution is defined by two 5
Sample mean µ ✂ ✾ ☞ ✍ ✟ ✑ ✍
✁ ✂ ✾ ☞ ✒ ☛ ✓ Standard deviation ✍ ✟ ✑ σ ✎ ✍ ✒ ✂ ✔ ✠
Example density value:
Then the density function f(x) is ✤
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 22
Classifying a new day
Probability densities
✳ A new day: Out look Temp. Hum idity Windy Play
✳ Relationship between probability and
Likelihood of “yes” = 2/ 9 × 0.0340 × 0.0221
Likelihood of “no” = 3/ 5 × 0.0221 × 0.0381
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
✳ But: this doesn’t change calculation of a
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%
✳ posteriori probabilities because ε cancels out
Exact relationship:
Missing values during training are not included in calculation of mean and
standard deviation
Multinomial naïve Bayes I Multinomial naïve Bayes II
✳ Version of naïve Bayes used for document ✳ Suppose dictionary has two words, yellow and blue classification using bag of words model
✳ n Suppose Pr[yellow | H] = 75% and Pr[blue | H] = 25% 1 ,n 2 , ... , n k : number of times word i occurs in ✳ Suppose E is the document “blue yellow blue”
✳ document
Probability of observing document:
P ,P , ... , P : probability of obtaining word i when
✳ sampling from documents in class H
Suppose there is another class H' that has
Probability of observing document E given class H Pr[yellow | H'] = 75% and Pr[yellow | H'] = 25%: ✴ ✵ ✶ ✓ ✔ ✕ ✖ ✗ ✘ ✗ ✕ ✕ ✙ ✚ ✔ ✕ ✖ ✗ ✛ ✜ ✢ ✰ ✾ ✿ ✣ ✤ ❀ ✥
✍ ✲ ★ ✍ ✧ ❀ ✥ ✲ ✥ ✪ ★ ✤ ✩ ✬ ✺ (based on multinomial distribution): ✯ ✭
Need to take prior probability of class into account to
✳ Ignores probability of generating a document of the make final classification Factorials don't actually need to be computed right length (prob. assumed constant for each class)
25 03/04/06 Underflows can be prevented by using logarithms Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Naïve Bayes: discussion Constructing decision trees
✧ Naïve Bayes works surprisingly well (even if
✳ Strategy: top down
✧ independence assumption is clearly violated) Recursive divide-and-conquer fashion Why? Because classification doesn’t require accurate probability estimates as long as maximum
First: select attribute for root node Create branch for each possible attribute value
✧ probability is assigned to correct class However: adding too many redundant attributes
Then: split instances into subsets
will cause problems (e.g. identical attributes) One for each branch extending from the node
Note also: many numeric attributes are not ♦ Finally: repeat recursively for each branch, normally distributed ( → kernel density estimators)
✳ using only instances that reach the branch Stop if all instances have the same class
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 28
Which attribute to select? Which attribute to select?
Criterion for attribute selection Computing information
✧ Which is the best attribute? Measure information in bits
Want to get the smallest tree Given a probability distribution, the info required to predict an event is the
Heuristic: choose the attribute that produces the
distribution’s entropy
✳ “purest” nodes
Entropy gives the information required in bits Popular impurity criterion: information gain
✧ ♦ (can involve fractions of bits!) Information gain increases with the average Formula for computing the entropy:
✳ purity of the subsets
✗ ✂ ✙ ✄ ✘ ❂ ☎ ✆ ✝ ☎ ✞ ✝ ✭ ✭ ✭ ✟ ☎ ✠ ❃ ✩ ✸ ☎ ✆ ✕ ✙ ✡ ☎ ✆ ✸ ☎ ✞ ✕ ✙ ✡ ☎ ✞ ✭ ✭ ✭ ✸ ☎ ✠ ✕ ✙ ✡ ☎ Strategy: choose attribute that gives greatest ✠ information gain
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 32
Example: attribute Outlook Computing information gain
Outlook ✧ ☛ ☞ ✌ ✍ ✎ ✂ ✏ ✑ ✒ ✝ ✓ ✔ = Sunny : ✕ ☞ ✖ ✗ ✍ ✘ ✙ ✎ ✏ ✚ ✛ ✑ ✒ ✚ ✛ ✓ ✔ ✜ ✏ ✚ ✛ ✢ ✍ ✣ ✎ ✏ ✚ ✛ ✓ ✜ ✒ ✚ ✛ ✢ ✍ ✣ ✎ ✒ ✚ ✛ ✓ ✔ ✤ ✥ ✦ ✧ ★ ✩ ☛ ✖ ✪ Information gain: information before splitting – information after splitting
✧ Outlook = Overcast : gain( Out look ) = inf o([9,5]) – inf o([2,3],[4 ,0 ],[3,2])
☛ ☞ ✌ ✍ ✎ ✂ ✫ ✑ ✤ ✝ ✓ ✔ ✕ ☞ ✖ ✗ ✍ ✘ ✙ ✎ ★ ✑ ✤ ✓ ✔ ✜ ★ ✢ ✍ ✣ ✎ ★ ✓ ✜ ✤ ✢ ✍ ✣ ✎ ✤ ✓ ✔ ✤ ✩ ☛ ✖ ✪ Note: this
is normally
= 0.247 bit s
Outlook = Rainy :
undefined.
Information gain for attributes from weather data: ✧ Expected information for attribute:
gain( Out look )
= 0 .2 47 bit s
gain( Temp er at ur e )
= 0.02 9 bit s
gain( Hum idit y )
= 0 .1 52 bit s
gain( Windy )
= 0 .0 48 bit s
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 34
Continuing to split
Final decision tree
✳ Note: not all leaves need to be pure; sometimes gain( Tem perat ure )
identical instances have different classes gain( Humidi ty )
Splitting stops when data can’t be split any further gain( Windy )
= 0.571 bit s
= 0.971 bit s
= 0.020 bit s
Wishlist for a purity measure Properties of the entropy
✳ Properties we require from a purity measure:
✳ The multistage property:
When node is pure, measure should be zero ✗ ✁ ✂ ✙ ✄ ✘ ❂ ☎ ✟ ✟ ✟ ✵ ❃ ✩ ✗ ✁ ✂ ✙ ✄ ✘ ❂ ☎ ✟ ✟ ✽ ✵ ❃ ✽ ❂ ✟ ✽ ✵ ❃ ❀ ✗ ✁ ✂ ✙ ✄ ✘ ♦ ❂ When impurity is maximal (i.e. all classes equally ✠ ✠ ✡ ☛ ✟ ✠ ☛ ✡ ☛ ❃
likely), measure should be maximal
Measure should obey multistage property (i.e. ✳ Simplification of computation: decisions can be made in several stages):
Entropy is the only function that satisfies all ✳ Note: instead of maximizing info gain we three properties!
could just minimize information
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 38
Highly-branching attributes Weather data with ID code
ID code
Outlook
Temp.
y Hum idit
Windy Play
Problematic: attributes with a large number
A Sunny
of values (extreme case: ID code)
B Sunny
Subsets are more likely to be pure if there is Yes
C Overcast
D Rainy
a large number of values
E Rainy
Cool
Norm al
False
Yes
⇒ Information gain is biased towards choosing
F Rainy
Cool
Norm al
True
No
attributes with a large number of values
G Overcast
Cool
Norm al
True
Yes
⇒ No This may result in overfitting (selection of an
H Sunny
✳ attribute that is non-optimal for prediction)
I Sunny
Cool
Norm al
Norm al
False
Yes
Another problem: fragmentation
Sunny
Mild
Norm al
Norm al
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 40
Tree stump for ID code attribute
Gain ratio
✳ Gain ratio : a modification of the information gain ✳ that reduces its bias Gain ratio takes number and size of branches into
account when choosing an attribute
It corrects the information gain by taking the intrinsic information of a split into account
Entropy of split: ✳ Intrinsic information: entropy of distribution of
instances into branches (i.e. how much info do we ⇒ need to tell which branch an instance belongs to) Information gain is maximal for ID code
(namely 0.940 bits)
Computing the gain ratio Gain ratios for weather data
✳ Example: intrinsic information for ID code
Value of attribute decreases as intrinsic 0.029
Gain: 0.940- 0.693
Gain: 0.940- 0.911
✳ information gets larger
Split info: info([5,4,5])
Split info: info([4,6,4]) 1.557
Gain ratio: 0.247/ 1.577
Gain ratio: 0.029/ 1.557 0.019
Definition of gain ratio:
Hum idity
Gain: 0.940- 0.788
Gain: 0.940- 0.892
Split info: info([7,7])
Split info: info([8,6])
Example: 0.049
Gain ratio: 0.152/ 1
Gain ratio: 0.048/ 0.985
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 44
More on the gain ratio
Discussion
✳ “Outlook” still comes out top
Top-down induction of decision trees: ID3, However: “ID code” has greater gain ratio
algorithm developed by Ross Quinlan
Standard fix: ad hoc test to prevent splitting on that ♦ Gain ratio just one modification of this basic ✳ type of attribute
algorithm
Problem with gain ratio: it may overcompensate C4.5: deals with numeric attributes, missing
✳ values, noisy data
May choose an attribute just because its intrinsic
✳ Similar approach: CART
information is very low
Standard fix: only consider attributes with greater There are many other attribute selection than average information gain
criteria! (But little difference in accuracy of result)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 46
Covering algorithms Example: generating a rule
✳ Convert decision tree into a rule set
Straightforward, but rule set overly complex ✳ ♦ More effective conversions are not trivial
Instead, can generate rule set directly
for each class in turn find rule set that covers
If true
If x > 1.2 and y > 2.6 then class = a
all instances in it
then class = a
✳ (excluding instances not in the class)
If x > 1.2
Called a covering approach: then class = a ♦ ✳ at each stage a rule is identified that “covers” Possible rule set for class “b”: some of the instances
If x
1.2 then class = b If x > 1.2 and y
2.6 then class = b
✳ Could add more rules, get “perfect” rule set
Rules vs. trees Simple covering algorithm
Corresponding decision tree: ✳ Generates a rule by adding tests that maximize (produces exactly the same
rule’s accuracy
predictions) ✳ Similar to situation in decision trees: problem
of selecting an attribute to split on ✳ ♦ But: decision tree inducer maximizes overall purity
Each new test reduces
But: rule sets can be more perspicuous when
rule’s coverage:
✧ decision trees suffer from replicated subtrees Also: in multiclass situations, covering algorithm concentrates on one class at a time whereas
decision tree learner takes all classes into account
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 50
Selecting a test Example: contact lens data
✧ Goal: maximize accuracy If ? then recommendation = hard
✧ Rule we seek: Possible tests:
t total number of instances covered by rule
p positive examples of the class covered by rule
t–p number of errors made by rule
Age = Young
✳ ⇒ Select test that maximizes the ratio p/t
Age = Pre-presbyopic
We are finished when p/t = 1 or the set of
Age = Presbyopic
instances can’t be split any further 3/12
Spectacle prescription = Myope
Spectacle prescription = Hypermetrope
Astigmatism = no
Astigmatism = yes
Tear production rate = Reduced
Tear production rate = Normal
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 52
Modified rule and resulting data
Further refinement
✧ Rule with best test added:
✧ Current state: If astigmatism = yes and ?
If astigmatism = yes then recommendation = hard then recommendation = hard
Instances covered by modified rule: ✧ Possible tests:
Age prescri ption Sp ectacle Asti gmati sm
Tear product ion rate
Recomm ended
Young Young Myop e Myop e Yes Yes
Norm al Reduced
None lenses
Age = Young
Pre- presbyopic Young Hypermet rope Pre- presbyopic
Young Hypermet rope Yes
Reduced
None Hard
Age = Pre-presbyopic
Myop e Yes Yes
Reduced Norm al
hard
Age = Presbyopic
Pre- presbyopic Myop e Hypermet rope Yes Yes
Norm al
Hard None
Spectacle prescription = Myope
Presbyopi c Pre- presbyopic Hypermet rope Myop e Yes Yes
Norm al Reduced
None
Spectacle prescription = Hypermetrope
Presbyopi c Myop e Yes
Norm al
Reduced
None
None Hard
Tear production rate = Reduced
Presbyopi c Presbyopi c Hypermet rope Hypermet rope Yes Yes
Reduced Norm al
None None
Tear production rate = Normal
Modified rule and resulting data
Further refinement
Current state:
Rule with best test added:
If astigmatism = yes and tear production rate = normal then recommendation = hard and tear production rate = normal
If astigmatism = yes
then recommendation = hard and ?
✧ Instances covered by modified rule:
Possible tests:
Age = Young
Age Sp ectacle prescri ption Asti gmati sm
Young Young Myop e
Yes Yes
Norm al Norm al
rate
Tear product ion
lenses
Recomm ended
Age = Pre-presbyopic
Hard
Age = Presbyopic
Pre- presbyopic Hypermet rope Yes Yes
Pre- presbyopic Myop e Hypermet rope
Norm al Norm al
Hard hard
Spectacle prescription = Myope
3/3 Spectacle prescription = Hypermetrope
Presbyopi c Hypermet rope Yes
Presbyopi c Myop e Yes
Hard None
Norm al
Norm al
None
✳ Tie between the first and the fourth test
We choose the one with greater coverage
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 56
The result Pseudo-code for PRISM
For each class C
Final rule:
If astigmatism = yes and tear production rate = normal
While E contains instances in class C Initialize E to the instance set
and spectacle prescription = myope Create a rule R with an empty left-hand side that predicts class C then recommendation = hard
Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v,
Second rule for recommending “hard lenses”: Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t
(built from instances not covered by first rule)
Remove the instances covered by R from E Add A = v to R (break ties by choosing the condition with the largest p) If age = young and astigmatism = yes then recommendation = hard and tear production rate = normal
✳ These two rules cover all “hard lenses”:
Process is repeated with other two classes
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 58
Rules vs. decision lists
Separate and conquer
PRISM with outer loop removed generates a ✳ Methods like PRISM (for dealing with one decision list for one class
class) are separate-and-conquer algorithms:
Subsequent rules are designed for rules that are not
First, identify a useful rule
covered by previous rules
Then, separate out all the instances it covers
✳ same class ✳ Finally, “conquer” the remaining instances Outer loop considers all classes separately
But: order doesn’t matter because all rules predict the
Difference to divide-and-conquer methods:
Subset covered by rule doesn’t need to be ✳ No order dependence implied
explored any further
Problems: overlapping rules, default rule required
Mining association rules
Item sets
Naïve method for finding association rules: ✳ Support: number of instances correctly covered
Use separate-and-conquer method
by association rule
The same as the number of instances covered by all ✳ values as a separate class
Treat every possible combination of attribute
✳ tests in the rule (LHS and RHS!)
Two problems:
✳ Item : one test/attribute-value pair
✳ Item set : all items occurring in a rule
Resulting number of rules (which would have to be Goal: only rules that exceed pre-defined support ✳ pruned on the basis of support and confidence)
Computational complexity
Do it by finding all item sets with the given minimum support and generating rules from them! But: we can look for high support rules directly!
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 62
Weather data Item sets for weather data
Outlook Tem p Humidity
Windy
Play
Sunny Hot High
Four- it em set s Sunny
False
No
One- item sets
Two- item sets
Three- item sets
Overcast Out look = Sunny Hot High False Yes Temperat ure = Hot (2)
Hot High
True
No
Out look = Sunny (5)
Out look = Sunny Temperat ure = Hot
Out look = Sunny Temperat ure = Hot
Rainy Mild High
Humi dity = High Rainy
False
Yes
Humi dity = High (2)
Temperat ure = Mi ld Out look = Rainy Overcast
Cool Normal
False
Yes
Play = No (2)
Rainy Cool Normal
True
No
(4) Temperat ure = Cool
Out look = Sunny Humi dity = High (3)
Humi dity = High Out look = Sunny
Windy = False Sunny
Play = Yes (2) Sunny
Cool Normal
True
Yes
Windy = False (2)
Mild High
False
No
Cool Normal
False
Yes
Rainy Mild Normal
False
Yes
Sunny Mild Normal
True
Yes
✳ In total: 12 one-item sets, 47 two-item sets, 39
Overcast Mild High
True
Yes
three-item sets, 6 four-item sets and 0 five-
Overcast Hot Normal
False
Yes
item sets (with minimum support of two)
Rainy Mild High
True
No
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 64
Generating rules from an item set Rules for weather data
Once all item sets with minimum support have ✧ Rules with support > 1 and confidence = 100%: been generated, we can turn them into rules
Association rule
Sup. Conf.
4 Example: 100%
1 Humidity=Normal Windy=False
⇒ Play=Yes
4 100% Humidity = Normal, Windy = False, Play = Yes (4)
2 Temperature=Cool
⇒ Humidity=Normal
3 Outlook=Overcast
⇒ Play=Yes
3 Seven (2 100% -1) potential rules:
4 Temperature=Cold Play=Yes
⇒ Humidity=Normal
... ... If Humidity = Normal and Windy = False then Play = Yes
2 100% If Humidity = Normal and Play = Yes then Windy = False
58 Outlook=Sunny Temperature=Hot
⇒ Humidity=High
If Humidity = Normal then Windy = False and Play = Yes If Windy = False and Play = Yes then Humidity = Normal
In total:
3 rules with support four
If Windy = False then Humidity = Normal and Play = Yes
5 with support three
If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False
50 with support two
and Play = Yes
Example rules from the same set Generating item sets efficiently
✳ Item set:
✳ How can we efficiently find all frequent item sets? ✳ Resulting rules (all with 100% confidence):
Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2)
✳ Finding one-item sets easy
Idea: use one-item sets to generate two-item sets,
Temperature = Cool, Windy = False, Humidity = Normal Temperature = Cool, Windy = False ⇒ Humidity = Normal, Play = Yes
two-item sets to generate three-item sets, …
Humidity = Normal ⇒ Play = Yes
If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well!
Temperature = Cool, Windy = False, Play = Yes
In general: if X is frequent k-item set, then all (k-1)- due to the following “frequent” item sets:
item subsets of X are also frequent ⇒
Compute k-item set by merging (k-1)-item sets
Temperature = Cool, Windy = False, Play = Yes (2) Temperature = Cool, Humidity = Normal, Windy = False (2) Temperature = Cool, Windy = False (2)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 68
Example Generating rules efficiently
✳ Given: five three-item sets
We are looking for all high-confidence rules
Support of antecedent obtained from hash table Lexicographically ordered!
(A B C), (A B D), (A C D), (A C E), (B C D)
✳ ♦ But: brute-force method is (2 N ✳ -1)
Candidate four-item sets: Better way: building (c + 1)-consequent rules
from c-consequent ones
(A B C D) OK because of (B C D)
Observation: (c + 1)-consequent rule can only hold
✳ if all corresponding c-consequent rules also hold Final check by counting instances in
(A C D E) Not OK because of (C D E)
Resulting algorithm similar to procedure for dataset!
large item sets
(k –1)-item sets are stored in hash table
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 70
Example Association rules: discussion
✳ 1-consequent rules: If Outlook = Sunny and Windy = False and Play = No ✳ Above method makes one pass through the data
then Humidity = High (2/2)
for each different size item set
If Humidity = High and Windy = False and Play = No
Other possibility: generate (k+2)-item sets just after
then Outlook = Sunny (2/2)
(k+1)-item sets have been generated
Result: more (k+2)-item sets than necessary will be
considered but less passes through the data Corresponding 2-consequent rule:
✳ ♦ Makes sense if data too large for main memory
Practical issue: generating a certain number of
rules (e.g. by incrementally reducing min. support) ✳ Final check of antecedent against hash table!
If Windy = False and Play = No
then Outlook = Sunny and Humidity = High (2/2)
Other issues Linear models: linear regression
✳ Standard ARFF format very inefficient for typical
✳ market basket data Work most naturally with numeric attributes
Attributes represent items in a basket and most Standard technique for numeric prediction
items are usually missing
Outcome is linear combination of attributes
☞ Data should be represented in sparse format ✁ ✥ ✖ ✁ ✍ ✂ ✍ ✖ ✁ ✥ ✂ ✥ ✖ ✄ ✄ ✄ ✖ ✁ ✎ ✂ ✎
✳ Instances are also called transactions ✳ Weights are calculated from the training data Confidence is not necessarily the best measure
Predicted value for first training instance a (1)
Example: milk occurs in almost every supermarket
transaction ✁ ✥ ✂ ✥ ✬ ✍ ✰ ✖ ✁ ✍ ✂ ✬ ✍ ✍ ✰ ✖ ✁ ✥ ✂ ✥ ✬ ✍ ✰ ✖ ✄ ✄ ✄ ✖ ✁ ✎ ✂ ✬ ✎ ✍ ✰ ☞ ☎ ✎ ✆ ✌ ✥ ✁ ✆ ✂ ✬ ✆ ✍ ✰
Other measures have been devised (e.g. lift)
(assuming each instance is extended with a constant attribute with value 1)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 74
Minimizing the squared error
Classification
Any regression technique can be used for Choose k +1 coefficients to minimize the
classification
✝ squared error on the training data
Training: perform a regression for each class, setting ✝ Squared error: ☎ ✟ ✞ ✌ ✍ ✠ ✬ ✞ ✰ ✍ ☎ ✡ ☛ ✌ ✥ ✁ ✡ ✂ ✡ ✬ ✞ ✰ ☞ ✥
the output to 1 for training instances that belong to class, and 0 for those that don’t
Derive coefficients using standard matrix Prediction: predict class corresponding to model ✳ with largest output value (membership value)
✝ operations For linear regression this is known as multi- Can be done if there are more instances than
✳ response linear regression
attributes (roughly speaking)
Problem: membership values are not in [0,1] Minimizing the absolute error is more difficult
range, so aren't proper probability estimates
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 76
Linear models: logistic regression
Logit transformation
✝ Builds a linear model for a transformed target variable
✝ Assume we have two classes ✝ Logistic regression replaces the target
by this target
Resulting model:
✯ ✯ ✝ ✵ ✶ ✮ ✷ ✸ ✹ ✺ ✻ ✹ ✺ ✙ ✼ ✙ ✹ ✽ ✽ ✽ ✹ ✺ ✣ ✼ ✣ ✾ Logit transformation maps [0,1] to (- ∞ ,+ ∞ )
Example logistic regression model
Maximum likelihood
Model with w ✝
0.5 and w 1 = 1:
Aim: maximize probability of training data wrt parameters
✝ Can use logarithms of probabilities and
maximize log-likelihood of model:
where the x (i) are either 0 or 1
Weights w i need to be chosen to maximize Parameters are found from training data
log-likelihood (relatively simple method: using maximum likelihood
iteratively re-weighted least squares )
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 80
Multiple classes
Pairwise classification
Can perform logistic regression ✝ Idea: build model for each pair of classes, independently for each class
using only training data from those classes ✝ (like multi-response linear regression)
✝ Problem? Have to solve k(k-1)/2
Problem: probability estimates for different classification problems for k-class problem ✝ classes won't sum to one
✝ Turns out not to be a problem in many Better: train coupled models by
cases because training sets become small: maximizing likelihood over all classes
Assume data evenly distributed, i.e. 2n/k per
Alternative that often works well in
learning problem for n instances in total
practice: pairwise classification
♦ Suppose learning algorithm is linear in n ♦ Then runtime of pairwise classification is
proportional to (k(k-1)/2)×(2n/k) = (k-1)n
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 82
Linear models are hyperplanes Linear models: the perceptron
Decision boundary for two-class logistic ✳ Don't actually need probability estimates if all we regression is where probability equals 0.5:
want to do is classification
✳ Different approach: learn separating hyperplane
Assumption: data is linearly separable
✝ which occurs when ✥ ✦ ✫ ✥ ✦ ✬ ✬ ✥ ✕ ✕ ✕ ✥ ✦ ✭ ✭ ✙
✳ Algorithm for learning separating hyperplane:
Thus logistic regression can only separate
✳ perceptron learning rule ✴ ✵ ✁ ✂ ✶ ✷ ✁ ✏ ✂ ✏ ✷
✶ ✁ ✒ ✂ ✒ data that can be separated by a hyperplane ✷ ✸ ✸ ✸ ✷ ✁ ☛ ✂ ☛
Hyperplane:
✝ Multi-response has the same problem.
where we again assume that there is a constant
Class 1 is assigned if: attribute with value 1 (bias)
If sum is greater than zero we predict the first
class, otherwise the second class
The algorithm Perceptron as a neural network
Set all weights to zero Until all instances in the training data are classified correctly
For each instance I in the training data If I is classified incorrectly by the perceptron
If I belongs to the first class add it to the weight vector else subtract it from the weight vector
✳ Why does this work?
Output
Consider situation where instance a pertaining to
layer
the first class has been added: ✁ ✶ ✷ ✂ ✶ ✁ ✂ ✶ ✷ ✁ ✏ ✷ ✂ ✏ ✁ ✂ ✏ ✷ ✁ ✒ ✷ ✂ ✒ ✁ ✂ ✒ ✷ ✸ ✸ ✸ ✷ ✁ ☛ ✷ ✂ ☛ ✁ ✂ ☛
This means output for a has increased by: ✂ ✶ ✂ ✶ ✷ ✂ ✏ ✂ ✏ ✷ ✂ ✒ ✂ ✒ ✷ ✸ ✸ ✸ ✷ ✂ ☛ ✂ ☛
Input layer
This number is always positive, thus the hyperplane has moved into the correct direction (and we can show output decreases for instances of other class)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 86
Linear models: Winnow
The algorithm
✝ Another mistake-driven algorithm for
while some instances are misclassified for each instance a in the training data
finding a separating hyperplane
classify a using the current weights if the predicted class is incorrect
♦ Assumes binary data (i.e. attribute values are
either zero or one) by alpha
if a belongs to the first class for each a that is 1, multiply w
is 0, leave w (if a i i
unchanged)
Difference: multiplicative updates instead otherwise for each a that is 1, divide w by alpha
of additive updates
is 0, leave w (if a i
unchanged)
♦ ✳ Weights are multiplied by a user-specified Winnow is very effective in homing in on relevant parameter α> 1 (or its inverse)
✳ features (it is attribute efficient)
Another difference: user-specified threshold
Can also be used in an on-line setting in which
parameter θ new instances arrive continuously
(like the perceptron algorithm)
♦ Predict first class if ✁ ✂ ✂ ✂ ✷ ✁ ✖ ✂ ✖ ✷ ✁ ✄ ✂ ✄ ✷ ✸ ✸ ✸ ✷ ✁ ☎ ✂ ☎ ✆ ✝
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 88
Balanced Winnow Instance-based learning
✳ Winnow doesn't allow negative weights and this ✳ can be a drawback in some applications
Balanced Winnow ✑ maintains two weight vectors, ✑ Distance function defines what’s learned one for each class:
Most instance-based schemes use
while some instances are misclassified for each instance a in the training data
Euclidean distance ✓
if the predicted class is incorrect classify a using the current weights if a belongs to the first class
for each a i that is 1, multiply w i +
by alpha and divide w i -
by alpha
: two instances with k attributes
otherwise ✑ Taking the square root is not required when
i is 0, leave w (if a and w i -
i unchanged)
i is 0, leave w (if a i + i
for each a i that is 1, multiply w -
and w i -
by alpha and divide w unchanged)
by alpha
✑ comparing distances
✳ Instance is classified as belonging to the first class ✒ Other popular metric: city-block metric
Adds differences without squaring them
(of two classes) if: ✞ ✞ ✍ ✍ ✍ ✞
Normalization and other issues
Finding nearest neighbors efficiently
Simplest way of finding nearest neighbour: linear Different attributes are measured on different
scan of the data
scales ⇒ need to be normalized:
Classification takes time proportional to the product
✳ of the number of instances in training and test sets Nearest-neighbor search can be done more
v i : the actual value of attribute i ✳ ✳ efficiently using appropriate data structures We will discuss two methods that represent ✳ Nominal attributes: distance either 0 or 1
training data in a tree structure:
Common policy for missing values: assumed to be maximally distant (given normalized attributes)
kD-trees and ball trees
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 92
k D-tree example Using kD-trees: example
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 94
More on kD-trees Building trees incrementally
✑ Complexity depends on depth of tree, given by ✑ Big advantage of instance-based learning: ✑ logarithm of number of nodes
classifier can be updated incrementally Amount of backtracking required depends on
♦ Just add new training instance!
quality of tree (“square” vs. “skinny” nodes)
Can we do the same with kD-trees?
How to build a good tree? Need to find good split point and split direction ✑ Heuristic strategy:
Split direction: direction with greatest variance Find leaf node containing new instance
Split point: median value along that direction ♦ Place instance into leaf if leaf is empty
♦ Otherwise, split leaf according to the longest
Using value closest to mean (rather than
dimension (to preserve squareness)
✑ median) can be better if data is skewed
✑ Tree should be re-built occasionally (i.e. if Can apply this recursively
95 03/04/06 depth grows to twice the optimum depth) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 96
Ball trees
Ball tree example
✑ Problem in kD-trees: corners ✑ Observation: no need to make sure that
✑ regions don't overlap Can use balls (hyperspheres) instead of
hyperrectangles
A ball tree organizes the data into a tree of k- dimensional hyperspheres ♦ Normally allows for a better fit to the data and thus more efficient search
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 98
Using ball trees
Building ball trees
Nearest-neighbor search is done using the ✑ Ball trees are built top down (like kD-trees) same backtracking strategy as in kD-trees
Don't have to continue until leaf balls Ball can be ruled out from consideration if:
contain just two points: can enforce distance from target to ball's center exceeds ball's radius plus current upper bound
✑ minimum occupancy (same in kD-trees) ✑ Basic problem: splitting a ball into two
Simple (linear-time) split selection strategy:
♦ Choose point farthest from ball's center ♦ Choose second point farthest from first one ♦ Assign each point to these two points ♦ Compute cluster centers and radii based on the
two subsets to get two balls Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
Discussion of nearest-neighbor learning
More discussion
Often very accurate ✑ Instead of storing all training instances,
✑ Assumes all attributes are equally important
compress them into regions
✑ Remedy: attribute selection or weights
✑ Example: hyperpipes (from discussion of 1R)
✒ Possible remedies against noisy instances: ✒ Take a majority vote over the k nearest neighbors
Another simple technique (Voting Feature
Intervals):
Removing noisy instances from dataset (difficult!)
✑ Statisticians have used k-NN since early 1950s
♦ Construct intervals for each attribute
✑ If n → ∞ and k/n → 0, error approaches minimum
Discretize numeric attributes
k D-trees become inefficient when number of Treat each value of a nominal attribute as an “interval” ✑ attributes is too large (approximately > 10)
♦ Count number of times class occurs in interval Ball trees (which are instances of metric trees)
♦ Prediction is generated by letting intervals vote work well in higher-dimensional spaces
(those that contain the test instance)
Clustering The k-means algorithm
✑ Clustering techniques apply when there is no class to be predicted
To cluster data into k groups:
Aim: divide instances into “natural” groups
(k is predefined)
✑ As we've seen clusters can be:
1. Choose k cluster centers
disjoint vs. overlapping e.g. at random
2. Assign instances to clusters
deterministic vs. probabilistic
based on distance to cluster centers
✑ flat vs. hierarchical
3. Compute centroids of clusters
We'll look at a classic clustering algorithm
4. Go to step 1
called k-means
until convergence
♦ k-means clusters are disjoint, deterministic, and flat
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
103
03/04/06
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 104
Discussion Faster distance calculations
✳ Algorithm minimizes squared distance to
cluster centers ✑ Can we use kD-trees or ball trees to speed
up the process? Yes:
✳ Result can vary significantly
based on initial choice of seeds ♦ First, build tree, which remains static, for all
the data points
Can get trapped in local minimum
Example:
At each node, store number of instances and
centres cluster
initial
sum of all instances ♦ In each iteration, descend tree and find out
instances
which cluster each node belongs to
To increase chance of finding global optimum: Can stop descending as soon as we find out that a node belongs entirely to a particular cluster ✳ restart with different random seeds
Use statistics stored at the nodes to compute new Can we applied recursively with k = 2
cluster centers
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4)
105
03/04/06
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 1 06
Example Comments on basic methods
✳ Bayes’ rule stems from his “Essay towards solving
a problem in the doctrine of chances” (1763)
Difficult bit in general: estimating prior probabilities ✳ (easy in the case of naïve Bayes)
Extension of naïve Bayes: Bayesian networks ✳ (which we'll discuss later)
✳ Algorithm for association rules is called APRIORI Minsky and Papert (1969) showed that linear classifiers have limitations, e.g. can’t learn XOR
But: combinations of them can ( →
multi-layer neural
nets, which we'll discuss later)