Binary vs multiway splits
Backpropagation
Examples Hot to learn weights given network structure?
♦ Cannot simply use perceptron learning rule
because we have hidden layer(s) ♦ Function we are trying to minimize: error ♦ Can use a general function minimization
technique called gradient descent ★
Need differentiable activation function: use sigmoid
function instead of threshold function ☎ ✛ ✜✣ ✁ ✆ ✆
Need differentiable error function: can't use zero-one ✘ loss, but can use squared error
The two activation functions Gradient descent example
Function: x ✧ 2 +1
Derivative: 2x
Learning rate: 0.1 Start value: 4
Can only find a local minimum!
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 62
Minimizing the error I
Minimizing the error II
Need to find partial derivative of error ★ What about the weights for the connections from function for each parameter (i.e. weight)
the input to the hidden layer?
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 64
Remarks Radial basis function networks
Same process works for multiple hidden layers and
multiple output units (eg. for multiple classes) Can update weights after all training instances have been
Another type of feedforward network with
two layers (plus the input layer)
processed or incrementally:
Hidden units represent points in instance
batch learning vs. stochastic backpropagation
space and activation depends on distance
Weights are initialized to small random values ♦ To this end, distance is converted into How to avoid overfitting?
similarity: Gaussian activation function
Early stopping : use validation set to check when to stop
Width may be different for each hidden unit
Weight decay : add penalty term to error function ♦ Points of equal activation form hypersphere How to speed up learning?
(or hyperellipsoid) as opposed to hyperplane
Momentum : re-use proportion of old weight change ✧
Output layer same as in MLP
Use optimization method that employs 2nd derivative
Learning RBF networks Instance-based learning
Parameters: centers and widths of the RBFs + weights in output layer
Practical problems of 1-NN scheme:
Can learn two sets of parameters
Slow (but: fast tree-based approaches exist)
independently and still get accurate models
Remedy: remove irrelevant data
♦ ♦ Noise (but: k -NN copes quite well with noise) Eg.: clusters from k-means can be used to form basis functions Remedy: remove noisy instances
All attributes deemed equally important ♦ Linear model can be used based on fixed RBFs
Remedy: weight attributes (or simply select) ♦ Makes learning RBFs very efficient
Doesn’t perform explicit generalization
Disadvantage: no built-in attribute weighting Remedy: rule-based NN approach
based on relevance RBF networks are related to RBF SVMs
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 68
Learning prototypes Speed up, combat noise
IB2: save memory, speed up classification
Work incrementally
Only incorporate misclassified instances
Problem: noisy data gets incorporated IB3: deal with noise
Only those instances involved in a decision Discard instances that don’t perform well
Compute confidence intervals for
need to be stored
1. Each instance’s success rate 2. Default accuracy of its class
Noisy instances should be filtered out Idea: only use prototypical examples
Accept/reject instances
Accept if lower limit of 1 exceeds upper limit of 2 Reject if upper limit of 1 is below lower limit of 2
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 70
Weight attributes Rectangular generalizations
IB4: weight each attribute
(weights can be class-specific) Weighted Euclidean distance: ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄✆☎✞✝ ✝ ✝ ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄
Update weights based on nearest neighbor
Class correct: increase weight Nearest-neighbor rule is used outside
Class incorrect: decrease weight
rectangles
Amount of change for i th attribute depends on |x ✧
i - y i | Rectangles are rules! (But they can be more
conservative than “normal” rules.) Nested rectangles are rules with exceptions
Generalized exemplars Separating generalized exemplars
Generalize instances into hyperrectangles
Online: incrementally modify rectangles
Offline version: seek small set of rectangles that
Class 1
cover the instances Important design decisions:
Cl ass
Allow overlapping rectangles?
Requires conflict resolution
Allow nested rectangles?
Dealing with uncovered instances?
Separation line
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 74
Generalized distance functions
Numeric prediction
Given: some transformation operations on
Counterparts exist for all schemes previously
attributes
discussed
K*: similarity = probability of transforming
Decision trees, rule learners, SVMs, etc.
Average over all transformation paths (Almost) all classification schemes can be applied
instance A into B by chance
Weight paths according their probability to regression problems using discretization
(need way of measuring this)
Discretize the class into intervals
Uniform way of dealing with different
Predict weighted average of interval midpoints
attribute types
Weight according to class probabilities
Easily generalized to give distance between sets of instances
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 76
Regression trees
Model trees
Like decision trees, but: ★
Build a regression tree
Each leaf ⇒ linear regression function
Splitting criterion: minimize intra-subset
variation Smoothing: factor in ancestor’s predictions
Termination criterion:
std dev becomes
Smoothing formula:
Same effect can be achieved by incorporating ♦ ✶ Pruning criterion:
small
based on numeric error measure
ancestor models into the leaves
Prediction: Leaf predicts average
Need linear regression function at each node
class values of instances At each node, use only a subset of attributes
Piecewise constant functions
Those occurring in subtree
Easy to interpret ♦
(+maybe those occurring in path to the root)
More sophisticated version: model trees Fast: tree usually uses only a small subset of the attributes
Building the tree
Nominal attributes
Splitting: standard deviation reduction ✂✁☎✄
Convert nominal attributes to binary ones
Standard deviation < 5% of its value on full training set
Sort attribute by average class value If attribute has k values,
Too few instances remain (e.g. < 4)
generate k – 1 binary attributes
Pruning: i th is 0 if value lies within the first i , otherwise 1
Heuristic estimate of absolute error of LR models: ✧
Treat binary attributes as numeric
Can prove: best split on one of the new
Greedily remove terms from LR models to minimize ✶ attributes is the best (binary) split on original estimated error
Heavy pruning: single model may replace whole subtree
Proceed bottom up: compare error of LR model at internal node to error of subtree
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 80
Missing values Surrogate splitting based on class
Modify splitting criterion:
Choose split point based on instances with
known values
To determine which subset an instance
Split point divides instances into 2 subsets goes into, use surrogate splitting ★
L (smaller class average)
Split on the attribute whose correlation with
R (larger)
original is greatest
is the average of the two averages
Problem: complex and time-consuming
For an instance with a missing value:
Simple solution: always use the class
Choose L if class value < m
Test set: replace missing value with
Otherwise R
average Once full tree is built, replace missing values with averages of corresponding leaf nodes
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 82
Pseudo-code for M5'
MakeModelTree
Four methods:
{ MakeModelTree (instances)
Main method: MakeModelTree
SD = sd(instances)
Method for splitting: split
for each k-valued nominal attribute
Method for pruning: prune
convert into k-1 synthetic binary attributes root = newNode
Method that computes error: subtreeError
root.instances = instances split(root)
We’ll briefly look at each method in turn
prune(root)
Assume that linear regression method performs
printTree(root)
attribute subset selection based on error } attribute subset selection based on error }
prune
split(node) {
prune(node)
if sizeof(node.instances) < 4 or
sd(node.instances) < 0.05*SD
if node = INTERIOR then
node.type = LEAF
prune(node.leftChild)
else
prune(node.rightChild) node.model = linearRegression(node)
for each attribute node.type = INTERIOR
if subtreeError(node) > error(node) then
for all possible split positions of attribute
node.type = LEAF
calculate the attribute's SDR
node.attribute = attribute with maximum SDR split(node.left) split(node.right)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 86
subtreeError Model tree for servo data
Result
of merging
subtreeError(node)
{ l = node.left; r = node.right if node = INTERIOR then
return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) else return error(node) /sizeof(node.instances) }
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 88
Rules from model trees Locally weighted regression
PART algorithm generates classification rules by building partial decision trees ✧
Numeric prediction that combines
Can use the same method to build rule sets for
instance-based learning
regression
linear regression
♦ Use model trees instead of decision trees
“Lazy”:
♦ Use variance instead of entropy to choose node
computes regression function at prediction time
to expand when building partial tree
works incrementally
Weight training instances ★
Rules will have linear models on right-hand side
according to distance to test instance
Caveat: using smoothed trees may not be
needs weighted version of linear regression appropriate due to separate-and-conquer strategy
Advantage: nonlinear approximation But: slow
Design decisions
Discussion
Weighting function:
Regression trees were introduced in CART
Inverse Euclidean distance
Quinlan proposed model tree method (M5)
Gaussian kernel applied to Euclidean distance
M5’: slightly improved, publicly available Quinlan also investigated combining
Triangular kernel used the same way
etc.
Smoothing parameter is used to scale the distance function ✧ CUBIST: Quinlan’s commercial rule learner
instance-based learning with M5
for numeric prediction
Possible choice: distance of k th nearest Interesting comparison: neural nets vs. M5 training instance (makes it data dependent)
Multiply distance by inverse of this parameter
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 92
Clustering: how many clusters? Incremental clustering
How to choose k in k-means? Possibilities:
Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally
♦ Choose k that minimizes cross-validated
squared distance to cluster centers
Start:
♦ Use penalized squared distance on the training data (eg. using an MDL criterion)
tree consists of empty root node
Then:
Apply k-means recursively with k = 2 and use
stopping criterion (eg. based on MDL) add instances one by one
update tree appropriately at each stage Seeds for subclusters can be chosen by seeding along direction of greatest variance in cluster
to update, find the right leaf for an instance (one standard deviation away in each direction
May involve restructuring the tree
from cluster center of parent cluster) Base update decisions on category utility Implemented in algorithm called X-means (using Bayesian Information Criterion instead of MDL)
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 94
Clustering weather data Clustering weather data
ID Outlook Temp. Humi dity Windy
A Sunny Hot High False
ID Outlook
Tem p.
Humi dity
Windy
B Sunny Hot High True
A Sunny
C Overcast Hot High False
B Sunny
E D Rainy Rainy Mild Cool Normal High False False
C Overcast
2 D Rainy
F Rainy Cool Normal True
E Rainy
G Overcast Cool Normal True
Overcast
G F Rainy
Cool Cool
Normal Normal
True True
H Sunny Mild High False
Merge
best host
I Sunny Cool Normal False
H Sunny
and runner- up
J Rainy Mild Normal False
I Sunny
K Sunny Mild Normal True
L Overcast Mild High True
M Overcast Hot Normal False
N Rainy Mild High True
host if merging doesn’t help Consider splitting the best
Final hierarchy Example: the iris data (subset)
ID Outlook Temp. Hum idity Windy A Sunny
Hot High False B Sunny
Hot High True C Overcast
Hot High False D Rainy
Mild High False
actually very similar Oops! a and b are
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 98
Clustering with cutoff
Category utility
Category utility: quadratic loss function defined on conditional probabilities:
Every instance in different category ⇒ numerator becomes
number of attributes
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 00
Numeric attributes Probability-based clustering
Assume normal distribution: ✟✙✼✾✽ ✆ ✿✰❀❂❁❄❃
Problems with heuristic approach:
Division by k?
Order of examples?
Are restructuring operations sufficient?
Thus
Is result at least local minimum of category utility?
Probabilistic perspective ⇒
seek the most likely clusters given the data
Also: instance belongs to a particular cluster Prespecified minimum variance
with a certain probability
acuity parameter
Finite mixtures Two-class mixture model
A 51
Model data using a mixture of distributions B 64 data
One cluster, one distribution
governs probabilities of attribute values in that
Finite mixtures : finite number of clusters
model
Individual distributions are normal
(usually) Combine distributions using cluster weights
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) A A A B B 03/04/06 B 104
Using the mixture model
Learning the clusters
Probability that instance x belongs to
Assume:
cluster A:
we know there are k clusters
Learn the clusters ⇒
determine their parameters
with
I.e. means and standard deviations
Performance criterion:
Probability of an instance given the clusters: probability of training data given the clusters
EM algorithm
finds a local maximum of the likelihood
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 06
EM algorithm
More on EM
EM = Expectation-Maximization Estimate parameters from weighted instances
Generalize k-means to probabilistic setting
Iterative procedure: ☎
E “expectation” step: ✹ Calculate cluster probability for each instance
M “maximization” step: Estimate distribution parameters from cluster
Stop when log-likelihood saturates
Store cluster probabilities as instance
weights
Log-likelihood:
Stop when improvement is negligible
Extending the mixture model More mixture model extensions
More then two distributions: easy
Nominal attributes: easy if independent
Several attributes: easy—assuming independence! Correlated attributes: difficult
Correlated nominal attributes: difficult Two correlated attributes ⇒ 1 v v 2 parameters
Joint model: bivariate normal distribution with a (symmetric) covariance matrix ✱
Missing values: easy
n attributes: need to estimate n + n (n+1)/2 parameters
Can use other distributions than normal:
“log-normal” if predetermined minimum is given
“log-odds” if bounded from above and below
Poisson for attributes that are integer counts Use cross-validation to estimate k !
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 10
Bayesian clustering
Discussion
Problem: many parameters ✹ ⇒ EM overfits Can interpret clusters by using supervised
Bayesian approach : give every parameter a
learning
prior probability distribution ♦
post-processing step
Incorporate prior into overall likelihood figure Decrease dependence between attributes?
Penalizes introduction of parameters
pre-processing step
Eg: Laplace estimator for nominal attributes
E.g. use principal component analysis
Can also have prior on number of clusters!
Can be used to fill in missing values
Implementation: NASA’s AUTOCLASS Key advantage of probabilistic clustering:
Can estimate likelihood of data
Use it to compare different models objectively
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 12
From naïve Bayes to Bayesian Networks
Enter Bayesian networks
Naïve Bayes assumes: ✱ Graphical models that can represent any attributes conditionally independent
probability distribution
given the class Graphical representation: directed acyclic Doesn’t hold in practice but
graph, one node for each attribute
classification accuracy often high Overall probability distribution factorized However: sometimes performance
into component distributions
much worse than e.g. decision tree
Graph’s nodes hold component distributions (conditional distributions)
Can we eliminate the assumption? Can we eliminate the assumption?
he
rt a
rt
k at
fo
fo at
or d
or d
er
w er
et th
et th
N ea
N ea
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 16
Computing the class probabilities Why can we do this? (Part I)
Two steps: computing a product of probabilities for each class and normalization
Single assumption: values of a node’s
♦ For each class value ✱
parents completely determine
Take all attribute values and class value
probability distribution for current node
Look up corresponding entries in conditional probability distribution tables
Take the product of all probabilities ♦ Divide the product for each class by the sum of
• Means that node/attribute is
the products (normalization)
conditionally independent of other ancestors given parents
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 118
Why can we do this? (Part II)
Learning Bayes nets
Basic components of algorithms for learning
Chain rule from probability theory:
Bayes nets:
♦ Method for evaluating the goodness of a given
• Because of our assumption from the previous slide: Measure based on probability of training data
given the network (or the logarithm thereof)
♦ Method for searching through space of possible
Amounts to searching through sets of edges because nodes are fixed
Problem: overfitting Searching for a good structure
Can’t just maximize probability of the training data ✱ Task can be simplified: can optimize
Because then it’s always better to add more edges (fit
each node separately
the training data more closely) ♦ Because probability of an instance is Need to use cross-validation or some penalty for
product of individual nodes’ probabilities complexity of the network ✂✁☎✄ ✯✝✆✟✞ ✦✱✲
♦ Also works for AIC and MDL criterion – AIC measure: ✑✓✒✂✔
because penalties just add up
– MDL measure:
Can optimize node by adding or
– LL: log-likelihood (log of probability of data), K: number of free parameters, N: #instances
removing edges from other nodes
• Another possibility: Bayesian approach with prior
Must not introduce cycles!
distribution over networks
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 22
The K2 algorithm
Some tricks
Starts with given ordering of nodes ✹ Sometimes it helps to start the search with a naïve (attributes)
Bayes network
Processes each node in turn
It can also help to ensure that every node is in
Markov blanket of class node
Greedily tries adding edges from
Markov blanket of a node includes all parents, children,
previous nodes to current node
and children’s parents of that node
Moves to next node when current node Given values for Markov blanket, node is conditionally
independent of nodes outside blanket
can’t be optimized further
I.e. node is irrelevant to classification if not in Markov
Result depends on initial order blanket of class node
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 124
Other algorithms Likelihood vs. conditional likelihood
Extending K2 to consider greedily adding or ✹ In classification what we really want is to deleting edges between any pair of nodes
maximize probability of class given other
Further step: considering inverting the direction of
attributes
edges
– Not probability of the instances
TAN (Tree Augmented Naïve Bayes): But: no closed-form solution for probabilities in
Starts with naïve Bayes
nodes’ tables that maximize this
Considers adding second parent to each node
However: can easily compute conditional
(apart from class node) probability of data based on given network
Efficient algorithm exists
Seems to work well when used for network scoring
Data structures for fast learning
Learning Bayes nets involves a lot of counting
AD tree example
for computing conditional probabilities
Naïve strategy for storing counts: hash table
Runs into memory problems very quickly
More sophisticated strategy: all-dimensions (AD) tree
♦ Analogous to kD-tree for numeric data ♦ Stores counts in a tree but in a clever way such
that redundancy is eliminated ♦ Only makes sense to use it for large datasets
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
127
03/04/06
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 128
Building an AD tree
Discussion
Assume each attribute in the data has been
We have assumed: discrete data, no missing assigned an index
values, no new nodes
Then, expand node for attribute i with the Different method of using Bayes nets for values of all attributes j > i
classification: Bayesian multinets
♦ Two important restrictions:
I.e. build one network for each class and make
Most populous expansion for each attribute is
prediction using Bayes’ rule
omitted (breaking ties arbitrarily)
Different class of learning methods for
Expansions with counts that are zero are also
Bayes nets: testing conditional
omitted
independence assertions
The root node is given index zero Can also build Bayes nets for regression tasks
03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)
129
03/04/06
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 30