Binary vs multiway splits

Backpropagation

Examples Hot to learn weights given network structure?

♦ Cannot simply use perceptron learning rule

because we have hidden layer(s) ♦ Function we are trying to minimize: error ♦ Can use a general function minimization

technique called gradient descent ★

Need differentiable activation function: use sigmoid

function instead of threshold function ☎ ✛ ✜✣ ✁ ✆ ✆

Need differentiable error function: can't use zero-one ✘ loss, but can use squared error

The two activation functions Gradient descent example

Function: x ✧ 2 +1

Derivative: 2x

Learning rate: 0.1 Start value: 4

Can only find a local minimum!

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 62

Minimizing the error I

Minimizing the error II

Need to find partial derivative of error ★ What about the weights for the connections from function for each parameter (i.e. weight)

the input to the hidden layer?

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 64

Remarks Radial basis function networks

Same process works for multiple hidden layers and

multiple output units (eg. for multiple classes) Can update weights after all training instances have been

Another type of feedforward network with

two layers (plus the input layer)

processed or incrementally:

Hidden units represent points in instance

batch learning vs. stochastic backpropagation

space and activation depends on distance

Weights are initialized to small random values ♦ To this end, distance is converted into How to avoid overfitting?

similarity: Gaussian activation function

Early stopping : use validation set to check when to stop

Width may be different for each hidden unit

Weight decay : add penalty term to error function ♦ Points of equal activation form hypersphere How to speed up learning?

(or hyperellipsoid) as opposed to hyperplane

Momentum : re-use proportion of old weight change ✧

Output layer same as in MLP

Use optimization method that employs 2nd derivative

Learning RBF networks Instance-based learning

Parameters: centers and widths of the RBFs + weights in output layer

Practical problems of 1-NN scheme:

Can learn two sets of parameters

Slow (but: fast tree-based approaches exist)

independently and still get accurate models

Remedy: remove irrelevant data

♦ ♦ Noise (but: k -NN copes quite well with noise) Eg.: clusters from k-means can be used to form basis functions Remedy: remove noisy instances

All attributes deemed equally important ♦ Linear model can be used based on fixed RBFs

Remedy: weight attributes (or simply select) ♦ Makes learning RBFs very efficient

Doesn’t perform explicit generalization

Disadvantage: no built-in attribute weighting Remedy: rule-based NN approach

based on relevance RBF networks are related to RBF SVMs

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 68

Learning prototypes Speed up, combat noise

IB2: save memory, speed up classification

Work incrementally

Only incorporate misclassified instances

Problem: noisy data gets incorporated IB3: deal with noise

Only those instances involved in a decision Discard instances that don’t perform well

Compute confidence intervals for

need to be stored

1. Each instance’s success rate 2. Default accuracy of its class

Noisy instances should be filtered out Idea: only use prototypical examples

Accept/reject instances

Accept if lower limit of 1 exceeds upper limit of 2 Reject if upper limit of 1 is below lower limit of 2

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 70

Weight attributes Rectangular generalizations

IB4: weight each attribute

(weights can be class-specific) Weighted Euclidean distance: ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄✆☎✞✝ ✝ ✝ ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄

Update weights based on nearest neighbor

Class correct: increase weight Nearest-neighbor rule is used outside

Class incorrect: decrease weight

rectangles

Amount of change for i th attribute depends on |x ✧

i - y i | Rectangles are rules! (But they can be more

conservative than “normal” rules.) Nested rectangles are rules with exceptions

Generalized exemplars Separating generalized exemplars

Generalize instances into hyperrectangles

Online: incrementally modify rectangles

Offline version: seek small set of rectangles that

Class 1

cover the instances Important design decisions:

Cl ass

Allow overlapping rectangles?

Requires conflict resolution

Allow nested rectangles?

Dealing with uncovered instances?

Separation line

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 74

Generalized distance functions

Numeric prediction

Given: some transformation operations on

Counterparts exist for all schemes previously

attributes

discussed

K*: similarity = probability of transforming

Decision trees, rule learners, SVMs, etc.

Average over all transformation paths (Almost) all classification schemes can be applied

instance A into B by chance

Weight paths according their probability to regression problems using discretization

(need way of measuring this)

Discretize the class into intervals

Uniform way of dealing with different

Predict weighted average of interval midpoints

attribute types

Weight according to class probabilities

Easily generalized to give distance between sets of instances

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 76

Regression trees

Model trees

Like decision trees, but: ★

Build a regression tree

Each leaf ⇒ linear regression function

Splitting criterion: minimize intra-subset

variation Smoothing: factor in ancestor’s predictions

Termination criterion:

std dev becomes

Smoothing formula:

Same effect can be achieved by incorporating ♦ ✶ Pruning criterion:

small

based on numeric error measure

ancestor models into the leaves

Prediction: Leaf predicts average

Need linear regression function at each node

class values of instances At each node, use only a subset of attributes

Piecewise constant functions

Those occurring in subtree

Easy to interpret ♦

(+maybe those occurring in path to the root)

More sophisticated version: model trees Fast: tree usually uses only a small subset of the attributes

Building the tree

Nominal attributes

Splitting: standard deviation reduction ✂✁☎✄

Convert nominal attributes to binary ones

Standard deviation < 5% of its value on full training set

Sort attribute by average class value If attribute has k values,

Too few instances remain (e.g. < 4)

generate k – 1 binary attributes

Pruning: i th is 0 if value lies within the first i , otherwise 1

Heuristic estimate of absolute error of LR models: ✧

Treat binary attributes as numeric

Can prove: best split on one of the new

Greedily remove terms from LR models to minimize ✶ attributes is the best (binary) split on original estimated error

Heavy pruning: single model may replace whole subtree

Proceed bottom up: compare error of LR model at internal node to error of subtree

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 80

Missing values Surrogate splitting based on class

Modify splitting criterion:

Choose split point based on instances with

known values

To determine which subset an instance

Split point divides instances into 2 subsets goes into, use surrogate splitting ★

L (smaller class average)

Split on the attribute whose correlation with

R (larger)

original is greatest

is the average of the two averages

Problem: complex and time-consuming

For an instance with a missing value:

Simple solution: always use the class

Choose L if class value < m

Test set: replace missing value with

Otherwise R

average Once full tree is built, replace missing values with averages of corresponding leaf nodes

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 82

Pseudo-code for M5'

MakeModelTree

Four methods:

{ MakeModelTree (instances)

Main method: MakeModelTree

SD = sd(instances)

Method for splitting: split

for each k-valued nominal attribute

Method for pruning: prune

convert into k-1 synthetic binary attributes root = newNode

Method that computes error: subtreeError

root.instances = instances split(root)

We’ll briefly look at each method in turn

prune(root)

Assume that linear regression method performs

printTree(root)

attribute subset selection based on error } attribute subset selection based on error }

prune

split(node) {

prune(node)

if sizeof(node.instances) < 4 or

sd(node.instances) < 0.05*SD

if node = INTERIOR then

node.type = LEAF

prune(node.leftChild)

else

prune(node.rightChild) node.model = linearRegression(node)

for each attribute node.type = INTERIOR

if subtreeError(node) > error(node) then

for all possible split positions of attribute

node.type = LEAF

calculate the attribute's SDR

node.attribute = attribute with maximum SDR split(node.left) split(node.right)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 86

subtreeError Model tree for servo data

Result

of merging

subtreeError(node)

{ l = node.left; r = node.right if node = INTERIOR then

return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) else return error(node) /sizeof(node.instances) }

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 88

Rules from model trees Locally weighted regression

PART algorithm generates classification rules by building partial decision trees ✧

Numeric prediction that combines

Can use the same method to build rule sets for

instance-based learning

regression

linear regression

♦ Use model trees instead of decision trees

“Lazy”:

♦ Use variance instead of entropy to choose node

computes regression function at prediction time

to expand when building partial tree

works incrementally

Weight training instances ★

Rules will have linear models on right-hand side

according to distance to test instance

Caveat: using smoothed trees may not be

needs weighted version of linear regression appropriate due to separate-and-conquer strategy

Advantage: nonlinear approximation But: slow

Design decisions

Discussion

Weighting function:

Regression trees were introduced in CART

Inverse Euclidean distance

Quinlan proposed model tree method (M5)

Gaussian kernel applied to Euclidean distance

M5’: slightly improved, publicly available Quinlan also investigated combining

Triangular kernel used the same way

etc.

Smoothing parameter is used to scale the distance function ✧ CUBIST: Quinlan’s commercial rule learner

instance-based learning with M5

for numeric prediction

Possible choice: distance of k th nearest Interesting comparison: neural nets vs. M5 training instance (makes it data dependent)

Multiply distance by inverse of this parameter

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 92

Clustering: how many clusters? Incremental clustering

How to choose k in k-means? Possibilities:

Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally

♦ Choose k that minimizes cross-validated

squared distance to cluster centers

Start:

♦ Use penalized squared distance on the training data (eg. using an MDL criterion)

tree consists of empty root node

Then:

Apply k-means recursively with k = 2 and use

stopping criterion (eg. based on MDL) add instances one by one

update tree appropriately at each stage Seeds for subclusters can be chosen by seeding along direction of greatest variance in cluster

to update, find the right leaf for an instance (one standard deviation away in each direction

May involve restructuring the tree

from cluster center of parent cluster) Base update decisions on category utility Implemented in algorithm called X-means (using Bayesian Information Criterion instead of MDL)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 94

Clustering weather data Clustering weather data

ID Outlook Temp. Humi dity Windy

A Sunny Hot High False

ID Outlook

Tem p.

Humi dity

Windy

B Sunny Hot High True

A Sunny

C Overcast Hot High False

B Sunny

E D Rainy Rainy Mild Cool Normal High False False

C Overcast

2 D Rainy

F Rainy Cool Normal True

E Rainy

G Overcast Cool Normal True

Overcast

G F Rainy

Cool Cool

Normal Normal

True True

H Sunny Mild High False

Merge

best host

I Sunny Cool Normal False

H Sunny

and runner- up

J Rainy Mild Normal False

I Sunny

K Sunny Mild Normal True

L Overcast Mild High True

M Overcast Hot Normal False

N Rainy Mild High True

host if merging doesn’t help Consider splitting the best

Final hierarchy Example: the iris data (subset)

ID Outlook Temp. Hum idity Windy A Sunny

Hot High False B Sunny

Hot High True C Overcast

Hot High False D Rainy

Mild High False

actually very similar Oops! a and b are

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 98

Clustering with cutoff

Category utility

Category utility: quadratic loss function defined on conditional probabilities:

Every instance in different category ⇒ numerator becomes

number of attributes

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 00

Numeric attributes Probability-based clustering

Assume normal distribution: ✟✙✼✾✽ ✆ ✿✰❀❂❁❄❃

Problems with heuristic approach:

Division by k?

Order of examples?

Are restructuring operations sufficient?

Thus

Is result at least local minimum of category utility?

Probabilistic perspective ⇒

seek the most likely clusters given the data

Also: instance belongs to a particular cluster Prespecified minimum variance

with a certain probability

acuity parameter

Finite mixtures Two-class mixture model

A 51

Model data using a mixture of distributions B 64 data

One cluster, one distribution

governs probabilities of attribute values in that

Finite mixtures : finite number of clusters

model

Individual distributions are normal

(usually) Combine distributions using cluster weights

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) A A A B B 03/04/06 B 104

Using the mixture model

Learning the clusters

Probability that instance x belongs to

Assume:

cluster A:

we know there are k clusters

Learn the clusters ⇒

determine their parameters

with

I.e. means and standard deviations

Performance criterion:

Probability of an instance given the clusters: probability of training data given the clusters

EM algorithm

finds a local maximum of the likelihood

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 06

EM algorithm

More on EM

EM = Expectation-Maximization Estimate parameters from weighted instances

Generalize k-means to probabilistic setting

Iterative procedure: ☎

E “expectation” step: ✹ Calculate cluster probability for each instance

M “maximization” step: Estimate distribution parameters from cluster

Stop when log-likelihood saturates

Store cluster probabilities as instance

weights

Log-likelihood:

Stop when improvement is negligible

Extending the mixture model More mixture model extensions

More then two distributions: easy

Nominal attributes: easy if independent

Several attributes: easy—assuming independence! Correlated attributes: difficult

Correlated nominal attributes: difficult Two correlated attributes ⇒ 1 v v 2 parameters

Joint model: bivariate normal distribution with a (symmetric) covariance matrix ✱

Missing values: easy

n attributes: need to estimate n + n (n+1)/2 parameters

Can use other distributions than normal:

“log-normal” if predetermined minimum is given

“log-odds” if bounded from above and below

Poisson for attributes that are integer counts Use cross-validation to estimate k !

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 10

Bayesian clustering

Discussion

Problem: many parameters ✹ ⇒ EM overfits Can interpret clusters by using supervised

Bayesian approach : give every parameter a

learning

prior probability distribution ♦

post-processing step

Incorporate prior into overall likelihood figure Decrease dependence between attributes?

Penalizes introduction of parameters

pre-processing step

Eg: Laplace estimator for nominal attributes

E.g. use principal component analysis

Can also have prior on number of clusters!

Can be used to fill in missing values

Implementation: NASA’s AUTOCLASS Key advantage of probabilistic clustering:

Can estimate likelihood of data

Use it to compare different models objectively

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 12

From naïve Bayes to Bayesian Networks

Enter Bayesian networks

Naïve Bayes assumes: ✱ Graphical models that can represent any attributes conditionally independent

probability distribution

given the class Graphical representation: directed acyclic Doesn’t hold in practice but

graph, one node for each attribute

classification accuracy often high Overall probability distribution factorized However: sometimes performance

into component distributions

much worse than e.g. decision tree

Graph’s nodes hold component distributions (conditional distributions)

Can we eliminate the assumption? Can we eliminate the assumption?

he

rt a

rt

k at

fo

fo at

or d

or d

er

w er

et th

et th

N ea

N ea

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 16

Computing the class probabilities Why can we do this? (Part I)

Two steps: computing a product of probabilities for each class and normalization

Single assumption: values of a node’s

♦ For each class value ✱

parents completely determine

Take all attribute values and class value

probability distribution for current node

Look up corresponding entries in conditional probability distribution tables

Take the product of all probabilities ♦ Divide the product for each class by the sum of

• Means that node/attribute is

the products (normalization)

conditionally independent of other ancestors given parents

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 118

Why can we do this? (Part II)

Learning Bayes nets

Basic components of algorithms for learning

Chain rule from probability theory:

Bayes nets:

♦ Method for evaluating the goodness of a given

• Because of our assumption from the previous slide: Measure based on probability of training data

given the network (or the logarithm thereof)

♦ Method for searching through space of possible

Amounts to searching through sets of edges because nodes are fixed

Problem: overfitting Searching for a good structure

Can’t just maximize probability of the training data ✱ Task can be simplified: can optimize

Because then it’s always better to add more edges (fit

each node separately

the training data more closely) ♦ Because probability of an instance is Need to use cross-validation or some penalty for

product of individual nodes’ probabilities complexity of the network ✂✁☎✄ ✯✝✆✟✞ ✦✱✲

♦ Also works for AIC and MDL criterion – AIC measure: ✑✓✒✂✔

because penalties just add up

– MDL measure:

Can optimize node by adding or

– LL: log-likelihood (log of probability of data), K: number of free parameters, N: #instances

removing edges from other nodes

• Another possibility: Bayesian approach with prior

Must not introduce cycles!

distribution over networks

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 22

The K2 algorithm

Some tricks

Starts with given ordering of nodes ✹ Sometimes it helps to start the search with a naïve (attributes)

Bayes network

Processes each node in turn

It can also help to ensure that every node is in

Markov blanket of class node

Greedily tries adding edges from

Markov blanket of a node includes all parents, children,

previous nodes to current node

and children’s parents of that node

Moves to next node when current node Given values for Markov blanket, node is conditionally

independent of nodes outside blanket

can’t be optimized further

I.e. node is irrelevant to classification if not in Markov

Result depends on initial order blanket of class node

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 124

Other algorithms Likelihood vs. conditional likelihood

Extending K2 to consider greedily adding or ✹ In classification what we really want is to deleting edges between any pair of nodes

maximize probability of class given other

Further step: considering inverting the direction of

attributes

edges

– Not probability of the instances

TAN (Tree Augmented Naïve Bayes): But: no closed-form solution for probabilities in

Starts with naïve Bayes

nodes’ tables that maximize this

Considers adding second parent to each node

However: can easily compute conditional

(apart from class node) probability of data based on given network

Efficient algorithm exists

Seems to work well when used for network scoring

Data structures for fast learning

Learning Bayes nets involves a lot of counting

AD tree example

for computing conditional probabilities

Naïve strategy for storing counts: hash table

Runs into memory problems very quickly

More sophisticated strategy: all-dimensions (AD) tree

♦ Analogous to kD-tree for numeric data ♦ Stores counts in a tree but in a clever way such

that redundancy is eliminated ♦ Only makes sense to use it for large datasets

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

127

03/04/06

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 128

Building an AD tree

Discussion

Assume each attribute in the data has been

We have assumed: discrete data, no missing assigned an index

values, no new nodes

Then, expand node for attribute i with the Different method of using Bayes nets for values of all attributes j > i

classification: Bayesian multinets

♦ Two important restrictions:

I.e. build one network for each class and make

Most populous expansion for each attribute is

prediction using Bayes’ rule

omitted (breaking ties arbitrarily)

Different class of learning methods for

Expansions with counts that are zero are also

Bayes nets: testing conditional

omitted

independence assertions

The root node is given index zero Can also build Bayes nets for regression tasks

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

129

03/04/06

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 30

Dokumen yang terkait

Analisis dan Penerapan Perhitungan Orang Menggunakan Metode Histogram Of Oriented Gradients-Local Binary Pattern Dengan Deteksi Kepala-Bahu Studi Kasus: Perhitungan Orang Dalam Kelas Analysis and Implementation Of People Counting Using Histogram Of Orient

0 2 13

Sistem Identifikasi Individu berbasis Palm Vein dengan Metode Local Binary Pattern Personal Identification System based Palm Vein using Local Binary Pattern Method

0 0 9

PENGOLAHAN CITRA RADIOGRAF PERIAPIKAL PADA DETEKSI PENYAKIT GRANULOMA DENGAN METODE BINARY LARGE OBJECT BERBASIS ANDROID Image Processing Of Periapical Radiograph On Granuloma Disease Detection By Binary Large Object Method Based On Android

0 0 9

Deteksi Copy-Move pada Pemalsuan Citra Menggunakan Local Binary Pattern dan SVD-matching Copy-move Image Forgery Detection using Local Binary Pattern and SVD-matching

0 0 8

Pengenalan Angka Tulisan Tangan dengan Menggunakan Local Binary Pattern Variance dan Klasifikasi K-Nearest Neighbour Handwriting Digit Recognition with Use Local Binary Pattern Variance and K-Nearest Neighbour Classification

0 0 8

Sistem Deteksi Api Berbasis Visual menggunakan Metode Local Binary Patterns-Three Orthogonal Planes dan Grey-level Co-occurrence Matrix Vision-based Fire Detection System using Local Binary Patterns-Three Orthogonal Planes and Grey-level Co-occurrence Mat

1 1 8

Penjadwalan pengiriman produk jadi dengan menggunakan model Binary Integer Programming di PT. XYZ

0 0 102

Imunitas Negara Asing di Forum Pengadilan , Nasional dalam Kasus Pelanggaran HAM Berat: Studi kasus Putusan The European Court on Human Right dalam Al-Adsani vs The United Kingdom 21 Nopember 2001

0 0 21

Pengenalan Emosi Berdasarkan Ekspresi Mikro Menggunakan Metode Local Binary Pattern

0 2 9

BAGIAN 3 PENATAAN RUANG BERBASIS PASAR VS SUMBER AIR  BERKELANJUTAN DI KAWASAN TERBANGUN PESISIR  PANTURA JAKARTA‐ CILIWUNG BOPUNJUR - Pengelolaan Bag 3 Penataan Ruang vs sd air berkelanjutan

1 3 56