Binary vs multiway splits

Backpropagation

Examples Hot to learn weights given network structure?

♦ Cannot simply use perceptron learning rule

because we have hidden layer(s) ♦ Function we are trying to minimize: error ♦ Can use a general function minimization

technique called gradient descent ★

Need differentiable activation function: use sigmoid

function instead of threshold function ☎ ✛ ✜✣ ✁ ✆ ✆

Need differentiable error function: can't use zero-one ✘ loss, but can use squared error

The two activation functions Gradient descent example

Function: x ✧ 2 +1

Derivative: 2x

Learning rate: 0.1 Start value: 4

Can only find a local minimum!

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 62

Minimizing the error I

Minimizing the error II

Need to find partial derivative of error ★ What about the weights for the connections from function for each parameter (i.e. weight)

the input to the hidden layer?

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 64

Remarks Radial basis function networks

Same process works for multiple hidden layers and

multiple output units (eg. for multiple classes) Can update weights after all training instances have been

Another type of feedforward network with

two layers (plus the input layer)

processed or incrementally:

Hidden units represent points in instance

batch learning vs. stochastic backpropagation

space and activation depends on distance

Weights are initialized to small random values ♦ To this end, distance is converted into How to avoid overfitting?

similarity: Gaussian activation function

Early stopping : use validation set to check when to stop

Width may be different for each hidden unit

Weight decay : add penalty term to error function ♦ Points of equal activation form hypersphere How to speed up learning?

(or hyperellipsoid) as opposed to hyperplane

Momentum : re-use proportion of old weight change ✧

Output layer same as in MLP

Use optimization method that employs 2nd derivative

Learning RBF networks Instance-based learning

Parameters: centers and widths of the RBFs + weights in output layer

Practical problems of 1-NN scheme:

Can learn two sets of parameters

Slow (but: fast tree-based approaches exist)

independently and still get accurate models

Remedy: remove irrelevant data

♦ ♦ Noise (but: k -NN copes quite well with noise) Eg.: clusters from k-means can be used to form basis functions Remedy: remove noisy instances

All attributes deemed equally important ♦ Linear model can be used based on fixed RBFs

Remedy: weight attributes (or simply select) ♦ Makes learning RBFs very efficient

Doesn’t perform explicit generalization

Disadvantage: no built-in attribute weighting Remedy: rule-based NN approach

based on relevance RBF networks are related to RBF SVMs

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 68

Learning prototypes Speed up, combat noise

IB2: save memory, speed up classification

Work incrementally

Only incorporate misclassified instances

Problem: noisy data gets incorporated IB3: deal with noise

Only those instances involved in a decision Discard instances that don’t perform well

Compute confidence intervals for

need to be stored

1. Each instance’s success rate 2. Default accuracy of its class

Noisy instances should be filtered out Idea: only use prototypical examples

Accept/reject instances

Accept if lower limit of 1 exceeds upper limit of 2 Reject if upper limit of 1 is below lower limit of 2

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 70

Weight attributes Rectangular generalizations

IB4: weight each attribute

(weights can be class-specific) Weighted Euclidean distance: ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄✆☎✞✝ ✝ ✝ ☎ ✄ ☎ ☛ ✝✧✆ ☞ ✄

Update weights based on nearest neighbor

Class correct: increase weight Nearest-neighbor rule is used outside

Class incorrect: decrease weight

rectangles

Amount of change for i th attribute depends on |x ✧

i - y i | Rectangles are rules! (But they can be more

conservative than “normal” rules.) Nested rectangles are rules with exceptions

Generalized exemplars Separating generalized exemplars

Generalize instances into hyperrectangles

Online: incrementally modify rectangles

Offline version: seek small set of rectangles that

Class 1

cover the instances Important design decisions:

Cl ass

Allow overlapping rectangles?

Requires conflict resolution

Allow nested rectangles?

Dealing with uncovered instances?

Separation line

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 74

Generalized distance functions

Numeric prediction

Given: some transformation operations on

Counterparts exist for all schemes previously

attributes

discussed

K*: similarity = probability of transforming

Decision trees, rule learners, SVMs, etc.

Average over all transformation paths (Almost) all classification schemes can be applied

instance A into B by chance

Weight paths according their probability to regression problems using discretization

(need way of measuring this)

Discretize the class into intervals

Uniform way of dealing with different

Predict weighted average of interval midpoints

attribute types

Weight according to class probabilities

Easily generalized to give distance between sets of instances

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 76

Regression trees

Model trees

Like decision trees, but: ★

Build a regression tree

Each leaf ⇒ linear regression function

Splitting criterion: minimize intra-subset

variation Smoothing: factor in ancestor’s predictions

Termination criterion:

std dev becomes

Smoothing formula:

Same effect can be achieved by incorporating ♦ ✶ Pruning criterion:

small

based on numeric error measure

ancestor models into the leaves

Prediction: Leaf predicts average

Need linear regression function at each node

class values of instances At each node, use only a subset of attributes

Piecewise constant functions

Those occurring in subtree

Easy to interpret ♦

(+maybe those occurring in path to the root)

More sophisticated version: model trees Fast: tree usually uses only a small subset of the attributes

Building the tree

Nominal attributes

Splitting: standard deviation reduction ✂✁☎✄

Convert nominal attributes to binary ones

Standard deviation < 5% of its value on full training set

Sort attribute by average class value If attribute has k values,

Too few instances remain (e.g. < 4)

generate k – 1 binary attributes

Pruning: i th is 0 if value lies within the first i , otherwise 1

Heuristic estimate of absolute error of LR models: ✧

Treat binary attributes as numeric

Can prove: best split on one of the new

Greedily remove terms from LR models to minimize ✶ attributes is the best (binary) split on original estimated error

Heavy pruning: single model may replace whole subtree

Proceed bottom up: compare error of LR model at internal node to error of subtree

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 80

Missing values Surrogate splitting based on class

Modify splitting criterion:

Choose split point based on instances with

known values

To determine which subset an instance

Split point divides instances into 2 subsets goes into, use surrogate splitting ★

L (smaller class average)

Split on the attribute whose correlation with

R (larger)

original is greatest

is the average of the two averages

Problem: complex and time-consuming

For an instance with a missing value:

Simple solution: always use the class

Choose L if class value < m

Test set: replace missing value with

Otherwise R

average Once full tree is built, replace missing values with averages of corresponding leaf nodes

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 82

Pseudo-code for M5'

MakeModelTree

Four methods:

{ MakeModelTree (instances)

Main method: MakeModelTree

SD = sd(instances)

Method for splitting: split

for each k-valued nominal attribute

Method for pruning: prune

convert into k-1 synthetic binary attributes root = newNode

Method that computes error: subtreeError

root.instances = instances split(root)

We’ll briefly look at each method in turn

prune(root)

Assume that linear regression method performs

printTree(root)

attribute subset selection based on error } attribute subset selection based on error }

prune

split(node) {

prune(node)

if sizeof(node.instances) < 4 or

sd(node.instances) < 0.05*SD

if node = INTERIOR then

node.type = LEAF

prune(node.leftChild)

else

prune(node.rightChild) node.model = linearRegression(node)

for each attribute node.type = INTERIOR

if subtreeError(node) > error(node) then

for all possible split positions of attribute

node.type = LEAF

calculate the attribute's SDR

node.attribute = attribute with maximum SDR split(node.left) split(node.right)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 86

subtreeError Model tree for servo data

Result

of merging

subtreeError(node)

{ l = node.left; r = node.right if node = INTERIOR then

return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r)) else return error(node) /sizeof(node.instances) }

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 88

Rules from model trees Locally weighted regression

PART algorithm generates classification rules by building partial decision trees ✧

Numeric prediction that combines

Can use the same method to build rule sets for

instance-based learning

regression

linear regression

♦ Use model trees instead of decision trees

“Lazy”:

♦ Use variance instead of entropy to choose node

computes regression function at prediction time

to expand when building partial tree

works incrementally

Weight training instances ★

Rules will have linear models on right-hand side

according to distance to test instance

Caveat: using smoothed trees may not be

needs weighted version of linear regression appropriate due to separate-and-conquer strategy

Advantage: nonlinear approximation But: slow

Design decisions

Discussion

Weighting function:

Regression trees were introduced in CART

Inverse Euclidean distance

Quinlan proposed model tree method (M5)

Gaussian kernel applied to Euclidean distance

M5’: slightly improved, publicly available Quinlan also investigated combining

Triangular kernel used the same way

etc.

Smoothing parameter is used to scale the distance function ✧ CUBIST: Quinlan’s commercial rule learner

instance-based learning with M5

for numeric prediction

Possible choice: distance of k th nearest Interesting comparison: neural nets vs. M5 training instance (makes it data dependent)

Multiply distance by inverse of this parameter

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 92

Clustering: how many clusters? Incremental clustering

How to choose k in k-means? Possibilities:

Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally

♦ Choose k that minimizes cross-validated

squared distance to cluster centers

Start:

♦ Use penalized squared distance on the training data (eg. using an MDL criterion)

tree consists of empty root node

Then:

Apply k-means recursively with k = 2 and use

stopping criterion (eg. based on MDL) add instances one by one

update tree appropriately at each stage Seeds for subclusters can be chosen by seeding along direction of greatest variance in cluster

to update, find the right leaf for an instance (one standard deviation away in each direction

May involve restructuring the tree

from cluster center of parent cluster) Base update decisions on category utility Implemented in algorithm called X-means (using Bayesian Information Criterion instead of MDL)

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 94

Clustering weather data Clustering weather data

ID Outlook Temp. Humi dity Windy

A Sunny Hot High False

ID Outlook

Tem p.

Humi dity

Windy

B Sunny Hot High True

A Sunny

C Overcast Hot High False

B Sunny

E D Rainy Rainy Mild Cool Normal High False False

C Overcast

2 D Rainy

F Rainy Cool Normal True

E Rainy

G Overcast Cool Normal True

Overcast

G F Rainy

Cool Cool

Normal Normal

True True

H Sunny Mild High False

Merge

best host

I Sunny Cool Normal False

H Sunny

and runner- up

J Rainy Mild Normal False

I Sunny

K Sunny Mild Normal True

L Overcast Mild High True

M Overcast Hot Normal False

N Rainy Mild High True

host if merging doesn’t help Consider splitting the best

Final hierarchy Example: the iris data (subset)

ID Outlook Temp. Hum idity Windy A Sunny

Hot High False B Sunny

Hot High True C Overcast

Hot High False D Rainy

Mild High False

actually very similar Oops! a and b are

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 98

Clustering with cutoff

Category utility

Category utility: quadratic loss function defined on conditional probabilities:

Every instance in different category ⇒ numerator becomes

number of attributes

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 00

Numeric attributes Probability-based clustering

Assume normal distribution: ✟✙✼✾✽ ✆ ✿✰❀❂❁❄❃

Problems with heuristic approach:

Division by k?

Order of examples?

Are restructuring operations sufficient?

Thus

Is result at least local minimum of category utility?

Probabilistic perspective ⇒

seek the most likely clusters given the data

Also: instance belongs to a particular cluster Prespecified minimum variance

with a certain probability

acuity parameter

Finite mixtures Two-class mixture model

A 51

Model data using a mixture of distributions B 64 data

One cluster, one distribution

governs probabilities of attribute values in that

Finite mixtures : finite number of clusters

model

Individual distributions are normal

(usually) Combine distributions using cluster weights

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) A A A B B 03/04/06 B 104

Using the mixture model

Learning the clusters

Probability that instance x belongs to

Assume:

cluster A:

we know there are k clusters

Learn the clusters ⇒

determine their parameters

with

I.e. means and standard deviations

Performance criterion:

Probability of an instance given the clusters: probability of training data given the clusters

EM algorithm

finds a local maximum of the likelihood

03/04/06 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 1 06

EM algorithm

Binary vs multiway splits

Backpropagation

Learning rate: 0.1 Start value: 4

Missing values Surrogate splitting based on class

Dokumen yang terkait

Analisis dan Penerapan Perhitungan Orang Menggunakan Metode Histogram Of Oriented Gradients-Local Binary Pattern Dengan Deteksi Kepala-Bahu Studi Kasus: Perhitungan Orang Dalam Kelas Analysis and Implementation Of People Counting Using Histogram Of Orient

Sistem Identifikasi Individu berbasis Palm Vein dengan Metode Local Binary Pattern Personal Identification System based Palm Vein using Local Binary Pattern Method

PENGOLAHAN CITRA RADIOGRAF PERIAPIKAL PADA DETEKSI PENYAKIT GRANULOMA DENGAN METODE BINARY LARGE OBJECT BERBASIS ANDROID Image Processing Of Periapical Radiograph On Granuloma Disease Detection By Binary Large Object Method Based On Android

Deteksi Copy-Move pada Pemalsuan Citra Menggunakan Local Binary Pattern dan SVD-matching Copy-move Image Forgery Detection using Local Binary Pattern and SVD-matching

Pengenalan Angka Tulisan Tangan dengan Menggunakan Local Binary Pattern Variance dan Klasifikasi K-Nearest Neighbour Handwriting Digit Recognition with Use Local Binary Pattern Variance and K-Nearest Neighbour Classification

Sistem Deteksi Api Berbasis Visual menggunakan Metode Local Binary Patterns-Three Orthogonal Planes dan Grey-level Co-occurrence Matrix Vision-based Fire Detection System using Local Binary Patterns-Three Orthogonal Planes and Grey-level Co-occurrence Mat

Penjadwalan pengiriman produk jadi dengan menggunakan model Binary Integer Programming di PT. XYZ

Imunitas Negara Asing di Forum Pengadilan , Nasional dalam Kasus Pelanggaran HAM Berat: Studi kasus Putusan The European Court on Human Right dalam Al-Adsani vs The United Kingdom 21 Nopember 2001

Pengenalan Emosi Berdasarkan Ekspresi Mikro Menggunakan Metode Local Binary Pattern

BAGIAN 3 PENATAAN RUANG BERBASIS PASAR VS SUMBER AIR BERKELANJUTAN DI KAWASAN TERBANGUN PESISIR PANTURA JAKARTA‐ CILIWUNG BOPUNJUR - Pengelolaan Bag 3 Penataan Ruang vs sd air berkelanjutan

Dukungan

Links

Binary vs multiway splits

Backpropagation

Learning rate: 0.1 Start value: 4

Missing values Surrogate splitting based on class

Dokumen yang terkait

Analisis dan Penerapan Perhitungan Orang Menggunakan Metode Histogram Of Oriented Gradients-Local Binary Pattern Dengan Deteksi Kepala-Bahu Studi Kasus: Perhitungan Orang Dalam Kelas Analysis and Implementation Of People Counting Using Histogram Of Orient

Sistem Identifikasi Individu berbasis Palm Vein dengan Metode Local Binary Pattern Personal Identification System based Palm Vein using Local Binary Pattern Method

PENGOLAHAN CITRA RADIOGRAF PERIAPIKAL PADA DETEKSI PENYAKIT GRANULOMA DENGAN METODE BINARY LARGE OBJECT BERBASIS ANDROID Image Processing Of Periapical Radiograph On Granuloma Disease Detection By Binary Large Object Method Based On Android

Deteksi Copy-Move pada Pemalsuan Citra Menggunakan Local Binary Pattern dan SVD-matching Copy-move Image Forgery Detection using Local Binary Pattern and SVD-matching

Pengenalan Angka Tulisan Tangan dengan Menggunakan Local Binary Pattern Variance dan Klasifikasi K-Nearest Neighbour Handwriting Digit Recognition with Use Local Binary Pattern Variance and K-Nearest Neighbour Classification

Sistem Deteksi Api Berbasis Visual menggunakan Metode Local Binary Patterns-Three Orthogonal Planes dan Grey-level Co-occurrence Matrix Vision-based Fire Detection System using Local Binary Patterns-Three Orthogonal Planes and Grey-level Co-occurrence Mat

Penjadwalan pengiriman produk jadi dengan menggunakan model Binary Integer Programming di PT. XYZ

Imunitas Negara Asing di Forum Pengadilan , Nasional dalam Kasus Pelanggaran HAM Berat: Studi kasus Putusan The European Court on Human Right dalam Al-Adsani vs The United Kingdom 21 Nopember 2001

Pengenalan Emosi Berdasarkan Ekspresi Mikro Menggunakan Metode Local Binary Pattern

BAGIAN 3 PENATAAN RUANG BERBASIS PASAR VS SUMBER AIR BERKELANJUTAN DI KAWASAN TERBANGUN PESISIR PANTURA JAKARTA‐ CILIWUNG BOPUNJUR - Pengelolaan Bag 3 Penataan Ruang vs sd air berkelanjutan

Dokumen yang Anda mencari sudah siap untuk unduhkan