Slide TIF311 DM 10 11

DATA MINING WITH
CLUSTERING AND
CLASSIFICATION

Overview
• Definition of Clustering
• Existing clustering methods
• Clustering examples
• Classification
• Classification examples
• Conclusion

Definition
• Clustering can be considered the most important
unsupervised learning technique; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.
• Clustering is “the process of organizing objects into
groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to

the objects belonging to other clusters.

Mu-Yu Lu, SJSU

Why clustering?
A few good reasons ...
• Simplifications
• Pattern detection
• Useful in data concept construction
• Unsupervised learning process

Where to use clustering?
• Data mining
• Information retrieval
• text mining
• Web analysis
• marketing
• medical diagnostic

Which method should I

use?
• Type of attributes in data
• Scalability to larger dataset
• Ability to work with irregular data
• Time cost
• complexity
• Data order dependency
• Result presentation

Major Existing clustering
methods
• Distance-based
• Hierarchical
• Partitioning
• Probabilistic

Measuring Similarity
• Dissimilarity/Similarity metric: Similarity is expressed
in terms of a distance function, which is typically
metric: d(i, j)

• There is a separate “quality” function that measures
the “goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical,
ordinal and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Professor Lee, Sin-Min

Distance based method

• In this case we easily identify the 4 clusters into which the data can
be divided; the similarity criterion is distance: two or more objects
belong to the same cluster if they are “close” according to a given
distance. This is called distance-based clustering.

Hierarchical clustering
Agglomerative (bottom

up)
1.

start with 1 point
(singleton)
2. recursively add two or
more appropriate
clusters
3. Stop when k number
of clusters is
achieved.

Divisive (top down)
1.
2.
3.

Start with a big cluster
Recursively divide into
smaller clusters

Stop when k number of
clusters is achieved.

general steps of hierarchical
clustering
Given a set of N items to be clustered, and an N*N distance
(or similarity) matrix, the basic process of hierarchical
clustering (defined by S.C. Johnson in 1967) is this:
• Start by assigning each item to a cluster, so that if you
have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between
the clusters the same as the distances (similarities)
between the items they contain.
• Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one
cluster less.
• Compute distances (similarities) between the new
cluster and each of the old clusters.
• Repeat steps 2 and 3 until all items are clustered into K
number of clusters


Mu-Yu Lu, SJSU

Exclusive vs. non
exclusive clustering
• In the first case data are grouped in an
exclusive way, so that if a certain
datum belongs to a definite cluster
then it could not be included in
another cluster. A simple example of
that is shown in the figure below,
where the separation of points is
achieved by a straight line on a bidimensional plane.
• On the contrary the second type, the
overlapping clustering, uses fuzzy sets
to cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.

Partitioning clustering

1. Divide data into proper subset
2. recursively go through each subset
and relocate points between
clusters (opposite to visit-once
approach in Hierarchical approach)
This recursive relocation= higher quality cluster

Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for
cluster
3. Single cluster membership

Single-Linkage
Clustering(hierarchical)
• The N*N proximity matrix is D = [d(i,j)]
• The clusterings are assigned sequence
numbers 0,1,......, (n-1)

• L(k) is the level of the kth clustering
• A cluster with sequence number m is
denoted (m)
• The proximity between clusters (r) and
(s) is denoted d [(r),(s)]
Mu-Yu Lu, SJSU

The algorithm is composed of
the following steps:
• Begin with the disjoint clustering having
level L(0) = 0 and sequence number m = 0.
• Find the least dissimilar pair of clusters in
the current clustering, say pair (r), (s),
according to

d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of
clusters in the current clustering.

The algorithm is composed of the

following steps:(cont.)
• Increment the sequence number : m = m +1. Merge
clusters (r) and (s) into a single cluster to form the
next clustering m. Set the level of this clustering to

L(m) = d[(r),(s)]
• Update the proximity matrix, D, by deleting the rows
and columns corresponding to clusters (r) and (s) and
adding a row and column corresponding to the newly
formed cluster. The proximity between the new
cluster, denoted (r,s) and old cluster (k) is defined in
this way:

d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
• If all objects are in one cluster, stop. Else, go to step
2.

Hierarchical clustering example
• Let’s now see a simple example: a hierarchical
clustering of distances in kilometers between some

Italian cities. The method used is single-linkage.
• Input distance matrix (L = 0 for all the clusters):

• The nearest pair of cities is MI and TO, at distance 138. These
are merged into a single cluster called "MI/TO". The level of
the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
Then we compute the distance from this new compound object
to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is
equal to the shortest distance from any member of the
cluster to the outside object. So the distance from "MI/TO"
to RM is chosen to be 564, which is the distance from MI to
RM, and so on.

• After merging MI with TO we obtain the
following matrix:

• min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a
new cluster called NA/RM

L(NA/RM) = 219
m=2

• min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM
into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3

• min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and
FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4

• Finally, we merge the last two clusters at level 295.
• The process is summarized by the following hierarchical tree:

K-mean algorithm
1.

2.

It accepts the number of clusters to group
data into, and the dataset to cluster as input
values.
It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows
of data in the dataset and 3 clusters need to be
formed, then the first K=3 initial clusters will
be created by selecting 3 records randomly
from the dataset as the initial clusters. Each of
the 3 initial clusters formed will have just one
row of data.

3. The K-Means algorithm calculates the Arithmetic
Mean of each cluster formed in the dataset. The
Arithmetic Mean of a cluster is the mean of all the
individual records in the cluster. In each of the first K initial
clusters, their is only one record. The Arithmetic Mean of a
cluster with one record is the set of values that make up
that record. For Example if the dataset we are discussing
is a set of Height, Weight and Age measurements for
students in a University, where a record P in the dataset S
is represented by a Height, Weight and Age
measurement, then P = {Age, Height, Weight). Then
a record containing the measurements of a student John,
would be represented as John = {20, 170, 80} where
John's Age = 20 years, Height = 1.70 metres and Weight =
80 Pounds. Since there is only one record in each
initial cluster then the Arithmetic Mean of a cluster with
only the record for John as a member = {20, 170, 80}.

4.

Next, K-Means assigns each record in the dataset to only one of the
initial clusters. Each record is assigned to the nearest cluster (the
cluster which it is most similar to) using a measure of distance or
similarity like the Euclidean Distance Measure or Manhattan/CityBlock Distance Measure.

5.

K-Means re-assigns each record in the dataset to the most
similar cluster and re-calculates the arithmetic mean of all the clusters
in the dataset. The arithmetic mean of a cluster is the arithmetic mean
of all the records in that cluster. For Example, if a cluster contains two
records where the record of the set of measurements for John = {20,
170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean
is represented as Pmean= {Agemean, Heightmean, Weightmean). 
Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2 and Weightmean=
(80 + 120)/2. The arithmetic mean of this cluster = {25, 165,
100}. This new arithmetic mean becomes the center of this new
cluster. Following the same procedure, new cluster centers are
formed for all the existing clusters.

6. K-Means re-assigns each record in the dataset to only one of
the new clusters formed. A record or data point is assigned to
the nearest cluster (the cluster which it is most similar to)
using a measure of distance or similarity 
7. The preceding steps are repeated until stable clusters are
formed and the K-Means clustering procedure is completed.
Stable clusters are formed when new iterations or repetitions
of the K-Means clustering algorithm does not create new
clusters as the cluster center or Arithmetic Mean of each
cluster formed is the same as the old cluster center. There
are different techniques for determining when a stable
cluster is formed or when the k-means clustering algorithm
procedure is completed.

Classification
Goal: Provide an overview of the
classification problem and introduce some of
the basic algorithms

• Classification Problem Overview
• Classification Techniques






Regression
Distance
Decision Trees
Rules
Neural Networks

Classification Examples
• Teachers classify students’ grades
as A, B, C, D, or F.
• Identify mushrooms as poisonous
or edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition

Classification Ex: Grading
x

• If x >= 90 then grade
=90
=A.
• If 80