Lect3 Data Preprocessing

Obj ti
Objectives
Motivation: Why preprocess the Data?
Data Preprocessing Techniques
Data Cleaning
Data Integration and Transformation
Data Reduction

`
`
`
`
`

Data Preprocessing
Lecture 3/DMBI/IKI83403T/MTI/UI

Yudho Giri Sucahyo,
([email protected]))
y , Ph.D,, CISA (y
Faculty of Computer Science, University of Indonesia


2

Wh P
Why
Preprocess the
th D
Data?
t ?

Wh Preprocess
Why
P
the
th Data?
D t ? (2)

Quality decisions must be based on quality data
Data could be incomplete, noisy, and inconsistent
Data warehouse needs consistent integration of

qualityy data
q
Incomplete

`
`
`
`

`
`
`

LLacking
ki attribute
ib
values
l
or certain
i attributes

ib
off iinterest
Containing only aggregate data
Causes:
`
`
`
`

3

University of Indonesia

Noisy (having incorrect attribute values)

`

`
`


`
`
`

Data collection instruments used may be faulty
Human or computer errors occuring at data entry
Errors in data transmission

Inconsistent

`

`

Not considered important at the time of entry
Equipment malfunctions
Data not entered due to misunderstanding
Inconsistent with other recorded data and thus deleted
University of Indonesia


Containing errors, or outlier values that deviate from the
expected
Causes:

4

Containing discrepancies in
the department codes
used to categorize items

University of Indonesia

Wh Preprocess
Why
P
the
th Data?
D t ? (3)

D t Preprocessing

Data
P
i
Techniques
T h i

“Clean”
Clean the data by filling in missing values,
values smoothing
noisy data, identifying or removing outliers, and resolving
inconsistencies.
inconsistencies
Some examples of inconsistencies:

`

`

`
`


customer_id
Bill

vs
vs

cust_id
William

vs

B.

Data Cleaning

`

`


Data Integration

`

`

`

`
University of Indonesia

D t Preprocessing
Data
P
i
Techniques
T h i
(2)

Normalization (to improve the accuracy and efficiency of

mining algorithms involving distance measurements E.g.
Eg
Neural networks, nearest-neighbor)

Data
D
t Discretization
Di
ti ti
Data Reduction

`

5

Merges
g data from multiple
p sources into a coherent data
store, such as a data warehouse or a data cube


Data Transformation

`

Some attributes may be inferred from others. Data
cleaningg includingg detection and removal of redundancies
that may have resulted.

`

To remove noise and correct inconsistencies in the data

6

University of Indonesia

D t Preprocessing
Data
P
i

Techniques
T h i
(3)

Data Reduction

`

`

Warehouse may store terabytes of data

`

Complex
p data analysis/mining
y
g mayy take a veryy longg time to run on the
complete data set

`

Obtains a reduced representation
p
of the data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results.

Strategies for Data Reduction

`

`
`
`
`
`
7

Data aggregation (e.g., building a data cube)
Dimension reduction (e.g. removing irrelevant attributes through
correlation analysis)
Data compression (e.g. using encoding schemes such as minimum length
encoding or wavelets)
Numerosity reduction
Generalization
University of Indonesia

8

University of Indonesia

D t Cleaning
Data
Cl
i
– Missing
Mi i
Values
V l

D t Cleaning
Data
Cl
i
– Missing
Mi i
V
Values
l
(2)

Ignore the tuple

1.

Usually done when class label is missing Æ classification
Not effective when the missing values in attributes spread in
different tuples

`
`

FFillll in the
h missing value
l manually:
ll tedious
d
+ infeasible?
f
bl ?
Use a gglobal constant to fill in the missingg value

2.
3.

‘unknown’, a new class?
Mining program may mistakenly think that they form an
interesting concept, since they all have a value in common Æ
not recommended

`
`

Use the attribute mean to fill in the missing value Æ
avg income
i

4.
9

University of Indonesia

Data Cleaning –
N i and
Noise
d Incorrect
I
t (Inconsistent)
(I
i t t) Data
D t
`
`
`

Noise is a random error or variance in a measured variable.
variable
How can we smooth out the data to remove the noise?
Binning Method
`
`
`
`

Use the attribute mean for all samples belonging to the
same class as the given tuple Æ same credit risk
category
t
Use the most probable value to fill in the missing value

5
5.

Smooth a sorted data value by consulting its “neighborhood”, that
is, the values around it.
The sorted values are distributed into a number of buckets, or bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
Binning is also uses as a discretizatin technique (will be discussed
later)

6.
`

Determined with regression, inference-based tools such as
Bayesian
y
formalism, or decision tree induction

Methods 3 to 6 bias the data. The filled-in value mayy not be
correct. However, method 6 is a popular strategy, since:
` It uses the most information from the present data to predict missing values
` There

is a greater chance that the relationships between income and the other
attributes are preserved.
preserved
10

University of Indonesia

Data Cleaning – Noisy Data
Bi i
Binning
Methods
M th d
* Sorted data for pprice (in
( dollars):
) 4,, 8,, 9,, 15,, 21,, 21,, 24,, 26,, 25,, 28,, 29,, 34
* Partition into (equidepth) bins of depth 3, each bin contains three values:
- Bin 1: 4,, 8,, 9,, 15
- Bin 2: 21, 21, 24, 26
- Bin 3: 25,, 28,, 29,, 34
* Smoothing by bin means:
- Bin 1: 9,, 9,, 9,, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29,, 29,, 29,, 29
* Smoothing by bin boundaries: Æ the larger the width, the greater the effect
- Bin 1: 4,, 4,, 4,, 15
- Bin 2: 21, 21, 26, 26
- Bin 3: 25,, 25,, 25,, 34

11

University of Indonesia

12

University of Indonesia

Data Cleaning – Noisy Data
Cl t i
Clustering
`
`

Data Cleaning – Noisy Data
R
Regression
i

Similar values are organized into groups
groups, or clusters.
clusters
Values that fall outside of the set of clusters may be
considered
id d outliers.
tli

`

`

`

13

`
`

Many methods for data smoothing are also methods
for data reduction involving discretization.
Examples
`

`

15

14

University of Indonesia

D t Smoothing
Data
S
thi
vs Data
D t Reduction
R d ti

Binningg techniques
q
Æ reduce the number of distinct values
per attribute. Useful for decision tree induction which
repeatedly make value comparisons on sorted data.
Concept hierarchies are also a form of data discretization
g
that can also be used for data smoothng.
` Mapping real price into inexpensive, moderately_priced,
p
expensive
` Reducing the number of data values to be handled by the
mining process.
University of Indonesia

Data can be smoothed by
fitting the data to a
function such as with
function,
regression.
Li
Linear
regression
i involves
i l
finding the best line to fit
two variables, so that one
variable can be used to
predict the other.
Multiple
p linear regression
g
Æ > 2 variables,
multidimensional surface

y

Y1

y=x+1

Y1’

x

X1

University of Indonesia

D t Cleaning
Data
Cl
i
- Inconsistent
I
i t t Data
D t
`
`

`

May be corrected manually.
manually
Errors made at data entry may be corrected by
performing
f
i a paper trace,
t
coupled
l d with
ith routines
ti
designed
d i d
to help correct the inconsistent use of codes.
Can also using tools to detect the violation of known
data constraints.

16

University of Indonesia

D t Integration
Data
I t
ti
and
d Transformation
T
f
ti

D t Transformation
Data
T
f
ti

`

Data Integration: combines data from multiple data stores

`

`

Schema integration

`

`

`

integrate metadata from different sources

`

`

Entityy identification pproblem: identifyy real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#

`
`

D t ti and
Detecting
d resolving
l i d
data
t value
l conflicts
fli t
`

for the same real world entity, attribute values from different
sources are different

`

possible reasons: different representations, different scales (feet
vs metre)

17

`

v − minA
(new _ maxA − new _ minA) + new _ minA
maxA − minA

University of Indonesia

`

Data consist of sales per quarter,
quarter for several years.
years User
interested in the annual sales (total per year) Æ data can
b aggregated
be
d so that
h the
h resulting
l i data
d summarize
i the
h
total sales per year instead of per quarter.

`

Resulting data set is smaller in volume, without loss of
information necessary for the analysis task.
task

`

See Figure 3.4 [JH]

A

normalization byy decimal scalingg

v' =
19

v − mean A
stand
t d _ dev
d

Useful for classification involving neural networks, or distance
measurements such as nearest neighbor classification and clustering

18

z-score normalization

v'=
`

`

min-max normalization

v' =

Smoothing: binning, clustering, and regression
Aggregation:
gg g
summarization, data cube construction
Generalization: low-level or raw data are replaced by higherlevel concepts
p through
g the use of concept
p hierarchies
` Street Æ city or country
` Numeric attributes of age Æ young,
young middle-aged
middle-aged, senior
Normalization: attribute data are scaled so as to fall within a
small specified range,
range such as 0.0
0 0 to 1.0
10

D t Reduction
Data
R d ti
– Data
D t Cube
C b Aggregation
A
ti

N
Normalization:
li i scaled
l d to ffallll within
i hi a small,
ll specified
ifi d range
`

`

University of Indonesia

D t Transformation
Data
T
f
ti
(2)
`

Data are transformed into forms appropriate for mining
Methods:

v
10 j

Where j is the smallest integer such that Max(|

|) 1
v' |) Reduced attribute set: {A1, A4, A6}
23

University of Indonesia

University of Indonesia

D t Compression
Data
C
i
`

Initial attribute set:
{A1, A2, A3, A4, A5, A6}

Stepwise
St
i fforward
d selection
l ti
Stepwise backward selection (or combination of both)
Decision tree induction

Data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the
original data.
data
Lossless data compression technique: If the original data
can be
b reconstructedd from
f
the
h compressed
d data
d without
ih
any loss of information.
Lossy data compression technique: we can reconstruct
pp
of the original
g
data.
onlyy an approximation
Two popular and effective methods of lossy data
compression: wavelet transformts and principal components
analysis.
24

University of Indonesia

D t Compression
Data
C
i
(2)

N
Numerosity
it Reduction
R d ti
`

Parametric methods:
`

Oi i lD
Original
Data
t

`

Compressed
C
d
Data

`

l
lossless
l

Assume the data fits some model, estimate model parameters,
store only the parameters,
parameters and discard the data (except
possible outliers).
Log-linear models: obtain value at a point in m-D space as the
product on appropriate marginal subspaces. (see Slide 14)

Non parametric methods:
Non-parametric
`
`

No assume models
Three major families:
`
`

Original Data
Approximated

`

25

N
Numerosity
it Reduction
R d ti
- Histograms
Hi t
`
`

`

A popular
l d
data reduction
d i
technique
Divide data into buckets
and store average (sum) for
each
h bucket
b k
Partitionng rules:
` Equiwidth
` Equidepth
` Etc.

26

University of Indonesia

`

Allows a large data set to be represented by a much
smaller random sample (or subset) of the data.

`

Choose a representative
Ch
t ti subset
b t off th
the data
d t
` Simple random sampling may have very poor performance in
th presence off skew
the
k
Develop adaptive sampling methods
` Stratified sampling:

35
30
25
`

20
15

`

10
`

5
`

0
27

University of Indonesia

N
Numerosity
it Reduction
R d ti
- Sampling
S
li

40

10000

30000

50000

70000

Clustering (see Slide 13)
Histograms
Sampling

90000

University of Indonesia

`
28

Approximate the percentage of each class (or subpopulation of
interest)) in the
h overallll database
d b
Used in conjunction with skewed data

Simple
Si
l random
d
sample
l without
ih
replacement
l
(SRSWOR)
Simple random sample with replacement (SRSWR)
University of Indonesia

N
Numerosity
it Reduction
R d ti
– Sampling
S
li
(2)
Raw Data

N
Numerosity
it Reduction
R d ti
– Sampling
S
li
(3)

Cluster/Stratified Sample

Raw Data

29

Di
Discretization
ti ti
and
d Concept
C
t Hierarchy
Hi
h
`

`

Discretization can be used to reduce the number of
values for a given continuous attribute, by dividing the
range off the
th attribute
tt ib t into
i t intervals.
i t
l IInterval
t
l llabels
b l
can then be used to replace actual data values.
Concept hierarchies can be used to reduce the data
by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level
concepts (such as young,
young middle-aged,
middle aged or senior).
senior)

31

30

University of Indonesia

University of Indonesia

University of Indonesia

Discretization and concept hierarchy
generation
ti
for
f numeric
i d
data
t
`
`
`
`
`

Binning
Histogram analysis
Clustering analysis
Entropy-based
py
discretization
Segmentation by natural partitioning Æ 3-4-5 rule

32

University of Indonesia

Concept hierarchy generation for
categorical
t
i l data
d t

E
Example
l off 3-4-5
3 4 5 rule
l

`

count

-$351

Step
1:

-$159

Min

Step 2:

profit

msd=1,000

Low=-$1,000

(-$1,000 - 0)

( $400 - 0)
(-$400

33

($1 000 - $2,
($1,000
$2 000)
($1,000
$1,200)
(($1,200 $1,400)

$200)
($200 $400)

($1,400 $1,600)

($400 $600)
($600 $800)

(-$100 0)

($800 $1,000)

($1,600
($1
600 ($1,800 $1,800)
$2,000)

($2,000 - $5, 000)

($2,000 $3,000)
($3,000 $4,000)
($4,000
$5,000)

`

`
`
`

`

34

country
province_or_ state

15 distinct values

65 distinct values

city

3567 distinct values

street

674,339 distinct values

University of Indonesia

R f
References

Data preparation is a big issue for both warehousing
and mining
Data preparation includes
`

Concept hierarchy can be
automatically generated
based on the number of
distinct values per attribute
in the given attribute set.
The attribute with the most
di ti t values
distinct
l
is
i placed
l d att
the lowest level of the
hierarchy
hierarchy.

University of Indonesia

C
Conclusion
l i
`

`
`

($1,000 - $2,000)

(0 -$
1,000)

(0 - $1,000)
$1 000)
(0 -

(-$200 -$100)

Max

High=$2,000

(-$4000 -$5,000)

Step
4:

(-$300 -$200)

$4,700
High(i.e, 95%-0 tile)

(-$1,000 - $2,000)

Step 3:

(-$400 -$300)

$1,838

Low (i.e, 5%-tile)

Categorical data are discrete data.
data Have a finite
number of distinct values, with no ordering among the
values Ex.
values.
Ex Location,
Location job category.
category
Specification of a set of attributes:

`

[JH] Jiawei Han and Micheline Kamber,
Kamber Data Mining:
Concepts and Techniques, Morgan Kaufmann, 2001.

Data cleaningg
Data integration and Data transformation
Data reduction and feature selection
Discretization

A lot
l t a methods
th d have
h
been
b
d
developed
l
d but
b t still
till an
active area of research

35

University of Indonesia

36

University of Indonesia