Bahan Ajar Data Mining.rar (5,149Kb)
Ericks WHY DO WE NEED TO PREPROCESS THE DATA?
Fields that are obsolete or redundant
Missing values
Outliers
Data in a form not suitable for data
mining models
Values not consistent with policy or
common sense.DATA CLEANING
Customer ID Zip Gender Income Age Marital Status Transaction Amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F
−40000
40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 M 99999
30 D 3000 Can You Find Any Problems in This Tiny Data Set? DATA CLEANING - ZIP Customer ID Zip 1001 10048 1002 J2S7K7 1003 90210 1004 6269 1005 55101 Customer 1002 zip code of J2S7K7. (Actually, this is the zip code of St.
Hyancinthe, Quebec, Canada). Customer 1004? (The zip code is probably 06269, which refers to Storrs,
Standard U.S. zip code = five digits numeral DATA CLEANING - GENDER Customer ID Gender 1001 M 1002 F 1003 1004 M 1005 M Contains a missing value for customer 1003.
DATA CLEANING - INCOME
Customer 1003 is shown as having an
Customer ID Income
income of $10,000,000 per year. Although
1001 75000
entirely possible, especially when
1002 −40000
considering the customer’s zip code (90210, Beverly Hills), this value of income
1003 10000000 is nevertheless an outlier, an extreme data
1004 50000 value. 1005 99999
Customer 1004 reported ’s income of
−$40,000 lies beyond the field bounds for income and therefore must be an error. Customer 1005 ’s income of $99,999? Perhaps nothing; it may in fact be valid. But if all the other incomes are rounded to the nearest $5000, why the precision with customer 1005? Often, in legacy databases, certain pecified values are meant to be codes for anomalous entries, such as missing values. Perhaps
DATA CLEANING
- – ZIP & INCOME
Customer ID Zip Income 1001 10048 75000 1002 J2S7K7
−40000 1003 90210 10000000 1004 6269 50000 1005 55101 99999
Finally, are we clear as to which unit of measure the income variable is measured in? Databases often get merged, sometimes without bothering to check whether such merges are entirely appropriate for all fields. For example, it is quite possible that customer 1002, with the Canadian zip code, has an income measured in Canadian dollars, not U.S. dollars. DATA CLEANING - AGE
The age field has a couple of problems. Although all
Customer ID Age the other customers have numerical values for age,
1001 C customer 1001 ’s “age” of C probably reflects an earlier
1002 40 categorization of this man’s age into a bin labeled C.
The data mining software will definitely not like this 1003
45
categorical value in an otherwise numerical field, and
1004 we will have to resolve this problem somehow. 1005
30 How about customer 1004
’s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000. More likely, the age of this person is probably missing and was coded as 0 to indicate this or some other anomalous condition (e.g., refused to provide the age information).
HANDLING MISSING DATA
Replace the missing value with some constant, specified by the analyst.
Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).
Replace the missing values with a value generated at random from the variable distribution observed.
MEAN (RATA-RATA)
Menggambarkan nilai pertengahan dari sekumpulan data
X X = Kumpulan Data i i
Rata = ∑ --------
N N = Jumlah Data REPLACE MISSING VALUE WITH MEAN Customer ID Age
Customer 1001, Nilai C diganti 1001 C dengan rata-rata Age
1002
40 1003
45 Rata = 40 + 45 + 0 + 30 / 4 1004 = 28.75
1005
30
MODE (MODUS)
Menggambarkan nilai yang paling sering muncul
dalam kumpulan data.Data = 1, 2, 1, 4, 3, 1, 5, 3, 1, 2 Modus = 1 REPLACE MISSING VALUE WITH MODE Customer ID Gender
Customer 1003, Isi Field Gender 1001 M diganti dengan ‘M’
1002 F 1003 1004 M 1005 M
IDENTIFYING MISCLASSIFICATIONS
Notice Anything Strange about This Frequency Distribution? Level Name Count USA
1 France
1 US 156 Europe
46 Japan
51 USA and France, have a count of only one automobile each. What is clearly
happening here is that two of the records have been classified inconsistently with respect to the origin of manufacture.
To maintain consistency with the remainder of the data set, the record with
GRAPHICAL METHODS FOR
IDENTIFYING OUTLIERS
GRAPHICAL METHODS FOR
IDENTIFYING OUTLIERS
DATA TRANSFORMATION
Data miners should normalize their numerical variables, to standardize the scale of effect each variable has on the results
For example, if we are interested in major league baseball, players’ batting averages will range from zero to less than 0.400, while the number of home runs hit in a season will range from zero to around 70.
For some data mining algorithms, such differences in the ranges will lead to a tendency for the variable with greater range to have undue influence on the
Histogram of time-to-60, with summary statistics
Min = 8 Max = 25 Rata = 15.548 SD = 2.911 N = 261 Min
- –Max Normalization
X − min(X) X − min(X) ∗
X = ------------------------- = --------------------------
range(X) max(
X) − min(X)
For a “drag-racing-ready” vehicle, which takes only 8 seconds (the field minimum) to reach 60 mph, the min
- –max normalization is
∗ X = 8
− 8 / 25 − 8 = 0 From this we can learn that data values which represent the minimum for the variable will have a min
- –max normalization value of zero. For an
“average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the min
- –max normalization is
∗ X = 15.548 − 8 / 25 − 8 = 0.444
This tells us that we may expect variables values near the center of the distribution to have a min
- –max normalization value near 0.5. For an “I’ll get there when I’m ready” vehicle, which takes 25 seconds (the variable maximum) to reach 60 mph, the min
- –max normalization is
∗ X =
25 − 8 / 25 − 8 = 1.0
Z-Score Standardization X − mean(X)
∗ X = --------------------------
SD(X) For the vehicle that takes only 8 seconds to reach 60 mph, the Z-score standardization is:
∗ X =
8 − 15.548 / 2.911 = −2.593 Thus, data values that lie below the mean will have a negative Z-score standardization.
For an “average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the Z-score standardization is
∗ X = 15 .548 − 15.548 / 2.911 = 0
This tells us that variable values falling exactly on the mean will have a Z-score standardization of zero.
For the car that takes 25 seconds to reach 60 mph, the Z-score standardization is
∗ X =
25 − 15.548 / 2.911 = 3.247 Histogram of time-to-60, with summary statistics
Min = -2.593 Max = 3.247 Rata = 0.00 SD = 1.0 N = 261 VARIAN & STANDAR DEVIASI
Varian merupakan ukuran variabilitas data, yang berarti semakin besar nilai varian berarti semakin tinggi fluktuasi data antara satu data dengan data yang lain
Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10 Rata-rata = 60 / 10 = 6
2
2
2
2
- Varian = (2-6) + (3-6) + (4-6) + (4-6)
2
2
2
2
(5-6)
- (7-6) + (8-6) + (8-6)
2
2
(9-6) + (10-6) = 16 + 9 + 4 + 4 + 1 + 1 + 4 + 4 + 9 + 16 = 68
NUMERICAL METHODS FOR
IDENTIFYING OUTLIERS Interquartile Range.
The quartiles of a data set divide the data set into four parts, each containing 25% of the data.
The first quartile (Q1) is the 25th percentile. The second quartile (Q2) is the 50th percentile, that is, the median. The third quartile (Q3) is the 75th percentile.
The interquartile range (IQR) is a measure of variability that is much more robust than the standard deviation. The IQR is calculated as IQR = Q3 − Q1 and may be interpreted to represent the spread of the middle 50% of the data. A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if: a. It is located 1.5(IQR) or more below Q1, or INTERQUARTILE RANGE
For example, suppose that for a set of test scores, the 25th percentile was Q1 = 70 and the 75th percentile was Q3 = 80, so that half of all the test scores fell between 70 and 80. Then the interquartile range, the difference between these
quartiles, was IQR = 80 − 70 = 10.
A test score would be robustly identified as an outlier if:
a. It is lower than Q1 − 1.5(IQR) = 70 − 1.5(10) = 55, or b. It is higher than Q3 + 1.5(IQR) = 80 + 1.5(10) = 95.
MEDIAN
Menggambarkan nilai tengah dalam kumpulan
data.
N+1 N = Jumlah Data
Median = --------2
CONTOH MEDIAN
Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10, 12 Median = (11+1) / 2 = 6 Data ke -6 = 7
Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10 Median = (10+1) / 2 = 5,5 Data ke -5,5 = ? KUARTIL
Membagi data menjadi Kelompok Data Rendah
dan Median dan Kelompok Data Tinggi.Q1 = Median dari Kelompok Data Rendah Q2 = Median dari Seluruh Kelompok Data Q3 = Median dari Kelompok Data Tinggi
CONTOH KUARTIL
2, 3, 4, 4, 5 7 8, 8, 9, 10, 12 Q3 Q1 Q2
(5 + 7) / 2 2, 3, 4, 4, 5 7, 8, 8, 9, 10 Q3 Penanganan Noisy Data (#1) “Apa yang dimaksud noise ?“
Noise adalah kesalahan yang terjadi secara random atau karena variasi yang terjadi dalam pengukuran variabel.
Solusi: Dengan smoothing (penghalusan data).
Penanganan Noisy Data (#2)
Beberapa pendekatan Smoothing:- – Binning – Clustering – Regression
Metode-metode binning menghaluskan nilai pada data yang terurut dengan "berkonsultasi" dengan data "tetangganya", yaitu nilai-nilai di sekitarnya.
Nilai-nilai yang terurut didistribusikan ke dalam sejumlah "buckets" atau bins.
Penghalusan data secara lokal.
Pada contoh ini, data pertama kali diurutkan, dan kemudian dipartisi ke dalam bins dengan kedalaman yang sama, misal 3 (setiap bin berisi tiga nilai). Binning (#2) Contoh 1: Data untuk variabel harga yang terurut (dalam dollar): 4, 8, 15, 21, 21, 24, 25, 28, 34
Pertama kali data dipartisi dalam bin-bin dengan
equidepth 3 (kedalaman yang sama): Bin 1 : 4, 8, 15Bin 2 : 21, 21, 24 Binning (#3)
Smoothing dengan bin-means (nilai rata-rata):
Bin 1 : 9, 9, 9
Bin 2 : 22, 22, 22
Bin 3 : 29, 29, 29
Smoothing dengan bin-median (nilai tengah):
Bin 1 : 8, 8, 8 Bin 2 : 21, 21, 21
Bin 3 : 28, 28, 28 Binning (#4)
Smoothing dengan bin-boundaries (nilai-nilai
batas): Bin 1 : 4, 4, 15 {8 menjadi 4 karena lebih dekat ke 4 daripada ke 15} Bin 2 : 21, 21, 24 {21 tidak berubah karena nilainya sama}
Bin 3 : 25, 25, 34 {28 menjadi 25 karena lebih dekat ke 25 daripada ke 34}
DAFTAR PUSTAKA
Discovering Knowledge in Data (Introduction to Data Mining), Chapter 2, Daniel T. Larose, Wiley, 2004