Bahan Ajar Data Mining.rar (5,149Kb)

  Ericks WHY DO WE NEED TO PREPROCESS THE DATA? 

  Fields that are obsolete or redundant 

  Missing values 

  Outliers 

  

Data in a form not suitable for data

mining models 

  

Values not consistent with policy or

common sense.

DATA CLEANING

  Customer ID Zip Gender Income Age Marital Status Transaction Amount 1001 10048 M 75000 C M 5000 1002 J2S7K7 F

  −40000

  40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 S 1000 1005 55101 M 99999

  30 D 3000 Can You Find Any Problems in This Tiny Data Set? DATA CLEANING - ZIP Customer ID Zip 1001 10048 1002 J2S7K7 1003 90210 1004 6269 1005 55101 Customer 1002 zip code of J2S7K7. (Actually, this is the zip code of St.

  Hyancinthe, Quebec, Canada). Customer 1004? (The zip code is probably 06269, which refers to Storrs,

  Standard U.S. zip code = five digits numeral DATA CLEANING - GENDER Customer ID Gender 1001 M 1002 F 1003 1004 M 1005 M Contains a missing value for customer 1003.

  DATA CLEANING - INCOME

  Customer 1003 is shown as having an

  Customer ID Income

  income of $10,000,000 per year. Although

  1001 75000

  entirely possible, especially when

  1002 −40000

  considering the customer’s zip code (90210, Beverly Hills), this value of income

  1003 10000000 is nevertheless an outlier, an extreme data

  1004 50000 value. 1005 99999

  Customer 1004 reported ’s income of

  −$40,000 lies beyond the field bounds for income and therefore must be an error. Customer 1005 ’s income of $99,999? Perhaps nothing; it may in fact be valid. But if all the other incomes are rounded to the nearest $5000, why the precision with customer 1005? Often, in legacy databases, certain pecified values are meant to be codes for anomalous entries, such as missing values. Perhaps

DATA CLEANING

  • – ZIP & INCOME

  Customer ID Zip Income 1001 10048 75000 1002 J2S7K7

  −40000 1003 90210 10000000 1004 6269 50000 1005 55101 99999

  Finally, are we clear as to which unit of measure the income variable is measured in? Databases often get merged, sometimes without bothering to check whether such merges are entirely appropriate for all fields. For example, it is quite possible that customer 1002, with the Canadian zip code, has an income measured in Canadian dollars, not U.S. dollars. DATA CLEANING - AGE

  The age field has a couple of problems. Although all

  Customer ID Age the other customers have numerical values for age,

  1001 C customer 1001 ’s “age” of C probably reflects an earlier

  1002 40 categorization of this man’s age into a bin labeled C.

  The data mining software will definitely not like this 1003

  45

  categorical value in an otherwise numerical field, and

  1004 we will have to resolve this problem somehow. 1005

30 How about customer 1004

  ’s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000. More likely, the age of this person is probably missing and was coded as 0 to indicate this or some other anomalous condition (e.g., refused to provide the age information).

HANDLING MISSING DATA

   Replace the missing value with some constant, specified by the analyst.

   Replace the missing value with the field mean (for numerical variables) or the mode (for categorical variables).

   Replace the missing values with a value generated at random from the variable distribution observed.

MEAN (RATA-RATA)

  Menggambarkan nilai pertengahan dari sekumpulan data

  X X = Kumpulan Data i i

  Rata = ∑ --------

N N = Jumlah Data REPLACE MISSING VALUE WITH MEAN Customer ID Age

  Customer 1001, Nilai C diganti 1001 C dengan rata-rata Age

  1002

  40 1003

  45 Rata = 40 + 45 + 0 + 30 / 4 1004 = 28.75

  1005

  30

MODE (MODUS)

  

Menggambarkan nilai yang paling sering muncul

dalam kumpulan data.

  Data = 1, 2, 1, 4, 3, 1, 5, 3, 1, 2 Modus = 1 REPLACE MISSING VALUE WITH MODE Customer ID Gender

  Customer 1003, Isi Field Gender 1001 M diganti dengan ‘M’

  1002 F 1003 1004 M 1005 M

IDENTIFYING MISCLASSIFICATIONS

  Notice Anything Strange about This Frequency Distribution? Level Name Count USA

  1 France

  1 US 156 Europe

  46 Japan

  51 USA and France, have a count of only one automobile each. What is clearly

  happening here is that two of the records have been classified inconsistently with respect to the origin of manufacture.

  To maintain consistency with the remainder of the data set, the record with

  

GRAPHICAL METHODS FOR

IDENTIFYING OUTLIERS

  

GRAPHICAL METHODS FOR

IDENTIFYING OUTLIERS

DATA TRANSFORMATION

   Data miners should normalize their numerical variables, to standardize the scale of effect each variable has on the results

  For example, if we are interested in major league baseball, players’ batting averages will range from zero to less than 0.400, while the number of home runs hit in a season will range from zero to around 70.

  For some data mining algorithms, such differences in the ranges will lead to a tendency for the variable with greater range to have undue influence on the

  Histogram of time-to-60, with summary statistics

  Min = 8 Max = 25 Rata = 15.548 SD = 2.911 N = 261 Min

  • –Max Normalization

  X − min(X) X − min(X)

  X = ------------------------- = --------------------------

  range(X) max(

  X) − min(X)

  For a “drag-racing-ready” vehicle, which takes only 8 seconds (the field minimum) to reach 60 mph, the min

  • –max normalization is

  ∗ X = 8

  − 8 / 25 − 8 = 0 From this we can learn that data values which represent the minimum for the variable will have a min

  • –max normalization value of zero. For an

  “average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the min

  • –max normalization is

  ∗ X = 15.548 − 8 / 25 − 8 = 0.444

  This tells us that we may expect variables values near the center of the distribution to have a min

  • –max normalization value near 0.5. For an “I’ll get there when I’m ready” vehicle, which takes 25 seconds (the variable maximum) to reach 60 mph, the min
  • –max normalization is

  ∗ X =

  25 − 8 / 25 − 8 = 1.0

  Z-Score Standardization X − mean(X)

  ∗ X = --------------------------

  SD(X) For the vehicle that takes only 8 seconds to reach 60 mph, the Z-score standardization is:

  ∗ X =

  8 − 15.548 / 2.911 = −2.593 Thus, data values that lie below the mean will have a negative Z-score standardization.

  For an “average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the Z-score standardization is

  ∗ X = 15 .548 − 15.548 / 2.911 = 0

  This tells us that variable values falling exactly on the mean will have a Z-score standardization of zero.

  For the car that takes 25 seconds to reach 60 mph, the Z-score standardization is

  ∗ X =

  25 − 15.548 / 2.911 = 3.247 Histogram of time-to-60, with summary statistics

  Min = -2.593 Max = 3.247 Rata = 0.00 SD = 1.0 N = 261 VARIAN & STANDAR DEVIASI

  Varian merupakan ukuran variabilitas data, yang berarti semakin besar nilai varian berarti semakin tinggi fluktuasi data antara satu data dengan data yang lain

  Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10 Rata-rata = 60 / 10 = 6

  2

  2

  2

  2

  • Varian = (2-6) + (3-6) + (4-6) + (4-6)

  2

  2

  2

  2

  (5-6)

  • (7-6) + (8-6) + (8-6)

  2

  2

  (9-6) + (10-6) = 16 + 9 + 4 + 4 + 1 + 1 + 4 + 4 + 9 + 16 = 68

NUMERICAL METHODS FOR

  IDENTIFYING OUTLIERS Interquartile Range.

  The quartiles of a data set divide the data set into four parts, each containing 25% of the data.

  The first quartile (Q1) is the 25th percentile. The second quartile (Q2) is the 50th percentile, that is, the median. The third quartile (Q3) is the 75th percentile.

  The interquartile range (IQR) is a measure of variability that is much more robust than the standard deviation. The IQR is calculated as IQR = Q3 − Q1 and may be interpreted to represent the spread of the middle 50% of the data. A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if: a. It is located 1.5(IQR) or more below Q1, or INTERQUARTILE RANGE

  For example, suppose that for a set of test scores, the 25th percentile was Q1 = 70 and the 75th percentile was Q3 = 80, so that half of all the test scores fell between 70 and 80. Then the interquartile range, the difference between these

  quartiles, was IQR = 80 − 70 = 10.

  A test score would be robustly identified as an outlier if:

a. It is lower than Q1 − 1.5(IQR) = 70 − 1.5(10) = 55, or b. It is higher than Q3 + 1.5(IQR) = 80 + 1.5(10) = 95.

  MEDIAN

Menggambarkan nilai tengah dalam kumpulan

data.

  

N+1 N = Jumlah Data

Median = --------

  2

CONTOH MEDIAN

  

Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10, 12 Median = (11+1) / 2 = 6 Data ke -6 = 7

Data = 2, 3, 4, 4, 5, 7, 8, 8, 9, 10 Median = (10+1) / 2 = 5,5 Data ke -5,5 = ? KUARTIL

Membagi data menjadi Kelompok Data Rendah

dan Median dan Kelompok Data Tinggi.

  Q1 = Median dari Kelompok Data Rendah Q2 = Median dari Seluruh Kelompok Data Q3 = Median dari Kelompok Data Tinggi

CONTOH KUARTIL

  2, 3, 4, 4, 5 7 8, 8, 9, 10, 12 Q3 Q1 Q2

  

(5 + 7) / 2 2, 3, 4, 4, 5 7, 8, 8, 9, 10 Q3 Penanganan Noisy Data (#1) “Apa yang dimaksud noise ?“

  Noise adalah kesalahan yang terjadi secara random atau karena variasi yang terjadi dalam pengukuran variabel.

  Solusi: Dengan smoothing (penghalusan data).

  

Penanganan Noisy Data (#2)

Beberapa pendekatan Smoothing:

  • – Binning – Clustering – Regression
Binning (#1) 

  Metode-metode binning menghaluskan nilai pada data yang terurut dengan "berkonsultasi" dengan data "tetangganya", yaitu nilai-nilai di sekitarnya.

   Nilai-nilai yang terurut didistribusikan ke dalam sejumlah "buckets" atau bins.

   Penghalusan data secara lokal.

   Pada contoh ini, data pertama kali diurutkan, dan kemudian dipartisi ke dalam bins dengan kedalaman yang sama, misal 3 (setiap bin berisi tiga nilai). Binning (#2) Contoh 1: Data untuk variabel harga yang terurut (dalam dollar): 4, 8, 15, 21, 21, 24, 25, 28, 34

Pertama kali data dipartisi dalam bin-bin dengan

equidepth 3 (kedalaman yang sama): Bin 1 : 4, 8, 15

  Bin 2 : 21, 21, 24 Binning (#3) 

  Smoothing dengan bin-means (nilai rata-rata): 

  Bin 1 : 9, 9, 9 

  Bin 2 : 22, 22, 22 

  Bin 3 : 29, 29, 29 

  Smoothing dengan bin-median (nilai tengah): 

  Bin 1 : 8, 8, 8  Bin 2 : 21, 21, 21 

  Bin 3 : 28, 28, 28 Binning (#4) 

  

Smoothing dengan bin-boundaries (nilai-nilai

batas): 

  Bin 1 : 4, 4, 15 {8 menjadi 4 karena lebih dekat ke 4 daripada ke 15}  Bin 2 : 21, 21, 24 {21 tidak berubah karena nilainya sama}

   Bin 3 : 25, 25, 34 {28 menjadi 25 karena lebih dekat ke 25 daripada ke 34}

DAFTAR PUSTAKA

   Discovering Knowledge in Data (Introduction to Data Mining), Chapter 2, Daniel T. Larose, Wiley, 2004