Conceptual Learning Data Machine Learning

Program Studi: Manajemen Bisnis Telekomunikasi & Informatika Mata Kuliah: Big Data And Data Analytics Oleh: Tim Dosen

CONCEPTUAL DATA SCIENCE Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

o Data Simulation (Monte Carlo) o

Data Preprocessing o

Conceptual Learning Data / Machine Learning o

Model Evaluation / Accuracy o

Case Study / Exercise OUTLINE

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Modeling and Simulation

– Modeling and simulation (M&S) refers to using or

otherwise entity, phenomenon, or process

– as a basis for
– methods for implementing a model (either statically or) over time – to develop data

as a basis for managerial or technical M&S helps getting information about how something will behave without actually testing it in real life (wikipedia) An Example of Simulation : Monte Carlo Methods

Monte Carlo Monte Carlo methods (or Monte Carlo experiments) are a broad class of to solve problems that might be

deterministic in principle. They are often used in problems and

are most useful when it is difficult or impossible to use other approaches. Monte Carlo

Creating the great business leaders Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

Monte Carlo Example

Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

GoldSim Video Monte Carlo Simulation

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Why Simulation

Simulations is generally cheaper, safer and sometimes more ethical than conducting real-world experiments. For example, Similar efforts are conducted to simulate hurricanes and other natural catastrophes.
Simulations can often be even more realistic than traditional experiments, as they allow the free configuration of environment parameters found in the operational application field of the final product. Examples are supporting deep water operation of the US Navy or the simulating the surface of neighbored planets in preparation of

can easily be obtained from operational data. This use of simulation adds decision support simulation systems to the tool box of traditional

Creating the great business leaders

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Data Preprocessing (Why ?)

Measures for data quality : A multidimensional view Accuracy : correct or wrong, accurate or not Completeness

: not recorded, unavailable, … Consistency

: some modified but some not, …

Timeliness : timely update? Believability : how trustable the data are correct?

Interpretability : how easily the data can be understood?

Creating the great business leaders Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

1. Data cleaning

 Fill in missing values  Smooth noisy data 

Identify or remove outliers  Resolve inconsistencies 2.

Data reduction  Dimensionality reduction 

Numerosity reduction  Data compression 3.

Data transformation and data discretization  Normalization 

Concept hierarchy generation 4.

Data integration 

Integration of multiple databases or files Major Task in Data Preprocessing

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Data Cleaning

Data in the Real World Is Dirty : Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error  Incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data missing data )

 e.g., Occupation=“ ” (

 Noisy : containing noise, errors, or outliers  an error ) e.g., Salary=“−10” (

 Inconsistent : containing discrepancies in codes or names  e.g., Age 42 03/07/2010

=“ ”, Birthday=“ ”



Was rating “1, 2, 3”, now rating “A, B, C” 

Discrepancy between duplicate records 

Intentional (e.g., disguised missing data ) 

Jan. 1 as everyone’s birthday? Creating the great business leaders Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

 Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

 Missing data may be due to  equipment malfunction  inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding

 certain data may not be considered important at the time of entry  not register history or changes of the data

 Missing data may need to be inferred Incomplete (Missing) Data

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Data Reduction Strategies

Data Reduction

reduced representation of the data set that is much smaller in volume but yet

Obtain a produces the same analytical results
Why Data Reduction?

terabytes of data

A database/data warehouse may store

take a very long time to run on the complete dataset

Complex data analysis
Data Reduction Strategies

1. Dimensionality reduction

1. Feature Extraction

2. Feature Selection

2. Numerosity reduction ( Data Reduction )

Regression and Log-Linear Models • Histograms, clustering, sampling

Creating the great business leaders Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

1. Estimation:

 Linear Regression, Neural Network, Support Vector Machine, etc

Prediction/Forecasting: 

Linear Regression, Neural Network, Support Vector Machine, etc 3.

Classification : 

Naive Bayes, K-Nearest Neighbor, C4.5, ID3, CART, Linear Discriminant Analysis, Logistic Regression, etc

4. Clustering :  K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means, etc 5.

Association : 

FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc General Methods in Data Analytics Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

 Error : Root Mean Square Error (RMSE), MSE, MAPE, etc 2.

Prediction/Forecasting (Prediksi/Peramalan):  Error : Root Mean Square Error (RMSE) , MSE, MAPE, etc 3.

Classification:  Confusion Matrix : Accuracy 

ROC Curve : Area Under Curve (AUC) 4.

Clustering:  Internal Evaluation : Davies

–Bouldin index, Dunn index,

 External Evaluation : Rand measure, F-measure, Jaccard index, Fowlkes

–Mallows index, Confusion matrix 5.

Association: 

Lift Charts : Lift Ratio  Precision and Recall (F-measure)

Evaluation (Accuracy, Error)

1. Estimation:

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Machine Learning

In the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction - in commercial use, this is known as These analytical models allow researchers, engineers, and analysts to "produce reliable, repeatable decisions and results" and uncover "hidden insights" through learning from historical relationships and trends in the data (wikipedia)

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. (standford/coursera) Machine learning is a type of artificial intelligence that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. (whatis.com)

Creating the great business leaders

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Data Split

The Split Data operator takes a dataset as its input and delivers the subsets of that dataset through its output ports The sampling type parameter decides how the examples should be shuffled in the resultant partitions:

1. Linear sampling : Linear sampling simply divides the dataset into partitions without changing the order of the examples

Subsets with consecutive examples are created

2. Shuffled sampling : Shuffled sampling builds random subsets of the dataset

Examples are chosen randomly for making subsets

3. Stratified sampling : Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset

In the case of a binominal classification, stratified sampling builds random subsets so that each subset contains roughly the same proportions of the two values of the label

We split data into 2 group: Training data and Testing data Creating the great business leaders

Dosen: Fakultas Ekonomi dan Bisnis Program Studi: _{School Economic and Business} _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA Yudi Priyadi, M.T.}

Telkom University Cross Validation Methods overlapping choice from testing data

Cross-Validation method used to avoid
Cross-Validation step:
Divide data into k subset (same size)
Use each subset for testing data and the rest for training data

k-fold cross-validation

This method also called
We often use stratified (bertingkat) sampling before cross-validation process, because it reduces

variance estimation Creating the great business leaders Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

10 Fold Cross-Validation

Accuracy

91%

Akurasi Rata-Rata 92% Orange Box : k-subset (data testing)

90%

91%

93%

94%

Eksperiment Dataset

93%

90%

91%

93%

93% Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

Case Study : NBA Telkom University Creating the great business leaders

Program Studi: _{MANAJEMEN BISNIS TELEKOMUNIKASI & INFORMATIKA} Dosen: _{Yudi Priyadi, M.T.} Fakultas Ekonomi dan Bisnis _{School Economic and Business}

Exercise:

1. Use one of the following tools : RapidMiner, R, Orange, Weka

2. Create prediction model (prediksi elektabilitas caleg) using data training on data pemilu ( datapemilukpu.xls ) using the following algorithm :.

1. Decision Tree (C4.5)

2. Naïve Bayes (NB)

3. K-Nearest Neighbor (K-NN)

3. Do evaluation / accuracy testing using 10-fold X Validation

C4.5 NB K-NN

Accuracy 92.45% 77.46% 88.72% AUC 0.851 0.840

0.5

Conceptual Learning Data Machine Learning

1. Data cleaning

1. Estimation:

10 Fold Cross-Validation

C4.5 NB K-NN

Dokumen yang terkait

Komparasi Nilai Daya Dukung Tiang Tunggal Pondasi Bor Menggunakan Data SPT, dan Hasil Loading Test pada Tanah Granuler

Pelatihan Singkat Pengolahan Data dengan Menggunakan Program SPSS for Windows

Pengertian Basis Data (Database)

Evaluation of Teacher-Student Learning Style Disparity in Construction Management Education

Bullying Risk in Children With Learning Difficulties in Inclusive.pdf

Learning a First Language

Using Data Flow Diagrams

Analyzing Systems Using Data Dictionaries

R and Data Mining: Examples and Case Studies

Algorithm, Complexity Theory, and Data Analytics Strategy

Dukungan

Links

Conceptual Learning Data Machine Learning

1. Data cleaning

1. Estimation:

10 Fold Cross-Validation

C4.5 NB K-NN

Dokumen yang terkait

Komparasi Nilai Daya Dukung Tiang Tunggal Pondasi Bor Menggunakan Data SPT, dan Hasil Loading Test pada Tanah Granuler

Pelatihan Singkat Pengolahan Data dengan Menggunakan Program SPSS for Windows

Pengertian Basis Data (Database)

Evaluation of Teacher-Student Learning Style Disparity in Construction Management Education

Bullying Risk in Children With Learning Difficulties in Inclusive.pdf

Learning a First Language

Using Data Flow Diagrams

Analyzing Systems Using Data Dictionaries

R and Data Mining: Examples and Case Studies

Algorithm, Complexity Theory, and Data Analytics Strategy

Dokumen yang Anda mencari sudah siap untuk unduhkan