A Sentiment Knowledge Discovery Model In Twitter’s Tv Content Using Stochastic Gradient Descent Algorithm.

A SENTIMENT KNOWLEDGE DISCOVERY MODEL IN
TWITTER’S TV CONTENT USING STOCHASTIC
GRADIENT DESCENT ALGORITHM

LIRA RUHWINANINGSIH

GRADUATE SCHOOL
BOGOR AGRICULTURAL UNIVERSITY
BOGOR
2015

STATEMENT OF THESIS AND SOURCES OF
INFORMATION AND DEVOLUTION COPYRIGHT
Hereby, I genuinely stated that the thesis entitled “A Sentiment Knowledge
Discovery Model in Twitter’s TV Content using Stochastic Gradient Descent
Algorithm” is my own work and to the best of my knowledge, under supervision
by Dr. Eng. Taufik Djatna, STP, M.Si and Dr. Ir. Agus Buono, M.Si, M.Kom. It
has never previously been published in any university. All of incorporated
originated from other published as well as unpublished papers are stated clearly in
the texts as well as in the references.
Hereby, I devolve the copyright of my thesis to Bogor Agricultural University.


Bogor, October 2015

Lira Ruhwinaningsih
Student ID G651110634

SUMMARY
LIRA RUHWINANINGSIH. A Sentiment Knowledge Discovery Model in
Twitter’s TV Content using Stochastic Gradient Descent Algorithm. Supervised by
TAUFIK DJATNA and AGUS BUONO.
Explosive usage of social media can be a rich source for data mining.
Meanwhile, increasing and variation of TV shows or TV programs motivate people
to write comments via social media. Social network contains abundant information
which is unstructured, heterogeneous, high dimensional and a lot of noise.
Abundant data can be a rich source of information but it is difficult to be manually
identified.
This research proposed an approach to perform preprocessing of unstructured,
noisy and very diversed data; to find patterns of the information and knowledge of
user activities on social media in form of positive and negative sentiment on Twitter
TV content.

Some methodologies and techniques were used to perform preprocessing.
There were removing punctuation and symbols, removing number, replacing
numbers into letters, translation of Alay words, remove stop word, and Stemming
using Porter Algorithm. The methodology to used find patterns of information and
knowledge of social media in this study is Stochastic Gradient Descent (SGD).
The text preprocessing produced a more structured text, reduced noise and
reduced the diversity of text. Thus, preprocessing affected to the accuracy and
processing time. The experiment results showed that the use of SGD for discovery
of the positive and negative sentiment tent to be faster for large and stream data.
The Percentage of maximum accuracy to find sentiment patterns is 88%.
Keywords: Stochastic Gradient Descent, opinion mining, sentiment analysis,
Stemming Porter, stream data mining.

RINGKASAN
LIRA RUHWINANINGSIH. A Sentiment Knowledge Discovery Model in
Twitter’s TV Content using Stochastic Gradient Descent Algorithm. Di bawah
bimbingan TAUFIK DJATNA and AGUS BUONO.
Penggunaan media sosial yang eksplosif dapat menjadi sumber yang kaya
untuk data mining. Sementara itu, perkembangan program televisi semakin
meningkat dan beragam sehingga memotivasi orang untuk memberi komentar

melalui akun media sosial. Data sosial media memilki informasi yang melimpah
tetapi tidak terstruktur, heterogen, berdimensi tinggi dan banyak noise. Data yang
melimpah dapat menjadi sumber yang kaya informasi tetapi sulit diidentifikasi
secara manual.
Kontribusi dari penelitian ini adalah untuk melakukan praproses dalam
mengatasi data yang tidak terstrukur, banyak noise dan sangat beragam;
menemukan pola informasi dan pengetahuan kegiatan pengguna sosial media dalam
bentuk sentimen positif dan negatif pada Twitter TV konten.
Beberapa metodologi dan teknik digunakan untuk melakukan praproses.
Metodologi tersebut adalah menghilangkan tanda baca dan simbol, menghilangkan
nomor, mengganti nomor ke huruf, menterjemahankan kata-kata Alay,
menghilangkan stop word, dan Algoritma Stemming Porter. Metodologi untuk
menemukan pola sentimen dalam penelitian ini menggunaka Stochastic Gradient
Descent (SGD).
Teks yang telah melalui preprocessing menghasilkan teks lebih terstruktur,
mengurangi noise dan mengurangi keragaman teks. Jadi, praproses berdampak pada
akurasi klasifikasi dan waktu proses. Hasil percobaan menunjukkan bahwa
penggunaan SGD dalam penemuan sentimen positif dan negatif cenderung lebih
cepat untuk data yang besar atau stream data. Persentasi maksimum akurasi untuk
menemukan pola sentimen sebesar 88%.

Kata kunci: Stochastic Gradient Descent, opinion mining, sentimen analisis,
Stemming Porter, stream data mining.

© Copyright of This Thesis Belongs to Bogor Agricultural
University (IPB), 2015
All Rights Reserved
Prohibited citing in part or whole of this paper without include or citing sources.
The quotation is only for educational purposes, research, scientific writing, report
writing, criticism, or review of a problem; and citations are not detrimental to the
interests of IPB.
Prohibited announced and reproduce part or all of this publication in any form
without permission IPB.

A SENTIMENT KNOWLEDGE DISCOVERY MODEL IN
TWITTER’S TV CONTENT USING STOCHASTIC
GRADIENT DESCENT ALGORITHM

LIRA RUHWINANINGSIH

Thesis

as partial fulfillment of the requirements for the degree of
Master of Computer
in
the Department of Computer Science

GRADUATE SCHOOL
BOGOR AGRICULTURAL UNIVERSITY
BOGOR
2015

Non-committee examiner on thesis examination:
Dr. Imas Sukaesih Sitanggang, S.Si, M.Kom

PREFACE
I would like to thank Allah subhanahu wa ta’ala for all his gifts so that this
research is successfully completed. The theme chosen in the research is visual
usability conducted since July 2013, with the title of Sentiment Knowledge
Discovery in Twitter’s TV Content Using Stochastic Gradient Descent.
I would like to express my sincere gratitude to Dr. Eng. Taufik Djatna, STP,
M.Si as Chair of Advisory Committee for his support and encouragement during

my study in Bogor Agricultural University. I am very grateful to Dr.Ir. Agus Buono,
M.Si, M.Kom. as Member of Advisory Committee for his advice and supervision
during the thesis work. I am very grateful to Dr. Imas Sukaesih Sitanggang S.Si,
M.Kom who give a lot of advice for this research.
I would like to say many thanks to my beloved family Fery Dergantoro
(husband), Euis Winarti (mother), I. Ruhayat (father) Lisda Ruhwinawati (sister),
Linda Ruhwina Fauziah (sister), Asriwati (mother in law) for their true and endless
love, for never failing patience and encouragement.
I would like to thank all lectures and staff of Computer Science Department,
all of colleagues, especially my classmate of Computer Science Department namely
Probo Kusumo, Zaenal Abidin, Wahyu, Novan, Agung, Ayu and Sriyono. Least but
not last, I would like to thank to Rinto, Loren, Setiaji, Haryadi, Latief and all my
team (Tyo, Meli, Ika, Garbel, Reza, Danu, Ika) as my office colleagues for
understanding and support.
Hopefully this research useful.

Bogor, September 2015
Lira Ruhwinaningsih

TABLE OF CONTENT

TABLE OF CONTENT

vi

TABLE OF FIGURE

vi

1 INTRODUCTION
Background
Problem Statement
Objectives
Benefits
Scope

1
1
2
2
3

3

2 LITERATURE REVIEW
Social Media
Data Mining
Content Analysis
TV Content
Preprocessing
Stochastic Gradient Descent (SGD)

4
4
4
6
6
6
8

3 METHODOLOGY
Preprocessing

Classification Model using SGD
Model Evaluation

9
12
19
29

4 RESULT AND DISCUSSION
Evaluation Model using Split Test
Evaluation Model Using Cross validation

29
31
35

5 CONCLUSION

36


REFERENCES

37

LIST OF TABELS
1 List of converting the numbers into letters
2 Testing with learning rate 0.01 and epoch 100
3 Correctly Classified Instances and Total Time Execution using Cross
Validation

15
32
35

TABLE OF FIGURE
1
2
3
4
5

6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

Scope of research
Data mining as a step in the process of knowledge discovery
General Flow
Data Flow Diagram level 0
Data Flow Diagram level 1
Data Flow Diagram level 2
Preprocessing stages
Eliminate all punctuation and symbols algorithm
Algorithm of eliminate numbers in front of words
Relace number with letter
Algorithm for eliminate repeated letters
Algorithm for translate “Alay” words into normal words
Algorithm for eliminate stop words
Stemmig Algorithm
Flow classification model
Stage to create a classifier and training data
Training flow referenced from Weka libraries
Flow upate classifier part 1 referenced from Weka libraries
Flow update classifier part 2 referenced from Weka libraries
Flow tokenizer referenced from Weka libraries

Flow “dotprod” referenced from Weka libraries
Testing flow and prdicted value referenced from Weka libraries
Estimating Accuracy with holdout method (Han et al. 2012)
Chart execution time on preprocessing variations
Correctly classified instances chart
Total time execution Chart
Chart of ratio on changing learning rate
Chart of learning rate changes on execution time
Chart of epoch changes to correctly classified instances
Chart correctly classified instances using Cross Validation
Sentiment pattern

3
5
10
11
11
12
12
13
14
15
16
17
18
19
20
21
22
22
23
24
27
28
29
31
32
33
33
34
34
35
36

1 INTRODUCTION
Background
The variety of TV Program was presented to the audiences can be good
quality content, touching, making people annoyed, monotonous and so forth.
People may comment on the Twitter TV content. Twitter TV content is twitter
account that used by TV program providers as a media to comment on their TV
shows, example: @Kikandyshow, @matanajwa and other.These comments can be
useful information for TV content provider or become a rich source of information
for data mining. Billions of information can be taken from social media web pages.
Text comments are good indicators of user satisfaction. Sentiment analysis
algorithms offer an analysis of the users’ preferences in which the comments may
not be associated with an explicit rating. Thus, this analysis will also have an impact
on the popularity of a given media show (Peleja et al. 2013). Automatic TV content
analysis is very important for the efficient indexing and retrieval of the vast amount
of TV content available not only for end users so that they can easily access content
of interest, but also for content producers and broadcasters in order to identify
copyright violations and to filter or characterize undesirable content (Kompatsiaris
et al. 2012). The informations were captured from social media produce gold useful
information for the organization. Mining thousands of viewing choices and millions
of patterns, advertisers and TV networks identify household characteristics, tastes,
and desires to create and deliver custom targeted advertising (Spangler et al. 2003).
TV content that taking a lot of audience needs continuous improvement to avoid
monotony and lose audiences. TV content providers need to do improvement in
every episode so that the content useful, qualified and give a positive value to the
society. There is a number of quality attributes from the TV viewer view point there
are: relevance, coverage, diversity, understandability, novelty, and serendipity
(Bambini et al. 2012).
Social media data were very diversed, a lot of noise and unstructured so it is
need some preprocessing stage for further data processing. Vast, noise, distributed
and unstructured are characteristics of social media data (Gundecha and Liu 2012).
Preprocessing can improve the accuracy and efficiency of mining algorithms
involving distance measurements (Han et al. 2012). The next process is to find
patterns and gold information from social media data that has been done
preprocessing. Social media data is a stream data that requires methodologies and
algorithms can operate with a limited resource both in terms of time or memory
hardware. Moreover, it can handle data that changes over time (Bifet and Frank
2010).
In this paper, we address the above issues, which represent the main paper
contribution. Twit data was taken using Twitter Streaming API provided by twitter.
The data is performed preprocessing which eliminates punctuation and symbols,
eliminating number, replace numbers into letters, translation of Alay words and
eliminate stop word. The next step is perform word stemming using Porter
algorithm approach for Indonesia language. Preprocessing is an important step to
the classification process and necessary for cleansing social media data that filled
with noise and unstructured so that the data is ready to be processed to the next step.

2
Preprocessing was made for Indonesia language and modules of preprocessing
algorithm created by the author except stemming algorathm which adopted the
Porter algorithm for Indonesa language. Classification algorithm was used to find
patterns and information in this study is Stochastic Gradient Descent (SGD). SGD
suitable for large and stream data (Bifet and Frank 2010). Stochastic Gradient
Descent is versatile techniques that have proven invaluable as a learning algorithms
for large datasets (Bottou 2013). From the research that has been done, SGD model
give effect to a short processing time to process a lot of data or data streams.
Research about knowledge discovery on Twitter streaming data had been
done by Bifet and Frank (2010). Tweet data was taken using Twitter Streaming API.
Several approaches that had been done for classification of data, was such as
Multinomial Naive Bayes, Stochastic Gradient Descent, and Hoeffding Tree. Then
the evaluation of the changes in time and accuracy of its data stream in that research
was using Kappa statistic. The experimental results showed that the method
Hoeffding Tree poorly suited for data stream, both in terms of accuracy, kappa, and
processing time. Naïve Bayes have the better performance than Hoeffding Tree.
Based on some tests that have been carried out, SGD models recommended for the
data stream with the determination of the appropriate learning rate. The data that
used was not using real data, but it was using data of research from Edinburgh
Corpora and website reference twittersentiment.appspot.com, so it was not known
behavior of each method for real data (Bifet and Frank 2010). Further studies of
them are to expand research using the geographic scope, the number of followers
or friends, and also use the real data. Previous research on the positive and negative
sentiment has been done by Putranti and Winarko (2014), namely twitter sentiment
analysis for Indonesian language with the Maximum Entropy and Support Vector
Machine. Classification implementation 86,81 % accuration at examination of
7 validation cross fold for the type of kernel of Sigmoid.
Problem Statement
1.
2.

Here are the problems to be challenges in the research:
Social media data contains abundant information which is unstructured,
heterogeneous, high dimensional and a lot of noise.
Abundant data can be a rich source of information but it is difficult to identify
manually. How to find patterns of information and knowledge of social media
user activities in the form of positive and negative sentiment on twitter TV
content.
Objectives

There were two main objective of this study, first to perform preprocessing to
address unstructured data, a lot of noise and heterogeneous or diverse. Second, to
find patterns of information and knowledge of social media user activities in the
form of positive and negative sentiment on twitter TV content.

3
Benefits
1.
2.

The benefits of this research are :
Preprocessing provides convenience in the process of finding a pattern of
positive and negative sentiment.
This research as a basis for further research to build a system recommendation
and improvement scenarios for providers TV content and audience.
Scope

The data source used in this study was data twitter TV content for
@kickandyshow with Indonesian language text. Here is the overview scope of
research ilustrated on Figure 1. Data taken for this study were 760 tweets from
@kickandyshow account. The period of data collection from the date of May 14,
2015 until May 17, 2015. After the selection and pre-processing, the data ready to
be processed to the next step as much as 751 tweets.

Remove
Number at
the Beginnig
Word

Target Class

Get Data
from table

Replace
Number with
Letter

Remove
Repeat Letter

Stemming

Remove
Stop Word

Translate
Alay Word

Stochastic
Gradient
Descent

Model

Measurement

Prediction

Preprocessing

Remove
Punctuation
& Symbol

Store and
Spread to
Tables

Classification
Model

Crawl Twitter
Data using API
Streaming
Twitter

Testing

Crawler

Scope

Figure 1 Scope of research

4
Base on Figure 1, a crawler is a program that visits Web sites and reads their
pages and other information in order to create entries for a search engine index.
Social media used in this research is Twitter. Twitter provides Application
Programming Interface (API) that developers can be used to access the data. Twitter
Streaming API allows users to access subsets of public status descriptions in almost
real time, including replies and mentions created by public accounts. The result data
of crawler is inserted into the database. In the part of the target class, the task labels
the data twit manually whether positive or negative sentiment. Positive sentiment
rated 1 and the negative sentiment rated 0. This task is for the learning process in
the classification. Before the preprocessing phase, the data were taken from
database. The next step is to perform preprocessing. The steps of preprocessing
consists of remove punctuation a symbol, symbol number at the beginning word,
replace number with letter, remove repeat letter, translate "alay" word, remove stop
word, and stemming. The next step is to find patterns of information and knowledge
of social media user activities in the form of positive and negative sentiment on
twitter TV content use Stochastic SGD for succeeding. This algorithm generates
data mining modeling to predict the pattern of positive and negative sentiment. The
modeling that has been formed evaluated by measuring accuracy and processing
time.

2 LITERATURE REVIEW
Social Media
Social media is defined as a group of Internet-based applications that build on
the ideological and technological foundations of Web 2.0 and that allow the creation
and exchanges of user-generated content (Kaplan and Henlein 2010).
The following are characteristics of social data media (Gundecha and Liu
2012):
1. Vast - contents generated by the user are very large number
2. Noisy - social media data can often be very noisy. Removing the noise from
the data is essential before performing effective mining.
3. Distributed - Social media data are distributed because there is no central
authority that maintains data from all social media sites.
4. Unstructured - social media sites serve different purposes and meet different
needs of users.
5. Dynamic - Social media sites are dynamic and continuously evolving. For
example, Facebook recently brought about many concepts including a user's
timeline, the creation of in-groups for a user, and numerous user privacy
policy changes.
Data Mining
Data mining is the process of discovering interesting patterns and knowledge
from large amounts of data (Jiawei Han et al. 2012). The knowledge discovery
process is shown in Figure 2 as an iterative sequence of the following steps:
1. Data cleaning - to remove noise and inconsistent data

5
2.
3.
4.
5.
6.

Data integration - where multiple data sources may be combined
Data selection - where data relevant to the analysis task are retrieved from the
database
Data transformation - where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations
Data mining - an essential process where intelligent methods are applied to
extract data patterns
Pattern evaluation - to identify the truly interesting patterns representing
knowledge based on interestingness measures.

Preprocessing

Preprocessing
Data

Crawler

ARFF file

Figure 2 Data mining as a step in the process of knowledge discovery (Han et al.
2012)

6
7.

Knowledge presentation - where visualization and knowledge representation
techniques are used to present mined knowledge to users.
Content Analysis

Content analysis is used to study a broad range of ‘texts’ from transcripts of
interviews and discussions in clinical and social research to the narrative and form
of films, TV programs and the editorial and advertising content of newspapers and
magazines (Macnamara 2005). Berelson (1952) suggested five main purposes of
content analysis as follows:
1. To describe substance characteristics of message content;
2. To describe form characteristics of message content;
3. To make inferences to producers of content;
4. To make inferences to audiences of content;
5. To predict the effects of content on audiences.
TV Content
There are three important variables that affect the methodologies employed
in the image analysis part namely, the content properties, the quality of the data,
and the given computational and memory (Ekin 2012). Social media are definitely
one of the easiest ways of linking TV and web. It is a place for audiences to
influence and to create content for the actual TV broadcasts. Since it is free, it is
also a very cost-effective place for advertisements. It can be argued that the
potential of the non-chargeable ways of linkage between TV and web rely heavily
on the features and possibilities of social media. To make convergence of TV and
web contents entirely successful in both countries, the following are necessary
(Bachmayer and Tuomi 2012):
1.
2.
3.
4.
5.

Connect TV sets to the web by default.
Connect TV content to web content by defining hook points on both parts.
Provide adequate interfaces and input devices.
Provide automatism - analyzing TV content, analyzing web activity, updating
information, connecting to TV content.
Provide nonlinear television content to invite the viewer to web activity.

The content structure describes the geometric design or the media narrative space
of a story. A narrative is defined as a series of events that are linked in various ways,
including by cause and effect, time, and place. Something that happens in the first
event causes an action in the second event, and so on (Samsel and Wimberley 1998).
Smart TV is expected to combine TV and Internet, and provide more applications,
contents, and multimedia, as well as evolve into a device that serves as a digital hub
for households in the future. Smart TV seems to be changing current TV watching
patterns, and bringing about the promotion of active and intelligent broadcasting
and communication convergence services (Kim 2011).
Preprocessing
There are some major steps Involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation (Han et al. 2012).

7
Data Cleaning
Data cleansing is the process of detecting and correcting or removing corrupt
or inaccurate records from a record set, table, or database. Data cleansing routines
work for "clean" data to fill in missing values, smoothing noisy data, identify or
remove outliers, and resolve inconsistencies. The missing values can be addressed
with a few steps (Han et al. 2012), namely:
1.
2.
3.
4.
5.

Ignore the tuple
Fill in the missing value manually
Use a global constant to fill in the missing value
Use a measure of central tendency for the attribute
Use the attribute mean or median for all samples belonging to the same class
as the given tuple
6. Use the most probable value to fill in the missing value

Noise is a random error or variance in a measured variable. The Following are data
smoothing techniques (Han et al. 2012):
1. Binning - Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
2. Smoothing by bin medians - each bin value is replaced by the bin median
3. Regression - a technique that conforms data values to a function
4. Outlier analysis - values that fall outside of the set of clusters may be
considered outliers.
Data Integration
Data integration is the combination of technical and business processes used
to combine data from disparate sources into meaningful and valuable information.
Data Integration can help reduce and avoid redundancies and inconsistencies in the
resulting data set. This can help improve the accuracy and speed of the subsequent
data mining process.
Data Reduction
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is much smaller in volume, yet closely maintains the integrity of
the original data. That is, mining on the reduced data set should be more efficient
yet produce the same (or almost the same) analytical results. Data reduction
strategies include dimensionality reduction, numerosity reduction, and data
compression (Han et al. 2012).
Data Transformation
A data transformation converts a set of data values from the data format of a
source data system into the data format of a destination data system. Data
transformation assist mining process becomes more efficient and patterns found
easier to understand.
In addition to the above techniques there is also feature selection technique.
Feature selection techniques is one of the most important and frequently used in
preprocessing data mining. This technique reduces the number of features that are
involved in determining a target class values, reducing the irrelevant feature,

8
redundant, and the data that tend to misunderstanding of the target class that makes
immediate effect for application (Djatna and Morimoto 2008). Feature selection
could reveal hidden important pairwise attributes that abandoned by conventional
algorithms as their nonlinear correlation properties. These unique results support
for both categorical and continuous problems (Djatna and Morimoto 2008).

Stochastic Gradient Descent (SGD)
SGD is the methodology used to find patterns of information in the form of
positive and negative sentiment on twitter TV content on this research. Stochastic
Gradient Descent (SGD) is an algorithm that is used for large-scale learning
problems. SGD has an excellent performance in solving large-scale problems. Let
us first consider a simple supervised learning setup. Each example z is a pair (x,y)
composed of an arbitrary input x and a scalar output y. We consider a loss function
ℓ( ̂,y) that measures the cost of predicting ̂ when the actual answer is y, and we
choose a family F of functions fw(x) parametrized by a weight vector w. We seek
the function f ϵ F that minimizes the loss Q(z,w) = ℓ(fw(x),y) averaged on the
examples. Although we would like to average over the unknown distribution dP (z)
that embodies the Laws of Nature, we must often settle for computing the average
on a sample … (Bottou 2013)
= ∫ℓ

= ∑�= ℓ

,



,



(1)

The empirical risk En(f) measures the training set performance. The expected
risk E(f) measures the generalization performance, that is, the expected
performance on future examples. The stochastic gradient descent (SGD) algorithm
is a drastic simplification. Instead of computing the gradient of En(fw) exactly, each
iteration estimates this gradient on the basis of a single randomly picked example
zt (Bottou 2013):
+

=

−�



,

(2)

The stochastic process {wt; t =1, . . .} depends on the examples randomly picked at
each iteration. SGD efficient in performing classification even if it is based on nondifferentiable loss function (F(θ)) (Bottou 2013).
� =



|| || + ∑[ −

+

]

(3)

Where w is the weight vector of twit data, b is the bias , � is a regulation
parameter, and the class label y assumed {+1, -1}. Explicit specification of learning
rate is an important thing that relates to the time change in the data stream.
They (Bifet and Frank 2008) have investigated research on the Twitter
streams of data using several methods: Multinomial Naive Bayes, Stochastic
Gradient Descent and Hoeffding Tree. Based on the tests that have been done of the
three methods, models SGD with the appropriate learning rate is recommended to

9
process the data stream. The formulation of update classifier in SGD is (Fnew) (Bifet
and Frank 2008) :


= ∑�= [∑�=

.

− �.�

+ �.

|

Where :
n = number of data tweet rows
i = sum of iteration
w = weight
� = Learning Rate
 = Lambda
y = Status of sentiment (yes or no)
= accumulation of weight (per word)
b = bias.

+

|]

(4)

3 METHODOLOGY
The case study in this research was to determine the positive or negative
sentiment based on tweet of TV content. The tweet data taken using Twitter
Streaming API. Then it is taken continuously and stored in a table in real time. In
the database, the tweets that have been collected at the table, processed (parsed),
and its result distributed to several tables in preparation for the next process.
Software and tools used in this study were:
1. PHP to build crawler
2. MySQL to store the result of crawler
3. Visual Studio 2012 for preprocessing development
4. Sql Server to store preprocessing data
5. Java to implement SGD with Weka library
6. NetBeans for Java development tool
7. Weka for visualizing the pattern of sentiment
8. Excel for visualizing the accuracy and processing time.
This application usage environment for the next researchers that will explore
the sentiment analysis on the data stream with the case of TV content. The
achievement of long term is expected to create a system that could provide an
assessment and recommendation for the TV audience and scenarios improvement
for the TV content provider.
In general, flow of application can be illustrated in the Figure 3. First of all
people watching TV then they can comment on the content they watch TV already.
Crawler application was made to take the user comments on twitter. The result data
were stored in MySQL database crawler. Furthermore, the data were performed
preprocess step by the preprocess application. The results of this application were
ARFF files which are ready to be processed to the next step. Attribute-Relation File
Format (ARFF) as input to classifier application. Classifier application is an
application to find patterns and information about positive and negative sentiment.

10
Watching TV

Comments on twitter

Database
MySql

Crawler
Application

Tweets Data

Preprocess
Application
Result of
preprocessing

ARFF File

Classifier
Application

Figure 3 General Flow

1.
2.

3.
4.
5.

Here are some modules were made in this study:
The crawler module used an existing application from 140Dev
(http://140dev.com/). This module was made using PHP script
Preprocessing modules consist of some sub modules namely eliminate all
punctuation and symbols; eliminate numbers in front of words; replace number
with a letter; eliminate repeated letters; translating the "Alay" normal words into
words; eliminating the stop word and perform the word stemming. All modules
and preprocess algorithms were made by the author except Porter Algorithms
for stemming preprocess. The modules were made using Visual studio .Net
2012 with C # programming language
Model classifier module using Java programming language. It was developed
from Weka library(http://www.cs.waikato.ac.nz/ml/weka/contributors.html)
Module for evaluation of preprocess using C # Visual Studio .NET 2012. This
module was created by the author
Module for evaluation of the model using Java programming language. This
module
was
developed
from
Weka
library
(http://www.cs.waikato.ac.nz/ml/weka/contributors.htm).
System Design

Here are flow diagram of sentiment analysis system. In this study involved
only two entities ilustrated in Figure 4, they are as user that give comments on

11
twitter and researcher as an entity that consume information and patterns of
sentiment for the development of further research. On this system, there are several
processes ilustrated in Figure 5, namely:
1. Crawl process - to get tweet data and stored the data in Table tweets
2. labeling data manually for training process and stored in the table tweetsentimen
3. Preprocessing - at this stage there are several sub preprocessing as shown in
Figure 6. The process details and the algorithm will be described in the next
section
4. Update classifier - in this process SGD model applied. At this stage there are
several sub-processes (Figure 7). Detail the process and the algorithm will be
described in the next section
5. Testing - data already done training, then conducted tests on a model that has
formed SGD
6. Evaluation - the process is to know the ability of SGD models, both in terms of
execution time or accuracy
Database design is shown in Appendix 1. The user interface of crawler,
preprocessing and classifier are shown in Appendix 10, 11 and 12 respectively.
comments
1.0
Sentiment
Analysis System

User

Infromation &
sentiment
pattern
researcher

parameter

Figure 4 Data Flow Diagram level 0
Comments

Comments
User

Twitter

1.1
Crawl
process

tweet
Tweets

Comment text
TweetSentiment

model

1.2
Label data Class target (1/0)

Class target

1.5
Testing

1.5
Evaluation

Comment text
paremter

weight

Infromation
& sentiment
pattern,
accuracy

ARFF file
1.3
preprocessing

1.4
Update
Classifier

researcher

Figure 5 Data Flow Diagram level 1

12

text

text
1.3.1
Eliminate
punctuation &
symbol

text

NumberConversion

1.3.2
Eliminate
Numbers in
front of Words

text
1.3.4
Eliminate
Repeated
Letters

1.3.3
Replace Number
with Letter

Converted
letter
text

text

PronounRootWord
r
w oot
or
d

text

text

ord

1.3.7
Stemming

roo

root word
1.3.8
Create ARFF file

root
roo

PronounRootWordPrefix

tw

word

RootWordParticle

RootWordParticlePrefix

1.3.6
Eliminate the
Stop Word

1.3.5
Translating the
Alay Words

Removed
word

Translated
word

StopWord

AlayDictionary

tw

ord

Arff file

RootWordPrefix1

roo

tw

ord

roo

RootWordPrefix1Suffix

tw

ord

roo

tw

RootWordPrefix2

ord

RootWordPrefix2Suffix

roo

tw

ord

RootWordSuffix

Figure 6 Data Flow Diagram level 2
Preprocessing
To answer the problem of unstructured data, diverse and lots of noise some
preprocessing methods and techniques carried out. The following are the techniques
and methodologies was used in the preprocessing in this research. In general, the
preprocessing can be described as follows in Figure 7:
Remove
Punctuation
& Symbol

Remove
Number at the
Beginning
Word

Replace
Number with
Letter

Remove
Repeat Letter

Stemming

Remove
Stop Word

Translate
Alay Word

Figure 7 Preprocessing stages

13
Eliminate All Punctuation and Symbols
This information generally does not add to the understanding of text and will
make it harder to parse the words on some comments on Twitter. This information
includes single and double quotation marks, parentheses, punctuation, and other
symbols such as dollar signs and stars. For examples, "Haruskah kita membayar
kembali negeri ini !!, tp selama kita terus bertanya apa negara ini dapat
memberikan." Christine Hakim #MN, and after the process of removing
punctuation and symbols, the sentence becomes "Haruskah kita membayar kembali
negeri ini, tp selama kita terus bertanya apa negara ini dapat memberikan."
Christine Hakim #MN. The algorithm is shown at Figure 8.
Input word

Start

End
Check if found symbol
with Regex : “(\W|[_])+”

N

Output word

Y

Replace symbol with empty string

Figure 8 Eliminate all punctuation and symbols algorithm
To eliminate the symbol of each word using the replace function to replace
symbol with empty character by regular expression of .net library. Regular
expressions were designed to represent regular languages with a mathematical tool,
a tool built from a set of primitives and operations. This representation involves a
combination of strings of symbols from some alphabet Σ parentheses, operators +,
and * (Xavier 2005). Here is the formulation: Regex.replace (String Input, string
pattern, MatchEvaluator evaluator) and the combination of regex code is "
(\W|[_]+)". Characters "\W" means not word, matches any character that is not a
word character (alphanumeric & underscore). Character "|" means alteration, act
like a boolean OR. Matches the expression before or after the "|". Characters "[_]"
means character set, match character in the set. Meaning each character is taken
from http://www.regexr.com/ created by Grant Skinner & The Gskinner Team in
2008. Input is the string to search for a match, string pattern is combination of regex
characters for grapping of symbol and replace all string that match a specified
regular expression.
Eliminate Numbers in front of Words
For examples, "2hari" becomes "hari". Just like remove symbol, remove
number also use match character function to remove number by regular expression
of .net library. Here is the formulation: Regex.Match(string input, string pattern).
This formulation means searches the specified input string for the first occurrence

14
of the specified regular expression. The combination of regex code is
"^(\d{1,})([\w\W]+)$". Character "^" means matches the beginning of the string.
Character "(...)" means group multiple tokens together and creates a capture a group
for extracting a substring or using a back reference. Character "\d" means digit,
matches any digit character (0-9). Characters "{1,}" means quantifier, match 1 or
more of the preceding token. Characters "[...]" means character set, match character
in the set. Character "\w" means word, matches any word character (alphanumeric
and underscore). Characters "\W" means not word, matches any character that is
not a word character (alphanumeric & underscore). Character "+" means plus,
match one or more of the preceding token. Character "$" means matches the end of
the string. In the formulation, string input is the string to search for a match. String
pattern is combination of regex characters for check whether contained numbers at
the beginning of the word. If there are contained numeric characters in front of the
word, the numeric character is eliminated then the output is the string after the index
of numeric characters. Here is the of algorithm in Figure 9:
Input Word

Start

Declare variable :
str_match assigned by
match string;
L = length of str_match;
isRmvNum = true;

Substring of input string (range: index
L until last index of input string).
And assingned to strOut;

Definition: string variable
strOut with empty string

Y

End

Check if input is
empty string or
null

Y

Output word

Check if length’s
input == 1

Y

strOut = input string

Check if has remove
number, isRmvNum ?

Y

N

Check if regex is
any match?

N

Y

N

Check if input is
contained numeric only,
Regex : “^[0-9]+$”

N

Create object of Match Class for matching of
input string with regex : “^(\d{1,})([\w\W]+)$”

N

Definition: bool
isRmvNum = false;

Figure 9 Algorithm of eliminate numbers in front of words
Replace Number with Letter
Here is the algorithm and ilustrated in Figure 10:
1. Get list data conversion from database base on Table 1
2. Insert list data conversion to hash variable
3. Check numeric character in the middle of input string and check if it is not
contained all numeric character using regex, if true then

15
3.1. Check whether it is contained string "00", if true then replace it by "u" else
if it is contained string that match with data in Table 1, it will be replaced
by conversion data from Table 1.
4. If point 3 is false, then return the origin of input string.
Start

Input word

Definition : connector
of database

Conversion
number

Query data convertion from table
End

Check if input is numeric char at
middle word AND Check if input is
numeric char only

N

Output word

Y
Replace number with convertion
value char base on table

Figure 10 Replace number with letter
Table 1 List of converting the numbers into letters
(Naradhipa and Purwarianti 2012)
Number
0
00
1
2
3
4
5
6
7
8
9

Conversion
o
u
i
Copy char before “2”
e
a
s
g
t
b
g

The combination of regex point 3 is "(?:([0])\1{0,1})|([0-9])" and "^[0-9]+$".
Character "^" means matches the beginning of the string. Character "[0-9]" means

16
matches character in the range "0" to "9". Character "+" means plus, match one or
more of the preceding token. Character "$" means matches the end of the string.
Eliminate Repeated Letters
Here is the algorithm:
Start

Input word

Check if input is contained
numeric only, Regex : “^[0-9]+$”

Check if string input is contained
consecutive repeat chars,
Regex : “([a-zA-Z0-9])\1{1,}”

N

Y
Y
End

Output word

N

Replace repeat chars with result of substring for repeat
chars with range index 0 until 1

Figure 11 Algorithm for eliminate repeated letters
Examples: "Pagiiiiiiiiiii" becomes "pagi". In Figure 11, check if it has
repeated letter then take the first character only. The combination of regex code for
checking that only numeric character is "^[0-9]+$". And the combination of regex
code for checking repeated letter is "([a-zA-Z0-9])\1{1,}". Characters "a-z" means
matches a character in the range "a" to "z". Characters "A-Z" means matches a
character in the range "A" to "Z". Characters "0-9" means matches a character in
the range "0" to "9". Characters "\1" means matches the results of capture group 1
(refers to "([a-zA-Z0-9])"). Characters "{1,}" means match one or more of the
preceding token.
Translating the “Alay” Words into Normal Words
“Alay” word is excessive or strange words and also use writing mixed
numbers at once or using uppercase and lowercase letters that are not reasonable
(Meyke 2013). It is stored in the database and called as alay data dictionary. The
program will check into database whether the word included the Alay words or not,
and then translate the words Alay into normal words. Examples: "eM4Nk 4LaY
GhHeToO" becomes "emang alay gitu”. The following is algorithm for translate
the "Alay" word into normal word and ilustrated in Figure 12:
1. Check to the dictionary hash table, if the word is not exist in the data dictionary
then
1.1. Get data to database base on alay word input
1.2. If any result from query, then add the data to dictionary (hash table)
1.3. data result of query will be returned as output translate word
2. If the word exist in the data dictionary, then return the translate result from
dictionary hash table

17
List of alay words taken from various sources on the internet, there are:
1. https://radiksacepek28.wordpress.com/kamus-lengkap/kamus-besarbahasa-alay-kbba/
2. http://www.arenasahabat.com/2013/05/kamus-kata-gaul-dan-alay-terbaruistilah.html
3. http://konkzmedia.blogspot.co.id/2012/11/kamus-bahasa-alay-terbaru.html

Input word

Start

Definition: string variable strOut with
empty string;
Hashtable dictionary to save “Alay” word;
Database connector;

Check if input string is
null or empty

End

Output word

Y

N
strOut = input string

strOut = “Alay” word

Check if input string is already
contained in dictionary

Get “Alay” word
from dictionary

N
Y

Add to dictionary

Alay
Word

N

Y

Query data to find
“Alay” word in table

Check is any result
from query

Figure 12 Algorithm for translate “Alay” words into normal words
Eliminating the Stop Word
Stop word is a word that has no meaning and do not contribute during the
processing of the data. Stop word is also stored in the database data as data stop
words dictionary are like Alay data. If there are words that are listed in the table
stop word, the word is removed. Examples: "Jika dia pergi" becomes "pergi”. The
following is an algorithm for eliminate stop word and ilustrated in Figure 13:
1. Check to the dictionary hash table, if the word not exist in the stop word data
dictionary then
1.1. Get data to database base on stop word input
1.2. If any result from query, then add the data to stop word dictionary (hash
table)
1.3. Data result of query will be returned as output translate word

18
2.

If the word exist in the data dictionary, then return the translate result from
dictionary hash table
Start

Input word

Definition: string variable strOut with
empty string; Hashtable dictionary to
save stopword; Database connector;
End

Check if input string
is null or empty

Output word

Y

N
strOut = input string

Check if input string is already
contained in dictionary

strOut = stopword

Y

Get Stopword from
dictionary

Add to dictionary

Stop
Word

N

Y

Query data to find
stopword in table

Check is any result from query

Figure 13 Algorithm for eliminate stop words
Perform the Word Stemming
Stemming is the process of transformation of words contained in a document
to basic word (root word) with certain rules. The algorithm that used to stemming
is called Porter Algorithm for the Indonesia language (Agusta 2009) and ilustrated
in Figure 14. The following is Porter algorithm:
Step 1 : Removing the particles
Step 2 : Removing possessive pronoun
Step 3 : Delete the first prefix. If there is a second prefix, go to step 4.1, if there is
suffix then go to step 4.2
Step 4 :
Step 4.1 Removing second prefix, proceed to step 5.1.
Step 4.2 Removing suffix, if it is not found then the word is assumed to be
the root word. If found then proceed to step 5.2
Step 5 :
Step 5.1 Removing suffix, then the final word is assumed to be the root word.

19
Step 5.2 Deleting a second prefix, then the final word is assumed to be the
root word.

Definition: string variable
strOut with empty string

Input word

Identifying input string
by checking into particle
table (database)

Output word

End

Start

Y
N

Identifying input string
by checking into prefix’s
first person table

Check if input string
is null or empty

N

Check if found in
prefix’s second person
table

Check if found in
particle table

N

Identifying input string
by checking into prefix’s
first person table

Check if found in
prefix’s first person
table

N

Identifying input string
by checking into prefix’s
second person table

Figure 14 Stemming Algorithm
Classification Model using SGD
The reason of the use of SGD in this study because the data being processed
is a data stream that is continuously updated each time. The stream data has
characteristics that flows continuously, very large and real time. SGD has the ability
to use only one training sample from training set to do the update for a parameter
in a particular iteration. In the processing of stream data, the data do not wait until
pilling up but the data processing is done continuously in real time as long as the
data is being streamed. In most machine learning, the objective function is often the
cumulative sum of the error over the training examples. But the size of the training
examples set might be very large and hence computing the actual gradient would
be computationally expensive. In SGD method, compute an estimate or
approximation to the direction move only on this approximation. It is called as
stochastic because the approximate direction that is computed at every step can be
thought of a random variable of a stochastic process. There are two main parts in
machine learning namely, training and testing. The data model generated from the
training process is used for the prediction process in sentiment discovery. This is an
answer to the second problems in this research. Figure 15 shows a general overview
of the training and testing process.
First of all, the tweet data that have been labeled positive (yes) and negative
(no) stored in Attribute-Relation File Format (ARFF) file. The reason for using the
file ARFF because SGD library that is used comes from the Weka application. An

20
Training

Raw Data
Labelled (after
preprocessing)

Create Arff
file

Arrf for
Learning

Setting
Parameter

Tokenizer

Update
Classifier
using SGD

Data
Dictionary

Testing
Arrf for
Testing

Data
Modeling

Prediction

Data
Prediction

Figure 15 Flow classification model
ARFF file is an ASCII text file that describes a list of instances sharing a set of
attributes. ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information. The Header of the ARFF file
contains the name of the relation, a list of the attributes or the columns in the data.
An example header on the dataset looks like this:
@relation 'MyArffTVContent'
@attribute text string
@attribute class {no,yes}

The Data of the ARFF file looks like the following:
@data
'Semalam nangis KickAndyShow ',no
'kickandyshow eyuyant kartikarahmaeu lisaongt stargumilar lamat yg
tang ',no
'KickAndyShow kau jodohkukenapakarna jodoh
ngatur jodoh
aturJombloMahGitu',yes
'KickAndyShow om saya suliat regist log',no
'KCA VoteJKTabID MetroTV Pilih Andy F Noya dr KickAndyShow sbg
Presenter Talkshow Berita Informasi favorit htptcowzxkdavSTK',yes

ARFF data divided into two parts, one for training and another one for testing.
The process flow is assumed for the evaluation model based on the presentation of
training and testing. The determination of the value of a presentation based on user
input. There are some parameters in the training process, examples learning rate,
epoch, lambda and etc. The next process is the tokenizer. In lexical analysis,
tokenization is the process of breaking a stream of text up into words, phrases,
symbols, or other meaningful elements called tokens.

21
From Figure 16 shows that contents of ARFF file is called as data instances.
Selection of evaluation model is done at the beginning whether using "Simple Split"
or "K-Fold Cross-Validation". Simple Split is divide instance of data become two
parts, namely as training and testing data. In other case, if selecting to use K-Fold
Cross-Validation, that instance of data was divided by its k-folds. Evaluation model
will be described more detail on below evaluation model section. Every evaluation
model that used, both of them are containing training and testing steps. Training
and testing process steps as shown in flow chart above has sub processes that
contained several method and it will be described more detail using illus