Publication Repository Paper 3 NGram

Proceedings of CITEE, August 4, 2009

23

N-Gram Based Web Categorization on Indonesian
Language
Herman Budianto
Computer Science Dept., Sekolah Tinggi Teknik Surabaya
Surabaya 60284, Indonesia
herman@stts.edu

Gunawan
Electrical Engineering Dept., Faculty of Industrial Technology, Institut Teknologi Sepuluh Nopember
Surabaya 60111, Indonesia
admin@hansmichael.com

Tri Kurniawan Wijaya, Eva Paulina Tjendra
Computer Science Dept., Sekolah Tinggi Teknik Surabaya
Surabaya 60284, Indonesia
tritritri@stts.edu, moshi2_eva@yahoo.com


Abstract- Web categorization or classification can be
defined as an activity to classify one or more uncategorized
web documents based on their content. N-Gram is one of the
methods of text categorization. N-Gram uses an n-character
slice of a longer string. This research develops text
categorization based on N-Gram. The system developed is
divided into three parts: acquisition and preprocessing,
building library, and categorization. Acquisition and
preprocessing are used to gather and to prepare web document
(HTML) that is used as data training and testing. In web
mining perspective, we use crawling to collect web documents.
Building library is used to make a list of N-gram
characteristics in a category. Main function of the system is
categorizing an unlabeled web document to a library which
has the highest similarity rank. The similarity rank is
calculated by matching N-Gram in web document testing with
the N-Gram in each library. The library contained the smallest
distance calculation is considered as the library which has the
most similar characteristic to document testing. Hundreds web
sport news document in Indonesian language from several

dotcom web site are used as case study.
Keywords-text categorization, N-Gram, library, profile,
crawler

I.

INTRODUCTION

Information has very valuable value, and generally can
be accessed through the Internet, has grown very rapidly in
recent years. Large amount of information leads to problems
in fmding something useful for the community. Therefore,
there is a database and catalog of information and divided
into several categories, which helps to direct internet users to
the wanted information. Most of the information is text and
because of this the process of text categorization appears,
which is expected to help facilitate the information
searching.

Text categorization is selection of a category or catalog

database that has the same characteristics with the selected
text. However, to create a database takes a long time because
it has done by reading each text or each part of the text to
select the right category. Therefore, researchs on the text
categorization emerges. Text categorization generally tried
to categorize document based on two characteristics: text
language and text topic.
The process of text categorization is started from
inputting data training where there is a collection of files that
have a category that has been set so that the files are used as
comparison material (learning process) with the
characteristics of the new file that is wanted to be
categorized. Next, from the learning of the training data the
characteristics of each category is formed. Characteristics of
the new file that is wanted to be classified is formed and
matched to the characteristics of the training data and then
the category of the file is concluded.
The objective of this research is to extract information
that is in a web document that is not structured. Then take
advantages of information extracted from a collection of web

documents that have been previously identified to be a file
that contains information about the characteristics of a
category with N-Gram method.
System that is made divided into three main section,
namely the acquisition and preprocessing the text, the
establishment of the library, and a process of categorization
of web documents into categories that is previously defined.
The case study used was the sports news in Indonesian
language. Number of datasets used as the training data is
approximately 100 for each sub-category.

Conference on Information Technology and Electrical Engineering (CITEE)

ISSN: 2085-6350

Proceedings of CITEE, August 4, 2009

24

document, and certainly did not use any information about

the frequency of words or word order.

B.
Acquisition and
preprocessing

Acquisition and
preprocessing

Learning Phase

1) Feature selection
Used to provide the data for the learning process and is
used to train the classification machine with the feature
vectors obtained from the documents taken from the training
data set (the collection of data for training).
2) Learning methods
Classifier can be divided into two main types, namely
binary classifier, and m-ary (m>2) classifier [1]. Binary
classifier says YES or NO to the category which is being

identified in a document that is not associated with the
decision results of the other categories. M-ary classifier
using a classifier is the same for all categories and creates a
priority list of candidate categories for each document, with
the highest degree of confidence for each candidate category.
Each category results obtained from the binary classifier can
be obtained based on a priority level or the value of category
candidates.

Buildi ng libralY



Distance of vectors is the simplest classifier. In the
distance vectors, vectors are created to describe each
category and document. Then the distance between
category vector and document vector is measured.
The category which has the shortest distance to the
document is chosen.




Decision Trees algorithm is used to retrieve an
informative word based on the information
acquisition criteria and predicted categories of a
document based on the number of occurrence and
word combinations.



Naive Bayes is a probabilistic classifier that uses the
probability combination of words and categories to
determine the category of the documents. Classifier
using NaIve Bayes is more efficient compared with
other classifier which also uses the exponential
calculation. A system which is based on NaIve
Bayes is the most frequently used system in text
categorization.




kNN is an abbreviation of the k-Nearest neighbors.
This system is preparing k documents that were
related most closely with training data set and use
the category of the documents to predict a category
for the classified documents. kNN is one of m-ary
classifier.



Rocchio algorithm is a vector-space model for
document classification. The main process is use the
sum of the document and category vector with a
positive or negative weight, depending on the
category in the documents.



Ripper is a nonlinear learning rule algorithm. This
algorithm uses statistical data to make simple rules

for each category and then use the relationship of the
rules to determine whether the document is
categorized in that category.



Sleeping experts have an idea to combine category
predictions of several classifiers. Therefore it is very

Calegorizaion

Figure 1. Main system architecture

II.

TEXT CATEGORIZATION

A.

Introduction to Text Categorization

Categorization or classification text can be defmed as a
job to classify one or more documents that have not been
categorized based on the contents of the document. There are
many knowledge-based systems that are implemented to text
categorization. Recently, statistical pattern recognition and
neural networks is used to build a text classifier.
Classification is a term that has double meaning in the
field of information retrieval, but in general almost always
associated with a process for grouping data. Therefore, text
classification is an appropriate term to characterize some
parts of the field of information retrieval which are usually
considered different, but all of them are processes for
grouping textual data.
If a set of documents is included in the text classifier
system, the result is a document that has been matched with
a certain category, which describes the contents of the
document. The meaning of a document is the result of
natural language, such as the word is used, the use of
punctuation and so forth. Some of the text categorization
system can work very well without using too much

information available in the natural language text. For
example, a text categorization system can work well only
with the appearance or the absence of words found on a

ISSN: 2085-6350

Conference on Information Technology and Electrical Engineering (CITEE)

Proceedings of CITEE, August 4, 2009



25

important to take a suitable classifier and algorithm,
which is used in the combining process.

Directory nane

Neural networks can also be used for classifying text
by using the more-layered networks that have been
tested.

Copy

Crawling

The main part of the text categorization is the similarity
measurement of two documents or between a document and
a category. To measure the similarity, a set of feature or
vectors can be used. The use of the techniques in machine
learning is to train all of the parameters and thresholds of the
algorithm to find the right equation of the documents from
the same category [2].
III.

セ|jrゥcMn

N-Ciram is an N-characters substring from a sentence
which length is more than N [3]. Basically, the substring can
be characters from any character that is contained in the
sentence without considering the order of each letter. For
example, an N-Ciram can be formed from the first and the
third characters of a word.
However, in this research, we only use a substring which
is made from ordered characters of a word which varies in
length. The addition of a blank (space) is given at the
beginning and end of words (herinafter we used underscores
("_") to represent the blank) to give a limit to the word. For
example the word "TEXT" can be a bi-gram LT, TE, EX,
XT, tセL
tri-gram LTE, TeX, EXT, XT_, tセL
four-gram
LTEX, TEXT, EXT_, XT_, t セ N
In general, a sentence
or a word with k-Iength is added with a blank, has k+ 1 bigram, k+ 1 tri-gram, k+ 1 quad-gram and so forth.
The main advantage ofN-Ciram is in the way its process
the sentence, that is, each sentence is divided into small
parts, the errors that may occur are likely just affect only a
small portion of process. If we calculate the N-Ciram of two
sentences, the measurement results of the similarity between
the two sentences can be obtained which can survive against
a variety of errors on the content of the sentences.
IV.

XML template

Form XML
files

Figure 2.

Architecture diagram of acquisition and preprocessing

A.

Crawling
Crawler is made to take a number of web pages with a
relatively fast and then save them to a place in a local
storage. By using a crawler, other web pages that are linked
to the main web page, can also be retrieved automatically.
The number of retrieved web pages is determined by the
maximum depth level. An example of a web page can be
seen in the figure 3.

ACQUISITION AND PREPROCESSING
Selasa,3Juli2007

The process of acquisition and preprocessing consists of
two phases, namely the crawling process and the extraction
of the content of the web pages (HTML file). HTML
extraction process is actually resulting two outputs. The first
output is .news file which is the input for the next process.
This .news file contains important information from the web
page itself, information such as news title, date, time, the
publisher and of course the content of the news. The second
result is an Extensible Markup Language (XML) file which
is used for display the news content from the web page that
is used both for training and for testing in a uniform way.
Figure 2 shows the architecture of the acuquisition and
preprocessing.

Bセ ヲQ

セN

セNB



Menangkan5PakelWisata
Hlmgkol1g - Shermn • Disneyland

セ B G[セエ

Brセ

セ i [ oセ



JI(UlI sr/lS JrIRl & QUIZ F1 FlJEXI
Ketlk. REG F1 VurJm ke 20m

bilaBola
S.I....



oセNPWイRヲxj[イPXZQSwiY

Beckham Enam Jam Menahan Sakit

NiSZャ_YNセ x

m・⦅ャA セ⦅cAョNセ エ ァAャN

fV1ellyarrrl Selyotini· detikSport

bilaBola
lo1in88u.01ftJ1f200700:08WI8

Paris-David

スGNセ ー Nセァ⦅ LAセ ⦅ LA⦅
セN ェAセN

Beckhampunya
lata baru. ゥーセt
taksepertitatotatonpyang
lain,tatobaru
gelandang Real
Madriditu
terbilangsanga1
rum it dan

bilaBola
rN「オNRWPSgo WPbセ|ャiQX

NセU

セN A L

AQセ⦅jェN

セセ

bilaBola
S.nin.2oI'D61200713:09·I\J1B

ーNゥ A ゥNャ セイᄋM ᄋM M M

lusim Iklan Beckham

|Qi ゥ Nセ

bilaBola
Kamis,21IU6f200715:64IJlJIB

pBLィセ

jサBLィゥQ[ュセ ョ

...

menghabiskan
waklu
pembuatan
selama enam

Figure 3. Web document

Conference on Information Technology and Electrical Engineering (CITEE)

ISSN: 2085-6350

26

Proceedings of CITEE, August 4, 2009

To run the crawler an input or information about the
wanted pages is required. Here are the five main inputs
required to run the crawler:


Root URL: Root Uniform Resource Locator (URL)
is the root of the URL or the main URL which is
wanted to be crawled. Links in this page later will be
crawled and processed further if the desired depth
level is more than one.



Maximum depth level: this input determines how
deep the crawling will be done. Crawling is
terminated when the depth of the tree is equal with
the maximum level. So, if this parameter is filled
with one, it means that the download process is only
done on the root page only; it does not take any web
page connected.



Number of thread: The number of thread is used to
determine the number of crawling processes that can
be done simultaneously.



Time out: Time out is used to determine the time
limit each thread fetching a page. The time limit can
be filled in milliseconds. When the page retrieving
takes a longer time than the value of time out, then
the process is canceled and proceeds with the next
pages.

Document Extraction
Each web document retrieved usually not only contains
the news or content of the page itself but also contains the
menu and the links that not required in information
extraction. Figure 4 shows the content inside the web page
(HTML file) that is shown in the figure 3. The menu and
links could made information extraction not properly
working which later caused an unexpected results. There are
also HTML tags that result in the making the contents
unclear and the categorization process become much longer.
Therefore, to make the extraction process more effective and
efficient, the HTML tags, menus and links that are on the
HTML file is first have to be trimmed so that only the main
contents or news remains.

Figure 5. File .news

The extraction of the web pages aims to clean up HTML
tags, menus and links. There are several steps in extracting
the content of the web page:

B.

!D0



Find the publisher of the page.



Search for the HTML section that is contains the
maIn news.



Search for the published time (date and hours) of the
news.



Search for the title of the web page.



Cut the remaining HTML tags.

Creating XML files
For a better viewing of the .news file an Extensible
Markup Language (XML) document is formed. The XML
documents have some advantages compared to HTML
documents. The XML documents can deliver a range of
virtual documents, present information with structured form,
and sort, filter, search and manipulate information in a
simple way. Figure 6 shows the example of an XML file.
C.

VP E HTML P UBU C II..JI'M ClIDllD HTML 4.01 Tra sitio naWE till

11ltt p: Iwww.Vlc.ogrrRl19/RC.htmI4 01-19991411 00 se. dtd II>

HTML xm Ins=Ilhttp://\nIww.w]. () rg(19991)l htm III> H AD
digit l trITLE>

(DIV id isicon1en1k nanUt )d e kBo Ia : セ

3ngg I, Rab" t8J(]4f2il0116=1'

WIBMe iyanti Setyorini - dB1ikSport セrb
0( DIV \^イ。「ョャセァ[ 、ゥ
IMG height-s200 oBfe」。ーウセ
bi セイウ
0·· 9_filaslBec:knamtato-ch nadaily.jpgll wid1h;;;;2 - bordeMl>