Gunawan dan Andy Raharja Tanaya ICGC RCICT 2010

ISSN: 2086-4868

Technical Paper

Proceedings of ICGC-RCICT 2010

Collocation Detection for Indonesian
Gunawan •.••J, Andy RahaIja Tanaya **)
0)

..jd
Email: ァオョ。キ`ウャエN・、ケ「PSTHサカィッセ」
Department of Electrical Engineering, Faculty of Industrial Technology, Inshtut イセォョッャァ
Kampus ITS Sukolilo, Surabaya 60 Ill, East Java, Indonesia.

.

Sepuluh Nopember

Department of Computer Science, Sekolah Tinggi Teknik Surabaya
Ngagel laya Tengah 73-77, Surabaya 60284, East Java, Indonesia.


'0)

Abslracl- Task of collocation extraction and detection of a
corlms has always been a part of computational linguistic
nnd stntisticnl NLI>, CClllocation can be useful for various
ways such as improving parsing qunllty and attaching POS
tag on a COl'pUS, The term collocation for words
combination, as introduced by John Rupert Firth, a British
linguistic expert in 1951, has its own unique distinction. It is
regrettable that there arc no patent and clear rules for
detecting collocation on the linguistic part except for some
collocation criterin which have been widely accepted and the
knowledge of the linguistic expert. This rcsearch includes
the analysis of some methods for detecting collocation on
computationlll linguistic purl, namely association measure
(AM). Information regarding the frequency of occurrence in
one corpus is needcd for measuring AM. Blgram and
frequency of occurrence are a part of bigram list which is
the result of a corpus preproce.'ising phase. The results from

the calculation of AM which is an association score for eacb
blgram that used as input for the AM, nre needed for
ranking the used list hlgram. The ranklngs can further be
used for evuluatlon of AM performance test In detecting
collocation. Evaluation can be done using n-best list metbod
for counting precision value and recall value at the n-upper
rank from the list of bigram's rank. The results of tbe
evaluation can be used for discovering wbich AM bas best
performance in detecting collocation from n corpus based
on the value of n-best precision and recall.



l.

NLP;

col/oentio,,;

II-best




Non-subtitutability:

lcunci

inggris

i: klmci

jawaban.


Non-modifiability: nrembanting tulang helelcRHg

Only by using the 3 criteria above (although quite
helpful) is not sufficient to provide a clear definition of
collocation. Therefore, using the help of the association
score generated by AM, it is expected to provide statistical

evidence of whether a set of words is a collocation. In
analyzing the AM methods aid pilot program is used (will
be explained in the next chapter).

III.

ARCHITECTURE

Here is a system architecture used in the pilot program
in this paper.
list of
common
word

[エセゥャ

セg

testing


..........

------'"\
list of ranked
bigrum

INTRODUCTION

Since Firth first introduced term "collocation" in 1951,
there has been a lot of research about collocation.
Unfortunately until now there has been no formal
definition of collocation is widely accepted.
Because of that, then assistance of some methods of
measuring the association between words or association
measure (AM) is used based on the idea if a set of words
have a high associalion between words then it could be
suspected as a collocation. This paper will discuss several
methods of AM as well as the performance evaluation of
each of these AM.


II. COLLOCATION
One (from many different) definition of collocation:
collocation is an expression consisting of 2 or more words
associated with a conventional way in a collection of
uttered words. Manning and SchulZe [I] provides severnl
criteria of a set of words to be considered as a collocation,
namely:

Yogyakarta, 2·3 March 2010

I..-unci

inggris.

L...-_ _- - J

KeJ'Words-slntisticnl
nsssoc;ation nret'Sllre;

kunci jawahan,


Non-compositionality:

Figure I. System Archilechlre

Explanation of each stage of the Figure I above can be
seen in several sections below.

IV.

PREPROCESSING

Preprocessing phase change the format of the input
text corpus into datasets that contains list ofbi1,rr'dtll. There
are 2 types of corpus that are used, the raw corpus (Acts
of the Apostles, Luke, Mark, Romans, the entire New
Testament and Old Testament) and the corpus with Part of
Speech (POS) tags (Acts of the Apostles, Luke, Mark, and
Rome). There are 2 types of preprocessing that will be
done: the calculation of appearance frequency for

contiguous bi1,rram (bigram that has adjacent word) and
the distance calculation for positional bigl"dm (bigram that
its constituent words can be separated seveml words
depending on the size of the collocational window, max
5).

Dept. of EE and IT, GMU

281

Proceedings of ICGC-RCICT 2010

Technical Paper

ISSN: 2086-4868

The outputs from contiguous bigram preprocessing are
bigram dataset and its appeamnce frequency in the corpus.
Outputs from positional bigram preprocessing are bibrram
dataset and the distance between the collections of the

bigmm constituents for each bigmm in the dataset.
Common word filter is used for the preprocessing of the
both types of corpus, while POS corpus filtering is used
for POS tag types only.
TABLE I.

Poln POS Tal!
BN+BN
BN+BNP
BN+SF
BNP+ BNP

FREQUENCY OF OIlSERVED CON11NGENCY T ABtE
BIGRAM (U. V)

V=v

Y-j:v

011


0

0 21

0 22

U=u
Uif;u

Contoh



One side measurement: a high value indicates a
positive association (the bigram constituent is
more often occur together), whereas low values
including negative values indicate a negative
association (the bigram constituent more often
appear independently). Example of some AM that

is included in one side measurement are the
frequency and the t-score.



Two side measurement: a high value indicates a
strong association of both positive and negative
associations, whereas a low value indicates
independency. Example of AM thllt is included in
the two side measurement is the chi-squared.

Air mata
Kota Yerusalem
Airasin
Tuhan Yesus

TESTING

The testing phase aims to provide an association score
for each bigmm from the dataset preprocessing results
using one of the AM methods.

TABLE III.

CONTINGENCY TABLE EXPECTED FREQUENCY BIGRAM

(u,\')

According to Evert and Krenn [2J there are 4
approaches of AM, i.e.:
J) Significance ofassociation: Calculate the value of
evidence to reject the null hypothesis of independence
(that is a set of bigram that' just coincidence to come
together because of the combination of the probability of
each word that making up the bigram). There are 3
subclass of this approach:

Measurement likelihood



exact hypothesis tesllIy ofMi"nl'SOb 20().t

[6]

F\"(:Tt. Sln':ln. UnglU" krenn. Usmg Small Random S;lml'lc::s
\bnuat for the f\';Iluatlon of StJlhtlC;\I aセウッ」L。ャョ
セォZ^NGオイエB\ᆳ

111e common \\ord filter thai ィ。セ
bct.'Il done in the
help:. in fihering which
preprocessing セエGャァ・
bigmOl thm contains the common word, because
the bigr:lIn Ih:lt COl1laillS the common word is nOI
likely thc collocation,

From research on the collocation thai has beell
preSClllL"1! III this paper elm be taken :l few Ihings Ihal C.1ll
be used as suggestions for further dcveloplllem of
research ;11 Indonesian col local ion. namely:

284

31.11

5.5

P

1.73
2,72
5,19
6.42
6.67
7,41
X.XIJ
9.14
10.12
10.62

111

the eollocillion from a corpus. because only
bigr.lI11 result:. from ーイ・ッ」ウセゥョァ
stage thm h:l\'e
the POS tag combination pallem according 10 the
m:tiorit)' collocation that contilinoc'UIllClll
l:!grcb. Croaua 2006

... m
jャQHォセ ョァN

ut
uイャ |セBG y

HGオャ NZMセィoB

ot

Resources in the Indonesian language in the
conduct adequrtte collocation is llec