IALP2010 Gunawan dan Andy
2010 International Conference
on Asian Language
Processing
IALP 2010
Table of Contents
Message from General Chairs..............................................................................................................xi
Message from Program Chairs..........................................................................................................xiii
Conference Committees........................................................................................................................xiv
Program Committee...................................................................................................................................xv
Organizers and Sponsors....................................................................................................................xvii
Invited Talks ..................................................................................................................................................xix
Lexicon and Morphology
A Survey on Rendering Traditional Mongolian Script ...................................................................................3
Biligsaikhan Batjargal, Fuminori Kimura, and Akira Maeda
A Combination of Statistical and Rule-Based Approach for Mongolian
Lexical Analysis ....................................................................................................................................................7
Lili Zhao, Jia Men, Congpin Zhang, Qun Liu, Wenbin Jiang, Jinxing Wu,
and Qing Chang
A Letter Tagging Approach to Uyghur Tokenization ...................................................................................11
Batuer Aisha
Development of Analysis Rules for Bangla Root and Primary Suffix
for Universal Networking Language ................................................................................................................15
Md. Nawab Yousuf Ali, Shahid Al Noor, Md. Zakir Hossain, and Jugal Krishna Das
A Suffix-Based Noun and Verb Classifier for an Inflectional Language ...................................................19
Navanath Saharia, Utpal Sharma, and Jugal Kalita
Behavior of Word ‘kaa’ in Urdu Language ....................................................................................................23
Muhammad Kamran Malik, Aasim Ali, and Shahid Siddiq
Methods to Divide Uygur Morphemes and Treatments for Exceptions .....................................................27
Pu Li and Hao Zhao
Rules for Morphological Analysis of Bangla Verbs for Universal
Networking Language ........................................................................................................................................31
Md. Nawab Yousuf Ali, Mohammad Zakir Hossain Sarker,
Ghulam Farooque Ahmed, and Jugal Krishna Das
v
Discussion on Collation of Tibetan Syllable ..................................................................................................35
Heming Huang and Feipeng Da
A Dictionary Mechanism for Chinese Word Segmentation Based on
the Finite Automata ............................................................................................................................................39
Wu Yang, Liyun Ren, and Rong Tang
Development of Templates for Dictionary Entries of Bangla Roots
and Primary Suffixes for Universal Networking Language .........................................................................43
Md. Zakir Hossain, Shaikh Muhammad Allayear, Md. Nawab Yousuf Ali,
and Jugal Krishna Das
A Study on "Worry" Separable Words & Its Separable Slots ......................................................................47
Chunling Li and Xiaoxiao Wang
Syntax and Parsing
Improving Dependency Parsing Using Punctuation ......................................................................................53
Zhenghua Li, Wanxiang Che, and Ting Liu
A Tree Probability Generation Using VB-EM for Thai PGLR Parser .......................................................57
Kanokorn Trakultaweekoon, Taneth Ruangrajitpakorn, Prachya Boonkwan,
and Thepchai Supnithi
Research on Verb Subcategorization-Based Syntactic Parsing Postprocess
for Chinese Language .........................................................................................................................................61
Jinyong Wang and Xiwu Han
Identification of Maximal-Length Noun Phrases Based on Maximal-Length
Preposition Phrases in Chinese .........................................................................................................................65
Guiping Zhang, Wenjing Lang, Qiaoli Zhou, and Dongfeng Cai
Urdu Noun Phrase Chunking - Hybrid Approach ..........................................................................................69
Shahid Siddiq, Sarmad Hussain, Aasim Ali, Kamran Malik, and Wajid Ali
The Function of Fixed Word Combination in Chinese Chunk Parsing ......................................................73
Liqun Wang and Shoichi Yokoyama
Problems and Review of Statistical Parsing Language Model ....................................................................77
Faguo Zhou, Fan Zhang, and Bingru Yang
A General Comparison on Sentences Analysis and Its Teaching
Significance between Traditional and Structural Grammars ........................................................................81
Jiaying Yu
Semantics
Two Cores in Chinese Negation System: A Corpus-Based View ...............................................................87
Hio Tong Chan and Chunyu Kit
Finding Semantic Similarity in Vietnamese ...................................................................................................91
Dat Tien Nguyen and Son Bao Pham
Automatic Metaphor Recognition Based on Semantic Relation Patterns ..................................................95
Xuri Tang, Weiguang Qu, Xiaohe Chen, and Shiwen Yu
vi
Event Entailment Extraction Based on EM Iteration ..................................................................................101
Zhen Li, Hanjing Li, Mo Yu, Tiejun Zhao, and Sheng Li
On the Semantic Orientation and Computer Identification of the Adverb
“Jiù” ....................................................................................................................................................................105
Lin He and Jiaqin Wu
Semantic Genes and the Formalized Representation of Lexical Meaning ...............................................110
Dan Hu
Acquisition of Hypernymy-Hyponymy Relation between Nouns
for WordNet Building ......................................................................................................................................114
Gunawan and Erick Pranata
Algorithm for Conversion of Bangla Sentence to Universal Networking
Language ............................................................................................................................................................118
Md. Nawab Yousuf Ali, M. Ameer Ali, Abu Mohammad Nurannabi,
and Jugal Krishna Das
Construction of the Paradigmatic Semantic Network Based on Cognition .............................................122
Xiaofang Ouyang
The Research of Sentence Testing Based on HNC Analysis System
of Sentence Category ........................................................................................................................................126
Zhiying Liu
Semantic Patterns of Chinese Post-Modified V+N Phrases .......................................................................130
Likun Qiu and Wenxian Zhang
Information Extraction
A Grammar-Based Unsupervised Method of Mining Volitive Words .....................................................137
Jianfeng Zhang, Yu Hong, Yuehui Yang, Jianmin Yao, and Qiaoming Zhu
Using Feature Selection to Speed Up Online SVM Based Spam Filtering .............................................142
Yuewu Shen, Guanglu Sun, Haoliang Qi, and Xiaoning He
A Semi-supervised Method for Classification of Semantic Relation
between Nominals .............................................................................................................................................146
Yuan Chen, Yue Lu, Man Lan, Jian Su, and Zhengyu Niu
XPath-Wrapper Induction for Data Extraction .............................................................................................150
Nam-Khanh Tran, Kim-Cuong Pham, and Quang-Thuy Ha
A Block Segmentation Based Approach for Web Information Extraction ..............................................154
Chanwei Wang, Chengjie Sun, Lei Lin, and Xiaolong Wang
Linguistic Features for Named Entity Recognition Using CRFs ..............................................................158
R. Vijay Sundar Ram, A. Akilandeswari, and Sobha Lalitha Devi
Research on Domain-Adaptive Transfer Learning Method and Its
Applications .......................................................................................................................................................162
Geli Fei and Dequan Zheng
vii
Information Theory Based Feature Valuing for Logistic Regression
for Spam Filtering .............................................................................................................................................166
Haoliang Qi, Xiaoning He, Yong Han, Muyun Yang, and Sheng Li
Automatic Named Entity Set Expansion Using Semantic Rules
and Wrappers for Unary Relations .................................................................................................................170
Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, and Hoang-Quynh Le
Anaphora Resolution of Malay Text: Issues and Proposed Solution Model ...........................................174
Noorhuzaimi Karimah Mohd Noor, Shahrul Azman Noah, Mohd Juzaidin Ab. Aziz,
and Mohd Pouzi Hamzah
Combining Multi-features with Conditional Random Fields for Person
Recognition ........................................................................................................................................................178
Suxiang Zhang
Comparison between Typical Discriminative Learning Model
and Generative Model in Chinese Short Messages Service Spam Filtering ............................................182
Xiaoxia Zheng, Chao Liu, Chengzhe Huang, Yu Zou, and Hongwei Yu
Chinese Spam Filter Based on Relaxed Online Support Vector Machine ...............................................185
Yong Han, Xiaoning He, Muyun Yang, Haoliang Qi, and Chao Song
Comment Target Extraction Based on Conditional Random Field &
Domain Ontology ..............................................................................................................................................189
Shengchun Ding and Ting Jiang
Text Understanding and Summarization
Topic-Driven Multi-document Summarization ............................................................................................195
Hongling Wang and Guodong Zhou
Multiple Factors-Based Opinion Retrieval and Coarse-to-Fine Sentiment
Classification .....................................................................................................................................................199
Shu Zhang, Wenjie Jia, Yingju Xia, Yao Meng, and Hao Yu
Chinese Sentence-Level Sentiment Classification Based on Sentiment
Morphemes ........................................................................................................................................................203
Xin Wang and Guohong Fu
Extracting Phrases in Vietnamese Document for Summary Generation ..................................................207
Huong Thanh Le, Rathany Chan Sam, and Phuc Trong Nguyen
User Interest Analysis with Hidden Topic in News Recommendation
System ................................................................................................................................................................211
Mai-Vu Tran, Xuan-Tu Tran, and Huy-Long Uong
Dependency Tree-Based Anaphoricity Determination for Coreference
Resolution ..........................................................................................................................................................215
Fang Kong, Jianmei Zhou, Guodong Zhou, and Qiaoming Zhu
Text Clustering Based on Domain Ontology and Latent Semantic Analysis ..........................................219
Yaxiong Li, Jianqiang Zhang, and Dan Hu
viii
Social Network Mining Based on Wikipedia ...............................................................................................223
Fangfang Yang, Zhiming Xu, Sheng Li, and Zhikai Xu
Retrospective Labels in Chinese Argumentative Discourses .....................................................................227
Donghong Liu
Improve Search by Optimization or Personalization, A Case Study in Sogou
Log ......................................................................................................................................................................231
Jingbin Gao, Muyun Yang, Sheng Li, Tiejun Zhao, and Haoliang Qi
Machine Translation
Conditional Random Fields for Machine Translation System Combination ...........................................237
Tian Xia, Shandian Zhe, and Qun Liu
A Method of Automatic Translation of Words of Multiple Affixes
in Scientific Literature ......................................................................................................................................241
Lei Wang, Baobao Chang, and Janet Harkness
Hierarchical Pitman-Yor Language Model for Machine Translation .......................................................245
Tsuyoshi Okita and Andy Way
Training MT Model Using Structural SVM .................................................................................................249
Tiansang Du and Baobao Chang
English-Hindi Automatic Word Alignment with Scarce Resources .........................................................253
Eknath Venkataramani and Deepa Gupta
Sentence Similarity-Based Source Context Modelling in PBSMT ...........................................................257
Rejwanul Haque, Sudip Kumar Naskar, Andy Way, Marta R. Costa-jussà,
and Rafael E. Banchs
Verb Transfer in a Tamil to Hindi Machine Translation System ..............................................................261
Sobha Lalitha Devi, Pravin Pralayankar, S. Menaka, T. Bakiyavathi,
R. Vijay Sundar Ram, and V. Kavitha
Lexical Gap in English - Vietnamese Machine Translation: What to Do? ..............................................265
Le Manh Hai and Phan Thi Tuoi
Nominal Transfer from Tamil to Hindi .........................................................................................................270
Sobha Lalitha Devi, V. Kavitha, Pravin Pralayankar, S. Menaka, T. Bakiyavathi,
and R. Vijay Sundar Ram
Language Resources
Building Thai FrameNet through a Combination Approach ......................................................................277
Dhanon Leenoi, Sawittree Jumpathong, and Thepchai Supnithi
Evaluating the Quality of Web-Mined Bilingual Sentences Using Multiple
Linguistic Features ............................................................................................................................................281
Xiaohua Liu and Ming Zhou
ix
A Semi-Supervised Approach on Using Syntactic Prior Knowledge
for Construction Thai Treebank ......................................................................................................................285
Taneth Ruangrajitpakorn, Prachya Boonkwan, Thepchai Supnithi,
and Phiradet Bangcharoensap
A Proposed Model for Constructing a Yami WordNet ...............................................................................289
Meng-Chien Yang, D. Victoria Rau, and Ann Hui-Huan Chang
Annotation Guidelines for Hindi-English Word Alignment ......................................................................293
Rahul Kumar Yadav and Deepa Gupta
Building Synsets for Indonesian WordNet with Monolingual Lexical
Resources ...........................................................................................................................................................297
Gunawan and Andy Saputra
A Study of Unique Words in Hawks’ Translation of Hong Lou Meng
in Comparison with Yang’s Translation ........................................................................................................301
Yunfang Liang, Lixin Wang, and Dan Yang
Kazakh Noun Phrase Extraction Based on N-gram and Rules ..................................................................305
Gulila Altenbek and Ruina Sun
Spoken Language Processing
Feature Smoothing and Frame Reduction for Speaker Recognition .........................................................311
Santi Nuratch, Panuthat Boonpramuk, and Chai Wutiwiwatchai
Improved Cantonese Tone Recognition with Approximated F0 Contour:
Implications for Cochlear Implants ................................................................................................................315
Meng Yuan, Haihong Feng, and Tan Lee
Combining Sub-bands SNR on Cochlear Model for Voice Activity
Detection ............................................................................................................................................................319
Qibo Liu, Yi Liu, and Yanjie Li
Precedence of Emotional Features in Emotional Prosody Processing:
Behavioral and ERP Evidence ........................................................................................................................323
Xuhai Chen and Yufang Yang
Acoustic Space of Vowels with Different Tones: Case of Thai Language ..............................................327
Vaishna Narang, Deepshikha Misra, Ritu Yadav, and Sulaganya Punyayodhin
A Study of F1 Correlation with F0 in a Tone Language: Case of Thai ....................................................330
Sulaganya Punyayodhin, Deepshikha Misra, Ritu Yadav, and Vaishna Narang
A Contrastive Study of F3 Correlation with F2 and F1 in Thai and Hindi ..............................................334
Ritu Yadav, Deepshikha Misra, Sulaganya Punyayodhin, and Vaishna Narang
Durational Contrast and Centralization of Vowels in Hindi and Thai ......................................................339
Deepshikha Misra, Ritu Yadav, Sulaganya Punyayodhin, and Vaishna Narang
A Study in Comparing Acoustic Space: Korean and Hindi Vowels .........................................................343
Hyunkyung Lee and Vaishna Narang
Author Index ................................................................................................................................................347
x
Building Synsets for Indonesian WordNet with
Monolingual Lexical Resources
Gunawan*,**) and Andy Saputra**)
*) Dept. of Electrical Engineering
Faculty of Industrial Technology
Institut Teknologi Sepuluh Nopember
Surabaya, East Java, Indonesia
**) Dept. of Computer Science
Sekolah Tinggi Teknik Surabaya
Surabaya, East Java, Indonesia
[email protected], [email protected]
Abstract — This paper presents an approach to build synsets for
Indonesian WordNet semi-automatically using monolingual
lexical resources available freely in Bahasa Indonesia.
Monolingual lexical resources refer to Kamus Besar Bahasa
Indoensia or KBBI (monolingual dictionary of Bahasa Indonesia)
and Tesaurus Bahasa Indonesia (Indonesian thesaurus). We
assume that monolingual resources will play an important role in
synsets building, because it will provide more accurate senses
specifically for Bahasa. Besides, resources that have been used are
produced by Bahasa Indonesia Language Center, which is a
government institution that manages Bahasa Indonesia
development. However, the synsets retrieved will be considered as
a prototype version of Indonesian WordNet synsets.
Keywords - synset; automatic construction; monolingual lexical
resource; thesaurus; Bahasa Indonesia
I.
INTRODUCTION
Princeton WordNet (PWN)[1] is one of the most popular
and most widely used lexical databases for the English
language, mainly for computational linguistic and natural
language processing (NLP) researches and applications. It was
developed by Princeton University, and has been built manually
by a group of lexicographers which afterward, the result is
compiled as lexical database files. This manually built PWN
ensures high accuracy and high quality, but costs a lot of effort
and resources, such as experts in language and time. This is the
reason why many researchers try to build WordNet from
another available existing lexical resources automatically or
semi-automatically. Some experiments, for example, the
construction of Korean WordNet[2] and Romanian WordNet[3]
were done by automatically build new WordNet using existing
WordNet[4] and other existing lexical resources available.
This paper presents an approach used for building
Indonesian WordNet, limited to synsets only, using available
monolingual lexical resources in Indonesia. Another research to
complete all the semantic relations will be held in the future.
Synsets construction was firstly done because synset is the
basic concept which supports many other semantic relations in
a lexical database with PWN as the main reference.
The main approach for synset construction is using
monolingual resources such as KBBI and thesaurus as the main
resource, as we know that thesaurus is a lexical resource
providing a synonymy relationship between words in a
language. Furthermore, additional data are retrieved from
KBBI, the most widely used monolingual dictionary for
Bahasa. These two kind of resources are freely available for
download and provided by Bahasa Indonesia Language Center,
which is subpart of the Education Ministry of Indonesia.
This paper is organized as follows. In section 2, we describe
the available resources that were used in this research. In
section 3, we explain the process done for the approach using
monolingual resources in Bahasa. Section 4 discusses about the
result of synsets retrieved from the research that has been done.
Finally, we draw some conclusions and future researches in
section 5 and 6.
II.
LEXICAL RESOURCES
Lexical database that are built semi-automatically will
definitely be highly-dependent to lexical resources which are
used in the building process. Lexical resources that can be used
are such as dictionaries, thesaurus, and corpuses. Unfortunately,
only monolingual dictionary (KBBI) and Indonesian thesaurus
are available, while ideal corpus (e.g. encyclopedias) for this
research has not been found. KBBI and Indonesian thesaurus,
as previously stated, are freely downloadable from official site
of Education Ministry of Indonesia. This section will explain
the lexical resources used in this research.
A. Tesaurus Bahasa Indonesia
Tesaurus Bahasa Indonesia is a thesaurus in Bahasa which
provides some relations, such as antonyms synonyms, and
hyponyms (based on the official information inside the book,
state that an entry in this thesaurus can consist of synonyms or
hyponyms). Thesaurus itself is a resource that contains a group
of words sharing the same meanings. However, in the
Indonesian thesaurus, the differences between hyponyms and
synonyms are not as explicit as those between synonyms and
antonyms are. Therefore, a further consideration needs to be
taken into account before processing the thesaurus data.
This is the example of an entry in Tesaurus Bahasa
Indonesia:
haram a 1 gelap (ki), ilegal, liar, pantang, sumbang,
tabu, terlarang; 2 mulia, suci;
ant 1 halal
Lemma ‘haram’, in Bahasa, has a similar meaning of ‘illegal’ in
English, while the pairs such as ‘ilegal’ are similar to ‘illegal’,
‘liar’ is similar to ‘wild’, and ‘tabu’ is similar to ‘taboo’ in
English. Word ‘haram’ is considered as the main lemma for this
entry, which has class ‘a’ (adjective). The number following the
semantic class is the sense number. So, from the example
above, ‘haram’ has 2 different senses, which each sense has
different pairs of synonyms/hyponyms. The antonym for
‘haram’ (‘halal’, same as ‘legal’ in English) can be found after
the keyword ‘ant’, which shows the antonym part. The
following number after ‘ant’ shows which sense number
relevant to that antonym.
However, there are some problems found in processing
Indonesian Thesaurus, such as:
1.
Synonymy, in PWN, is a kind of a bi-directional
relation, which means that if a word w1 is a synonym of
word w2, then w2 should refer w1 as its synonym.
However, this kind of relation doesn’t always exist in
thesaurus entries.
2.
Indonesian Thesaurus, not like Roget’s Thesaurus for
English, does not provide categorization for its entries
and senses. Therefore, it is impossible to do word sense
disambiguation to entries with multiple senses with the
support of categories such as in Roget’s.
B. Kamus Besar Bahasa Indonesia (KBBI)
KBBI is the official monolingual dictionary of Bahasa
Indonesia and becomes the most widely used dictionary in
Indonesia. The latest edition of KBBI is the 4th edition and is
downloadable in PDF format, but this research use the 3rd
edition because it is available in a text file format which makes
it easier to be processed (we get the 3rd edition in tag-formatted
text file from a mobile application titled KBBI Mobile).
This dictionary, differs with the thesaurus, provides a
definition of each sense of a lemma and also example sentences
related to the usage of that lemma. Mainly, the reason for
adding data from KBBI in this approach is to ensure that all
words in Bahasa are covered in the synsets produced. However,
an interesting fact has been discovered is that KBBI has also
implicit synonyms. In some entries, the definition itself is only
consists of few words, only 1-3 words and separated by semicolon. This kind of definition will be processed as synonyms.
The following examples will illustrate some entries in
KBBI:
apel n nama pohon yg buahnya berdaging keras dan
mengandung air serta berkulit lunak yg warnanya
merah (kemerahan) atau kekuning (kekuningan), jika
matang rasanya manis keasam-asaman, Pyrus malus
‘apel’ is a lemma of an entry, followed by ‘n’ that shows the
semantic category (n stands for noun), continued with the
definition, and ended with its binomial name (limited to
animals and plants only). For additional information, ‘apel’
itself is similar to ‘apple’ in English, with the definition ‘the
name of a tree which fruit has reddish/yellowish skin, contains
water, and soft-skinned’. However, in Bahasa, for different
sense, ‘apel’ can have the same meaning with ‘upacara’, which
is similar to ‘ceremony’ in English.
abai a 1 tidak dihiraukan (tidak dilakukan dng
sungguh-sungguh; tidak diindahkan dsb); 2 lalai: anakanak tidak boleh – thd nasihat orang tua dan guru;
For another example, lema ‘abai’ (equivalent to ‘don’t care’ in
English) stated above has the adjective class (showed by ‘a’
after the lemma) and has two senses, separated by the number
following each definition of each sense. The italic sentences
(can be seen in the 2nd sense) is the example sentence related to
that sense.
aba n ayah; bapak; (kadang-kadang juga berarti)
kakek
Lemma ‘aba’ (equivalent to ‘father’ in English), as shown
above, has a noun class (showed by ‘n’). However, its
definition only consists of few words, separated by semi-colon.
For additional information, in Bahasa, ‘aba’, ‘ayah’, and ‘bapak’
share the same meaning (they have similar translation in
English, which is ‘father’). Therefore, the definition like the
above will be treated as implicit synonyms, considering ‘ayah’
and ‘bapak’ as the synonym of ‘aba’. But, the synonyms will be
limited only to a word or phrase that consists up to 3 words.
There are some problems found in processing KBBI,
especially in a merging process with data from the thesaurus,
such as:
1.
Some data from KBBI and thesaurus are irrelevant to
each other. For example, word w1 has 4 senses in
KBBI. Thesaurus also has w1 for its lemma (only 1
sense), but it does not provide any information which
one of the w1 sense in KBBI (which has been told
having 4 senses) is related to that entry in thesaurus.
2.
There are some data which exist both in KBBI and
thesaurus, which are single sense (monosemous), but
have different semantic classes, although if we look
manually, they refer to same lemma and they should
have the same semantic class.
III.
MONOLINGUAL RESOURCES BASED APPROACH
A. Overview of Approach
This approach only use monolingual resources, KBBI and
Indonesian Thesaurus as the resource to build the synsets. The
motivation of using monolingual resources is based on the fact
that monolingual resources which are produced by experts in
Bahasa will provide better senses and meaning, compared to
using other language resources (such as PWN itself, with the
support of a bilingual dictionary). There are some problems that
exist in these two resources (described in section 2.A and 2.B),
so we should make some assumptions as explained in the
following parts: basically, the process done is extracting the
candidate synsets from thesaurus entries, adding more
candidate synsets from KBBI, eliminating the redundant
(similar) candidate synsets, and last, using clustering technique,
merging candidate synsets which are assumed to have similar
meaning.
B. Synonymy Concept in Thesaurus
As stated in the previous section (section 2.A), thesaurus
has been known for having a group of synonymous words. We
consider synonymy concept as a bi-directional relation between
two words. However, not all entries in thesaurus have this kind
of relation. For examples:
aba-aba n arahan, instruksi, isyarat, kode, komando,
perintah, petunjuk, seruan, suruhan, tanda, titah
arahan n 1 bimbingan, firman, panduan, pedoman,
pengarahan, pengawalan, perintah, petunjuk,
pimpinan, sabda, suruhan, tuntunan; 2 aba-aba,
amanat, mandat, titah
seruan n 1 jeritan, laung, pekik, teriakan; 2 ajakan, anjuran,
imbauan, lambaian (ki), panggilan, permintaan,
undangan
From the examples above, we consider lemma ‘aba-aba’ has
the following synonymous words, including ‘arahan’ and
‘seruan’. Another entry, which lemma is ‘arahan’, has ‘aba-aba’
as its pairs in the 2nd sense (this shows the bi-directional relation
between these 2 words). But, entry for lemma ‘seruan’ does not
have ‘aba-aba’ as its pairs. So, the 2nd sense of ‘arahan’ will be
decided as a synonym to ‘aba-aba’, while ‘seruan’ will not be
considered as synonym to ‘aba-aba’ because it does not have a
bi-directional relation with ‘aba-aba’. This is the main concept
of synonymy used in processing thesaurus entries.
C. Adding Lemmas from KBBI as Additional Data
After retrieving candidate synsets from thesaurus, we
process data from KBBI to retrieve the implicit synonyms (as
described in section 2.B). Besides, we also try to retrieve
synsets which only consist of one lemma (without any
synonyms) that does not exist in thesaurus. This process need to
be done to complement the candidate synsets retrieved from
thesaurus. However, because some of the data from KBBI and
thesaurus are irrelevant to each other, lemmas from KBBI
which already exist in thesaurus will not be processed again.
This consideration is taken to reduce the ambiguity between
KBBI and thesaurus itself.
D. Eliminating Redundant Candidate Synsets
From the candidate synsets retrieved, we found some
redundant synsets which are perfectly similar to each other.
This is mainly because of the concept of synonym used and also
the characteristic of the thesaurus entries. These following
examples are the candidate synsets retrieved from thesaurus:
E. Merging Candidate Synsets Based on Clustering Technique
After an elimination process finished, there are still some
problems regarding the rest of the candidate synsets. By
manually monitor the result, we have found that some
candidates share the same meaning with the other candidates,
but they are not eliminated in the previous process since they
are not perfectly similar. Here are the examples of some
candidates:
1. aba-aba, atribut, markah, petunjuk, tanda
2. aba-aba, alamat, fenomena, firasat, indikasi, isyarat,
kode, petunjuk, sasmita, semboyan, sinyal, tanda,
tengara
3. aba-aba, amanat, komando, perintah, pesan, suruhan,
tugas
4. aba-aba, amanat, arahan, instruksi, komando, mandat,
order, perintah, suruhan, titah, tugas
5. aba-aba, duaja, gestur, isyarat, kode, sandi, semboyan,
sinyal
6. aba-aba, amanat, instruksi, komando, nasihat, perintah,
tugas
From the six examples of the candidate synsets above, we
assume that number 3, 4, and 6 can be merged into one synset
because it consists of quite many similar words. The similarity
of the candidates is measured by the words belonging to each
candidate as its member, because these candidates do not have
gloss or another information that can be used to distinguish the
sense perfectly. So, we try to merge the similar candidates using
the method, that adapting the clustering technique, using the
words belong to each candidates as attribute.
The clustering technique that is used in this research is
hierarchical clustering. The primary reasons of choosing
hierarchical clustering are because the result of the clusters can
not be predicted before the clustering process is done (therefore,
the partitional clustering is inappropriate for this need) and the
simplicity concept of hierarchical clustering itself. Furthermore,
the method of hierarchical clustering used is the agglomerative
method. However, since our goal is not to cluster the candidates
into a single cluster, a modification needs to be done, in which
the clustering process will be stopped after it reached a
condition decided by threshold value. The main algorithm of
this process is like the following:
1.
{aborsi, pengguguran}
{pengguguran, aborsi}
The first example retrieved from lemma ‘aborsi’ (equivalent to
‘abortion’ in English’), while the second retrieved from lemma
‘pengguguran’. However, both of them consist of exactly the
same words. This kind of case is the case we consider that some
candidate synsets are redundant to each other. In order to
reduce this kind of redundancy, an elimination process was
done to delete one of the redundant candidate synsets. Based
from our experiments, this elimination process has successfully
eliminated 14,988 redundant synsets.
2.
Calculate the distance matrix of each candidate to
another candidate. Distance value is calculated by
counting the percentage of similar words between 2
candidates. Example:
- Synset #1 consists of 4 words {a, b, c, d}
- Synset #2 consists of 4 words {a, c, e, f }
- Distance value between synset #1 and #2 :
number of similar words between 2 synsets /
total number of unique words. From the
example above, the similar words count is 2 (a
and c), divided by 6 (unique words are:
a,b,c,d,e,f) and produce 2/6 or 0.333 as the
distance value.
Get the first maximum similarity distance value from
the matrix. The threshold value used is a value that is
calculated from the first maximum similarity distance
3.
4.
5.
multiplied with α (α is a coefficient, which can be
changed manually. The α value used in this research is
0.5). However, any further research will be needed to
decide the best threshold value that able to give the best
result.
Do the agglomerative process, by merging the two
candidates which have the maximum similarity
distance.
Recalculate the new distance matrix
If the current maximum similarity distance is not lower
than the threshold value, back to step 3, otherwise, stop
the clustering process.
The result expected from this process (using the previous
examples) will be like this:
1.
2.
aba-aba, atribut, markah, petunjuk, tanda
aba-aba, alamat, duaja, fenomena, firasat, gestur,
indikasi, isyarat, kode, petunjuk, sandi, sasmita,
semboyan, sinyal, tanda, tengara
3. aba-aba, amanat, arahan, instruksi, komando, mandat,
nasihat, order, perintah, pesan, suruhan, titah, tugas
Candidate number 2 above is the merged candidate from the
previous examples number 2 and 5, while the candidate number
3 is the merged candidate from previous examples number 3,4,
and 6. Based on our experiment, the number of synsets merged
are 7,740 synsets.
IV.
V.
This paper has explored an approach to build synsets for
lexical database similar to PWN, specifically for Bahasa
Indonesia. We only use monolingual resources in Bahasa, based
on our assumption that the resources produced by experts in
Bahasa will provides the best lemmas, senses, and meanings
corresponding to Bahasa itself. We hope that the building of
lexical database for another language with the similar condition
with Bahasa Indonesia (which ideal lexical resources needed
are quite limited) would benefit from this research.
VI.
RESULT
NUMBER OF SYNSETS RETRIEVED
Synsets with
single
member
Synsets with
multiple
members
Noun
25.587
11.898
37.485
61.78%
Verb
9.090
7.831
16.921
27.88%
Adjective
3.347
2.458
5.805
9.57%
235
227
462
0.77%
38.259
22.414
60.673
100.00%
Category
Adverb
Total
Total
FURTHER RESEARCH
Based on our experience conducting this research, we find
that some enhancements can be done in order to increase the
accuracy and quality of the synsets retrieved. Some
enhancements that we suggest to do in further research are:
The result from this research is stated in table below:
TABLE I.
CONCLUSION
1.
Provide better lexical resources and ensure the
consistency and congruency between those resources.
This is based on our assumption that the result of
lexical database that has been built semi-automatically
is highly dependent on the lexical resources used.
However, this task will require co-operative works with
some linguistics experts in Bahasa Indonesia,
remembering that we do not have enough capability in
linguistics.
2.
A more powerful clustering technique can be used in
the clustering step. Remembering that the data used are
of categorical attribute, we strongly suggest to use the
specific clustering technique for this categorical
attribute, such as the ROCK techniques[6].
ACKNOWLEDGMENT
%
We would like to thank students from NLP class, who
support us in providing and pre-processing the raw materials of
lexical resources used in this research. We also hope that the
synsets retrieved from this research will be helpful for other
researches which are trying to do acquisition processes for
retrieving
other
semantic
relations,
such
as
hyponymy/hypernymy, holonymy/meronymy, and gloss
acquisition to build a better Indonesian WordNet.
REFERENCES
Some examples of the results are like stated below (on the
form of WordNet lexicographer files format) :
{ apel3, upacara2, (upacara;) }
{ aba, ayah4, babe2, bapak7, papi3, (gloss) }
{ empuan2, istri6, pedusi, perempuan4, puan3, (gloss) }
Unfortunately, because this research did not involve experts
in Bahasa, we hardly found any methods to perform accuracy
tests. We have found an automatic evaluation method of
synsets[5], but it needed gloss and hyponym/hypernymy
relation which did not exist in the synsets retrieved. Therefore,
this result will be considered as a prototype version of
Indonesian WordNet synsets, which will be improved by
further researches.
[1]
[2]
[3]
[4]
[5]
[6]
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller,
“Introduction to WordNet: An Online Lexical Database” , The
International Journal of Lexicography, vol. 3(4), pp. 235-244, 1990.
L. Changki, L. Geunbae, and S. Jungyun, “Automatic WordNet Mapping
using Word Sense Disambiguation”,In Proceedings of the joint SIGDAT
Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 2000). 2000.
E, Barbu and V. Barbu Mititelu, Automatic Building of WordNets, 2006.
X. Farreres, G.Rigau, and H. Rodriguez, “Using WordNet for Building
WordNets”, In Proceedings of COLING-ACL Workshop on Usage of
WordNet in Natural Language Processing Systems.
R. Nadig, J. Ramanand, and P. Bhattacharyya, “Automatic Evaluation of
WordNet Synonyms and Hypernyms”, In Proceedings of ICON-2008: 6th
International Conference of Natural Language Processing.
S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes”, In Proceedings of the 15th ICDE,
pp. 512-521. 1999.
on Asian Language
Processing
IALP 2010
Table of Contents
Message from General Chairs..............................................................................................................xi
Message from Program Chairs..........................................................................................................xiii
Conference Committees........................................................................................................................xiv
Program Committee...................................................................................................................................xv
Organizers and Sponsors....................................................................................................................xvii
Invited Talks ..................................................................................................................................................xix
Lexicon and Morphology
A Survey on Rendering Traditional Mongolian Script ...................................................................................3
Biligsaikhan Batjargal, Fuminori Kimura, and Akira Maeda
A Combination of Statistical and Rule-Based Approach for Mongolian
Lexical Analysis ....................................................................................................................................................7
Lili Zhao, Jia Men, Congpin Zhang, Qun Liu, Wenbin Jiang, Jinxing Wu,
and Qing Chang
A Letter Tagging Approach to Uyghur Tokenization ...................................................................................11
Batuer Aisha
Development of Analysis Rules for Bangla Root and Primary Suffix
for Universal Networking Language ................................................................................................................15
Md. Nawab Yousuf Ali, Shahid Al Noor, Md. Zakir Hossain, and Jugal Krishna Das
A Suffix-Based Noun and Verb Classifier for an Inflectional Language ...................................................19
Navanath Saharia, Utpal Sharma, and Jugal Kalita
Behavior of Word ‘kaa’ in Urdu Language ....................................................................................................23
Muhammad Kamran Malik, Aasim Ali, and Shahid Siddiq
Methods to Divide Uygur Morphemes and Treatments for Exceptions .....................................................27
Pu Li and Hao Zhao
Rules for Morphological Analysis of Bangla Verbs for Universal
Networking Language ........................................................................................................................................31
Md. Nawab Yousuf Ali, Mohammad Zakir Hossain Sarker,
Ghulam Farooque Ahmed, and Jugal Krishna Das
v
Discussion on Collation of Tibetan Syllable ..................................................................................................35
Heming Huang and Feipeng Da
A Dictionary Mechanism for Chinese Word Segmentation Based on
the Finite Automata ............................................................................................................................................39
Wu Yang, Liyun Ren, and Rong Tang
Development of Templates for Dictionary Entries of Bangla Roots
and Primary Suffixes for Universal Networking Language .........................................................................43
Md. Zakir Hossain, Shaikh Muhammad Allayear, Md. Nawab Yousuf Ali,
and Jugal Krishna Das
A Study on "Worry" Separable Words & Its Separable Slots ......................................................................47
Chunling Li and Xiaoxiao Wang
Syntax and Parsing
Improving Dependency Parsing Using Punctuation ......................................................................................53
Zhenghua Li, Wanxiang Che, and Ting Liu
A Tree Probability Generation Using VB-EM for Thai PGLR Parser .......................................................57
Kanokorn Trakultaweekoon, Taneth Ruangrajitpakorn, Prachya Boonkwan,
and Thepchai Supnithi
Research on Verb Subcategorization-Based Syntactic Parsing Postprocess
for Chinese Language .........................................................................................................................................61
Jinyong Wang and Xiwu Han
Identification of Maximal-Length Noun Phrases Based on Maximal-Length
Preposition Phrases in Chinese .........................................................................................................................65
Guiping Zhang, Wenjing Lang, Qiaoli Zhou, and Dongfeng Cai
Urdu Noun Phrase Chunking - Hybrid Approach ..........................................................................................69
Shahid Siddiq, Sarmad Hussain, Aasim Ali, Kamran Malik, and Wajid Ali
The Function of Fixed Word Combination in Chinese Chunk Parsing ......................................................73
Liqun Wang and Shoichi Yokoyama
Problems and Review of Statistical Parsing Language Model ....................................................................77
Faguo Zhou, Fan Zhang, and Bingru Yang
A General Comparison on Sentences Analysis and Its Teaching
Significance between Traditional and Structural Grammars ........................................................................81
Jiaying Yu
Semantics
Two Cores in Chinese Negation System: A Corpus-Based View ...............................................................87
Hio Tong Chan and Chunyu Kit
Finding Semantic Similarity in Vietnamese ...................................................................................................91
Dat Tien Nguyen and Son Bao Pham
Automatic Metaphor Recognition Based on Semantic Relation Patterns ..................................................95
Xuri Tang, Weiguang Qu, Xiaohe Chen, and Shiwen Yu
vi
Event Entailment Extraction Based on EM Iteration ..................................................................................101
Zhen Li, Hanjing Li, Mo Yu, Tiejun Zhao, and Sheng Li
On the Semantic Orientation and Computer Identification of the Adverb
“Jiù” ....................................................................................................................................................................105
Lin He and Jiaqin Wu
Semantic Genes and the Formalized Representation of Lexical Meaning ...............................................110
Dan Hu
Acquisition of Hypernymy-Hyponymy Relation between Nouns
for WordNet Building ......................................................................................................................................114
Gunawan and Erick Pranata
Algorithm for Conversion of Bangla Sentence to Universal Networking
Language ............................................................................................................................................................118
Md. Nawab Yousuf Ali, M. Ameer Ali, Abu Mohammad Nurannabi,
and Jugal Krishna Das
Construction of the Paradigmatic Semantic Network Based on Cognition .............................................122
Xiaofang Ouyang
The Research of Sentence Testing Based on HNC Analysis System
of Sentence Category ........................................................................................................................................126
Zhiying Liu
Semantic Patterns of Chinese Post-Modified V+N Phrases .......................................................................130
Likun Qiu and Wenxian Zhang
Information Extraction
A Grammar-Based Unsupervised Method of Mining Volitive Words .....................................................137
Jianfeng Zhang, Yu Hong, Yuehui Yang, Jianmin Yao, and Qiaoming Zhu
Using Feature Selection to Speed Up Online SVM Based Spam Filtering .............................................142
Yuewu Shen, Guanglu Sun, Haoliang Qi, and Xiaoning He
A Semi-supervised Method for Classification of Semantic Relation
between Nominals .............................................................................................................................................146
Yuan Chen, Yue Lu, Man Lan, Jian Su, and Zhengyu Niu
XPath-Wrapper Induction for Data Extraction .............................................................................................150
Nam-Khanh Tran, Kim-Cuong Pham, and Quang-Thuy Ha
A Block Segmentation Based Approach for Web Information Extraction ..............................................154
Chanwei Wang, Chengjie Sun, Lei Lin, and Xiaolong Wang
Linguistic Features for Named Entity Recognition Using CRFs ..............................................................158
R. Vijay Sundar Ram, A. Akilandeswari, and Sobha Lalitha Devi
Research on Domain-Adaptive Transfer Learning Method and Its
Applications .......................................................................................................................................................162
Geli Fei and Dequan Zheng
vii
Information Theory Based Feature Valuing for Logistic Regression
for Spam Filtering .............................................................................................................................................166
Haoliang Qi, Xiaoning He, Yong Han, Muyun Yang, and Sheng Li
Automatic Named Entity Set Expansion Using Semantic Rules
and Wrappers for Unary Relations .................................................................................................................170
Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, and Hoang-Quynh Le
Anaphora Resolution of Malay Text: Issues and Proposed Solution Model ...........................................174
Noorhuzaimi Karimah Mohd Noor, Shahrul Azman Noah, Mohd Juzaidin Ab. Aziz,
and Mohd Pouzi Hamzah
Combining Multi-features with Conditional Random Fields for Person
Recognition ........................................................................................................................................................178
Suxiang Zhang
Comparison between Typical Discriminative Learning Model
and Generative Model in Chinese Short Messages Service Spam Filtering ............................................182
Xiaoxia Zheng, Chao Liu, Chengzhe Huang, Yu Zou, and Hongwei Yu
Chinese Spam Filter Based on Relaxed Online Support Vector Machine ...............................................185
Yong Han, Xiaoning He, Muyun Yang, Haoliang Qi, and Chao Song
Comment Target Extraction Based on Conditional Random Field &
Domain Ontology ..............................................................................................................................................189
Shengchun Ding and Ting Jiang
Text Understanding and Summarization
Topic-Driven Multi-document Summarization ............................................................................................195
Hongling Wang and Guodong Zhou
Multiple Factors-Based Opinion Retrieval and Coarse-to-Fine Sentiment
Classification .....................................................................................................................................................199
Shu Zhang, Wenjie Jia, Yingju Xia, Yao Meng, and Hao Yu
Chinese Sentence-Level Sentiment Classification Based on Sentiment
Morphemes ........................................................................................................................................................203
Xin Wang and Guohong Fu
Extracting Phrases in Vietnamese Document for Summary Generation ..................................................207
Huong Thanh Le, Rathany Chan Sam, and Phuc Trong Nguyen
User Interest Analysis with Hidden Topic in News Recommendation
System ................................................................................................................................................................211
Mai-Vu Tran, Xuan-Tu Tran, and Huy-Long Uong
Dependency Tree-Based Anaphoricity Determination for Coreference
Resolution ..........................................................................................................................................................215
Fang Kong, Jianmei Zhou, Guodong Zhou, and Qiaoming Zhu
Text Clustering Based on Domain Ontology and Latent Semantic Analysis ..........................................219
Yaxiong Li, Jianqiang Zhang, and Dan Hu
viii
Social Network Mining Based on Wikipedia ...............................................................................................223
Fangfang Yang, Zhiming Xu, Sheng Li, and Zhikai Xu
Retrospective Labels in Chinese Argumentative Discourses .....................................................................227
Donghong Liu
Improve Search by Optimization or Personalization, A Case Study in Sogou
Log ......................................................................................................................................................................231
Jingbin Gao, Muyun Yang, Sheng Li, Tiejun Zhao, and Haoliang Qi
Machine Translation
Conditional Random Fields for Machine Translation System Combination ...........................................237
Tian Xia, Shandian Zhe, and Qun Liu
A Method of Automatic Translation of Words of Multiple Affixes
in Scientific Literature ......................................................................................................................................241
Lei Wang, Baobao Chang, and Janet Harkness
Hierarchical Pitman-Yor Language Model for Machine Translation .......................................................245
Tsuyoshi Okita and Andy Way
Training MT Model Using Structural SVM .................................................................................................249
Tiansang Du and Baobao Chang
English-Hindi Automatic Word Alignment with Scarce Resources .........................................................253
Eknath Venkataramani and Deepa Gupta
Sentence Similarity-Based Source Context Modelling in PBSMT ...........................................................257
Rejwanul Haque, Sudip Kumar Naskar, Andy Way, Marta R. Costa-jussà,
and Rafael E. Banchs
Verb Transfer in a Tamil to Hindi Machine Translation System ..............................................................261
Sobha Lalitha Devi, Pravin Pralayankar, S. Menaka, T. Bakiyavathi,
R. Vijay Sundar Ram, and V. Kavitha
Lexical Gap in English - Vietnamese Machine Translation: What to Do? ..............................................265
Le Manh Hai and Phan Thi Tuoi
Nominal Transfer from Tamil to Hindi .........................................................................................................270
Sobha Lalitha Devi, V. Kavitha, Pravin Pralayankar, S. Menaka, T. Bakiyavathi,
and R. Vijay Sundar Ram
Language Resources
Building Thai FrameNet through a Combination Approach ......................................................................277
Dhanon Leenoi, Sawittree Jumpathong, and Thepchai Supnithi
Evaluating the Quality of Web-Mined Bilingual Sentences Using Multiple
Linguistic Features ............................................................................................................................................281
Xiaohua Liu and Ming Zhou
ix
A Semi-Supervised Approach on Using Syntactic Prior Knowledge
for Construction Thai Treebank ......................................................................................................................285
Taneth Ruangrajitpakorn, Prachya Boonkwan, Thepchai Supnithi,
and Phiradet Bangcharoensap
A Proposed Model for Constructing a Yami WordNet ...............................................................................289
Meng-Chien Yang, D. Victoria Rau, and Ann Hui-Huan Chang
Annotation Guidelines for Hindi-English Word Alignment ......................................................................293
Rahul Kumar Yadav and Deepa Gupta
Building Synsets for Indonesian WordNet with Monolingual Lexical
Resources ...........................................................................................................................................................297
Gunawan and Andy Saputra
A Study of Unique Words in Hawks’ Translation of Hong Lou Meng
in Comparison with Yang’s Translation ........................................................................................................301
Yunfang Liang, Lixin Wang, and Dan Yang
Kazakh Noun Phrase Extraction Based on N-gram and Rules ..................................................................305
Gulila Altenbek and Ruina Sun
Spoken Language Processing
Feature Smoothing and Frame Reduction for Speaker Recognition .........................................................311
Santi Nuratch, Panuthat Boonpramuk, and Chai Wutiwiwatchai
Improved Cantonese Tone Recognition with Approximated F0 Contour:
Implications for Cochlear Implants ................................................................................................................315
Meng Yuan, Haihong Feng, and Tan Lee
Combining Sub-bands SNR on Cochlear Model for Voice Activity
Detection ............................................................................................................................................................319
Qibo Liu, Yi Liu, and Yanjie Li
Precedence of Emotional Features in Emotional Prosody Processing:
Behavioral and ERP Evidence ........................................................................................................................323
Xuhai Chen and Yufang Yang
Acoustic Space of Vowels with Different Tones: Case of Thai Language ..............................................327
Vaishna Narang, Deepshikha Misra, Ritu Yadav, and Sulaganya Punyayodhin
A Study of F1 Correlation with F0 in a Tone Language: Case of Thai ....................................................330
Sulaganya Punyayodhin, Deepshikha Misra, Ritu Yadav, and Vaishna Narang
A Contrastive Study of F3 Correlation with F2 and F1 in Thai and Hindi ..............................................334
Ritu Yadav, Deepshikha Misra, Sulaganya Punyayodhin, and Vaishna Narang
Durational Contrast and Centralization of Vowels in Hindi and Thai ......................................................339
Deepshikha Misra, Ritu Yadav, Sulaganya Punyayodhin, and Vaishna Narang
A Study in Comparing Acoustic Space: Korean and Hindi Vowels .........................................................343
Hyunkyung Lee and Vaishna Narang
Author Index ................................................................................................................................................347
x
Building Synsets for Indonesian WordNet with
Monolingual Lexical Resources
Gunawan*,**) and Andy Saputra**)
*) Dept. of Electrical Engineering
Faculty of Industrial Technology
Institut Teknologi Sepuluh Nopember
Surabaya, East Java, Indonesia
**) Dept. of Computer Science
Sekolah Tinggi Teknik Surabaya
Surabaya, East Java, Indonesia
[email protected], [email protected]
Abstract — This paper presents an approach to build synsets for
Indonesian WordNet semi-automatically using monolingual
lexical resources available freely in Bahasa Indonesia.
Monolingual lexical resources refer to Kamus Besar Bahasa
Indoensia or KBBI (monolingual dictionary of Bahasa Indonesia)
and Tesaurus Bahasa Indonesia (Indonesian thesaurus). We
assume that monolingual resources will play an important role in
synsets building, because it will provide more accurate senses
specifically for Bahasa. Besides, resources that have been used are
produced by Bahasa Indonesia Language Center, which is a
government institution that manages Bahasa Indonesia
development. However, the synsets retrieved will be considered as
a prototype version of Indonesian WordNet synsets.
Keywords - synset; automatic construction; monolingual lexical
resource; thesaurus; Bahasa Indonesia
I.
INTRODUCTION
Princeton WordNet (PWN)[1] is one of the most popular
and most widely used lexical databases for the English
language, mainly for computational linguistic and natural
language processing (NLP) researches and applications. It was
developed by Princeton University, and has been built manually
by a group of lexicographers which afterward, the result is
compiled as lexical database files. This manually built PWN
ensures high accuracy and high quality, but costs a lot of effort
and resources, such as experts in language and time. This is the
reason why many researchers try to build WordNet from
another available existing lexical resources automatically or
semi-automatically. Some experiments, for example, the
construction of Korean WordNet[2] and Romanian WordNet[3]
were done by automatically build new WordNet using existing
WordNet[4] and other existing lexical resources available.
This paper presents an approach used for building
Indonesian WordNet, limited to synsets only, using available
monolingual lexical resources in Indonesia. Another research to
complete all the semantic relations will be held in the future.
Synsets construction was firstly done because synset is the
basic concept which supports many other semantic relations in
a lexical database with PWN as the main reference.
The main approach for synset construction is using
monolingual resources such as KBBI and thesaurus as the main
resource, as we know that thesaurus is a lexical resource
providing a synonymy relationship between words in a
language. Furthermore, additional data are retrieved from
KBBI, the most widely used monolingual dictionary for
Bahasa. These two kind of resources are freely available for
download and provided by Bahasa Indonesia Language Center,
which is subpart of the Education Ministry of Indonesia.
This paper is organized as follows. In section 2, we describe
the available resources that were used in this research. In
section 3, we explain the process done for the approach using
monolingual resources in Bahasa. Section 4 discusses about the
result of synsets retrieved from the research that has been done.
Finally, we draw some conclusions and future researches in
section 5 and 6.
II.
LEXICAL RESOURCES
Lexical database that are built semi-automatically will
definitely be highly-dependent to lexical resources which are
used in the building process. Lexical resources that can be used
are such as dictionaries, thesaurus, and corpuses. Unfortunately,
only monolingual dictionary (KBBI) and Indonesian thesaurus
are available, while ideal corpus (e.g. encyclopedias) for this
research has not been found. KBBI and Indonesian thesaurus,
as previously stated, are freely downloadable from official site
of Education Ministry of Indonesia. This section will explain
the lexical resources used in this research.
A. Tesaurus Bahasa Indonesia
Tesaurus Bahasa Indonesia is a thesaurus in Bahasa which
provides some relations, such as antonyms synonyms, and
hyponyms (based on the official information inside the book,
state that an entry in this thesaurus can consist of synonyms or
hyponyms). Thesaurus itself is a resource that contains a group
of words sharing the same meanings. However, in the
Indonesian thesaurus, the differences between hyponyms and
synonyms are not as explicit as those between synonyms and
antonyms are. Therefore, a further consideration needs to be
taken into account before processing the thesaurus data.
This is the example of an entry in Tesaurus Bahasa
Indonesia:
haram a 1 gelap (ki), ilegal, liar, pantang, sumbang,
tabu, terlarang; 2 mulia, suci;
ant 1 halal
Lemma ‘haram’, in Bahasa, has a similar meaning of ‘illegal’ in
English, while the pairs such as ‘ilegal’ are similar to ‘illegal’,
‘liar’ is similar to ‘wild’, and ‘tabu’ is similar to ‘taboo’ in
English. Word ‘haram’ is considered as the main lemma for this
entry, which has class ‘a’ (adjective). The number following the
semantic class is the sense number. So, from the example
above, ‘haram’ has 2 different senses, which each sense has
different pairs of synonyms/hyponyms. The antonym for
‘haram’ (‘halal’, same as ‘legal’ in English) can be found after
the keyword ‘ant’, which shows the antonym part. The
following number after ‘ant’ shows which sense number
relevant to that antonym.
However, there are some problems found in processing
Indonesian Thesaurus, such as:
1.
Synonymy, in PWN, is a kind of a bi-directional
relation, which means that if a word w1 is a synonym of
word w2, then w2 should refer w1 as its synonym.
However, this kind of relation doesn’t always exist in
thesaurus entries.
2.
Indonesian Thesaurus, not like Roget’s Thesaurus for
English, does not provide categorization for its entries
and senses. Therefore, it is impossible to do word sense
disambiguation to entries with multiple senses with the
support of categories such as in Roget’s.
B. Kamus Besar Bahasa Indonesia (KBBI)
KBBI is the official monolingual dictionary of Bahasa
Indonesia and becomes the most widely used dictionary in
Indonesia. The latest edition of KBBI is the 4th edition and is
downloadable in PDF format, but this research use the 3rd
edition because it is available in a text file format which makes
it easier to be processed (we get the 3rd edition in tag-formatted
text file from a mobile application titled KBBI Mobile).
This dictionary, differs with the thesaurus, provides a
definition of each sense of a lemma and also example sentences
related to the usage of that lemma. Mainly, the reason for
adding data from KBBI in this approach is to ensure that all
words in Bahasa are covered in the synsets produced. However,
an interesting fact has been discovered is that KBBI has also
implicit synonyms. In some entries, the definition itself is only
consists of few words, only 1-3 words and separated by semicolon. This kind of definition will be processed as synonyms.
The following examples will illustrate some entries in
KBBI:
apel n nama pohon yg buahnya berdaging keras dan
mengandung air serta berkulit lunak yg warnanya
merah (kemerahan) atau kekuning (kekuningan), jika
matang rasanya manis keasam-asaman, Pyrus malus
‘apel’ is a lemma of an entry, followed by ‘n’ that shows the
semantic category (n stands for noun), continued with the
definition, and ended with its binomial name (limited to
animals and plants only). For additional information, ‘apel’
itself is similar to ‘apple’ in English, with the definition ‘the
name of a tree which fruit has reddish/yellowish skin, contains
water, and soft-skinned’. However, in Bahasa, for different
sense, ‘apel’ can have the same meaning with ‘upacara’, which
is similar to ‘ceremony’ in English.
abai a 1 tidak dihiraukan (tidak dilakukan dng
sungguh-sungguh; tidak diindahkan dsb); 2 lalai: anakanak tidak boleh – thd nasihat orang tua dan guru;
For another example, lema ‘abai’ (equivalent to ‘don’t care’ in
English) stated above has the adjective class (showed by ‘a’
after the lemma) and has two senses, separated by the number
following each definition of each sense. The italic sentences
(can be seen in the 2nd sense) is the example sentence related to
that sense.
aba n ayah; bapak; (kadang-kadang juga berarti)
kakek
Lemma ‘aba’ (equivalent to ‘father’ in English), as shown
above, has a noun class (showed by ‘n’). However, its
definition only consists of few words, separated by semi-colon.
For additional information, in Bahasa, ‘aba’, ‘ayah’, and ‘bapak’
share the same meaning (they have similar translation in
English, which is ‘father’). Therefore, the definition like the
above will be treated as implicit synonyms, considering ‘ayah’
and ‘bapak’ as the synonym of ‘aba’. But, the synonyms will be
limited only to a word or phrase that consists up to 3 words.
There are some problems found in processing KBBI,
especially in a merging process with data from the thesaurus,
such as:
1.
Some data from KBBI and thesaurus are irrelevant to
each other. For example, word w1 has 4 senses in
KBBI. Thesaurus also has w1 for its lemma (only 1
sense), but it does not provide any information which
one of the w1 sense in KBBI (which has been told
having 4 senses) is related to that entry in thesaurus.
2.
There are some data which exist both in KBBI and
thesaurus, which are single sense (monosemous), but
have different semantic classes, although if we look
manually, they refer to same lemma and they should
have the same semantic class.
III.
MONOLINGUAL RESOURCES BASED APPROACH
A. Overview of Approach
This approach only use monolingual resources, KBBI and
Indonesian Thesaurus as the resource to build the synsets. The
motivation of using monolingual resources is based on the fact
that monolingual resources which are produced by experts in
Bahasa will provide better senses and meaning, compared to
using other language resources (such as PWN itself, with the
support of a bilingual dictionary). There are some problems that
exist in these two resources (described in section 2.A and 2.B),
so we should make some assumptions as explained in the
following parts: basically, the process done is extracting the
candidate synsets from thesaurus entries, adding more
candidate synsets from KBBI, eliminating the redundant
(similar) candidate synsets, and last, using clustering technique,
merging candidate synsets which are assumed to have similar
meaning.
B. Synonymy Concept in Thesaurus
As stated in the previous section (section 2.A), thesaurus
has been known for having a group of synonymous words. We
consider synonymy concept as a bi-directional relation between
two words. However, not all entries in thesaurus have this kind
of relation. For examples:
aba-aba n arahan, instruksi, isyarat, kode, komando,
perintah, petunjuk, seruan, suruhan, tanda, titah
arahan n 1 bimbingan, firman, panduan, pedoman,
pengarahan, pengawalan, perintah, petunjuk,
pimpinan, sabda, suruhan, tuntunan; 2 aba-aba,
amanat, mandat, titah
seruan n 1 jeritan, laung, pekik, teriakan; 2 ajakan, anjuran,
imbauan, lambaian (ki), panggilan, permintaan,
undangan
From the examples above, we consider lemma ‘aba-aba’ has
the following synonymous words, including ‘arahan’ and
‘seruan’. Another entry, which lemma is ‘arahan’, has ‘aba-aba’
as its pairs in the 2nd sense (this shows the bi-directional relation
between these 2 words). But, entry for lemma ‘seruan’ does not
have ‘aba-aba’ as its pairs. So, the 2nd sense of ‘arahan’ will be
decided as a synonym to ‘aba-aba’, while ‘seruan’ will not be
considered as synonym to ‘aba-aba’ because it does not have a
bi-directional relation with ‘aba-aba’. This is the main concept
of synonymy used in processing thesaurus entries.
C. Adding Lemmas from KBBI as Additional Data
After retrieving candidate synsets from thesaurus, we
process data from KBBI to retrieve the implicit synonyms (as
described in section 2.B). Besides, we also try to retrieve
synsets which only consist of one lemma (without any
synonyms) that does not exist in thesaurus. This process need to
be done to complement the candidate synsets retrieved from
thesaurus. However, because some of the data from KBBI and
thesaurus are irrelevant to each other, lemmas from KBBI
which already exist in thesaurus will not be processed again.
This consideration is taken to reduce the ambiguity between
KBBI and thesaurus itself.
D. Eliminating Redundant Candidate Synsets
From the candidate synsets retrieved, we found some
redundant synsets which are perfectly similar to each other.
This is mainly because of the concept of synonym used and also
the characteristic of the thesaurus entries. These following
examples are the candidate synsets retrieved from thesaurus:
E. Merging Candidate Synsets Based on Clustering Technique
After an elimination process finished, there are still some
problems regarding the rest of the candidate synsets. By
manually monitor the result, we have found that some
candidates share the same meaning with the other candidates,
but they are not eliminated in the previous process since they
are not perfectly similar. Here are the examples of some
candidates:
1. aba-aba, atribut, markah, petunjuk, tanda
2. aba-aba, alamat, fenomena, firasat, indikasi, isyarat,
kode, petunjuk, sasmita, semboyan, sinyal, tanda,
tengara
3. aba-aba, amanat, komando, perintah, pesan, suruhan,
tugas
4. aba-aba, amanat, arahan, instruksi, komando, mandat,
order, perintah, suruhan, titah, tugas
5. aba-aba, duaja, gestur, isyarat, kode, sandi, semboyan,
sinyal
6. aba-aba, amanat, instruksi, komando, nasihat, perintah,
tugas
From the six examples of the candidate synsets above, we
assume that number 3, 4, and 6 can be merged into one synset
because it consists of quite many similar words. The similarity
of the candidates is measured by the words belonging to each
candidate as its member, because these candidates do not have
gloss or another information that can be used to distinguish the
sense perfectly. So, we try to merge the similar candidates using
the method, that adapting the clustering technique, using the
words belong to each candidates as attribute.
The clustering technique that is used in this research is
hierarchical clustering. The primary reasons of choosing
hierarchical clustering are because the result of the clusters can
not be predicted before the clustering process is done (therefore,
the partitional clustering is inappropriate for this need) and the
simplicity concept of hierarchical clustering itself. Furthermore,
the method of hierarchical clustering used is the agglomerative
method. However, since our goal is not to cluster the candidates
into a single cluster, a modification needs to be done, in which
the clustering process will be stopped after it reached a
condition decided by threshold value. The main algorithm of
this process is like the following:
1.
{aborsi, pengguguran}
{pengguguran, aborsi}
The first example retrieved from lemma ‘aborsi’ (equivalent to
‘abortion’ in English’), while the second retrieved from lemma
‘pengguguran’. However, both of them consist of exactly the
same words. This kind of case is the case we consider that some
candidate synsets are redundant to each other. In order to
reduce this kind of redundancy, an elimination process was
done to delete one of the redundant candidate synsets. Based
from our experiments, this elimination process has successfully
eliminated 14,988 redundant synsets.
2.
Calculate the distance matrix of each candidate to
another candidate. Distance value is calculated by
counting the percentage of similar words between 2
candidates. Example:
- Synset #1 consists of 4 words {a, b, c, d}
- Synset #2 consists of 4 words {a, c, e, f }
- Distance value between synset #1 and #2 :
number of similar words between 2 synsets /
total number of unique words. From the
example above, the similar words count is 2 (a
and c), divided by 6 (unique words are:
a,b,c,d,e,f) and produce 2/6 or 0.333 as the
distance value.
Get the first maximum similarity distance value from
the matrix. The threshold value used is a value that is
calculated from the first maximum similarity distance
3.
4.
5.
multiplied with α (α is a coefficient, which can be
changed manually. The α value used in this research is
0.5). However, any further research will be needed to
decide the best threshold value that able to give the best
result.
Do the agglomerative process, by merging the two
candidates which have the maximum similarity
distance.
Recalculate the new distance matrix
If the current maximum similarity distance is not lower
than the threshold value, back to step 3, otherwise, stop
the clustering process.
The result expected from this process (using the previous
examples) will be like this:
1.
2.
aba-aba, atribut, markah, petunjuk, tanda
aba-aba, alamat, duaja, fenomena, firasat, gestur,
indikasi, isyarat, kode, petunjuk, sandi, sasmita,
semboyan, sinyal, tanda, tengara
3. aba-aba, amanat, arahan, instruksi, komando, mandat,
nasihat, order, perintah, pesan, suruhan, titah, tugas
Candidate number 2 above is the merged candidate from the
previous examples number 2 and 5, while the candidate number
3 is the merged candidate from previous examples number 3,4,
and 6. Based on our experiment, the number of synsets merged
are 7,740 synsets.
IV.
V.
This paper has explored an approach to build synsets for
lexical database similar to PWN, specifically for Bahasa
Indonesia. We only use monolingual resources in Bahasa, based
on our assumption that the resources produced by experts in
Bahasa will provides the best lemmas, senses, and meanings
corresponding to Bahasa itself. We hope that the building of
lexical database for another language with the similar condition
with Bahasa Indonesia (which ideal lexical resources needed
are quite limited) would benefit from this research.
VI.
RESULT
NUMBER OF SYNSETS RETRIEVED
Synsets with
single
member
Synsets with
multiple
members
Noun
25.587
11.898
37.485
61.78%
Verb
9.090
7.831
16.921
27.88%
Adjective
3.347
2.458
5.805
9.57%
235
227
462
0.77%
38.259
22.414
60.673
100.00%
Category
Adverb
Total
Total
FURTHER RESEARCH
Based on our experience conducting this research, we find
that some enhancements can be done in order to increase the
accuracy and quality of the synsets retrieved. Some
enhancements that we suggest to do in further research are:
The result from this research is stated in table below:
TABLE I.
CONCLUSION
1.
Provide better lexical resources and ensure the
consistency and congruency between those resources.
This is based on our assumption that the result of
lexical database that has been built semi-automatically
is highly dependent on the lexical resources used.
However, this task will require co-operative works with
some linguistics experts in Bahasa Indonesia,
remembering that we do not have enough capability in
linguistics.
2.
A more powerful clustering technique can be used in
the clustering step. Remembering that the data used are
of categorical attribute, we strongly suggest to use the
specific clustering technique for this categorical
attribute, such as the ROCK techniques[6].
ACKNOWLEDGMENT
%
We would like to thank students from NLP class, who
support us in providing and pre-processing the raw materials of
lexical resources used in this research. We also hope that the
synsets retrieved from this research will be helpful for other
researches which are trying to do acquisition processes for
retrieving
other
semantic
relations,
such
as
hyponymy/hypernymy, holonymy/meronymy, and gloss
acquisition to build a better Indonesian WordNet.
REFERENCES
Some examples of the results are like stated below (on the
form of WordNet lexicographer files format) :
{ apel3, upacara2, (upacara;) }
{ aba, ayah4, babe2, bapak7, papi3, (gloss) }
{ empuan2, istri6, pedusi, perempuan4, puan3, (gloss) }
Unfortunately, because this research did not involve experts
in Bahasa, we hardly found any methods to perform accuracy
tests. We have found an automatic evaluation method of
synsets[5], but it needed gloss and hyponym/hypernymy
relation which did not exist in the synsets retrieved. Therefore,
this result will be considered as a prototype version of
Indonesian WordNet synsets, which will be improved by
further researches.
[1]
[2]
[3]
[4]
[5]
[6]
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller,
“Introduction to WordNet: An Online Lexical Database” , The
International Journal of Lexicography, vol. 3(4), pp. 235-244, 1990.
L. Changki, L. Geunbae, and S. Jungyun, “Automatic WordNet Mapping
using Word Sense Disambiguation”,In Proceedings of the joint SIGDAT
Conference on Empirical Methods in Natural Language Processing and
Very Large Corpora (EMNLP/VLC 2000). 2000.
E, Barbu and V. Barbu Mititelu, Automatic Building of WordNets, 2006.
X. Farreres, G.Rigau, and H. Rodriguez, “Using WordNet for Building
WordNets”, In Proceedings of COLING-ACL Workshop on Usage of
WordNet in Natural Language Processing Systems.
R. Nadig, J. Ramanand, and P. Bhattacharyya, “Automatic Evaluation of
WordNet Synonyms and Hypernyms”, In Proceedings of ICON-2008: 6th
International Conference of Natural Language Processing.
S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering
Algorithm for Categorical Attributes”, In Proceedings of the 15th ICDE,
pp. 512-521. 1999.