Automatically Assigning Tags to Text
defined in the XML tag data file describing some words and their scores associated with the tag “religion buddhism”:
tags topic name=religion_buddhism
term name=buddhism score=52 term name=buddhist score=50
term name=mind score=50 term name=medit score=41
term name=buddha score=37 term name=practic score=31
term name=teach score=15 term name=path score=14
term name=mantra score=14 term name=thought score=14
term name=school score=13 term name=zen score=13
term name=mahayana score=13 term name=suffer score=12
term name=dharma score=12 term name=tibetan score=11
. . .
topic . . .
tags Notice that the term names are stemmed words and all lower case. There are 28 tags
defined in the input XML file included in the ZIP file for this book. For data access, I also maintain an array of tag names and an associated list of the
word frequency hash tables for each tag name: private static String[] tagClassNames;
private static ListHashtableString, Float hashes =
new ArrayListHashtableString, Float; The XML data is read and these data structures are filled during static class load
time so creating multiple instances of the class AutoT agger has no performance penalty in either memory use or processing time. Except for an empty default class
constructor, there is only one public API for this class, the method getT ags:
public ListNameValueString, Float getTagsString text {
151
The utility class N ameV alue is defined in the file: src-statistical-nlp
comknowledgebooksnlputilNameValue.java To determine the tags for input text, we keep a running score for each defined tag
type. I use the internal class SF triple to hold triple values of word, score, and tag index. I choose the tags with the highest scores as the automatically assigned tags
for the input text. Scores for each tag are calculated by taking each word in the input text, stemming it, and if the stem is in the word frequency hash table for the tag
then add the score value in the hash table to the running sum for the tag. You can refer to the AutoTagger.java source code for details. Here is an example use of class
AutoT agger:
AutoTagger test = new AutoTagger; String s = The President went to Congress to argue
for his tax bill before leaving on a vacation to Las Vegas to see some shows
and gamble.; ListNameValueString, Float results =
test.getTagss; for NameValueString, Float result : results {
System.out.printlnresult; }
The output looks like: [NameValue: news_economy : 1.0]
[NameValue: news_politics : 0.84]