Automatically Assigning Tags to Text

defined in the XML tag data file describing some words and their scores associated with the tag “religion buddhism”: tags topic name=religion_buddhism term name=buddhism score=52 term name=buddhist score=50 term name=mind score=50 term name=medit score=41 term name=buddha score=37 term name=practic score=31 term name=teach score=15 term name=path score=14 term name=mantra score=14 term name=thought score=14 term name=school score=13 term name=zen score=13 term name=mahayana score=13 term name=suffer score=12 term name=dharma score=12 term name=tibetan score=11 . . . topic . . . tags Notice that the term names are stemmed words and all lower case. There are 28 tags defined in the input XML file included in the ZIP file for this book. For data access, I also maintain an array of tag names and an associated list of the word frequency hash tables for each tag name: private static String[] tagClassNames; private static ListHashtableString, Float hashes = new ArrayListHashtableString, Float; The XML data is read and these data structures are filled during static class load time so creating multiple instances of the class AutoT agger has no performance penalty in either memory use or processing time. Except for an empty default class constructor, there is only one public API for this class, the method getT ags: public ListNameValueString, Float getTagsString text { 151 The utility class N ameV alue is defined in the file: src-statistical-nlp comknowledgebooksnlputilNameValue.java To determine the tags for input text, we keep a running score for each defined tag type. I use the internal class SF triple to hold triple values of word, score, and tag index. I choose the tags with the highest scores as the automatically assigned tags for the input text. Scores for each tag are calculated by taking each word in the input text, stemming it, and if the stem is in the word frequency hash table for the tag then add the score value in the hash table to the running sum for the tag. You can refer to the AutoTagger.java source code for details. Here is an example use of class AutoT agger: AutoTagger test = new AutoTagger; String s = The President went to Congress to argue for his tax bill before leaving on a vacation to Las Vegas to see some shows and gamble.; ListNameValueString, Float results = test.getTagss; for NameValueString, Float result : results { System.out.printlnresult; } The output looks like: [NameValue: news_economy : 1.0] [NameValue: news_politics : 0.84]

9.5 Text Clustering

The text clustering system that I have written for my own projects, in simplified form, will be used in the section. It is inherently inefficient when clustering a large number of text documents because I perform significant semantic processing on each text document and then compare all combinations of documents. The runtime performance is O N 2 where N is the number of text documents. If you need to cluster or compare a very large number of documents you will probably want to use a K-Mean clustering algorithm search for “K-Mean clustering Java” for some open source projects. 152 I use a few different algorithms to rate the similarity of any two text documents and I will combine these depending on the requirements of the project that I am working on: 1. Calculate the intersection of common words in the two documents. 2. Calculate the intersection of common word stems in the two documents. 3. Calculate the intersection of tags assigned to the two documents. 4. Calculate the intersection of human and place names in the two documents. In this section we will implement the second option: calculate the intersection of word stems in two documents. Without showing the package and import state- ments, it takes just a few lines of code to implement this algorithm when we use the Stemmer class. The following listing shows the implementation of class ComparableDocument with comments. We start by defining constructors for documents defined by a F ile object and a String object: public class ComparableDocument { disable default constructor calls: private ComparableDocument { } public ComparableDocumentFile document throws FileNotFoundException { thisnew Scannerdocument. useDelimiter\\Z.next; } public ComparableDocumentString text { ListString stems = new Stemmer.stemStringtext; for String stem : stems { stem_count++; if stemCountMap.containsKeystem { Integer count = stemCountMap.getstem; stemCountMap.putstem, 1 + count; } else { stemCountMap.putstem, 1; } } } In the last constructor, I simply create a count of how many times each stem occurs in the document. 153