Suggested Project: Using a Part of Speech Tagger to Use the Correct WordNet Synonyms

9.3.4 Suggested Project: Using WordNet Synonyms to Improve Document Clustering

Another suggestion for a WordNet-based project is to use the Tagger to identify the probable part of speech for each word in all text documents that you want to cluster, and augment the documents with sysnset synonym data. You can then cluster the documents similarly to how we will calculate document similarity in Section 9.5.

9.4 Automatically Assigning Tags to Text

By tagging I mean assigning zero or more categories like “politics”, “economy”, etc. to text based on the words contained in the text. While the code for doing this is simple there is usually much work to do to build a word count database for different classifications. I have been working on commercial products for automatic tagging and semantic ex- traction for about ten years see www.knowledgebooks.com if you are interested. In this section I will show you some simple techniques for automatically assigning tags or categories to text using some code snippets from my own commercial prod- uct. We will use a set of tags for which I have collected word frequency statistics. For example, a tag of “Java” might be associated with the use of the words “Java,” “JVM,” “Sun,” etc. You can find my pre-trained tag data in the file: test_dataclassification_tags.xml The Java source code for the class AutoT agger is in the file: src-statistical-nlp comknowledgebooksnlpAutoTagger.java The AutoT agger class uses a few data structures to keep track of both the names of tags and the word count statistics for words associated with each tag name. I use a temporary hash table for processing the XML input data: private static HashtableString, HashtableString, Float tagClasses; The names of tags used are defined in the XML tag data file: change this file, and you alter both the tags and behavior of this utility class. Here is a snippet of data 150 defined in the XML tag data file describing some words and their scores associated with the tag “religion buddhism”: tags topic name=religion_buddhism term name=buddhism score=52 term name=buddhist score=50 term name=mind score=50 term name=medit score=41 term name=buddha score=37 term name=practic score=31 term name=teach score=15 term name=path score=14 term name=mantra score=14 term name=thought score=14 term name=school score=13 term name=zen score=13 term name=mahayana score=13 term name=suffer score=12 term name=dharma score=12 term name=tibetan score=11 . . . topic . . . tags Notice that the term names are stemmed words and all lower case. There are 28 tags defined in the input XML file included in the ZIP file for this book. For data access, I also maintain an array of tag names and an associated list of the word frequency hash tables for each tag name: private static String[] tagClassNames; private static ListHashtableString, Float hashes = new ArrayListHashtableString, Float; The XML data is read and these data structures are filled during static class load time so creating multiple instances of the class AutoT agger has no performance penalty in either memory use or processing time. Except for an empty default class constructor, there is only one public API for this class, the method getT ags: public ListNameValueString, Float getTagsString text { 151