Text Clustering Practical Artificial Intelligence Programming With Java

I use a few different algorithms to rate the similarity of any two text documents and I will combine these depending on the requirements of the project that I am working on: 1. Calculate the intersection of common words in the two documents. 2. Calculate the intersection of common word stems in the two documents. 3. Calculate the intersection of tags assigned to the two documents. 4. Calculate the intersection of human and place names in the two documents. In this section we will implement the second option: calculate the intersection of word stems in two documents. Without showing the package and import state- ments, it takes just a few lines of code to implement this algorithm when we use the Stemmer class. The following listing shows the implementation of class ComparableDocument with comments. We start by defining constructors for documents defined by a F ile object and a String object: public class ComparableDocument { disable default constructor calls: private ComparableDocument { } public ComparableDocumentFile document throws FileNotFoundException { thisnew Scannerdocument. useDelimiter\\Z.next; } public ComparableDocumentString text { ListString stems = new Stemmer.stemStringtext; for String stem : stems { stem_count++; if stemCountMap.containsKeystem { Integer count = stemCountMap.getstem; stemCountMap.putstem, 1 + count; } else { stemCountMap.putstem, 1; } } } In the last constructor, I simply create a count of how many times each stem occurs in the document. 153 The public API allows us to get the stem count hash table, the number of stems in the original document, and a numeric comparison value for comparing this document with another this is the first version – we will add an improvement later: public MapString, Integer getStemMap { return stemCountMap; } public int getStemCount { return stem_count; } public float compareToComparableDocument otherDocument { long count = 0; MapString,Integer map2 = otherDocument.getStemMap; Iterator iter = stemCountMap.keySet.iterator; while iter.hasNext { Object key = iter.next; Integer count1 = stemCountMap.getkey; Integer count2 = map2.getkey; if count1=null count2=null { count += count1 count2; } } return float Math.sqrt floatcountcount doublestem_count otherDocument.getStemCount 2f; } private MapString, Integer stemCountMap = new HashMapString, Integer; private int stem_count = 0; } I normalize the return value for the method compareT o to return a value of 1.0 if compared documents are identical after stemming and 0.0 if they contain no common stems. There are four test text documents in the test data directory and the following test code compares various combinations. Note that I am careful to test the case of comparing identical documents: ComparableDocument news1 = new ComparableDocumenttestdatanews_1.txt; ComparableDocument news2 = 154 new ComparableDocumenttestdatanews_2.txt; ComparableDocument econ1 = new ComparableDocumenttestdataeconomy_1.txt; ComparableDocument econ2 = new ComparableDocumenttestdataeconomy_2.txt; System.out.printlnnews 1 - news1: + news1.compareTonews1; System.out.printlnnews 1 - news2: + news1.compareTonews2; System.out.printlnnews 2 - news2: + news2.compareTonews2; System.out.printlnnews 1 - econ1: + news1.compareToecon1; System.out.printlnecon 1 - econ1: + econ1.compareToecon1; System.out.printlnnews 1 - econ2: + news1.compareToecon2; System.out.printlnecon 1 - econ2: + econ1.compareToecon2; System.out.printlnecon 2 - econ2: + econ2.compareToecon2; The following listing shows output that indicates mediocre results; we will soon make an improvement that makes the results better. The output for this test code is: news 1 - news1: 1.0 news 1 - news2: 0.4457711 news 2 - news2: 1.0 news 1 - econ1: 0.3649214 econ 1 - econ1: 1.0 news 1 - econ2: 0.32748842 econ 1 - econ2: 0.42922822 econ 2 - econ2: 1.0 There is not as much differentiation in comparison scores between political news stories and economic news stories. What is up here? The problem is that I did not remove common words and therefore common word stems when creating stem counts for each document. I wrote a utility class N oiseW ords for identifying both common words and their stems; you can see the implementation in the file Noise- Words.java. Removing noise words improves the comparison results I added a few tests since the last printout: news 1 - news1: 1.0 155 news 1 - news2: 0.1681978 news 1 - econ1: 0.04279895 news 1 - econ2: 0.034234844 econ 1 - econ2: 0.26178515 news 2 - econ2: 0.106673114 econ 1 - econ2: 0.26178515 Much better results The API for com.knowledgebooks.nlp.util.NoiseWords is: public static boolean checkForString stem You can add additional noise words to the data section in the file NoiseWords.java, depending on your application.

9.6 Spelling Correction

Automating spelling correction is a task that you may use for many types of projects. This includes both programs that involve users entering text that will be automati- cally processed with no further interaction with the user and for programs that keep the user “in the loop” by offering them possible spelling choices that they can se- lect. I have used five different approaches in my own work for automating spelling correction and getting spelling suggestions: • An old project of mine overly complex, but with good accuracy • Embedding the GNU ASpell utility • Use the LGPL licensed Jazzy spelling checker a port of the GNU ASpell spelling system to Java • Using Peter Norvig’s statistical spelling correction algorithm • Using Norvig’s algorithm, adding word pair statistics We will use the last three options in the next Sections 9.6.1, 9.6.2 and in Section 9.6.3 where we will extend Norvig’s algorithm by also using word pair statistics. This last approach is computationally expensive and is best used in applications with a highly specialized domain of discourse e.g., systems dealing just with boats, sports, etc.. Section 9.6.3 also provides a good lead in to Section 9.7 dealing with a similar but more general technique covered later in this chapter: Markov Models. 156

9.6.1 GNU ASpell Library and Jazzy

The GNU ASpell system is a hybrid system combining letter substitution and ad- dition which we will implement as a short example program in Section 9.6.2, the Soundex algorithm, and dynamic programming. I consider ASpell to be a best of breed spelling utility and I use it fairly frequently with scripting languages like Ruby where it is simple to “shell out” and run external programs. You can also “shell out” external commands to new processes in Java but there is no need to do this if we use the LGPLed Jazzy library that is similar to ASpell and written in pure Java. For the sake of completeness, here is a simple example of how you would use ASpell as an external program; first, we will run ASpell on in a command shell not all output is shown: markw echo ths doog | usrlocalbinaspell -a list International Ispell but really Aspell 0.60.5 ths 22 0: Th’s, this, thus, Th, \ldots doog 6 4: dog, Doug, dong, door, \ldots This output is easy enough to parse; here is an example in Ruby Python, Perl, or Java would be similar: def ASpell text s = ‘echo {text} | usrlocalbinaspell -a list‘ s = s.split\n s.shift results = [] s.each {|line| tokens = line.split, header = tokens[0].gsub’:’,’’.split’ ’ tokens[0] = header[4] results [header[1], header[3], tokens.collect {|tt| tt.strip}] if header[1] } results end I include the source code to the LGPLed Jazzy library and a test class in the di- rectory src-spelling-Jazzy. The Jazzy library source code is in the sub-directory comswabunga. We will spend no time looking at the implementation of the Jazzy library: this short section is simply meant to get you started quickly using Jazzy. Here is the test code from the file SpellingJazzyTester.java: 157