Text Clustering Practical Artificial Intelligence Programming With Java
I use a few different algorithms to rate the similarity of any two text documents and I will combine these depending on the requirements of the project that I am working
on:
1. Calculate the intersection of common words in the two documents. 2. Calculate the intersection of common word stems in the two documents.
3. Calculate the intersection of tags assigned to the two documents. 4. Calculate the intersection of human and place names in the two documents.
In this section we will implement the second option: calculate the intersection of word stems in two documents. Without showing the package and import state-
ments, it takes just a few lines of code to implement this algorithm when we use the Stemmer class.
The following listing shows the implementation of class ComparableDocument with comments. We start by defining constructors for documents defined by a F ile
object and a String object:
public class ComparableDocument { disable default constructor calls:
private ComparableDocument { } public ComparableDocumentFile document
throws FileNotFoundException { thisnew Scannerdocument.
useDelimiter\\Z.next; }
public ComparableDocumentString text { ListString stems =
new Stemmer.stemStringtext; for String stem : stems {
stem_count++; if stemCountMap.containsKeystem {
Integer count = stemCountMap.getstem; stemCountMap.putstem, 1 + count;
} else { stemCountMap.putstem, 1;
} }
} In the last constructor, I simply create a count of how many times each stem occurs
in the document.
153
The public API allows us to get the stem count hash table, the number of stems in the original document, and a numeric comparison value for comparing this document
with another this is the first version – we will add an improvement later:
public MapString, Integer getStemMap { return stemCountMap;
} public int getStemCount {
return stem_count; }
public float compareToComparableDocument otherDocument {
long count = 0; MapString,Integer map2 = otherDocument.getStemMap;
Iterator iter = stemCountMap.keySet.iterator; while iter.hasNext {
Object key = iter.next; Integer count1 = stemCountMap.getkey;
Integer count2 = map2.getkey; if count1=null count2=null {
count += count1 count2; }
} return float Math.sqrt
floatcountcount doublestem_count
otherDocument.getStemCount 2f;
} private MapString, Integer stemCountMap =
new HashMapString, Integer; private int stem_count = 0;
} I normalize the return value for the method compareT o to return a value of 1.0
if compared documents are identical after stemming and 0.0 if they contain no common stems. There are four test text documents in the test data directory and the
following test code compares various combinations. Note that I am careful to test the case of comparing identical documents:
ComparableDocument news1 = new ComparableDocumenttestdatanews_1.txt;
ComparableDocument news2 =
154
new ComparableDocumenttestdatanews_2.txt; ComparableDocument econ1 =
new ComparableDocumenttestdataeconomy_1.txt; ComparableDocument econ2 =
new ComparableDocumenttestdataeconomy_2.txt; System.out.printlnnews 1 - news1: +
news1.compareTonews1; System.out.printlnnews 1 - news2: +
news1.compareTonews2; System.out.printlnnews 2 - news2: +
news2.compareTonews2; System.out.printlnnews 1 - econ1: +
news1.compareToecon1; System.out.printlnecon 1 - econ1: +
econ1.compareToecon1; System.out.printlnnews 1 - econ2: +
news1.compareToecon2; System.out.printlnecon 1 - econ2: +
econ1.compareToecon2; System.out.printlnecon 2 - econ2: +
econ2.compareToecon2; The following listing shows output that indicates mediocre results; we will soon
make an improvement that makes the results better. The output for this test code is:
news 1 - news1: 1.0 news 1 - news2: 0.4457711
news 2 - news2: 1.0 news 1 - econ1: 0.3649214
econ 1 - econ1: 1.0 news 1 - econ2: 0.32748842
econ 1 - econ2: 0.42922822 econ 2 - econ2: 1.0
There is not as much differentiation in comparison scores between political news stories and economic news stories. What is up here? The problem is that I did not
remove common words and therefore common word stems when creating stem counts for each document. I wrote a utility class N oiseW ords for identifying both
common words and their stems; you can see the implementation in the file Noise- Words.java. Removing noise words improves the comparison results I added a few
tests since the last printout:
news 1 - news1: 1.0
155
news 1 - news2: 0.1681978 news 1 - econ1: 0.04279895
news 1 - econ2: 0.034234844 econ 1 - econ2: 0.26178515
news 2 - econ2: 0.106673114 econ 1 - econ2: 0.26178515
Much better results The API for com.knowledgebooks.nlp.util.NoiseWords is: public static boolean checkForString stem
You can add additional noise words to the data section in the file NoiseWords.java, depending on your application.