Extending the Norvig Algorithm by Using Word Pair Statistics
He went to Paris. The weather was warm. Optimally, we would not want to collect statistics on word or token pairs like “Paris
.” or “Paris The” that include the final period in a sentence or span a sentence. In a practical sense, since we will be discarding seldom occurring word pairs, it does not
matter too much so in our example we will collect all tokenized word pairs at the same time that we collect single word frequency statistics:
Pattern p = Pattern.compile[,.’\;:\\s]+; Scanner scanner =
new Scannernew Filetmpsmall.txt; scanner.useDelimiterp;
String last = ahjhjhdsgh; while scanner.hasNext {
String word = scanner.next; if wordCounts.containsKeyword {
Integer count = wordCounts.getword; wordCounts.putword, count + 1;
} else { wordCounts.putword, 1;
} String pair = last + + word;
if wordPairCounts.containsKeypair { Integer count = wordPairCounts.getpair;
wordPairCounts.putpair, count + 1; } else {
wordPairCounts.putpair, 1; }
last = word; }
scanner.close;
For the first page of text in the test file, if we print out word pairs that occur at least two times using this code:
for String pair : wordPairCounts.keySet { if wordPairCounts.getpair 1 {
System.out.printlnpair + : + wordPairCounts.getpair;
} }
then we get this output:
163
Arthur Conan: 3 by Sir: 2
of Sherlock: 2 Project Gutenberg: 5
how to: 2 The Adventures: 2
Sherlock Holmes: 2 Sir Arthur: 3
Adventures of: 2 information about: 2
Conan Doyle: 3
The words “Conan” and “Doyle” tend to appear together frequently. If we want to suggest spelling corrections for “the author Conan Doyyle wrote” it seems intuitive
that we can prefer the correction “Doyle” since if we take the possible list of correc- tions for “Doyyle” and combine each with the preceding word “Conan” in the text,
then we notice that the hash table wordP airCounts has a relatively high count for the key “Conan Doyle” that is a single string containing a word pair.
In theory this may look like a good approach, but there are a few things that keep this technique from being generally practical:
• It is computationally expensive to train the system for large training text. • It is more expensive computationally to perform spelling suggestions.
• The results are not likely to be much better than the single word approach unless the text is in one narrow domain and you have a lot of training text.
In the example of misspelling Doyyle, calling the method edits: editsDoyyle
returns a list with 349 elements. The method edits is identical to the one word spelling corrector in the last section.
I changed the method correct by adding an argument for the previous word, fac- toring in statistics from the word pair count hash table, and for this example by not
calculating “edits of edits” as we did in the last section. Here is the modified code:
public String correctString word,
String previous_word { ifwordCounts.containsKeyword return word;
ListString list = editsword;
164
candidate hash has as word counts as keys, word as value:
HashMapInteger, String candidates = new HashMapInteger, String;
for String testWord : list { look for word pairs with testWord in the
second position: String word_pair = previous_word + + testWord;
int count_from_1_word = 0; int count_from_word_pairs = 0;
ifwordCounts.containsKeytestWord {
count_from_1_word += wordCounts.gettestWord; candidates.putwordCounts.gettestWord,
testWord; }
if wordPairCounts.containsKeyword_pair { count_from_word_pairs +=
wordPairCounts.getword_pair; }
look for word pairs with testWord in the first position:
word_pair = testWord + + previous_word; if wordPairCounts.containsKeyword_pair {
count_from_word_pairs += wordPairCounts.getword_pair;
} int sum = count_from_1_word +
count_from_word_pairs; if sum 0
{ candidates.putsum, testWord;
} }
If candidates is not empty, then return the word with the largest key word count value:
ifcandidates.size 0 { return candidates.get
Collections.maxcandidates.keySet; }
return word; }
165
Using word pair statistics can be a good technique if you need to build an automated spelling corrector that only needs to work on text in one subject area. You will need
a lot of training text in your subject area and be prepared for extra work performing the training: as I mentioned before, for one customer project I could not fit the
word pair hash table in memory on the server that I had to use so I had to use a disk-based hash table – the training run took a long while. Another good alternative
for building systems for handling text in one subject area is to augment a standard spelling library like ASpell or Jazzy with custom word dictionaries.