Peter Norvig’s Spelling Algorithm
new HashMapString, Integer; static {
Use Peter Norvig’s training file big.txt: http:www.norvig.comspell-correct.html
FileInputStream fstream = new FileInputStreamtmpbig.txt;
DataInputStream in = new DataInputStreamfstream; BufferedReader br =
new BufferedReadernew InputStreamReaderin; String line;
while line = br.readLine = null { ListString words = Tokenizer.wordsToListline;
for String word : words { if wordCounts.containsKeyword {
Integer count = wordCounts.getword; wordCounts.putword, count + 1;
} else { wordCounts.putword, 1;
} }
} in.close;
} The class has two static methods that implement the algorithm. The first method
edits seen in the following listing is private and returns a list of permutations for a string containing a word. Permutations are created by removing characters, by
reversing the order of two adjacent characters, by replacing single characters with all other characters, and by adding all possible letters to each space between characters
in the word:
private static ListString editsString word { int wordL = word.length, wordLm1 = wordL - 1;
ListString possible = new ArrayListString; drop a character:
for int i=0; i wordL; ++i {
possible.addword.substring0, i + word.substringi+1;
} reverse order of 2 characters:
for int i=0; i wordLm1; ++i { possible.addword.substring0, i +
word.substringi+1, i+2 + word.substringi, i+1 +
159
word.substringi+2; }
replace a character in each location in the word: for int i=0; i wordL; ++i {
for char ch=’a’; ch = ’z’; ++ch { possible.addword.substring0, i + ch +
word.substringi+1; }
} add in a character in each location in the word:
for int i=0; i = wordL; ++i { for char ch=’a’; ch = ’z’; ++ch {
possible.addword.substring0, i + ch + word.substringi;
} }
return possible; }
Here is a sample test case for the method edits where we call it with the word “cat” and get a list of 187 permutations:
[at, ct, ca, act, cta, aat, bat, cat, .., fat, .., cct, cdt, cet, .., caty, catz]
The public static method correct has four possible return values: • If the word is in the spelling hash table, simply return the word.
• Generate a permutation list of the input word using the method edits. Build a hash table candidates from the permutation list with keys being the word
count in the main hashtable wordCounts with values of the words in the permutation list. If the hash table candidates is not empty then return the
permutation with the best key word count value.
• For each new word in the permutation list, call the method edits with the word, creating a new candidates hash table with permutations of permuta-
tions. If candidates is not empty then return the word with the highest score. • Return the value of the original word no suggestions.
public static String correctString word { ifwordCounts.containsKeyword return word;
ListString list = editsword;
160
Candidate hash has word counts as keys, word as value:
HashMapInteger, String candidates = new HashMapInteger, String;
for String testWord : list { ifwordCounts.containsKeytestWord {
candidates.putwordCounts.gettestWord, testWord;
} }
If candidates is not empty, then return the word with the largest key word
count value: ifcandidates.size 0 {
return candidates.get Collections.maxcandidates.keySet;
} If the edits method does not provide a
candidate word that matches then we will call edits again with each previous
permutation words. Note: this case occurs only about 20
of the time and obviously increases the runtime of method correct.
candidates.clear; for String editWords : list {
for String wrd : editseditWords { ifwordCounts.containsKeywrd {
candidates.putwordCounts.getwrd,wrd; }
} }
if candidates.size 0 { return candidates.get
Collections.maxcandidates.keySet; }
return word; }
161
Although Peter Norvig’s spelling algorithm is much simpler than the algorithm used in ASpell it works well. I have used Norvig’s spelling algorithm for one customer
project that had a small specific vocabulary instead of using ASpell. We will extend Norvig’s spelling algorithm in the next section to also take advantage of word pair
statistics.