Peter Norvig’s Spelling Algorithm

new HashMapString, Integer; static { Use Peter Norvig’s training file big.txt: http:www.norvig.comspell-correct.html FileInputStream fstream = new FileInputStreamtmpbig.txt; DataInputStream in = new DataInputStreamfstream; BufferedReader br = new BufferedReadernew InputStreamReaderin; String line; while line = br.readLine = null { ListString words = Tokenizer.wordsToListline; for String word : words { if wordCounts.containsKeyword { Integer count = wordCounts.getword; wordCounts.putword, count + 1; } else { wordCounts.putword, 1; } } } in.close; } The class has two static methods that implement the algorithm. The first method edits seen in the following listing is private and returns a list of permutations for a string containing a word. Permutations are created by removing characters, by reversing the order of two adjacent characters, by replacing single characters with all other characters, and by adding all possible letters to each space between characters in the word: private static ListString editsString word { int wordL = word.length, wordLm1 = wordL - 1; ListString possible = new ArrayListString; drop a character: for int i=0; i wordL; ++i { possible.addword.substring0, i + word.substringi+1; } reverse order of 2 characters: for int i=0; i wordLm1; ++i { possible.addword.substring0, i + word.substringi+1, i+2 + word.substringi, i+1 + 159 word.substringi+2; } replace a character in each location in the word: for int i=0; i wordL; ++i { for char ch=’a’; ch = ’z’; ++ch { possible.addword.substring0, i + ch + word.substringi+1; } } add in a character in each location in the word: for int i=0; i = wordL; ++i { for char ch=’a’; ch = ’z’; ++ch { possible.addword.substring0, i + ch + word.substringi; } } return possible; } Here is a sample test case for the method edits where we call it with the word “cat” and get a list of 187 permutations: [at, ct, ca, act, cta, aat, bat, cat, .., fat, .., cct, cdt, cet, .., caty, catz] The public static method correct has four possible return values: • If the word is in the spelling hash table, simply return the word. • Generate a permutation list of the input word using the method edits. Build a hash table candidates from the permutation list with keys being the word count in the main hashtable wordCounts with values of the words in the permutation list. If the hash table candidates is not empty then return the permutation with the best key word count value. • For each new word in the permutation list, call the method edits with the word, creating a new candidates hash table with permutations of permuta- tions. If candidates is not empty then return the word with the highest score. • Return the value of the original word no suggestions. public static String correctString word { ifwordCounts.containsKeyword return word; ListString list = editsword; 160 Candidate hash has word counts as keys, word as value: HashMapInteger, String candidates = new HashMapInteger, String; for String testWord : list { ifwordCounts.containsKeytestWord { candidates.putwordCounts.gettestWord, testWord; } } If candidates is not empty, then return the word with the largest key word count value: ifcandidates.size 0 { return candidates.get Collections.maxcandidates.keySet; } If the edits method does not provide a candidate word that matches then we will call edits again with each previous permutation words. Note: this case occurs only about 20 of the time and obviously increases the runtime of method correct. candidates.clear; for String editWords : list { for String wrd : editseditWords { ifwordCounts.containsKeywrd { candidates.putwordCounts.getwrd,wrd; } } } if candidates.size 0 { return candidates.get Collections.maxcandidates.keySet; } return word; } 161 Although Peter Norvig’s spelling algorithm is much simpler than the algorithm used in ASpell it works well. I have used Norvig’s spelling algorithm for one customer project that had a small specific vocabulary instead of using ASpell. We will extend Norvig’s spelling algorithm in the next section to also take advantage of word pair statistics.

9.6.3 Extending the Norvig Algorithm by Using Word Pair Statistics

It is possible to use statistics for which words commonly appear together to improve spelling suggestions. In my experience this is only worthwhile when applications have two traits: 1. The vocabulary for the application is specialized. For example, a social net- working site for people interested in boating might want a more accurate spelling system than one that has to handle more general English text. In this example, common word pairs might be multi-word boat and manufac- turer names, boating locations, etc. 2. There is a very large amount of text in this limited subject area to use for training. This is because there will be many more combinations of word pairs than words and a very large training set helps to determine which pairs are most common, rather than just coincidental. We will proceed in a similar fashion to the implementation in the last section but we will also keep an additional hash table containing counts for word pairs. Since there will be many more word pair combinations than single words, you should expect both the memory requirements and CPU time for training to be much larger. For one project, there was so much training data that I ended up having to use disk- based hash tables to store word pair counts. To make this training process take less training time and less memory to hold the large word combination hash table, we will edit the input file big.txt from the last section deleting the 1200 lines that contain random words added to the end of the Project Gutenberg texts. Furthermore, we will experiment with an even smaller version of this file renamed small.txt that is about ten percent of the size of the original training file. Because we are using a smaller training set we should expect marginal results. For your own projects you should use as much data as possible. In principle, when we collect a word pair hash table where the hash values are the number of times a word pair occurs in the training test, we would want to be sure that we do not collect word pairs across sentence boundaries and separate phrases occurring inside of parenthesis, etc. For example consider the following text frag- ment: 162 He went to Paris. The weather was warm. Optimally, we would not want to collect statistics on word or token pairs like “Paris .” or “Paris The” that include the final period in a sentence or span a sentence. In a practical sense, since we will be discarding seldom occurring word pairs, it does not matter too much so in our example we will collect all tokenized word pairs at the same time that we collect single word frequency statistics: Pattern p = Pattern.compile[,.’\;:\\s]+; Scanner scanner = new Scannernew Filetmpsmall.txt; scanner.useDelimiterp; String last = ahjhjhdsgh; while scanner.hasNext { String word = scanner.next; if wordCounts.containsKeyword { Integer count = wordCounts.getword; wordCounts.putword, count + 1; } else { wordCounts.putword, 1; } String pair = last + + word; if wordPairCounts.containsKeypair { Integer count = wordPairCounts.getpair; wordPairCounts.putpair, count + 1; } else { wordPairCounts.putpair, 1; } last = word; } scanner.close; For the first page of text in the test file, if we print out word pairs that occur at least two times using this code: for String pair : wordPairCounts.keySet { if wordPairCounts.getpair 1 { System.out.printlnpair + : + wordPairCounts.getpair; } } then we get this output: 163