Extending the Norvig Algorithm by Using Word Pair Statistics

He went to Paris. The weather was warm. Optimally, we would not want to collect statistics on word or token pairs like “Paris .” or “Paris The” that include the final period in a sentence or span a sentence. In a practical sense, since we will be discarding seldom occurring word pairs, it does not matter too much so in our example we will collect all tokenized word pairs at the same time that we collect single word frequency statistics: Pattern p = Pattern.compile[,.’\;:\\s]+; Scanner scanner = new Scannernew Filetmpsmall.txt; scanner.useDelimiterp; String last = ahjhjhdsgh; while scanner.hasNext { String word = scanner.next; if wordCounts.containsKeyword { Integer count = wordCounts.getword; wordCounts.putword, count + 1; } else { wordCounts.putword, 1; } String pair = last + + word; if wordPairCounts.containsKeypair { Integer count = wordPairCounts.getpair; wordPairCounts.putpair, count + 1; } else { wordPairCounts.putpair, 1; } last = word; } scanner.close; For the first page of text in the test file, if we print out word pairs that occur at least two times using this code: for String pair : wordPairCounts.keySet { if wordPairCounts.getpair 1 { System.out.printlnpair + : + wordPairCounts.getpair; } } then we get this output: 163 Arthur Conan: 3 by Sir: 2 of Sherlock: 2 Project Gutenberg: 5 how to: 2 The Adventures: 2 Sherlock Holmes: 2 Sir Arthur: 3 Adventures of: 2 information about: 2 Conan Doyle: 3 The words “Conan” and “Doyle” tend to appear together frequently. If we want to suggest spelling corrections for “the author Conan Doyyle wrote” it seems intuitive that we can prefer the correction “Doyle” since if we take the possible list of correc- tions for “Doyyle” and combine each with the preceding word “Conan” in the text, then we notice that the hash table wordP airCounts has a relatively high count for the key “Conan Doyle” that is a single string containing a word pair. In theory this may look like a good approach, but there are a few things that keep this technique from being generally practical: • It is computationally expensive to train the system for large training text. • It is more expensive computationally to perform spelling suggestions. • The results are not likely to be much better than the single word approach unless the text is in one narrow domain and you have a lot of training text. In the example of misspelling Doyyle, calling the method edits: editsDoyyle returns a list with 349 elements. The method edits is identical to the one word spelling corrector in the last section. I changed the method correct by adding an argument for the previous word, fac- toring in statistics from the word pair count hash table, and for this example by not calculating “edits of edits” as we did in the last section. Here is the modified code: public String correctString word, String previous_word { ifwordCounts.containsKeyword return word; ListString list = editsword; 164 candidate hash has as word counts as keys, word as value: HashMapInteger, String candidates = new HashMapInteger, String; for String testWord : list { look for word pairs with testWord in the second position: String word_pair = previous_word + + testWord; int count_from_1_word = 0; int count_from_word_pairs = 0; ifwordCounts.containsKeytestWord { count_from_1_word += wordCounts.gettestWord; candidates.putwordCounts.gettestWord, testWord; } if wordPairCounts.containsKeyword_pair { count_from_word_pairs += wordPairCounts.getword_pair; } look for word pairs with testWord in the first position: word_pair = testWord + + previous_word; if wordPairCounts.containsKeyword_pair { count_from_word_pairs += wordPairCounts.getword_pair; } int sum = count_from_1_word + count_from_word_pairs; if sum 0 { candidates.putsum, testWord; } } If candidates is not empty, then return the word with the largest key word count value: ifcandidates.size 0 { return candidates.get Collections.maxcandidates.keySet; } return word; } 165 Using word pair statistics can be a good technique if you need to build an automated spelling corrector that only needs to work on text in one subject area. You will need a lot of training text in your subject area and be prepared for extra work performing the training: as I mentioned before, for one customer project I could not fit the word pair hash table in memory on the server that I had to use so I had to use a disk-based hash table – the training run took a long while. Another good alternative for building systems for handling text in one subject area is to augment a standard spelling library like ASpell or Jazzy with custom word dictionaries.

9.7 Hidden Markov Models

We used a set of rules in Section 9.1 to assign parts of speech tags to words in English text. The rules that we used were a subset of the automatically generated rules that Eric Brill’s machine learning thesis project produced. His thesis work used Markov modeling to calculate the most likely tag of words, given precceding words. He then generated rules for taging – some of which we saw in Section 9.1 where we saw Brill’s published results of the most useful learned rules made writing a fast tagger relatively easy. In this section we will use word-use statistics to assign word type tags to each word in input text. We will look in some detail at one of the most popular approaches to tagging text: building Hidden Markov Models HMM and then evaluating these models against input text to assign word use or part of speech tags to words. A complete coverage of the commonly used techniques for training and using HMM is beyond the scope of this section. A full reference for these training techniques is Foundations of Statistical Natural Language Processing [Manning, Schutze, 1999]. We will discuss the training algorithms and sample Java code that implements HMM. The example in this chapter is purposely pedantic: the example code is intended to be easy to understand and experiment with. In Hidden Markov Models HMM, we speak of an observable sequence of events that moves a system through a series of states. We attempt to assign transition prob- abilities based on the recent history of states of the system or, the last few events. In this example, we want to develop an HMM that attempts to assign part of speech tags to English text. To train an HMM, we will assume that we have a large set of training data that is a sequence of words and a parallel sequence of manually assigned part of speech tags. We will see an example of this marked up training text that looks like “JohnNNP chasedVB theDT dogNN” later in this section. For developing a sample Java program to learn how to train a HMM, we assume that we have two Java lists words and tags that are of the same length. So, we will have one list of words like [“John”, “chased”, “the”, “dog”] and an associated list of part 166 of speech tags like [“NNP”, “VB”, “DT”, “NN”]. Once the HMM is trained, we will write another method test model that takes as input a Java vector of words and returns a Java vector of calculated part of speech tags. We now describe the assumptions made for Markov Models and Hidden Markov Models using this part of speech tagging problem. First, assume that the desired