Tokenizing, Stemming, and Part of Speech Tagging Text

section. Before we can process any text we need to break text into individual tokens. Tokens can be words, numbers and punctuation symbols. The class T okenizer has two static methods, both take an input string to tokenize and return a list of token strings. The second method has an extra argument to specify the maximum number of tokens that you want returned: static public ListString wordsToListString s static public ListString wordsToListString s, int maxR The following listing shows a fragment of example code using this class with the output: String text = The ball, rolling quickly, went down the hill.; ListString tokens = Tokenizer.wordsToListtext; System.out.printlntext; for String token : tokens System.out.print\+token+\ ; System.out.println; This code fragment produces the following output: The ball, rolling quickly, went down the hill. The ball , rolling quickly , went down the hill . For many applications, it is better to “stem” word tokens to simplify comparison of similar words. For example “run,” “runs,” and “running” all stem to “run.” The stemmer that we will use, which I believe to be in the public domain, is in the file srcpublic domainStemmer.java. There are two convenient APIs defined at the end of the class, one to stem a string of multiple words and one to stem a single word token: public ListString stemStringString str public String stemOneWordString word We will use both the F astT ag and Stemmer classes often in the remainder of this chapter. 138 The FastTag project resulted from my using the excellent tagger written by Eric Brill while he was at the University of Pennsylvania. He used machine learning techniques to learn transition rules for tagging text using manually tagged text as training examples. In reading through his doctoral thesis I noticed that there were a few transition rules that covered most of the cases and I implemented a simple “fast tagger” in Common Lisp, Ruby, Scheme and Java. The Java version is in the file srccomknowledgebooksnlpfasttagFastTag.java. The file srccomknowledgebooksnlpfasttagREADME.txt contains information on where to obtain Eric Brill’s original tagging system and also defines the tags for both his English language lexicon and the Medpost lexicon. Table 9.1 shows the most commonly used tags see the README.txt file for a complete description. Tag Description Examples NN singular noun dog NNS plural noun dogs NNP singular proper noun California NNPS plural proper noun Watsons CC conjunction and, but, or CD cardinal number one, two DT determiner the, some IN preposition of, in, by JJ adjective large, small, green JJR comparative adjective bigger JJS superlative adjective biggest PP proper pronoun I, he, you RB adverb slowly RBR comparative adverb slowest RP particle up, off VB verb eat VBN past participle verb eaten VBG gerund verb eating VBZ present verb eats WP wh pronoun who, what WDT wh determiner which, that Table 9.1: Most commonly used part of speech tags Brill’s system worked by processing manually tagged text and then creating a list of words followed by the tags found for each word. Here are a few random lines selected from the test datalexicon.txt file: Arco NNP Arctic NNP JJ fair JJ NN RB 139 Here “Arco” is a proper noun because it is the name of a corporation. The word “Arctic” can be either a proper noun or an adjective; it is used most frequently as a proper noun so the tag “NNP” is listed before “JJ.” The word “fair” can be an adjective, singular noun, or an adverb. The class T agger reads the file lexicon either as a resource stream if, for example, you put lexicon.txt in the same JAR file as the compiled T agger and T okenizer class files or as a local file. Each line in the lexicon.txt file is passed through the utility method parseLine that processes an input string using the first token in the line as a hash key and places the remaining tokens in an array that is the hash value. So, we would process the line “fair JJ NN RB” as a hash key of “fair” and the hash value would be the array of strings only the first value is currently used but I keep the other values for future use: [JJ, NN, RB] When the tagger is processing a list of word tokens, it looks each token up in the hash table and stores the first possible tag type for the word. In our example, the word “fair” would be assigned possibly temporarily the tag “JJ.” We now have a list of word tokens and an associated list of possible tag types. We now loop through all of the word tokens applying eight transition rules that Eric Brill’s system learned. We will look at the first rule in some detail; i is the loop variable in the range [0, number of word tokens - 1] and word is the current word at index i: rule 1: DT, {VBD | VBP} -- DT, NN if i 0 ret.geti - 1.equalsDT { if word.equalsVBD || word.equalsVBP || word.equalsVB { ret.seti, NN; } } In English, this rule states that if a determiner DT at word token index i − 1 is fol- lowed by either a past tense verb VBD or a present tense verb VBP then replace the tag type at index i with “NN.” I list the remaining seven rules in a short syntax here and you can look at the Java source code to see how they are implemented: rule 2: convert a noun to a number CD if . appears in the word rule 3: convert a noun to a past participle if 140 words.geti ends with ed rule 4: convert any type to adverb if it ends in ly rule 5: convert a common noun NN or NNS to an adjective if it ends with al rule 6: convert a noun to a verb if the preceding work is would rule 7: if a word has been categorized as a common anoun nd it ends with s, then set its type to plural common noun NNS rule 8: convert a common noun to a present participle verb i.e., a gerund My FastTag tagger is not quite as accurate as Brill’s original tagger so you might want to use his system written in C but which can be executed from Java as an external process or with a JNI interface. In the next section we will use the tokenizer, stemmer, and tagger from this section to develop a system for identifying named entities in text.

9.2 Named Entity Extraction From Text

In this section we will look at identifying names of people and places in text. This can be useful for automatically tagging news articles with the people and place names that occur in the articles. The “secret sauce” for identifying names and places in text is the data in the file test datapropername.ser – a serialized Java data file con- taining hash tables for human and place names. This data is read in the constructor for the class N ames; it is worthwhile looking at the code if you have not used the Java serialization APIs before: ObjectInputStream p = new ObjectInputStreamins; Hashtable lastNameHash = Hashtable p.readObject; Hashtable firstNameHash = Hashtable p.readObject; Hashtable placeNameHash = Hashtable p.readObject; Hashtable prefixHash = Hashtable p.readObject; If you want to see these data values, use code like while keysE.hasMoreElements { Object key = keysE.nextElement; System.out.printlnkey + : + placeNameHash.getkey; } 141 to see data values like the following: Mauritius : country Port-Vila : country_capital Hutchinson : us_city Mississippi : us_state Lithuania : country Before we look at the entity extraction code and how it works, we will first look at an example of using the main APIs for the N ames class. The following example uses the methods isP laceN ame, isHumanN ame, and getP roperN ames: System.out.printlnLos Angeles: + names.isPlaceNameLos Angeles; System.out.printlnPresident Bush: + names.isHumanNamePresident Bush; System.out.printlnPresident George Bush: + names.isHumanNamePresident George Bush; System.out.printlnPresident George W. Bush: + names.isHumanNamePresident George W. Bush; ScoredList[] ret = names.getProperNames George Bush played golf. President \ George W. Bush went to London England, \ and Mexico to see Mary \ Smith in Moscow. President Bush will \ return home Monday.; System.out.printlnHuman names: + ret[0].getValuesAsString; System.out.printlnPlace names: + ret[1].getValuesAsString; The output from running this example is: Los Angeles: true President Bush: true President George Bush: true President George W. Bush: true place name: London, placeNameHash.getname: country_capital place name: Mexico, placeNameHash.getname: country_capital place name: Moscow, 142