Tokenizing, Stemming, and Part of Speech Tagging Text
section. Before we can process any text we need to break text into individual tokens. Tokens
can be words, numbers and punctuation symbols. The class T okenizer has two static methods, both take an input string to tokenize and return a list of token strings.
The second method has an extra argument to specify the maximum number of tokens that you want returned:
static public ListString wordsToListString s static public ListString wordsToListString s,
int maxR The following listing shows a fragment of example code using this class with the
output:
String text = The ball, rolling quickly, went down the hill.;
ListString tokens = Tokenizer.wordsToListtext; System.out.printlntext;
for String token : tokens System.out.print\+token+\ ;
System.out.println; This code fragment produces the following output:
The ball, rolling quickly, went down the hill. The ball , rolling quickly , went
down the hill .
For many applications, it is better to “stem” word tokens to simplify comparison of similar words. For example “run,” “runs,” and “running” all stem to “run.” The
stemmer that we will use, which I believe to be in the public domain, is in the file srcpublic domainStemmer.java. There are two convenient APIs defined at the end
of the class, one to stem a string of multiple words and one to stem a single word token:
public ListString stemStringString str public String stemOneWordString word
We will use both the F astT ag and Stemmer classes often in the remainder of this chapter.
138
The FastTag project resulted from my using the excellent tagger written by Eric Brill while he was at the University of Pennsylvania. He used machine learning
techniques to learn transition rules for tagging text using manually tagged text as training examples. In reading through his doctoral thesis I noticed that there were a
few transition rules that covered most of the cases and I implemented a simple “fast tagger” in Common Lisp, Ruby, Scheme and Java. The Java version is in the file
srccomknowledgebooksnlpfasttagFastTag.java.
The file srccomknowledgebooksnlpfasttagREADME.txt contains information on where to obtain Eric Brill’s original tagging system and also defines the tags for both
his English language lexicon and the Medpost lexicon. Table 9.1 shows the most commonly used tags see the README.txt file for a complete description.
Tag Description
Examples NN
singular noun dog
NNS plural noun
dogs NNP
singular proper noun California
NNPS plural proper noun
Watsons CC
conjunction and, but, or
CD cardinal number
one, two DT
determiner the, some
IN preposition
of, in, by JJ
adjective large, small, green
JJR comparative adjective
bigger JJS
superlative adjective biggest
PP proper pronoun
I, he, you RB
adverb slowly
RBR comparative adverb
slowest RP
particle up, off
VB verb
eat VBN
past participle verb eaten
VBG gerund verb
eating VBZ
present verb eats
WP wh pronoun
who, what WDT
wh determiner which, that
Table 9.1: Most commonly used part of speech tags Brill’s system worked by processing manually tagged text and then creating a list
of words followed by the tags found for each word. Here are a few random lines selected from the test datalexicon.txt file:
Arco NNP Arctic NNP JJ
fair JJ NN RB
139
Here “Arco” is a proper noun because it is the name of a corporation. The word “Arctic” can be either a proper noun or an adjective; it is used most frequently as
a proper noun so the tag “NNP” is listed before “JJ.” The word “fair” can be an adjective, singular noun, or an adverb.
The class T agger reads the file lexicon either as a resource stream if, for example, you put lexicon.txt in the same JAR file as the compiled T agger and T okenizer
class files or as a local file. Each line in the lexicon.txt file is passed through the utility method parseLine that processes an input string using the first token in the
line as a hash key and places the remaining tokens in an array that is the hash value. So, we would process the line “fair JJ NN RB” as a hash key of “fair” and the hash
value would be the array of strings only the first value is currently used but I keep the other values for future use:
[JJ, NN, RB] When the tagger is processing a list of word tokens, it looks each token up in the
hash table and stores the first possible tag type for the word. In our example, the word “fair” would be assigned possibly temporarily the tag “JJ.” We now have a
list of word tokens and an associated list of possible tag types. We now loop through all of the word tokens applying eight transition rules that Eric Brill’s system learned.
We will look at the first rule in some detail; i is the loop variable in the range [0, number of word tokens - 1] and word is the current word at index i:
rule 1: DT, {VBD | VBP} -- DT, NN if i 0 ret.geti - 1.equalsDT {
if word.equalsVBD || word.equalsVBP ||
word.equalsVB { ret.seti, NN;
} }
In English, this rule states that if a determiner DT at word token index i − 1 is fol-
lowed by either a past tense verb VBD or a present tense verb VBP then replace the tag type at index i with “NN.”
I list the remaining seven rules in a short syntax here and you can look at the Java source code to see how they are implemented:
rule 2: convert a noun to a number CD if . appears in the word
rule 3: convert a noun to a past participle if
140
words.geti ends with ed rule 4: convert any type to adverb if it ends in ly
rule 5: convert a common noun NN or NNS to an adjective if it ends with al
rule 6: convert a noun to a verb if the preceding work is would
rule 7: if a word has been categorized as a common anoun nd it ends with s, then set its type
to plural common noun NNS rule 8: convert a common noun to a present participle
verb i.e., a gerund My FastTag tagger is not quite as accurate as Brill’s original tagger so you might
want to use his system written in C but which can be executed from Java as an external process or with a JNI interface.
In the next section we will use the tokenizer, stemmer, and tagger from this section to develop a system for identifying named entities in text.