Named Entity Extraction From Text

to see data values like the following: Mauritius : country Port-Vila : country_capital Hutchinson : us_city Mississippi : us_state Lithuania : country Before we look at the entity extraction code and how it works, we will first look at an example of using the main APIs for the N ames class. The following example uses the methods isP laceN ame, isHumanN ame, and getP roperN ames: System.out.printlnLos Angeles: + names.isPlaceNameLos Angeles; System.out.printlnPresident Bush: + names.isHumanNamePresident Bush; System.out.printlnPresident George Bush: + names.isHumanNamePresident George Bush; System.out.printlnPresident George W. Bush: + names.isHumanNamePresident George W. Bush; ScoredList[] ret = names.getProperNames George Bush played golf. President \ George W. Bush went to London England, \ and Mexico to see Mary \ Smith in Moscow. President Bush will \ return home Monday.; System.out.printlnHuman names: + ret[0].getValuesAsString; System.out.printlnPlace names: + ret[1].getValuesAsString; The output from running this example is: Los Angeles: true President Bush: true President George Bush: true President George W. Bush: true place name: London, placeNameHash.getname: country_capital place name: Mexico, placeNameHash.getname: country_capital place name: Moscow, 142 placeNameHash.getname: country_capital Human names: George Bush:1, President George W . Bush:1, Mary Smith:1, President Bush:1 Place names: London:1, Mexico:1, Moscow:1 The complete implementation that you can read through in the source file Extract- Names.java is reasonably simple. The methods isHumanN ame and isP laceN ame simply look up a string in either of the human or place name hash tables. For testing a single word this is very easy; for example: public boolean isPlaceNameString name { return placeNameHash.getname = null; } The versions of these APIs that handle names containing multiple words are just a little more complicated; we need to construct a string from the words between the starting and ending indices and test to see if this new string value is a valid key in the human names or place names hash tables. Here is the code for finding multi-word place names: public boolean isPlaceNameListString words, int startIndex, int numWords { if startIndex + numWords words.size { return false; } if numWords == 1 { return isPlaceNamewords.getstartIndex; } String s = ; for int i=startIndex; istartIndex + numWords; i++ { if i startIndex + numWords - 1 { s = s + words.getstartIndex + ; } else { s = s + words.getstartIndex; } } return isPlaceNames; } 143 This same scheme is used to test for multi-word human names. The top-level utility method getP roperN ames is used to find human and place names in text. The code in getP roperN ames is intentionally easy to understand but not very efficient because of all of the temporary test strings that need to be constructed.

9.3 Using the WordNet Linguistic Database

The home page for the WordNet project is http:wordnet.princeton.edu and you will need to download version 3.0 and install it on your computer to use the example programs in this section and in Chapter 10. As you can see on the WordNet web site, there are several Java libraries for accessing the WordNet data files; we will use the JAWS library written by Brett Spell as a student project at the Southern Methodist University. I include Brett’s library and the example programs for this section in the directory src-jaws-wordnet in the ZIP file for this book.

9.3.1 Tutorial on WordNet

The WordNet lexical database is an ongoing research project that includes many man years of effort by professional linguists. My own use of WordNet over the last ten years has been simple, mainly using the database to determine synonyms called synsets in WordNet and looking at the possible parts of speech of words. For reference as taken from the Wikipedia article on WordNet, here is a small subset of the type of relationships contained in WordNet for verbs shown by examples taken from the Wikipedia article: hypernym travel less general is an hypernym of movement more general entailment to sleep is entailed by to snore because you must be asleep to snore Here are a few of the relations supported for nouns: hypernyms canine is a hypernym of dog since every dog is of type canine hyponyms dog less general is a hyponym of canine more general holonym building is a holonym of window because a window is part of a building meronym window is a meronym of building because a window is part of a building Some of the related information maintained for adjectives is: related nouns similar to 144 I find the WordNet book WordNet: An Electronic Lexical Database Language, Speech, and Communication by Christiane Fellbaum, 1998 to be a detailed refer- ence for WordNet but there have been several new releases of WordNet since the book was published. The WordNet site and the Wikipedia article on WordNet are also good sources of information if you decide to make WordNet part of your toolkit: http:wordnet.princeton.edu http:en.wikipedia.orgwikiWordNet We will Brett’s open source Java WordNet utility library in the next section to ex- periment with WordNet. There are also good open source client applications for browsing the WordNet lexical database that are linked on the WordNet web site.

9.3.2 Example Use of the JAWS WordNet Library

Assuming that you have downloaded and installed WordNet on your computer, if you look at the data files themselves you will notice that the data is divided into index and data files for different data types. The JAWS library and other WordNet client libraries for many programming languages provides a useful view and convenient access to the WordNet data. You will need to define a Java property for the location of the raw WordNet data files in order to use JAWS; on my system I set: wordnet.database.dir=Usersmarkwtempwordnet3dict The example class W ordN etT est finds the different word senses for a given word and prints this data to standard output. We will tweak this code slightly in the next section where we will be combining WordNet with a part of speech tagger in another example program. Accessing WordNet data using Brett’s library is easy, so we will spend more time actually looking at the WordNet data itself. Here is a sample program that shows how to use the APIs. The class constructor makes a connection to the WordNet data files for reuse: public class WordNetTest { public WordNetTest { database = WordNetDatabase.getFileInstance; } Here I wrap a JAWS utility method to return lists of synsets instead of raw Java arrays: 145