Using the Trained Markov Model to Tag Text

The arrays for these probabilities in Markov.java are probabilityW ordGivenT ag and probabilityT ag 1T oT ag2. The logic for scoring a specific tagging possibility for a sequence of words in the method score. The method exponential tagging algorithm is the top level API for tagging words. Please note that the word sequence that you pass to exponential tagging algorithm must not contain any words that were not in the original training data i.e., in the file tagged text.txt. public ListString exponential_tagging_algorithmListString words { possibleTags = new ArrayListArrayListString; int num = words.size; indices = new int[num]; counts = new int[num]; int [] best_indices = new int[num]; for int i=0; inum; i++ { indices[i] = 0; counts[i] = 0; } for int i=0; inum; i++ { String word = + words.geti; ListString v = lexicon.getword; possible tags at index i: ArrayListString v2 = new ArrayListString; for int j=0; jv.size; j++ { String tag = + v.getj; if v2.containstag == false { v2.addtag; counts[i]++; } } possible tags at index i: possibleTags.addv2; System.out.printˆˆ word: + word + , tag count: + counts[i] + , tags: ; for int j=0; jv2.size; j++ { System.out.print + v2.getj; } System.out.println; } float best_score = -9999; do { System.out.printCurrent indices:; for int k=0; knum; k++ { System.out.print + indices[k]; 174 } System.out.println; float score = scorewords; if score best_score { best_score = score; System.out.println new best score: + best_score; for int m=0; mnum; m++ { best_indices[m] = indices[m]; } } } while incrementIndicesnum; see text below ListString tags = new ArrayListStringnum; for int i=0; inum; i++ { ListString v = possibleTags.geti; tags.addv.getbest_indices[i]; } return tags; } The method incrementIndices is responsible for generating the next possible tag- ging for a sequence of words. Each word in a sequence can have one or more possible tags. The method incrementIndices counts with a variable base per digit position. For example, if we had four words in an input sequence with the first and last words only having one possible tag value and the second having two possible tag values and the third word having three possible tag values, then incrementIndices would count like this: 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 2 0 0 1 1 0 The generated indices i.e., each row in this listing are stored in the class instance variable indices which is used in method score: Increment the class variable indices[] to point to the next possible set of tags to check. 175 private boolean incrementIndicesint num { for int i=0; inum; i++ { if indices[i] counts[i] - 1 { indices[i] += 1; for int j=0; ji; j++ { indices[j] = 0; } return true; } } return false; } We are not using an efficient algorithm if the word sequence is long. In practice this is not a real problem because you can break up long texts into smaller pieces for tagging; for example, you might want to tag just one sentence at a time. 176 10 Information Gathering We saw techniques for extracting semantic information in Chapter 9 and we will augment that material with the use of Reuters Open Calais web services for infor- mation extraction from text. We will then look at information discovery in relational database, indexing and search tools and techniques.

10.1 Open Calais

The Open Calais system was developed by Clear Forest later acquired by Reuters. Reuters allows free use with registration of their named entity extraction web ser- vice; you can make 20,000 web service calls a day. You need to sign up and get an access key at: www.opencalais.com. Starting in 1999, I have developed a similar named entity extraction system see www.knowledgebooks.com and I sometimes use both Open Calais and my own system together. The example program in this section OpenCalaisClient.java expects the key to be set in your environment; on my MacBook I set here I show a fake key – get your own: OPEN_CALAIS_KEY=al4345lkea48586dgfta3129aq You will need to make sure that this value can be obtained from a System.getenv call. The Open Calais web services support JSON, REST, and SOAP calls. I will use the REST architectural style in this example. The Open Calais server returns an XML RDF payload that can be directly loaded into RDF data stores like Sesame see Chapter 4. The example class OpenCalaisClient depends on a trick that may break in future versions of the Open Calais web service: an XML comment block at the top of the returned RDF payload lists the types of entities and their values. For example, here is a sample of the header comments with most of the RDF payload removed for brevity: ?xml version=1.0 encoding=utf-8? 177 string xmlns=http:clearforest.com --Use of the Calais Web Service is governed by the Terms of Service located at http:www.opencalais.com. By using this service or the results of the service you agree to these terms of service. -- --Relations: Country: France, United States, Spain Person: Hillary Clinton, Doug Hattaway, Al Gore City: San Francisco ProvinceOrState: Texas -- rdf:RDF xmlns:rdf=http:www.w3.org1 ... xmlns:c=http:s.opencalais.com1pred ... rdf:type ... ... . rdf:RDF string Here we will simply parse out the relations from the comment block. If you want to use Sesame to parse the RDF payload and load it into a local RDF repository then you can alternatively load the returned Open Calais response by modifying the example code from Chapter 4 using: