Open Calais Practical Artificial Intelligence Programming With Java

string xmlns=http:clearforest.com --Use of the Calais Web Service is governed by the Terms of Service located at http:www.opencalais.com. By using this service or the results of the service you agree to these terms of service. -- --Relations: Country: France, United States, Spain Person: Hillary Clinton, Doug Hattaway, Al Gore City: San Francisco ProvinceOrState: Texas -- rdf:RDF xmlns:rdf=http:www.w3.org1 ... xmlns:c=http:s.opencalais.com1pred ... rdf:type ... ... . rdf:RDF string Here we will simply parse out the relations from the comment block. If you want to use Sesame to parse the RDF payload and load it into a local RDF repository then you can alternatively load the returned Open Calais response by modifying the example code from Chapter 4 using: StringReader sr = new StringReaderresult; RepositoryConnection connection = repository.getConnection; connection.addsr, , RDFFormat.RDFXML; Here are a few code snippets incomplete code: please see the Java source file for more details from the file OpenCalaisClient.java: public HashtableString, ListString getPropertyNamesAndValuesString text throws MalformedURLException, IOException { HashtableString, ListString ret = new HashtableString, ListString; You need an Open Calais license key. The following code sets up the data for a REST style web service call and opens a connection to the server, makes the request, and retrieves the response in the string variable payload. The Java libraries for handling 178 HTTP connections make it simple to make a architecture style web service call and get the response as a text string: String licenseID = System.getenvOPEN_CALAIS_KEY; String content = text; String paramsXML = c:params ... c:params; StringBuilder sb = new StringBuildercontent.length + 512; sb.appendlicenseID=.appendlicenseID; sb.appendcontent=.appendcontent; sb.appendparamsXML=.appendparamsXML; String payload = sb.toString; URLConnection connection = new URLhttp:api.opencalais.com .... openConnection; connection.addRequestPropertyContent-Type, applicationx-www-form-urlencoded; connection.addRequestPropertyContent-Length, String.valueOfpayload.length; connection.setDoOutputtrue; OutputStream out = connection.getOutputStream; OutputStreamWriter writer = new OutputStreamWriterout; writer.writepayload; writer.flush; get response from Open Calais server: String result = new Scanner connection.getInputStream. useDelimiter\\Z.next; result = result.replaceAlllt;, . replaceAllgt;, ; The text that we are parsing looks like: Country: France, United States, Spain Person: Hillary Clinton, Doug Hattaway, Al Gore so the text response is parsed to extract a list of values for each property name contained in the string variable result: int index1 = result.indexOfterms of service.--; 179 index1 = result.indexOf--, index1; int index2 = result.indexOf--, index1; result = result.substringindex1 + 4, index2 - 1 + 1; String[] lines = result.split\\n; for String line : lines { int index = line.indexOf:; if index -1 { String relation = line.substring0, index.trim; String[] entities = line.substringindex + 1.trim.split,; for int i = 0, size = entities.length; i size; i++ { entities[i] = entities[i].trim; } ret.putrelation, Arrays.asListentities; } } return ret; } Again, I want to point out that the above code depends on the format of XML com- ments in the returned XML payload so this code may break in the future and require modification. Here is an example use of this API: String content = Hillary Clinton likes to remind Texans that ...; MapString, ListString results = new OpenCalaisClient. getPropertyNamesAndValuescontent; for String key : results.keySet { System.out.println + key + : + results.getkey; } In this example the string value assigned to the variable content was about 500 words of text from a news article; the full text can be seen in the example data files. The output of this example code is: Person: [Hillary Clinton, Doug Hattaway, Al Gore] Relations: [] City: [San Francisco] Country: [France, United States, Spain] ProvinceOrState: [Texas] 180 There are several ways that you might want to use named entity identification. One idea is to create a search engine that identifies people, places, and products in search results and offers users a linked set of documents or web pages that discuss the same people, places, andor products. Another idea is to load the RDF payload returned by the Open Calais web service calls to an RDF repository and support SPARQL queries. You may also want to modify any content management systems CMS that you use to add tags for documents maintained in a CMS; using Open Calais you are limited to the types of entities that they extract. This limitation is one reason why I maintain and support my own system for named entity and classification knowl- edgebooks.com – I like some flexibility in the type of semantic information that I extract from text data. I covered some of the techniques that I use in my own work in Section 9.2 if you decide to implement your own system to replace or augment Open Calais.

10.2 Information Discovery in Relational Databases

We will look at some techniques for using the JDBC meta-data APIs to explore re- lational database resources where you at least have read access rights. In order to make installation of the example programs easier we will use the Derby pure Java database that is bundled with JDK 1.6. If you are still using JDK 1.5, please down- load the derby.jar file and copy it to the “lib” directory for the Java book examples: http:db.apache.orgderby There are small differences in setting up a JDBC connection to an embedded Derby instance rather than accessing a remote server: these differences are not important to the material in this section, it is mostly a matter of changing a connection call. I will use two XML data sources data on US states and the CIA World FactBook for these examples, and start with the program to insert these XML data files into the relational database: src-info-disc-rdbsCreateSampleDatabase.java and continue with a program to print out all metadata that is implemented in the files: src-info-disc-rdbsDumpMetaData.java src-info-disc-rdbsDatabaseDiscovery.java 181 We will not implement any specific “database spidering” applications but I will pro- vide some basic access techniques and give you some ideas for using database meta data in your own projects. 10.2.1 Creating a Test Derby Database Using the CIA World FactBook and Data on US States The file test dataXMLFactBook.xml contains data that I obtained from the Fact- Book web site and converted to XML. This XML file contains data for individual countries and a few general regions: FactBook year=2001 country name=Aruba location=Caribbean, island in the ... background=Discovered and claimed ... climate=tropical marine; little seasonal ... terrain=flat; scant vegetation resources=NEGL; white sandy beaches hazards=lies outside the Caribbean hurricane belt population=70,007 July 2001 est. government=parliamentary democracy economy=Tourism is the mainstay of the Aruban ... inflation=4.2 2000 est. languages=Dutch official, Papiamento ... religions=Roman Catholic 82, Protestant 8, ... capital=Oranjestad unemployment=0.6 1999 est. industries=tourism, transshipment facilities, ... agriculture=aloes; livestock; fish exports=2.2 billion including oil reexports ... imports=2.5 billion 2000 est. debt=285 million 1996 aid=26 million 1995; note - the Netherlands ... internet_code=.aw ... FactBook 182