Using the JDBC Meta Data APIs

We will use the method getT ables to fetch a list of all tables in the database. The four arguments are: • String catalog: can be used when database systems support catalogs. We will use null to act as a wildcard match. • String schemaPattern: can be used when database systems support schemas. We will use null to act as a wildcard match. • String tableNamePattern: a pattern to match table names; we will use “” as a wildcard match. • String types[]: the types of table names to return. Possible values include TABLE, VIEW, ALIAS, SYNONYM, and SYSTEM TABLE. The method getT ables returns a ResultSet so we iterate through returned values just as you would in a regular SQL query using the JDBC APIs: ResultSet table_rs = md.getTablesnull, null, , new String[]{TABLE}; while table_rs.next { System.out.printlnTable: + table_rs.getString3; tableNames.addtable_rs.getString3; } Loop over all tables printing column meta data and the first row: for String tableName : tableNames { System.out.println\n\n Processing table + tableName + \n; String query = SELECT from + tableName; System.out.printlnquery; ResultSet rs = s.executeQueryquery; ResultSetMetaData table_meta = rs.getMetaData; int columnCount = table_meta.getColumnCount; System.out.println\nColumn meta data for table:; ListString columnNames = new ArrayListString10; columnNames.add; for int col=1; col=columnCount; col++ { System.out.printlnColumn + col + name: + table_meta.getColumnLabelcol; System.out.println column data type: + table_meta.getColumnTypeNamecol; columnNames.addtable_meta.getColumnLabelcol; 184 } System.out.println\nFirst row in table:; if rs.next { for int col=1; col=columnCount; col++ { System.out.println + columnNames.getcol + : + rs.getStringcol; } } } } } Output looks like this: Table: FACTBOOK Table: USSTATES Processing table FACTBOOK SELECT from FACTBOOK Column meta data for table: Column 1 name: NAME column data type: VARCHAR Column 2 name: LOCATION column data type: VARCHAR Column 3 name: EXPORT column data type: BIGINT Column 4 name: IMPORT column data type: BIGINT Column 5 name: DEBT column data type: BIGINT Column 6 name: AID column data type: BIGINT Column 7 name: UNEMPLOYMENT_PERCENT column data type: INTEGER Column 8 name: INFLATION_PERCENT column data type: INTEGER First row in table: NAME: Aruba LOCATION: Caribbean, island in the Caribbean Sea, north of Venezuela EXPORT: 2200000000 185 IMPORT: 2500000000 DEBT: 285000000 AID: 26000000 UNEMPLOYMENT_PERCENT: 0 INFLATION_PERCENT: 4 Processing table USSTATES SELECT from USSTATES Column meta data for table: Column 1 name: NAME column data type: VARCHAR Column 2 name: ABBREVIATION column data type: CHAR Column 3 name: INDUSTRY column data type: VARCHAR Column 4 name: AGRICULTURE column data type: VARCHAR Column 5 name: POPULATION column data type: BIGINT First row in table: NAME: Alabama ABBREVIATION: AL INDUSTRY: Paper, lumber and wood products, mining, rubber and plastic products, transportation equipment, apparel AGRICULTURE: Poultry and eggs, cattle, nursery stock, peanuts, cotton, vegetables, milk, soybeans POPULATION: 4447100 Using the JDBC meta data APIs is a simple technique but can be very useful for both searching many tables for specific column names and for pulling meta data and row data into local search engines. While most relational databases provide support for free text search of text fields in a database it is often better to export specific text columns in a table to an external search engine. We will spend the rest of this chapter on index and search techniques. While we usually index web pages and local document repositories, keep in mind that data in relational databases can also easily be indexed either with hand written export utilities or automated techniques using the JDBC meta-data APIs that we used in this section. 186

10.2.3 Using the Meta Data APIs to Discern Entity Relationships

When database schemas are defined it is usually a top down approach: entities and their relationships are modeled and then represented as relational database tables. When automatically searching remote databases for information we might need to discern which entities and their relationships exist depending on table and column names. This is likely to be a domain specific development effort. While it is feasible and probably useful to build a “database spider” for databases in a limited domain for example car parts or travel destinations to discern entity models and their relations, it is probably not possible without requiring huge resources to build a system that handles multiple data domains. The expression “dark web” refers to information on the web that is usually not “spi- dered” – information that lives mostly in relational databases and often behind query forms. While there are current efforts by search engine companies to determine the data domains of databases hidden behind user entry forms using surrounding text, for most organizations this is simply too large a problem to solve. On the other hand, using the meta data of databases that you or your organization have read access to for “database spidering” is a more tractable problem.

10.3 Down to the Bare Metal: In-Memory Index and Search

Indexing and search technology is used in a wide range of applications. In order to get a good understanding of index and search we will design and implement an in-memory library in this section. In Section 10.4 we will take a quick look at the Lucene library and in Section 10.5 we will look at client programs using the Nutch indexing and search system that is based on Lucene. We need a way to represent data to be indexed. We will use a simple package- visible class no getterssetters, assumed to be in the same package as the indexing and search class: class TestDocument { int id; String text; static int count = 0; TestDocumentString text { this.text = text; 187 id = count++; } public String toString { int len = text.length; if len 25 len = 25; return [Document id: + id + : + text.substring0,len + ...]; } } We will write a class InM emorySearch that indexes instances of the T estDocument class and supplies an API for search. The first decision to make is how to store the index that maps search terms to documents that contain the search terms. One sim- ple idea would be to use a map to maintain a set of document IDs for each search term; something like: MapString, SetInteger index; This would be easy to implement but leaves much to be desired so we will take a different approach. We would like to rank documents by relevance but a relevance measure just based on containing all or most of the search terms is weak. We will improve the index by also storing a score of how many times a search term occurs in a document, scaled by the number of words in a document. Since our document model does not contain links to other documents we will not use a Google-like page ranking algorithm that increases the relevance of search results based on the number of incoming links to matched documents. We will use a utility class again, assuming same package data visibility to hold a document ID and a search term count. I used generics for the first version of this class to allow alternative types for counting word use in a document and later changed the code to hardwiring the types for ID and word count to native integer values for runtime efficiency and to use less memory. Here is the second version of the code: class IdCount implements ComparableIdCount { int id = 0; int count = 0; public IdCountint k, int v { this.id = k; this.count = v; } public String toString { return [IdCount: + id + : + count + ]; } Override 188