Generic IR Pipeline

27.1.5 Generic IR Pipeline

As we mentioned earlier, documents are made up of unstructured natural language text composed of character strings from English and other languages. Common examples of documents include newswire services (such as AP or Reuters), corpo- rate manuals and reports, government notices, Web page articles, blogs, tweets, books, and journal papers. There are two main approaches to IR: statistical and semantic.

In a statistical approach, documents are analyzed and broken down into chunks of text (words, phrases, or n-grams, which are all subsequences of length n characters in a text or document) and each word or phrase is counted, weighted, and measured for relevance or importance. These words and their properties are then compared with the query terms for potential degree of match to produce a ranked list of resulting documents that contain the words. Statistical approaches are further clas- sified based on the method employed. The three main statistical approaches are Boolean, vector space, and probabilistic (see Section 27.2).

Semantic approaches to IR use knowledge-based techniques of retrieval that broadly rely on the syntactic, lexical, sentential, discourse-based, and pragmatic lev- els of knowledge understanding. In practice, semantic approaches also apply some form of statistical analysis to improve the retrieval process.

Figure 27.1 shows the various stages involved in an IR processing system. The steps shown on the left in Figure 27.1 are typically offline processes, which prepare a set of documents for efficient retrieval; these are document preprocessing, document modeling, and indexing. The steps involved in query formation, query processing, searching mechanism, document retrieval, and relevance feedback are shown on the right in Figure 27.1. In each box, we highlight the important concepts and issues. The rest of this chapter describes some of the concepts involved in the various tasks within the IR process shown in Figure 27.1.

Figure 27.2 shows a simplified IR processing pipeline. In order to perform retrieval on documents, the documents are first represented in a form suitable for retrieval. The significant terms and their properties are extracted from the documents and are represented in a document index where the words/terms and their properties are stored in a matrix that contains these terms and the references to the documents that contain them. This index is then converted into an inverted index (see Figure

27.4) of a word/term vs. document matrix. Given the query words, the documents

27.2 Retrieval Models 1001

Document 3 Document 2

Document Corpus

SEARCH INTENT

Document 1 Information Need/Search

Stopword removal

Query Formation Stemming

Preprocessing

Keywords, Boolean, phrase,

proximity, wildcard queries, etc.

Thesaurus Digits, hyphens,

Query Processing Information extraction

punctuation marks, cases

Conversion from humanly

understandable to internal format Situation assessment

Modeling

Query expansion heuristics

Retrieval models

(users’s profile, related metadata,

Type of queries

etc.)

Choice of search strategy

Searching

Inverted index construction Mechanism

Indexing

(approximate vs. exact matches,

exhaustive vs. top K)

Index vocabulary

Type of similarity measure

Document statistics Index maintenance

Ranking results

Document

Storing user’s Relevance

Showing useful

Retrieval

feedback Feedback

metadata

Personalization

Pattern analysis

External data

Metadata

of relevant

ontologies

Integration

results Legend

Dashed line indicates

Figure 27.1

next iteration

Generic IR framework.

containing these words—and the document properties, such as date of creation, author, and type of document—are fetched from the inverted index and compared with the query. This comparison results in a ranked list shown to the user. The user can then provide feedback on the results that triggers implicit or explicit query expansion to fetch results that are more relevant for the user. Most IR systems allow for an interactive search where the query and the results are successively refined.