A Brief History of IR

27.1.3 A Brief History of IR

Information retrieval has been a common task since the times of ancient civiliza- tions, which devised ways to organize, store, and catalog documents and records. Media such as papyrus scrolls and stone tablets were used to record documented information in ancient times. These efforts allowed knowledge to be retained and transferred among generations. With the emergence of public libraries and the printing press, large-scale methods for producing, collecting, archiving, and distrib- uting documents and books evolved. As computers and automatic storage systems emerged, the need to apply these methods to computerized systems arose. Several

techniques emerged in the 1950s, such as the seminal work of H. P. Luhn, 5 who pro- posed using words and their frequency counts as indexing units for documents, and using measures of word overlap between queries and documents as the retrieval cri- terion. It was soon realized that storing large amounts of text was not difficult. The harder task was to search for and retrieve that information selectively for users with specific information needs. Methods that explored word distribution statistics gave

rise to the choice of keywords based on their distribution properties 6 and keyword- based weighting schemes.

The earlier experiments with document retrieval systems such as SMART 7 in the 1960s adopted the inverted file organization based on keywords and their weights as the method of indexing (see Section 27.5). Serial (or sequential) organization proved inadequate if queries required fast, near real-time response times. Proper organiza- tion of these files became an important area of study; document classification and clustering schemes ensued. The scale of retrieval experiments remained a challenge due to lack of availability of large text collections. This soon changed with the World Wide Web. Also, the Text Retrieval Conference (TREC) was launched by NIST (National Institute of Standards and Technology) in 1992 as a part of the TIPSTER

program 8 with the goal of providing a platform for evaluating information retrieval methodologies and facilitating technology transfer to develop IR products.

A search engine is a practical application of information retrieval to large-scale document collections. With significant advances in computers and communica- tions technologies, people today have interactive access to enormous amounts of user-generated distributed content on the Web. This has spurred the rapid growth

5 See Luhn (1957) “A statistical approach to mechanized encoding and searching of literary information.” 6 See Salton, Yang, and Yu (1975). 7 For details, see Buckley et al. (1993).

27.1 Information Retrieval (IR) Concepts 999

in search engine technology, where search engines are trying to discover different kinds of real-time content found on the Web. The part of a search engine responsi- ble for discovering, analyzing, and indexing these new documents is known as a crawler . Other types of search engines exist for specific domains of knowledge. For example, the biomedical literature search database was started in the 1970s and is

now supported by the PubMed search engine, 9 which gives access to over 20 million abstracts.

While continuous progress is being made to tailor search results to the needs of an end user, the challenge remains in providing high-quality, pertinent, and timely information that is precisely aligned to the information needs of individual users.