Information Retrieval Architecture and Algorithms pdf pdf

  

Gerald Kowalski, Information Retrieval Architecture and Algorithms, DOI: 10.1007/978-1-4419-7716-8, © Springer Science+Business

Media, LLC 2011 Gerald Kowalski Information Retrieval Architecture and Algorithms

  Gerald Kowalski Ashburn, VA, USA

  ISBN 978-1-4419-7715-1 e-ISBN 978-1-4419-7716-8 Library of Congress Control Number: © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

  This book is dedicated to my grandchildren, Adeline, Bennet, Mollie Kate and Riley who are the future

  Jerry Kowalski

  Preface

  Information Retrieval has radically changed over the last 25 years. When I first started teaching Information Retrieval and developing large Information Retrieval systems in the 1980s it was easy to cover the area in a single semester course. Most of the discussion was theoretical with testing done on small databases and only a small subset of the theory was able to be implemented in commercial systems. There were not massive amounts of data in the right digital format for search. Since 2000, the field of Information retrieval has undergone a major transformation driven by massive amounts of new data (e.g., Internet, Facebook, etc.) that needs to be searched, new hardware technologies that makes the storage and processing of data feasible along with software architecture changes that provides the scalability to handle massive data sets. In addition, the area of information retrieval of multimedia, in particular images, audio and video, are part of everyone’s information world and users are looking for information retrieval of them as well as the traditional text. In the textual domain, languages other than English are becoming far more prevalent on the Internet.

  To understand how to solve the information retrieval problems is no longer focused on search algorithm improvements. Now that Information Retrieval Systems are commercially available, like the area of Data Base Management Systems, an Information Retrieval System approach is needed to understand how to provide the search and retrieval capabilities needed by users. To understand modern information retrieval it’s necessary to understand search and retrieval for both text and multimedia formats. Although search algorithms are important, other aspects of the total system such as pre-processing on ingest of data and how to display the search results can contribute as much to the user finding the needed information as the search algorithms.

  This book provides a theoretical and practical explanation of the latest advancements in information retrieval and their application to existing systems. It takes a system approach, discussing all aspects of an Information Retrieval System. The system approach to information retrieval starts with a functional discussion of what is needed for an information system allowing the reader to understand the scope of the information retrieval problem and the challenges in providing the needed functions. The book, starting with the Chap. 1, stresses that information retrieval has migrated from textual to multimedia. This theme is carried throughout the book with multimedia search, retrieval and display being discussed as well as all the classic and new textual techniques. Taking a system view of Information Retrieval explores every functional processing step in a system showing how decisions on implementation at each step can add to the goal of information retrieval; providing the user with the information they need minimizing their resources in getting the information (i.e., time it takes). This is not limited to search speed but also how search results are presented can influence how fast a user can locate the information they need. The information retrieval system can be defined as four major processing steps. It starts with “ingestion” of information to be indexed, the indexing process, the search process and finally the information presentation process. Every processing step has algorithms associated with it and provides the opportunity to make searching and retrieval more precise. In addition the changes in hardware and more importantly search architectures, such as those introduced by GOOGLE, are discussed as ways of approaching the scalability issues. The last chapter focuses on how to evaluate an information retrieval system and the data sets and forums that are available. Given the continuing introduction of new search technologies, ways of evaluating which are most useful to a particular information domain become important.

  The primary goal of writing this book is to provide a college text on Information Retrieval Systems. But in addition to the theoretical aspects, the book maintains a theme of practicality that puts into perspective the importance and utilization of the theory in systems that are being used by anyone on the Internet. The student will gain an understanding of what is achievable using existing technologies and the deficient areas that warrant additional research. What used to be able to be covered in a one semester course now requires at least three different courses to provide adequate background. The first course provides a complete overview of the Information Retrieval System theory and architecture as provided by this book. But additional courses are needed to go in more depth on the algorithms and theoretical options for the different search, classification, clustering and other related technologies whose basics are provided in this book. Another course is needed to focus in depth on the theory and implementation on the new growing area of Multimedia Information Retrieval and also Information Presentation technologies.

  Gerald Kowalski Gerald Kowalski, Information Retrieval Architecture and Algorithms, DOI: 10.1007/978-1-4419-7716-8_1, © Springer US 2011

  1. Information Retrieval System Functions 1 Gerald Kowalski

  (1) Ashburn, VA, USA

  Abstract

  In order to understand the technologies associated with an Information Retrieval system, an understanding of the goals and objectives of information retrieval systems along with the user’s functions is needed. This background helps in understanding some of the technical drivers on final implementation. To place Information Retrieval Systems into perspective, it’s also useful to discuss how they are the same and differ from other information handling systems such as Database Management Systems and Digital Libraries. The major processing subsystems in an information retrieval system are outlined to see the global architecture concerns. The precision and recall metrics are introduced early since they provide the basis behind explaining the impacts of algorithms and functions throughout the rest of the architecture discussion.

1.1 Introduction Information Retrieval is a very simple concept with everyone having practical experience in it’s use.

  The scenario of a user having an information need, translating that into a search statement and executing that search to locate the information has become ubiquitous to everyday life. The Internet has become a repository of any information a person needs, replacing the library as a more convenient research tool. An Information Retrieval System is a system that ingests information, transforms it into searchable format and provides an interface to allow a user to search and retrieve information. The most obvious example of an Information Retrieval System is GOOGLE and the English language has even been extended with the term “Google it” to mean search for something.

  So everyone has had experience with Information Retrieval Systems and with a little thought it is easy to answer the question—“Does it work?” Everyone who has used such systems has experienced the frustration that is encountered when looking for certain information. Given the massive amount of intellectual effort that is going into the design and evolution of a “GOOGLE” or other search systems the question comes to mind why is it so hard to find what you are looking for.

  One of the goals of this book is to explain the practical and theoretical issues associated with Information Retrieval that makes design of Information Retrieval Systems one of the challenges of our time. The demand for and expectations of users to quickly find any information they need continues to drive both the theoretical analysis and development of new technologies to satisfy that need. To scope the problem one of the first things that needs to be defined is “information”. Twenty-five years ago information retrieval was totally focused on textual items. That was because almost all of the “digital them most of the time the capability to create images and videos of interest—that is the cell phone. This has made modalities other than text to become as common as text. That is coupled with Internet web sites that allow and are designed for ease of use of uploading and storing those modalities which more than justify the need to include other than text as part of the information retrieval problem. There is a lot of parallelism between the information processing steps for text and for images, audio and video. Although maps are another modality that could be included, they will only be generally discussed.

  So in the context of this book, information that will be considered in Information Retrieval Systems includes text, images, audio and video. The term “item ” shall be used to define a specific information object. This could be a textual document, a news item from an RSS feed, an image, a video program or an audio program. It is useful to make a distinction between the original items from what is processed by the Information Retrieval System as the basic indexable item. The original item will always be kept for display purposes, but a lot of preprocessing can occur on it during the process of creating the searchable index. The term “item” will refer to the original object. On occasion the term document will be used when the item being referred to is a textual item.

  An Information Retrieval System is the hardware and software that facilitates a user in finding the information the user needs. Hardware is included in the definition because specialized hardware is needed to transform certain modalities into digital processing format (e.g., encoders that translate composite video to digital video). As the detailed processing of items is described it will become clear that an information retrieval system is not a single application but is composed of many different applications that work together to provide the tools and functions needed to assist the users in answering their questions. The overall goal of an Information Retrieval System is to minimize the user overhead in locating the information of value. Overhead from a user’s perspective can be defined as the time it takes to locate the needed information. The time starts when a user starts to interact with the system and ends when they have found the items of interest. Human factors play significantly in this process. For example, most users have a short threshold on frustration waiting for a response. That means in a commercial system on the Internet, the user is more satisfied with a response less than 3 s than a longer response that has more accurate information. In internal corporate systems, users are willing to wait a little longer to get results but there is still a tradeoff between accuracy and speed. Most users would rather have the faster results and iterate on their searches than allowing the system to process the queries with more complex techniques providing better results. All of the major processing steps are described for an Information Retrieval System, but in many cases only a subset of them are used on operational systems because users are not willing to accept the increase in response time.

  The evolution of Information Retrieval Systems has been closely tied to the evolution of computer processing power. Early information retrieval systems were focused on automating the manual indexing processes in libraries. These systems migrated the structure and organization of card catalogs into structured databases. They maintained the same Boolean search query structure associated with the data base that was used for other database applications. This was feasible because all of the assignment of terms to describe the content of a document was done by professional indexers. In parallel there was also academic research work being done on small data sets that considered how to automate the indexing process making all of the text of a document part of the searchable index. The only place that large systems designed to search on massive amounts of text were available was in Government and Military systems. As commercial processing power and storage significantly increased, it became more feasible to consider applying the algorithms and techniques being developed in the Universities to commercial systems. In addition, the creation of the original documents also was migrating to digital format so that they were in a format that could be processed by the new algorithms. The largest change that drove information technologies to become part of everyone’s experience was the introduction and growth of the Internet. The Internet became a massive repository of unstructured information and information retrieval techniques were the only approach to effectively locate information on it. This changed the funding and development of search techniques from a few Government funded efforts to thousands of new ideas being funded by Venture Capitalists moving the more practical implementation of university algorithms into commercial systems.

  Information Retrieval System architecture can be segmented into four major processing subsystems. Each processing subsystem presents the opportunity to improve the capability of finding and retrieving the information needed by the user. The subsystems are Ingesting, Indexing, Searching and Displaying. This book uses these subsystems to organize the various technologies that are the building blocks to optimize the retrieval of relevant items for a user. That is to say and end to end discussion of information retrieval system architecture is presented.

1.1.1 Primary Information Retrieval Problems

  The primary challenge in information retrieval is the difference between how a user expresses what information they are looking for and the way the author of the item expressed the information he is presenting. In other words, the challenge is the mismatch between the language of the user and the language of the author. When an author creates an item they will have information (i.e., semantics) they are trying to communicate to others. They will use the vocabulary they are use to express the information. A user will have an information need and will translate the semantics of their information need into the vocabulary they normally use which they present as a query. It’s easy to imagine the mismatch of the vocabulary. There are many different ways of expressing the same concept (e.g. car versus automobile). In many cases both the author and the user will know the same vocabulary, but which terms are most used to represent the same concept will vary between them. In some cases the vocabulary will be different and the user will be attempting to describe a concept without the vocabulary used by authors who write about it (see Fig.

  . That is why information

  retrieval systems that focus on a specific domain (e.g., DNA) will perform better than general purpose systems that contain diverse information. The vocabularies are more focused and shared within the specific domain.

  Fig. 1.1 Vocabulary domains

  There are obstacles to specification of the information a user needs that come from limits to the user’s ability to express what information is needed, ambiguities inherent in languages, and differences between the user’s vocabulary and that of the authors of the items in the database. In order for an Information Retrieval System to return good results, it important to start with a good search statement allowing for the correlation of the search statement to the items in the database. The inability to accurately create a good query is a major issue and needs to be compensated for in information retrieval. Natural languages suffer from word ambiguities such as polesemy that allow the same word to have multiple meanings and use of acronyms which are also words (e.g., the word “field” or the acronym “CARE”). Disambiguation techniques exist but introduce system overhead in processing power and extended search times and often require interaction with the user.

  Most users have trouble in generating a good search statement. The typical user does not have significant experience with, or the aptitude for, Boolean logic statements. The use of Boolean logic is a legacy from the evolution of database management systems and implementation constraints. Historically, commercial information retrieval systems were based upon databases. It is only with the introduction of Information Retrieval Systems such as FAST, Autonomy, ORACLE TEXT, and GOOGLE Appliances that the idea of accepting natural language queries is becoming a standard system feature. This allows users to state in natural language what they are interested in finding. But the completeness of the user specification is limited by the user’s willingness to construct long natural language queries. Most users on the Internet enter one or two search terms or at most a phrase. But quite often the user does not know the words that best describe what information they are looking for. The norm is now an iterative process where the user enters a search and then based upon the first page of hit results revises the query with other terms.

  Multimedia items add an additional level of complexity in search specification. Where the source format can be converted to text (e.g., audio transcription, Optical Character Reading) the standard text techniques are still applicable. They just need to be enhanced because of the errors in conversion (e.g. fuzzy searching). But query specification when searching for an image, unique sound, or video segment lacks any proven best interface approaches. Typically they are achieved by grabbing an example from the media being displayed or having prestored examples of known objects in the media and letting the user select them for the search (e.g., images of leaders allowing for searches on “Tony Blair”.) In some cases the processing of the multimedia extracts metadata describing the item and the metadata can be searched to locate items of interest (e.g., speaker identification, searching for “notions” in images—these will be discussed in detail later). This type specification becomes more complex when coupled with Boolean or natural language textual specifications.

  In addition to the complexities in generating a query, quite often the user is not an expert in the area that is being searched and lacks domain specific vocabulary unique to that particular subject area. The user starts the search process with a general concept of the information required, but does not have a focused definition of exactly what is needed. A limited knowledge of the vocabulary associated with a particular area along with lack of focus on exactly what information is needed leads to use of inaccurate and in some cases misleading search terms. Even when the user is an expert in the area being searched, the ability to select the proper search terms is constrained by lack of knowledge of the author’s vocabulary. The problem comes from synonyms and which particular synonym word is selected by the author and which by the user searching. All writers have a vocabulary limited by their life experiences, environment where they were raised and ability to express themselves. Other than in very technical restricted information domains, the user’s search vocabulary does not match the author’s vocabulary. Users usually start with simple queries that suffer from failure rates approaching

  50 % (Nordlie-99).

  Another major problem in information retrieval systems is how to effectively represent the possible items of interest identified by the system so the user can focus in on the ones of most likely value. Historically data has been presented in an order dictated by the order in which items are entered into the search indices (i.e., ordered by date the system ingests the information or the creation date of the item). For those users interested in current events this is useful. But for the majority of searches it does not filter out less useful information. Information Retrieval Systems provide functions that provide the results of a query in order of potential relevance based upon the users query. But the inherent fallacy in the current systems is that they present the information in a linear ordering. As noted before, users have very little patience for browsing long linear lists in a sequential order. That is why they seldom look beyond the first page of the linear ordering. So even if the user’s query returned the optimum set of items of interest, if there are too many false hits on the first page of display, the user will revise their search. To optimize the information retrieval process a non-linear way of presenting the search results will optimize the user’s ability to find the information they are interested in. The display of the search hits using visualization techniques allows the natural parallel processing capability of the users mind to focus and localize on the items of interest rather than being forced to a sequential processing model.

  Once the user has been able to localize on the many potential items of interest other sophisticated processing techniques can aid the users in finding the information of interest in the hits. Techniques such as summarization across multiple items, link analysis of information and time line correlations of information can reduce the linear process of having to read each item of interest and provide an overall insight into the total information across multiple items. For example if there has been a plane crash, the user working with the system may be able to localize a large number of news reports on the disaster. But it’s not unusual to have almost complete redundancy of information in reports from different sources on the same topic. Thus the user will have to read many documents to try and find any new facts. A summarization across the multiple textual items that can eliminate the redundant parts can significantly reduce the user’s overhead (time) it takes to find the data the user needs. More importantly it will eliminate the possibility the user gets tired of reading redundant information and misses reading the item that has significant new information in it.

1.1.2 Objectives of Information Retrieval System

  The general objective of an Information Retrieval System is to minimize the time it takes for a user to locate the information they need. The goal is to provide the information needed to satisfy the user’s question. Satisfaction does not necessarily mean finding all information on a particular issue. It means finding sufficient information that the user can proceed with whatever activity initiated the need for information. This is very important because it does explain some of the drivers behind existing search systems and suggests that precision is typically more important than recalling all possible information. For example a user looking for a particular product does not have to find the names of everyone that sells the product or every company that manufactures the product to meet their need of getting that product. Of course if they did have total information then it’s possible they could have gotten it cheaper, but in most cases the consumer will never know what they missed. The concept that a user does not know how much information they missed explains why in most cases the precision of a search is more important than the ability to recall all possible items of interest—the user never knows what they missed but they can tell if they are seeing a lot of useless information in the first few pages of search results. That does not mean finding everything on a topic is not important to some users. If you are trying to make decisions on purchasing a stock or a company, then finding all the facts about that stock or company may be critical to prevent a bad investment. Missing the one article talking about the company being sued and possibly going bankrupt could lead to a very painful investment. But providing comprehensive retrieval of all items that are relevant to a users search can have the negative effect of information overload on the user. In particular there is a tendency for important information to be repeated in many items on the same topic. Thus trying to get all information makes the process of reviewing and filtering out redundant information very tedious. The better a system is in finding all items on a question (recall) the more important techniques to present aggregates of that information become.

  From the users perspective time is the important factor that they use to gage the effectiveness of information retrieval. Except for users that do information retrieval as a primary aspect of their job (e.g., librarians, research assistants), most users have very little patience for investing extensive time in finding information they need. They expect interactive response from their searches with replies within 3–4 s at the most. Instead of looking through all the hits to see what might be of value they will only review the first one and at most second pages before deciding they need to change their search strategy. These aspects of the human nature of searchers have had a direct effect on the commercial web sites and the development of commercial information retrieval. The times that are candidates to be minimized in an Information Retrieval System are the time to create the query, the time to execute the query, the time to select what items returned from the query the user wants to review in detail and the time to determine if the returned item is of value. The initial research in information retrieval focused on the search as the primary area of interest. But to meet the users expectation of fast response and to maximize the relevant information returned requires optimization in all of these areas. The time to create a query used to be considered outside the scope of technical system support. But systems such as Google know what is in their database and what other users have searched on so as you type a query they provide hints on what to search on. This “vocabulary browse” capability helps the user in expanding the search string and helps in getting better precision.

  In information retrieval the term “relevant ” is used to represent an item containing the needed information. In reality the definition of relevance is not a binary classification but a continuous function. Items can exactly match the information need or partially match the information need. From a user’s perspective “relevant” and “needed” are synonymous. From a system perspective, information could be relevant to a search statement (i.e., matching the criteria of the search statement) even though it is not needed/relevant to user (e.g., the user already knew the information or just read it in the previous item reviewed).

  When considering the document space (all items in the information retrieval system), for any specific information request and the documents returned from it based upon a query, the document space can be divided into four quadrants. Documents returned can be relevant to the information request or not relevant . Documents not returned also falls into those two categories; relevant and not relevant (see Fig

  .

  Fig. 1.2 Relevant retrieval document space

  Relevant documents are those that contain some information that helps answer the user’s information need. Non-relevant documents do not contain any useful information. Using these definitions the two primary metrics used in evaluating information retrieval systems can be defined. They are Precision and Recall :

  The Number_Possible_Relevant are the number of relevant items in the database,

  Number_Total_Retrieved is the total number of items retrieved from the query, and Number_Retrieved_Relevant is the number of items retrieved that are relevant to the user’s search

  need.

  Precision is the factor that most users understand. When a user executes a search and has 80 % precision it means that 4 out of 5 items that are retrieved are of interest to the user. From a user perspective the lower the precision the more likely the user is wasting his resource (time) looking at non-relevant items. From a metric perspective the precision figure is across all of the “hits” returned from the query. But in reality most users will only look at the first few pages of hit results before deciding to change their query strategy. Thus what is of more value in commercial systems is not the total precision but the precision across the first 20–50 hits. Typically, in a weighted system where the words within a document are assigned weights based upon how well they describe the semantics of the document, precision in the first 20–50 items is higher than the precision across all the possible hits returned (i.e., further down the hit list the more likely items are not of interest). But when comparing search systems the total precision is used.

  Recall is a very useful concept in comparing systems. It measures how well a search system is capable of retrieving all possible hits that exist in the database. Unfortunately it is impossible to calculate except in very controlled environments. It requires in the denominator the total number of relevant items in the database. If the system could determine that number, then the system could return them. There have been some attempts to estimate the total relevant items in a database, but there are no techniques that provide accurate enough results to be used for a specific search request. In Chap. 9 on Information Retrieval Evaluation, techniques that have been used in evaluating the accuracy of different search systems will be described. But it’s not applicable in the general case.

  Figurehows the values of precision and recall as the number of items retrieved increases, under an optimum query where every returned item is relevant. There are “N” relevant items in the database. Figures show the optimal and currently achievable relationships between Precision and Recall (Harman-95). In Fighe basic properties of precision (solid line) and recall (dashed line) can be observed. Precision starts off at 100 % and maintains that value as long as relevant items are retrieved. Recall starts off close to zero and increases as long as relevant items are retrieved until all possible relevant items have been retrieved. Once all “N” relevant items have been retrieved, the only items being retrieved are non-relevant. Precision is directly affected by retrieval of non-relevant items and drops to a number close to zero. Recall is not affected by retrieval of non- relevant items and thus remains at 100 %.

  Fig. 1.3 a Ideal precision and recall. b Ideal precision/recall graph. c Achievable precision/recall graph file (Hit file ) assuming the hit file is ordered ranking from the most relevant to least relevant item. As with Fig. shows the perfect case where every item retrieved is relevant. The values of precision and recall are recalculated after every “n” items in the ordered hit list. For example if “n” is 10 then the first 10 items are used to calculate the first point on the chart for precision and recall.

  The first 20 items are used to calculate the precision and recall for the second point and so on until the complete hit list is evaluated. The precision stays at 100 % (1.0) until all of the relevant items have been retrieved. Recall continues to increase while moving to the right on the x-axis until it also reaches the 100 % (1.0) point. Although Fig. stops here. Continuation stays at the same y-axis location (recall never changes and remains 100 %) but precision decreases down the y-axis until it gets close to the x-axis as more non-relevant are discovered and precision decreases.

  Figures a typical result from the TREC conferences (see Chap. 9) and is representative of current search capabilities. This is called the eleven point interpolated average precision graph. The precision is measured at 11 recall levels (0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0). Most systems do not reach recall level 1.0 (found all relevant items) but will end at a lower number. To understand the implications of Figt’s useful to describe the implications of a particular point on the precision/recall graph. Assume that there are 200 relevant items in the data base and from the graph at precision of 0.3 (i.e., 30 % of the items are relevant) there is an associated recall of 0.5 (i.e., 50 % of the relevant items have been retrieved from the database). The recall of 50 % means there would be 100 relevant items in the Hit file (50 % of 200 items). A precision of 30 % means the user would review 333 items (30 % of 333 is 100 items) to find the 100 relevant items—thus approximately 333 items in the hit file.

1.2 Functional Overview of Information Retrieval Systems

  Most of this book is focused on the detailed technologies associated with information retrieval systems. A functional overview will help to better place the technologies in perspective and provide additional insight into what an information system needs to achieve.

  An information retrieval system starts with the ingestion of information. Chapter 3 describes the ingest process in detail. There are multiple functions that are applied to the information once it has been ingested. The most obvious function is to store the item in it’s original format in an items data base and create a searchable index to allow for later ad hoc searching and retrieval of an item. Another operation that can occur on the item as it’s being received is “Selective Dissemination of Information” (SDI). This function allows users to specify search statements of interest (called “Profiles”) and whenever an incoming item satisfies the search specification the item is stored in a user’s “mail” box for later review. This is a dynamic filtering of the input stream for each user for the subset they want to look at on a daily basis. Since it’s a dynamic process the mail box is constantly getting new items of possible interest. Associated with the Selective Dissemination of Information process is the “Alert” process. The alert process will attempt to notify the user whenever any new item meets the user’s criteria for immediate action on an item. This helps the user in multitasking— doing their normal daily tasks but be made aware when there is something that requires immediate attention.

  Finally there is automatically adding metadata and creating a logical view of the items into a structured taxonomy. The user can then navigate the taxonomy to find items of interest versus having to search for them. The indexing assigns additional descriptive citational and semantic metadata to an

  Fig. 1.4 Functional overview

  1.2.1 Selective Dissemination of Information The Selective Dissemination of Information (Mail) Process (see Fig provides the capability to

  dynamically compare newly received items to the information system against stored statements of interest of users and deliver the item to those users whose statement of interest matches the contents of the item. The Mail process is composed of the search process, user statements of interest (Profiles) and user mail files. As each item is received, it is processed against every user’s profile. A profile typically contains a broad search statement along with a list of user mail files that will receive the document if the search statement in the profile is satisfied. User mail profiles are different than interactive user queries in that they contain significantly more search terms (10–100 times more terms) and cover a wider range of interests. These profiles define all the areas in which a user is interested versus an interactive query which is frequently focused to answer a specific question. It has been shown in studies that automatically expanded user profiles perform significantly better than human generated profiles (Harman-95).

  When the search statement is satisfied, the item is placed in the Mail File(s) associated with the profile. Items in Mail files are typically viewed in time of receipt order and automatically deleted after a specified time period (e.g., after one month) or upon command from the user during display. The dynamic asynchronous updating of Mail Files makes it difficult to present the results of dissemination in estimated order of likelihood of relevance to the user (ranked order).

  Very little research has focused exclusively on the Mail Dissemination process. Most systems modify the algorithms they have established for retrospective search of document (item) databases to apply to Mail Profiles. Dissemination differs from the ad hoc search process in that thousands of user profiles are processed against each new item versus the inverse and there is not a large relatively static database of items to be used in development of relevance ranking weights for an item. One common implementation is to not build the mail files as items come into the system. Instead when the user requests to see their Mail File, a query is initiated that will dynamically produce the mail file. This works as long as the user does not have the capability to selectively eliminate items from their mail file. In this case a permanent file structure is needed. When a permanent file structure is implemented typically the mail profiles become a searchable structure and the words in each new item become the queries against it. Chapter 2 will describe n-grams which are one method to help in creating a mail search system.

1.2.2 Alerts

  Alerts are very similar to the processing for mail items. The user defines a set of “alert profiles” that are search statements that define what information a user wants to be alerted on. The profile has additional metadata that may contain a list of e-mail addresses that an alert notice should be mailed. If the user is currently logged onto the alert system a dynamic message could also be presented to the user. Alerts on textual items are simple in that the complete textual item can be processed for the alerts and then the alert notifications with links to the alert item can be sent out. Typically a user will have a number of focused alert profiles rather than the more general Mail profiles because the user wants to know more precisely the cause of the alert versus Mail profiles that are for collecting the general areas of interest to a user. When processing textual items it’s possible to process the complete item before the alert profiles are validated against the item because the processing is so fast.

  For multimedia (e.g., alerts on television news programs), the processing of the multimedia item happens in real time. But waiting until the end of the complete program to send out the alert could introduce significant delays to allowing the user to react to the item. In this case, periodically (e.g., every few minutes or after “n” alerts have been identified) alert notifications are sent out. This makes it necessary to define other rules to ensure the user is not flooded with alerts. The basic concept that needs to be implemented is that a user should receive only one alert notification for a specific item for each alert profile the user has that the item satisfies. This is enough to get the user to decide if they want to look at the item. When the user looks at the item all instances within the item that has to that point meet the alert criteria should be displayed. For example, assume a user has alert profiles on Natural Disaster, Economic Turmoil and Military Action. When the hurricane hit the US Gulf of Mexico oil platforms, a news video could hit on both Natural Disaster and Economic Turmoil. Within minutes into the broadcast the first hits to those profiles would be identified and the alert sent to the user. The user only needs to know the hits occurred. When the user displays the video, maybe 10 min into the news broadcast, all of the parts of the news program to the current time that satisfied the profiles should be indicated.

  1.2.3 Items and Item Index The retrospective item Search Process (see Fig provides the capability for a query to search

  against all items received by the system. The Item index is the searchable data structure that is derived from the contents of each item. In addition the original item is saved to display as the results of a search. The search is against the Item index by the user entered queries (typically ad hoc queries). It is sometimes called the retrospective search of the system. If the user is on-line, the Selective Dissemination of Information system delivers to the user items of interest as soon as they are processed into the system. Any search for information that has already been processed into the system can be considered a “retrospective” search for information. This does not preclude the search to have search statements constraining it to items received in the last few hours. But typically the searches span far greater time periods. Each query is processed against the total item index. Queries differ from alert and mail profiles in that queries are typically short and focused on a specific area of interest. The Item Database can be very large, hundreds of millions or billions of items. Typically items in the Item Database do not change (i.e., are not edited) once received. The value of information quickly decreases over time. Historically these facts were used to partition the database by time and allow for archiving by the time partitions. Advances in storage and processors now allow all the indices to remain on-line. But for multimedia item databases, the original items are often moved to slower but cheaper tape storage (i.e., using Hierarchical Storage Management systems).

1.2.4 Indexing and Mapping to a Taxonomy In addition to the item there is additional citational metadata that can be determined for the item.

  Citational metadata typically describes aspects of the item other than the semantics of the item. For example, typical citational metadata that can go into an index of the items received is the date it is received, it’s source (e.g. CNN news), the author, etc. All of that information may be useful in locating information but does not describe the information in the item. This metadata can subset the total set of items to be searched reducing the chances for false hits. Automatic indexing can extract the citational information and can also extract additional data from the item that can be used to index the item, but usually the semantic metadata assigned to describe an item is human generated (see Chap. 4). The index of metadata against the entire database of items (called public index) expands the information searchable beyond the index of each item’s content to satisfy a users search. In addition to a public index of the items coming in, users can also generate their private index to the items. This can be used to logically define subsets of the received items that are focused on a particular user’s interest along with keywords to describe the items. This subsetting can be used to constrain a user’s search, thereby significantly increasing the precision of a users search at the expense of recall.

  In addition to the indexing, some systems attempt to organize the items by mapping items received to locations within a predefined or dynamically defined taxonomy (e.g., Autonomy system). A Taxonomy (sometimes referred to as Ontology) refers to a hierarchical ordering of a set of controlled vocabulary terms that describe concepts. They provide an alternative mechanism for users to navigate to information of interest. The user will expand the taxonomy tree until they get to the area of interest and then review the items at that location in the taxonomy. This has the advantage that users without an in depth knowledge of an area can let the structured taxonomy help navigate them to the area of interest. A typical use of taxonomy is a wine site that let you navigate through the different wines that are available. It lets you select the general class of wines, then the grapes and then specific brands. In this case there is a very focused taxonomy. But in general information retrieval case there can be a large number of taxonomies on the most important conceptual areas that the information retrieval system users care about. Taxonomies help those users that do not have an in depth knowledge of a particular area select the subset of that area they are interested in.

  The data for the taxonomy is often discovered as part of the ingest process and then is applied as an alternative index that users can search and navigate. Some systems as part of their display will take a hit list of documents and create taxonomy of the information content for that set of items. This is an example of the visualization process except the assignment of objects to locations in a static taxonomy (this is discussed in Chap. 7).

1.3 Understanding Search Functions

  The objective of the search capability is to allow for a mapping between a user’s information need and the items in the information database that will answer that need. The search query statement is the means that the user employs to communicate a description of the needed information to the system. It can consist of natural language text in composition style and/or query terms with Boolean logic indicators between them. Understanding the functions associated with search helps in understanding what architectures best allow for those functions to be provided.