Index elimination Champion lists
7.1 Efficient scoring and ranking
1377.1.1 Inexact top
K document retrieval Thus far, we have focused on retrieving precisely the K highest-scoring doc- uments for a query. We now consider schemes by which we produce K doc- uments that are likely to be among the K highest scoring documents for a query. In doing so, we hope to dramatically lower the cost of computing the K documents we output, without materially altering the user’s perceived relevance of the top K results. Consequently, in most applications it suffices to retrieve K documents whose scores are very close to those of the K best. In the sections that follow we detail schemes that retrieve K such documents while potentially avoiding computing scores for most of the N documents in the collection. Such inexact top-K retrieval is not necessarily, from the user’s perspective, a bad thing. The top K documents by the cosine measure are in any case not necessarily the K best for the query: cosine similarity is only a proxy for the user’s perceived relevance. In Sections 7.1.2 – 7.1.6 below, we give heuristics using which we are likely to retrieve K documents with cosine scores close to those of the top K documents. The principal cost in computing the out- put stems from computing cosine similarities between the query and a large number of documents. Having a large number of documents in contention also increases the selection cost in the final stage of culling the top K docu- ments from a heap. We now consider a series of ideas designed to eliminate a large number of documents without computing their cosine scores. The heuristics have the following two-step scheme: 1. Find a set A of documents that are contenders, where K | A | ≪ N . A does not necessarily contain the K top-scoring documents for the query, but is likely to have many documents with scores near those of the top K. 2. Return the K top-scoring documents in A. From the descriptions of these ideas it will be clear that many of them require parameters to be tuned to the collection and application at hand; pointers to experience in setting these parameters may be found at the end of this chapter. It should also be noted that most of these heuristics are well-suited to free text queries, but not for Boolean or phrase queries.7.1.2 Index elimination
For a multi-term query q, it is clear we only consider documents containing at least one of the query terms. We can take this a step further using additional heuristics: 1. We only consider documents containing terms whose idf exceeds a preset threshold. Thus, in the postings traversal, we only traverse the postings Preliminary draft c 2008 Cambridge UP 138 7 Computing scores in a complete search system for terms with high idf. This has a fairly significant benefit: the post- ings lists of low-idf terms are generally long; with these removed from contention, the set of documents for which we compute cosines is greatly reduced. One way of viewing this heuristic: low-idf terms are treated as stop words and do not contribute to scoring. For instance, on the query catcher in the rye , we only traverse the postings for catcher and rye . The cutoff threshold can of course be adapted in a query-dependent manner. 2. We only consider documents that contain many and as a special case, all of the query terms. This can be accomplished during the postings traversal; we only compute scores for documents containing all or many of the query terms. A danger of this scheme is that by requiring all or even many query terms to be present in a document before considering it for cosine computation, we may end up with fewer than K candidate documents in the output. This issue will discussed further in Section 7.2.1 .7.1.3 Champion lists
The idea of champion lists sometimes also called fancy lists or top docs is to precompute, for each term t in the dictionary, the set of the r documents with the highest weights for t; the value of r is chosen in advance. For tf- idf weighting, these would be the r documents with the highest tf values for term t. We call this set of r documents the champion list for term t. Now, given a query q we create a set A as follows: we take the union of the champion lists for each of the terms comprising q. We now restrict cosine computation to only the documents in A. A critical parameter in this scheme is the value r, which is highly application dependent. Intuitively, r should be large compared with K, especially if we use any form of the index elimination described in Section 7.1.2 . One issue here is that the value r is set at the time of index construction, whereas K is application dependent and may not be available until the query is received; as a result we may as in the case of index elimination find ourselves with a set A that has fewer than K documents. There is no reason to have the same value of r for all terms in the dictionary; it could for instance be set to be higher for rarer terms.7.1.4 Static quality scores and ordering
Parts
» cambridge introductiontoinformationretrieval2008 readingversion
» An example information retrieval problem
» A first take at building an inverted index
» Processing Boolean queries cambridge introductiontoinformationretrieval2008 readingversion
» The extended Boolean model versus ranked retrieval
» References and further reading
» Obtaining the character sequence in a document
» Tokenization Determining the vocabulary of terms
» Dropping common terms: stop words
» Normalization equivalence classing of terms
» Stemming and lemmatization Determining the vocabulary of terms
» Faster postings list intersection via skip pointers
» Biword indexes Positional postings and phrase queries
» Positional indexes Positional postings and phrase queries
» Combination schemes Positional postings and phrase queries
» Search structures for dictionaries
» General wildcard queries Wildcard queries
» k-gram indexes for wildcard queries
» Implementing spelling correction Spelling correction
» Forms of spelling correction Edit distance
» k-gram indexes for spelling correction
» Context sensitive spelling correction
» Phonetic correction cambridge introductiontoinformationretrieval2008 readingversion
» Hardware basics Blocked sort-based indexing
» Single-pass in-memory indexing cambridge introductiontoinformationretrieval2008 readingversion
» Distributed indexing cambridge introductiontoinformationretrieval2008 readingversion
» Dynamic indexing cambridge introductiontoinformationretrieval2008 readingversion
» References and further reading Exercises
» Heaps’ law: Estimating the number of terms Zipf’s law: Modeling the distribution of terms
» Blocked storage Dictionary compression
» Variable byte codes γ Postings file compression
» Exercises cambridge introductiontoinformationretrieval2008 readingversion
» Weighted zone scoring Parametric and zone indexes
» Learning weights Parametric and zone indexes
» The optimal weight Parametric and zone indexes
» Inverse document frequency Term frequency and weighting
» Tf-idf weighting Term frequency and weighting
» Dot products The vector space model for scoring
» Queries as vectors Computing vector scores
» Sublinear tf scaling Maximum tf normalization
» Document and query weighting schemes Pivoted normalized document length
» Inexact top Efficient scoring and ranking
» Index elimination Champion lists
» Static quality scores and ordering
» Impact ordering Efficient scoring and ranking
» Cluster pruning Efficient scoring and ranking
» Tiered indexes Query-term proximity
» Designing parsing and scoring functions
» Vector space scoring and query operator interaction
» Information retrieval system evaluation
» Standard test collections cambridge introductiontoinformationretrieval2008 readingversion
» Evaluation of unranked retrieval sets
» Evaluation of ranked retrieval results
» Critiques and justifications of the concept of relevance
» Results snippets cambridge introductiontoinformationretrieval2008 readingversion
» The Rocchio algorithm for relevance feedback
» Probabilistic relevance feedback When does relevance feedback work?
» Relevance feedback on the web
» Evaluation of relevance feedback strategies
» Pseudo relevance feedback Indirect relevance feedback
» Vocabulary tools for query reformulation Query expansion
» Basic XML concepts cambridge introductiontoinformationretrieval2008 readingversion
» A vector space model for XML retrieval
» Text-centric vs. data-centric XML retrieval
» Review of basic probability theory
» The 10 loss case The Probability Ranking Principle
» Deriving a ranking function for query terms
» Probability estimates in theory
» Probability estimates in practice
» Probabilistic approaches to relevance feedback
» An appraisal of probabilistic models
» Tree-structured dependencies between terms Okapi BM25: a non-binary model
» Finite automata and language models
» Multinomial distributions over words
» Using query likelihood language models in IR
» Estimating the query generation probability
» Ponte and Croft’s Experiments
» Language modeling versus other approaches in IR
» Extended language modeling approaches
» The text classification problem
» The Bernoulli model cambridge introductiontoinformationretrieval2008 readingversion
» Mutual information Feature selection
» Frequency-based feature selection Feature selection for multiple classifiers
» Evaluation of text classification
» Document representations and measures of relatedness in vec-
» Rocchio classification cambridge introductiontoinformationretrieval2008 readingversion
» Time complexity and optimality of kNN
» Linear versus nonlinear classifiers
» Classification with more than two classes
» The bias-variance tradeoff cambridge introductiontoinformationretrieval2008 readingversion
» Support vector machines: The linearly separable case
» Soft margin classification Extensions to the SVM model
» Multiclass SVMs Nonlinear SVMs
» Experimental results Extensions to the SVM model
» Choosing what kind of classifier to use
» Improving classifier performance Issues in the classification of text documents
» A simple example of machine-learned scoring
» Result ranking by machine learning
» Clustering in information retrieval
» Evaluation of clustering cambridge introductiontoinformationretrieval2008 readingversion
» Cluster cardinality in K-means
» Model-based clustering cambridge introductiontoinformationretrieval2008 readingversion
» Centroid clustering cambridge introductiontoinformationretrieval2008 readingversion
» Optimality of HAC cambridge introductiontoinformationretrieval2008 readingversion
» Divisive clustering cambridge introductiontoinformationretrieval2008 readingversion
» Cluster labeling cambridge introductiontoinformationretrieval2008 readingversion
» Implementation notes cambridge introductiontoinformationretrieval2008 readingversion
» Term-document matrices and singular value decompositions
» Low-rank approximations cambridge introductiontoinformationretrieval2008 readingversion
» Latent semantic indexing cambridge introductiontoinformationretrieval2008 readingversion
» Background and history cambridge introductiontoinformationretrieval2008 readingversion
» The web graph Web characteristics
» Advertising as the economic model
» Near-duplicates and shingling cambridge introductiontoinformationretrieval2008 readingversion
» Features a crawler Features a crawler
» Distributing indexes cambridge introductiontoinformationretrieval2008 readingversion
» Connectivity servers cambridge introductiontoinformationretrieval2008 readingversion
» Anchor text and the web graph
» The PageRank computation PageRank
Show more