References and further reading
1.5 References and further reading
17 Exercise 1.13 [ ⋆ ] Try using the Boolean search features on a couple of major web search engines. For instance, choose a word, such as burglar , and submit the queries i burglar , ii burglar AND burglar , and iii burglar OR burglar . Look at the estimated number of results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major search engines. Can you make sense of what is going on? What about if you try different words? For example, query for i knight , ii conquer , and then iii knight OR conquer . What bound should the number of results from the first two queries place on the third query? Is this bound observed?1.5 References and further reading
The practical pursuit of computerized information retrieval began in the late 1940s Cleverdon 1991 , Liddy 2005 . A great increase in the production of scientific literature, much in the form of less formal technical reports rather than traditional journal articles, coupled with the availability of computers, led to interest in automatic document retrieval. However, in those days, doc- ument retrieval was always based on author, title, and keywords; full-text search came much later. The article of Bush 1945 provided lasting inspiration for the new field: “Consider a future device for individual use, which is a sort of mech- anized private file and library. It needs a name, and, to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mech- anized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” The term Information Retrieval was coined by Calvin Mooers in 19481950 Mooers 1950 . In 1958, much newspaper attention was paid to demonstrations at a con- ference see Taube and Wooster 1958 of IBM “auto-indexing” machines, based primarily on the work of H. P. Luhn. Commercial interest quickly gravitated towards Boolean retrieval systems, but the early years saw a heady debate over various disparate technologies for retrieval systems. For example Moo- ers 1961 dissented: “It is a common fallacy, underwritten at this date by the investment of several million dollars in a variety of retrieval hardware, that the al- gebra of George Boole 1847 is the appropriate formalism for retrieval system design. This view is as widely and uncritically accepted as it is wrong.” The observation of AND vs. OR giving you opposite extremes in a precision recall tradeoff, but not the middle ground comes from Lee and Fox 1988 . Preliminary draft c 2008 Cambridge UP 18 1 Boolean retrieval The book Witten et al. 1999 is the standard reference for an in-depth com- parison of the space and time efficiency of the inverted index versus other possible data structures; a more succinct and up-to-date presentation ap- pears in Zobel and Moffat 2006 . We further discuss several approaches in Chapter 5 . Friedl 2006 covers the practical usage of regular expressions for searching. REGULAR EXPRESSIONS The underlying computer science appears in Hopcroft et al. 2000 . Preliminary draft c 2008 Cambridge UP DRAFT © July 12, 2008 Cambridge University Press. Feedback welcome. 19 2 The term vocabulary and postings lists Recall the major steps in inverted index construction: 1. Collect the documents to be indexed. 2. Tokenize the text. 3. Do linguistic preprocessing of tokens. 4. Index the documents that each term occurs in. In this chapter we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined Section 2.1 . We then examine in detail some of the substantive linguis- tic issues of tokenization and linguistic preprocessing, which determine the vocabulary of terms which a system uses Section 2.2 . Tokenization is the process of chopping character streams into tokens, while linguistic prepro- cessing then deals with building equivalence classes of tokens which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4 . Then we return to the implementation of postings lists. In Section 2.3 , we examine an extended postings list data structure that supports faster query- ing, while Section 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web.2.1 Document delineation and character sequence decoding
Parts
» cambridge introductiontoinformationretrieval2008 readingversion
» An example information retrieval problem
» A first take at building an inverted index
» Processing Boolean queries cambridge introductiontoinformationretrieval2008 readingversion
» The extended Boolean model versus ranked retrieval
» References and further reading
» Obtaining the character sequence in a document
» Tokenization Determining the vocabulary of terms
» Dropping common terms: stop words
» Normalization equivalence classing of terms
» Stemming and lemmatization Determining the vocabulary of terms
» Faster postings list intersection via skip pointers
» Biword indexes Positional postings and phrase queries
» Positional indexes Positional postings and phrase queries
» Combination schemes Positional postings and phrase queries
» Search structures for dictionaries
» General wildcard queries Wildcard queries
» k-gram indexes for wildcard queries
» Implementing spelling correction Spelling correction
» Forms of spelling correction Edit distance
» k-gram indexes for spelling correction
» Context sensitive spelling correction
» Phonetic correction cambridge introductiontoinformationretrieval2008 readingversion
» Hardware basics Blocked sort-based indexing
» Single-pass in-memory indexing cambridge introductiontoinformationretrieval2008 readingversion
» Distributed indexing cambridge introductiontoinformationretrieval2008 readingversion
» Dynamic indexing cambridge introductiontoinformationretrieval2008 readingversion
» References and further reading Exercises
» Heaps’ law: Estimating the number of terms Zipf’s law: Modeling the distribution of terms
» Blocked storage Dictionary compression
» Variable byte codes γ Postings file compression
» Exercises cambridge introductiontoinformationretrieval2008 readingversion
» Weighted zone scoring Parametric and zone indexes
» Learning weights Parametric and zone indexes
» The optimal weight Parametric and zone indexes
» Inverse document frequency Term frequency and weighting
» Tf-idf weighting Term frequency and weighting
» Dot products The vector space model for scoring
» Queries as vectors Computing vector scores
» Sublinear tf scaling Maximum tf normalization
» Document and query weighting schemes Pivoted normalized document length
» Inexact top Efficient scoring and ranking
» Index elimination Champion lists
» Static quality scores and ordering
» Impact ordering Efficient scoring and ranking
» Cluster pruning Efficient scoring and ranking
» Tiered indexes Query-term proximity
» Designing parsing and scoring functions
» Vector space scoring and query operator interaction
» Information retrieval system evaluation
» Standard test collections cambridge introductiontoinformationretrieval2008 readingversion
» Evaluation of unranked retrieval sets
» Evaluation of ranked retrieval results
» Critiques and justifications of the concept of relevance
» Results snippets cambridge introductiontoinformationretrieval2008 readingversion
» The Rocchio algorithm for relevance feedback
» Probabilistic relevance feedback When does relevance feedback work?
» Relevance feedback on the web
» Evaluation of relevance feedback strategies
» Pseudo relevance feedback Indirect relevance feedback
» Vocabulary tools for query reformulation Query expansion
» Basic XML concepts cambridge introductiontoinformationretrieval2008 readingversion
» A vector space model for XML retrieval
» Text-centric vs. data-centric XML retrieval
» Review of basic probability theory
» The 10 loss case The Probability Ranking Principle
» Deriving a ranking function for query terms
» Probability estimates in theory
» Probability estimates in practice
» Probabilistic approaches to relevance feedback
» An appraisal of probabilistic models
» Tree-structured dependencies between terms Okapi BM25: a non-binary model
» Finite automata and language models
» Multinomial distributions over words
» Using query likelihood language models in IR
» Estimating the query generation probability
» Ponte and Croft’s Experiments
» Language modeling versus other approaches in IR
» Extended language modeling approaches
» The text classification problem
» The Bernoulli model cambridge introductiontoinformationretrieval2008 readingversion
» Mutual information Feature selection
» Frequency-based feature selection Feature selection for multiple classifiers
» Evaluation of text classification
» Document representations and measures of relatedness in vec-
» Rocchio classification cambridge introductiontoinformationretrieval2008 readingversion
» Time complexity and optimality of kNN
» Linear versus nonlinear classifiers
» Classification with more than two classes
» The bias-variance tradeoff cambridge introductiontoinformationretrieval2008 readingversion
» Support vector machines: The linearly separable case
» Soft margin classification Extensions to the SVM model
» Multiclass SVMs Nonlinear SVMs
» Experimental results Extensions to the SVM model
» Choosing what kind of classifier to use
» Improving classifier performance Issues in the classification of text documents
» A simple example of machine-learned scoring
» Result ranking by machine learning
» Clustering in information retrieval
» Evaluation of clustering cambridge introductiontoinformationretrieval2008 readingversion
» Cluster cardinality in K-means
» Model-based clustering cambridge introductiontoinformationretrieval2008 readingversion
» Centroid clustering cambridge introductiontoinformationretrieval2008 readingversion
» Optimality of HAC cambridge introductiontoinformationretrieval2008 readingversion
» Divisive clustering cambridge introductiontoinformationretrieval2008 readingversion
» Cluster labeling cambridge introductiontoinformationretrieval2008 readingversion
» Implementation notes cambridge introductiontoinformationretrieval2008 readingversion
» Term-document matrices and singular value decompositions
» Low-rank approximations cambridge introductiontoinformationretrieval2008 readingversion
» Latent semantic indexing cambridge introductiontoinformationretrieval2008 readingversion
» Background and history cambridge introductiontoinformationretrieval2008 readingversion
» The web graph Web characteristics
» Advertising as the economic model
» Near-duplicates and shingling cambridge introductiontoinformationretrieval2008 readingversion
» Features a crawler Features a crawler
» Distributing indexes cambridge introductiontoinformationretrieval2008 readingversion
» Connectivity servers cambridge introductiontoinformationretrieval2008 readingversion
» Anchor text and the web graph
» The PageRank computation PageRank
Show more