Features a crawler Features a crawler

Preliminary draft c 2008 Cambridge UP DRAFT © July 12, 2008 Cambridge University Press. Feedback welcome. 443 20 Web crawling and indexes

20.1 Overview

Web crawling is the process by which we gather pages from the Web, in order to index them and support a search engine. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible, together with the link structure that interconnects them. In Chapter 19 we studied the complexities of the Web stemming from its creation by millions of uncoordinated individuals. In this chapter we study the resulting difficulties for crawling the Web. The focus of this chapter is the component shown in Figure 19.7 as web crawler; it is sometimes referred to as a spider. WEB CRAWLER SPIDER The goal of this chapter is not to describe how to build the crawler for a full-scale commercial web search engine. We focus instead on a range of issues that are generic to crawling from the student project scale to substan- tial research projects. We begin Section 20.1.1 by listing desiderata for web crawlers, and then discuss in Section 20.2 how each of these issues is ad- dressed. The remainder of this chapter describes the architecture and some implementation details for a distributed web crawler that satisfies these fea- tures. Section 20.3 discusses distributing indexes across many machines for a web-scale implementation.

20.1.1 Features a crawler

must provide We list the desiderata for web crawlers in two categories: features that web crawlers must provide, followed by features they should provide. Robustness: The Web contains servers that create spider traps, which are gen- erators of web pages that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain. Crawlers must be de- signed to be resilient to such traps. Not all such traps are malicious; some are the inadvertent side-effect of faulty website development. Preliminary draft c 2008 Cambridge UP 444 20 Web crawling and indexes Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them. These politeness policies must be respected.

20.1.2 Features a crawler

should provide Distributed: The crawler should have the ability to execute in a distributed fashion across multiple machines. Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines and bandwidth. Performance and efficiency: The crawl system should make efficient use of various system resources including processor, storage and network band- width. Quality: Given that a significant fraction of all web pages are of poor util- ity for serving user query needs, the crawler should be biased towards fetching “useful” pages first. Freshness: In many applications, the crawler should operate in continuous mode: it should obtain fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that the search engine’s index contains a fairly current representation of each indexed web page. For such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates the rate of change of that page. Extensible: Crawlers should be designed to be extensible in many ways – to cope with new data formats, new fetch protocols, and so on. This de- mands that the crawler architecture be modular.

20.2 Crawling