Crawler Design¶

The basic crawler design is simple and straight forward. You have a Master that collects the URLs that should be crawled and a number of Worker threads (or processes) that download the content and extract new links from it. In practice though there are a number of pitfalls you have to keep an eye on. Just to give one example: you really don’t want to excessively crawl one host as you might be doing a Denial of Service attack given enough workers. And even if the host survives, the site owner might not like you from now on.

Some Science¶

Ok, really only a little bit. Basically there two papers describing effective crawler designs. The Mercator paper (Mercator: A Scalable, Extensible Web Crawler (1999)) describes the architecture of the Mercator crawler. The crawler is split into several parts:

Frontier for keeping track of URLs
Scheduler for scheduling the URLs to be crawled
Downloader for really downloading the content
Link Extractors for extracting new links from different kinds of content
Unique Filter for filtering known URLs from the extracted ones
Host Splitter for working with multiple Frontiers

The second important paper on crawler design is the Ubi Crawler (UbiCrawler: a scalable fully distributed Web crawler (2003)). In this paper the authors use a Consistent Hashing algorithm for splitting the hosts among several Frontiers.

The Spyder is designed on the basis of these two papers.

References¶

The Spyder is not only inspired by these two papers but also on Heritrix the Internet Archive’s open source crawler. Heritrix is designed just like Mercator except it lacks something like a Host Splitter that allows one to crawl using more than one Frontier. Additionally Heritrix does not provide any kind of monitoring or revisiting strategy, although this might be possible in Version H3.

Crawler Design¶

Some Science¶

References¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Crawler Design¶

Some Science¶

References¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation