The basic crawler design is simple and straight forward. You have a Master that collects the URLs that should be crawled and a number of Worker threads (or processes) that download the content and extract new links from it. In practice though there are a number of pitfalls you have to keep an eye on. Just to give one example: you really don’t want to excessively crawl one host as you might be doing a Denial of Service attack given enough workers. And even if the host survives, the site owner might not like you from now on.
Ok, really only a little bit. Basically there two papers describing effective crawler designs. The Mercator paper (Mercator: A Scalable, Extensible Web Crawler (1999)) describes the architecture of the Mercator crawler. The crawler is split into several parts:
The second important paper on crawler design is the Ubi Crawler (UbiCrawler: a scalable fully distributed Web crawler (2003)). In this paper the authors use a Consistent Hashing algorithm for splitting the hosts among several Frontiers.
The Spyder is designed on the basis of these two papers.
The Spyder is not only inspired by these two papers but also on Heritrix the Internet Archive’s open source crawler. Heritrix is designed just like Mercator except it lacks something like a Host Splitter that allows one to crawl using more than one Frontier. Additionally Heritrix does not provide any kind of monitoring or revisiting strategy, although this might be possible in Version H3.