Spyder is a scalable web-spider written in Python using the non-blocking Tornado library and ZeroMQ as messaging layer. The messages are serialized using Thrift.
The architecture is very basic: a Master process contains the crawl Frontier that organises the URLs that need to be crawled; several Worker processes actually download the content and extract new URLs that should be crawled in the future. For storing the content you may attach a Sink to the Master and be informed about the interesting events for an URL.