Libraries used in Spyder

ZeroMQ

Not only with the emergence of multicore systems Python’s Global Interpreter Lock becomes a major issue for scaling across cores. Libraries like multiprocess try to circumvent the GIL by forking child processes and establishing a messaging layer between them. This enables Python programmers to scale with the number of available cores but scaling across node boundaries is not possible using plain multiprocess.

At this point ZeroMQ comes to the rescue. As the name suggests, ZeroMQ is a message queue. But, unlike other more famous queues like AMQP or more lightweight ones like STOMP or XMPP, ZeroMQ does not need a global broker (that might act as single point of failure). It is instead a little bit of code around the plain socket interface that adds simple messaging patterns to them (it’s like sockets on steroids).

The beauty of ZeroMQ lies in it’s simplicity. The programmer basically defines a socket to which one side binds and the other connects and a messaging pattern with which both sides communicate with each other. Once this is established, scaling across cores/nodes/data centers is simple as pie. Four types of sockets are supported by ZeroMQ:

  1. inproc sockets can be used for intra-process communication (between

    threads, e.g.)

  2. ipc sockets can be used for inter-process communication between

    different processes on the same node.

  3. tcp sockets can be used for inter-process communication between

    different processes on different node.

  4. pgn sockets can be used for inter-process communication between one and

    many other processes on many other nodes.

So by simply changing the socket type from ipc to tcp the application can scale across node boundaries transparently for the programmer, i.e. by not changing a single line of code. Awesome!

This leaves us with the different messaging patterns. ZeroMQ supports all well known (at least to me) messaging patterns. The first one that comes into mind is of course the PUB/SUB pattern that allows one publisher to send messages to many subscribers. The PUSH/PULL pattern allows one master to send messages to only one of the available clients (the common producer/consumer pattern). With REQ/REP a simple request and response pattern is possible. Most of the patterns have a non-blocking equivalent.

Messaging Patterns used in Spyder

ZeroMQ is used as messaging layer to distribute the workload to an arbitrary number of worker processes which in return send the result back to the master. In the context of Spyder the master process controls the URLs that should be crawled and sends them to the worker processes when they are due. One of the worker processes then downloads the content and possibly extracts new links from it. When finished it sends the result back to the master.

We do not use the REQ/REP pattern as it does not scale as easily as we need since we have to keep track of whom we sent the URL to and we would have to do the load balancing ourselves.

Instead with the PUSH/PULL pattern we get the load balancing as a nice little gift. It comes with a fair distribution policy that simply distributes the messages to all workers in a round-robin way. In order to send the results back to the master we will use the PUB/SUB pattern where the publisher is the worker process and the subscriber is the master process.

The PUB/SUB pattern is used to send the results back to the master process.

Users familiar with ZeroMQ might already have noted that this messaging setup is shamelessly adapted from Mongrel2. In the case of a Web Server as well as for a crawler this is a perfect fit as it helps you to scale very easy.

Note

There is another way to do this type message pattern using XPEQ/XREP. Transition to this pattern is planned for the near future.

For a crawler there are two parts that we possibly want to scale: the worker and the master. While scaling the worker across several processes is somewhat obvious, scaling the master first seems to be of no relevance. But if you want to crawl large portions of the web (all German Internet pages, e.g.), you might experience difficulties as this are not only many URLs but also many hosts you possibly want to connect. While the number of URLs might not be the limiting part, the number of hosts can be as they require a lot of queue switching.

Note

For more info on this, see the Crawler Design document.

What does all that mean in practice

The master process binds to one socket with a PUSH type and to another socket using the SUB type. On the SUB socket the master registers a ZeroMQ filter to only receive messages with a certain topic: it’s identity.

The worker in connects to the PUSH socket using a PULL type socket and receives the URLs from the master containing the master’s identity. When the URL has been processed it sends the result back to the master using the PUB socket it has connected to the SUB socket. By setting the message’s topic to the identity of the sending master, it is ensured that only the master process that sent this URL receives the answer.

Future version of Spyder will thus be able to work with n master and m worker processes.

Tornado

Tornado is a non-blocking or evented IO library developed at FriendFeed (now Facebook) to run their python front-end servers. Basically this is a

while True:
    callback_for_event(event)

loop. The events are any read or write event on a number of sockets or files that are registered with the loop. So instead of starting one thread for each socket connection everything runs in one thread or even process. Although this might feel strange it has been shown to be alot faster for network intensive applications that potentially serve a large number of clients.

Note

For more info see the C10k Problem

An additional reason for choosing Tornado was the nice integration with ZeroMQ. This not only makes programming with ZeroMQ easier but also makes it possible to easily write non-blocking, evented IO programms with Python and ZeroMQ.

Table Of Contents

Previous topic

Crawler Design

Next topic

Spyder API

This Page