.. vim: set fileencoding=UTF-8 : .. vim: set tw=80 : .. include:: globals.rst Libraries used in |spyder| ========================== .. _seczmq: ZeroMQ ------ Not only with the emergence of multicore systems Python's `Global Interpreter Lock `_ becomes a major issue for scaling across cores. Libraries like `multiprocess `_ try to circumvent the `GIL` by forking child processes and establishing a messaging layer between them. This enables Python programmers to scale with the number of available cores but scaling across node boundaries is not possible using plain `multiprocess`. At this point `ZeroMQ `_ comes to the rescue. As the name suggests, |zmq| is a message queue. But, unlike other more famous queues like `AMQP` or more lightweight ones like `STOMP` or `XMPP`, |zmq| does not need a global broker (that might act as *single point of failure*). It is instead a little bit of code around the plain *socket* interface that adds simple messaging patterns to them (it's like *sockets on steroids*). The beauty of |zmq| lies in it's simplicity. The programmer basically defines a *socket* to which one side **binds** and the other **connects** and a messaging pattern with which both sides communicate with each other. Once this is established, scaling across cores/nodes/data centers is simple as pie. Four types of *sockets* are supported by |zmq|: 1. `inproc` sockets can be used for **intra-process** communication (between threads, e.g.) 2. `ipc` sockets can be used for **inter-process** communication between different processes *on the same node*. 3. `tcp` sockets can be used for **inter-process** communication between different processes *on different node*. 4. `pgn` sockets can be used for **inter-process** communication between one and many other processes *on many other nodes*. So by simply changing the socket type from `ipc` to `tcp` the application can scale across node boundaries transparently for the programmer, i.e. by **not changing a single line of code**. Awesome! This leaves us with the different messaging patterns. |zmq| supports all well known (at least to me) messaging patterns. The first one that comes into mind is of course the `PUB/SUB` pattern that allows one publisher to send messages to many subscribers. The `PUSH/PULL` pattern allows one master to send messages to only one of the available clients (the common producer/consumer pattern). With `REQ/REP` a simple request and response pattern is possible. Most of the patterns have a `non-blocking` equivalent. Messaging Patterns used in |spyder| +++++++++++++++++++++++++++++++++++ |zmq| is used as messaging layer to distribute the workload to an arbitrary number of worker processes which in return send the result back to the master. In the context of |spyder| the master process controls the |urls| that should be crawled and sends them to the worker processes when they are due. One of the worker processes then downloads the content and possibly extracts new links from it. When finished it sends the result back to the master. We do not use the `REQ/REP` pattern as it does not scale as easily as we need since we have to keep track of whom we sent the |url| to and we would have to do the load balancing ourselves. Instead with the |pushpull| pattern we get the load balancing as a nice little gift. It comes with a *fair distribution policy* that simply distributes the messages to all workers in a *round-robin* way. In order to send the results back to the master we will use the |pubsub| pattern where the *publisher* is the worker process and the *subscriber* is the master process. The |pubsub| pattern is used to send the results back to the master process. Users familiar with |zmq| might already have noted that this messaging setup is shamelessly *adapted* from `Mongrel2 `_. In the case of a *Web Server* as well as for a crawler this is a perfect fit as it helps you to scale **very** easy. .. note:: There is another way to do this type message pattern using *XPEQ/XREP*. Transition to this pattern is planned for the near future. For a crawler there are two parts that we possibly want to scale: the worker *and* the master. While scaling the worker across several processes is somewhat obvious, scaling the master first seems to be of no relevance. But if you want to crawl large portions of the web (all German Internet pages, e.g.), you might experience difficulties as this are not only **many** |urls| but also **many** hosts you possibly want to connect. While the number of |urls| might not be the limiting part, the number of hosts can be as they require a lot of queue switching. .. note:: For more info on this, see the :ref:`seccrawlerdesign` document. What does all that mean in practice +++++++++++++++++++++++++++++++++++ The master process binds to one socket with a `PUSH` type and to another socket using the `SUB` type. On the `SUB` socket the master registers a |zmq| filter to only receive messages with a certain *topic*: it's identity. The worker in connects to the `PUSH` socket using a `PULL` type socket and receives the |urls| from the master containing the master's identity. When the |url| has been processed it sends the result back to the master using the `PUB` socket it has connected to the `SUB` socket. By setting the message's topic to the identity of the sending master, it is ensured that only the master process that sent this |url| receives the answer. Future version of |spyder| will thus be able to work with **n** master and **m** worker processes. .. _sectornado: |tornado| --------- `Tornado `_ is a *non-blocking* or *evented IO* library developed at FriendFeed (now Facebook) to run their python front-end servers. Basically this is a .. code-block:: python while True: callback_for_event(event) loop. The events are any *read* or *write* event on a number of sockets or files that are registered with the loop. So instead of starting one thread for each socket connection everything runs in one thread or even process. Although this might feel strange it has been shown to be **alot** faster for network intensive applications that potentially serve a large number of clients. .. note:: For more info see the `C10k Problem `_ An additional reason for choosing |tornado| was the nice integration with |zmq|. This not only makes programming with |zmq| easier but also makes it possible to easily write *non-blocking, evented* IO programms with Python and |zmq|.