Workerprocess

This module contains the default architecture for worker processes. In order to start a new worker process you should simply call this modules main method.

Communication between master -> worker and inside the worker is as follows:

Master -> PUSH -> Worker Fetcher

Worker Fetcher -> PUSH -> Worker Extractor

Worker Extractor -> PUB -> Master

Each Worker is a ZmqWorker (or ZmqAsyncWorker). The Master pushes new CrawlUris to the Worker Fetcher. This will download the content from the web and PUSH the resulting CrawlUri to the Worker Extractor. At this stage several Modules for extracting new URLs are running. The Worker Scoper will decide if the newly extracted URLs are within the scope of the crawl.

spyder.workerprocess.create_processing_function(settings, pipeline)[source]

Create a processing method that iterates all processors over the incoming message.

spyder.workerprocess.create_worker_extractor(settings, mgmt, zmq_context, log_handler, io_loop)[source]

Create and return a new Worker Extractor that will combine all configured extractors to a single ZmqWorker.

spyder.workerprocess.create_worker_fetcher(settings, mgmt, zmq_context, log_handler, io_loop)[source]

Create and return a new Worker Fetcher.

spyder.workerprocess.create_worker_management(settings, zmq_context, io_loop)[source]

Create and return a new instance of the ZmqMgmt.

spyder.workerprocess.main(settings)[source]

The main() method for worker processes.

Here we will:

  • create a ZmqMgmt instance
  • create a Fetcher instance
  • initialize and instantiate the extractor chain

The settings have to be loaded already.

ZeroMQ Worker

This module contains a ZeroMQ based Worker abstraction.

The ZmqWorker class expects an incoming and one outgoing zmq.socket as well as an instance of the spyder.core.mgmt.ZmqMgmt class.

class spyder.core.worker.AsyncZmqWorker(insocket, outsocket, mgmt, processing, log_handler, log_level, io_loop=None)[source]

Asynchronous version of the ZmqWorker.

This worker differs in that the self._processing method should have two arguments: the message and the socket where the result should be sent to!

class spyder.core.worker.ZmqWorker(insocket, outsocket, mgmt, processing, log_handler, log_level, io_loop=None)[source]

This is the ZMQ worker implementation.

The worker will register a ZMQStream with the configured zmq.Socket and zmq.eventloop.ioloop.IOLoop instance.

Upon ZMQStream.on_recv the configured processors will be executed with the deserialized context and the result will be published through the configured zmq.socket.

close()[source]

Close all open sockets.

start()[source]

Start the worker.

stop()[source]

Stop the worker.

Table Of Contents

Previous topic

Queue Management

Next topic

Content Fetcher

This Page