.. vim: set fileencoding=UTF-8 :
.. vim: set tw=80 :
.. include:: globals.rst

.. _secgettingstarted:

Getting Started
===============

*Spyder* is just a library for creating web crawlers. In order to really crawl
content, you first have to create a *Spyder* skeleton:

.. code-block:: bash

   $ mkdir my-crawler && cd my-crawler
   $ spyder start
   $ ls
   log logging.conf master.py settings.py sink.py spyder-ctrl.py

This will copy the skeleton into `my-crawler`. The main file is `settings.py`.
In it, you can configure the logging level for **Masters** and **Workers** and
define the **crawl scope**. In `master.py` you should manipulate the starting
URLs and add your specific `sink.py` into the **Frontier**. `spyder-ctrl.py` is
just a small control script that helps you start the **Log Sink**, **Master** and
**Worker**.

In the skeleton everything is setup as if you would want to crawl Sailing
related pages from **DMOZ**. That should give you a starting point for your own
crawler.

So, when you wrote your sink and have everything configured right, it's time to
start crawling. First, on one of your nodes you start the logsink:

.. code-block:: bash

   $ spyder-ctrl.py logsink &

Again on one node (the same as the logsink, e.g.) you start the **Master**:

.. code-block:: bash

   $ spyder-ctrl.py master &

Finally you can start as many **Workers** as you want:

.. code-block:: bash

   $ spyder-ctrl.py worker &
   $ spyder-ctrl.py worker &
   $ spyder-ctrl.py worker &

Here we started 3 workers since it is a powerful node having a quad core CPU.


Scaling the Crawl
-----------------

With the default settings it is not possible to start workers on different
nodes. Most of the time one node is powerful enough to crawl quite an amount of
data. But there are times when you simply want to crawl using *many* nodes. This
can be done by configuring the **ZeroMQ** transports to something like

   
.. code-block:: python

    ZEROMQ_MASTER_PUSH = "tcp://NodeA:5005"
    ZEROMQ_MASTER_SUB = "tcp://NodeA:5007"

    ZEROMQ_MGMT_MASTER = "tcp://NodeA:5008"
    ZEROMQ_MGMT_WORKER = "tcp://NodeA:5009"

    ZEROMQ_LOGGING = "tcp://NodeA:5010"

Basically we have setup a 2 node crawl cluster. **NodeA** acts as logging sink
and controls the crawl via the **Master**. **NodeB** Is a pure **Worker** node.
Only the **Master** actually *binds* **ZeroMQ** sockets, the **Worker** always
*connect* to them so the **Master** does not have to know where the
**Workers** are really running.


From here
---------

There is plenty of room for improvement and development ahead. Everything will
be handled by Github tickets from now on and, if there is interest, we may setup
a Google Group.