Crawl Scoper

The Crawl Scope defines which URLs the Spyder should process. The main usecases for them are:

  • only spider content from the Seed Hosts
  • do not spider images, css, videos

and there are probably a lot of other reasons you want to have at least one the scoper configured, otherwise you might end up downloading the internet.

So each scoper should iterate over the curi.optional_vars[CURI_EXTRACTED_URLS] and determine if it should be downloaded or not.

The RegexScoper maintains a list of regular expressions that define the crawl scope. Two classes of expressions exist: positive and negative. The initial decision of the scoper is to not download its content. If a regex from the positive list matches, and no regex from the negative list matches, the URL is marked for downloading. In any other case, the URL will be abandoned.

Note

We should really split up the regex scoper and allow the user to configure more than just one scoper.

class spyder.processor.scoper.RegexScoper(settings)[source]

The scoper based on regular expressions.

There are two settings that influence this scoper:

  1. settings.REGEX_SCOPE_POSITIVE
  2. settings.REGEX_SCOPE_NEGATIVE

Both have to be a list. The scoper is executed in the __call__() method.

Previous topic

Link Extractors

Next topic

Sink

This Page