The Crawl Scope defines which URLs the Spyder should process. The main usecases for them are:
and there are probably a lot of other reasons you want to have at least one the scoper configured, otherwise you might end up downloading the internet.
So each scoper should iterate over the curi.optional_vars[CURI_EXTRACTED_URLS] and determine if it should be downloaded or not.
The RegexScoper maintains a list of regular expressions that define the crawl scope. Two classes of expressions exist: positive and negative. The initial decision of the scoper is to not download its content. If a regex from the positive list matches, and no regex from the negative list matches, the URL is marked for downloading. In any other case, the URL will be abandoned.
Note
We should really split up the regex scoper and allow the user to configure more than just one scoper.