Generic Frontier implementation.
The SingleHostFrontier will only select URIs from the queues by iterating over all available queues and added into a priority queue.
The priority is calculated based on the timestamp it should be crawled next.
In contrast to the spyder.core.sqlitequeues module, URIs in this module are represented as spyder.thrift.gen.ttypes.CrawlUri.
A base class for implementing frontiers.
Basically this class provides the different general methods and configuration parameters used for frontiers.
Add a sink to the frontier. A sink will be responsible for the long term storage of the crawled contents.
Add the specified CrawlUri to the frontier.
next_date is a datetime object for the next time the uri should be crawled.
Note: time based crawling is never strict, it is generally used as some kind of prioritization.
Called when an URL was not found.
This could mean, that the URL has been removed from the server. If so, do something about it!
Override this method in the actual frontier implementation.
Called when there were too many redirects for an URL, or the site has note been updated since the last visit.
In the latter case, update the internal uri and increase the priority level.
Called when there was some kind of server error.
Override this method in the actual frontier implementation.