Frontier
Generic Frontier implementation.
The SingleHostFrontier will only select URIs from the queues by
iterating over all available queues and added into a priority queue.
The priority is calculated based on the timestamp it should be crawled next.
In contrast to the spyder.core.sqlitequeues module, URIs in this module
are represented as spyder.thrift.gen.ttypes.CrawlUri.
-
class spyder.core.frontier.AbstractBaseFrontier(settings, log_handler, front_end_queues, prioritizer, unique_hash='sha1')[source]
A base class for implementing frontiers.
Basically this class provides the different general methods and
configuration parameters used for frontiers.
-
add_sink(sink)[source]
Add a sink to the frontier. A sink will be responsible for the long
term storage of the crawled contents.
-
add_uri(curi)[source]
Add the specified CrawlUri to the frontier.
next_date is a datetime object for the next time the uri should be
crawled.
Note: time based crawling is never strict, it is generally used as some
kind of prioritization.
-
close()[source]
Close the underlying frontend queues.
-
get_next()[source]
Return the next uri scheduled for crawling.
-
process_not_found(curi)[source]
Called when an URL was not found.
This could mean, that the URL has been removed from the server. If so,
do something about it!
Override this method in the actual frontier implementation.
-
process_redirect(curi)[source]
Called when there were too many redirects for an URL, or the site has
note been updated since the last visit.
In the latter case, update the internal uri and increase the priority
level.
-
process_server_error(curi)[source]
Called when there was some kind of server error.
Override this method in the actual frontier implementation.
-
process_successful_crawl(curi)[source]
Called when an URI has been crawled successfully.
curi is a CrawlUri
-
update_uri(curi)[source]
Update a given uri.
-
class spyder.core.frontier.MultipleHostFrontier(settings, log_handler)[source]
A Frontier for crawling many hosts simultaneously.
-
get_next()[source]
Get the next URI that is ready to be crawled.
-
process_not_found(curi)[source]
The page does not exist anymore!
-
process_redirect(curi)[source]
There was a redirect.
-
process_server_error(curi)[source]
Punish any server errors in the budget for this queue.
-
process_successful_crawl(curi)[source]
Crawling was successful, now update the politeness rules.
-
class spyder.core.frontier.SingleHostFrontier(settings, log_handler)[source]
A frontier for crawling a single host.
-
get_next()[source]
Get the next URI.
Only return the next URI if we have waited enough.
-
process_successful_crawl(curi)[source]
Add the timebased politeness to this frontier.