Frontier¶

Generic Frontier implementation.

The SingleHostFrontier will only select URIs from the queues by iterating over all available queues and added into a priority queue.

The priority is calculated based on the timestamp it should be crawled next.

In contrast to the spyder.core.sqlitequeues module, URIs in this module are represented as spyder.thrift.gen.ttypes.CrawlUri.

class spyder.core.frontier.AbstractBaseFrontier(settings, log_handler, front_end_queues, prioritizer, unique_hash='sha1')[source]¶

A base class for implementing frontiers.

Basically this class provides the different general methods and configuration parameters used for frontiers.

add_sink(sink)[source]¶: Add a sink to the frontier. A sink will be responsible for the long term storage of the crawled contents.

add_uri(curi)[source]¶

Add the specified CrawlUri to the frontier.

next_date is a datetime object for the next time the uri should be crawled.

Note: time based crawling is never strict, it is generally used as some kind of prioritization.

close()[source]¶: Close the underlying frontend queues.

get_next()[source]¶: Return the next uri scheduled for crawling.

process_not_found(curi)[source]¶

Called when an URL was not found.

This could mean, that the URL has been removed from the server. If so, do something about it!

Override this method in the actual frontier implementation.

process_redirect(curi)[source]¶

Called when there were too many redirects for an URL, or the site has note been updated since the last visit.

In the latter case, update the internal uri and increase the priority level.

process_server_error(curi)[source]¶

Called when there was some kind of server error.

Override this method in the actual frontier implementation.

process_successful_crawl(curi)[source]¶

Called when an URI has been crawled successfully.

curi is a CrawlUri

update_uri(curi)[source]¶: Update a given uri.

class spyder.core.frontier.MultipleHostFrontier(settings, log_handler)[source]¶

A Frontier for crawling many hosts simultaneously.

get_next()[source]¶: Get the next URI that is ready to be crawled.

process_not_found(curi)[source]¶: The page does not exist anymore!

process_redirect(curi)[source]¶: There was a redirect.

process_server_error(curi)[source]¶: Punish any server errors in the budget for this queue.

process_successful_crawl(curi)[source]¶: Crawling was successful, now update the politeness rules.

class spyder.core.frontier.SingleHostFrontier(settings, log_handler)[source]¶

A frontier for crawling a single host.

get_next()[source]¶

Get the next URI.

Only return the next URI if we have waited enough.

process_successful_crawl(curi)[source]¶: Add the timebased politeness to this frontier.

Frontier¶

Previous topic

Next topic

This Page

Navigation

Frontier¶

Previous topic

Next topic

This Page

Quick search

Navigation