This module contains the default queue storages backed by SQlite.
Exception raised when a queue could not be found.
A queue storage for multiple queues that can be used for crawling multiple hosts simultaneously.
Internally all URLs are being stored in one table. Each queue has its own INTEGER identifier.
Each URL is represented as a tuple of the form uri = (url, queue, etag, mod_date, next_date, priority).
The queue is ordered using the next_date in ascending fashion.
Add a new queue with the identifier. If the queue already exists, it’s id is returned, otherwise the id of the newly created queue.
A generator for iterating over all available urls.
Note: does not return the full uri object, only the url. This will be used to refill the unique uri filter upon restart.
A generator for iterating over all available queues.
This will return (queue, identifier) as (int, str)
Get the queue for the given identifier if there is one. Raises a QueueNotFound error if there is no queue with the identifier.
Called when an URI should be ignored. This is usually the case when there is a HTTP 404 or recurring HTTP 500’s.
Calculate the number of known uris. If queue is given, only return the size of this queue, otherwise the size of all queues is returned.
Return the top n elements from the queue. By default, return the top element from the queue.
If you specify offset the first offset entries are ignored.
Any entries with a next_date below 1000 are being ignored. This enables the crawler to ignore URIs _and_ storing the status code.
This is a queue that can be used for crawling a single host.
Internally there is only one queue for all URLs. Each URL is represented as a tuple of the form: uri = (url, etag, mod_date, next_date, priority).
The queue is ordered using the next_date in ascending fashion.
A generator for iterating over all available urls.
Note: does not return the full uri object, only the url. This will be used to refill the unique uri filter upon restart.
Called when an URI should be ignored. This is usually the case when there is a HTTP 404 or recurring HTTP 500’s.
Return the top n elements from the queue. By default, return the top element from the queue.
If you specify offset the first offset entries are ignored.
Any entries with a next_date below 1000 are being ignored. This enables the crawler to ignore URIs _and_ storing the status code.