Queue Management

This module contains the default queue storages backed by SQlite.

exception spyder.core.sqlitequeues.QueueException[source]

Base exception for errors in the queues.

exception spyder.core.sqlitequeues.QueueNotFound(identifier)[source]

Exception raised when a queue could not be found.

class spyder.core.sqlitequeues.SQLiteMultipleHostUriQueue(db_name)[source]

A queue storage for multiple queues that can be used for crawling multiple hosts simultaneously.

Internally all URLs are being stored in one table. Each queue has its own INTEGER identifier.

Each URL is represented as a tuple of the form uri = (url, queue, etag, mod_date, next_date, priority).

The queue is ordered using the next_date in ascending fashion.

add_or_create_queue(identifier)[source]

Add a new queue with the identifier. If the queue already exists, it’s id is returned, otherwise the id of the newly created queue.

add_uri(uri)[source]

Add the uri to the given queue.

add_uris(uris)[source]

Add the list of uris to the given queue.

all_uris(queue=None)[source]

A generator for iterating over all available urls.

Note: does not return the full uri object, only the url. This will be used to refill the unique uri filter upon restart.

get_all_queues()[source]

A generator for iterating over all available queues.

This will return (queue, identifier) as (int, str)

get_queue_count()[source]

Return the number of available queues.

get_queue_for_ident(identifier)[source]

Get the queue for the given identifier if there is one. Raises a QueueNotFound error if there is no queue with the identifier.

get_uri(url)[source]

Return the URI tuple for the given URL.

ignore_uri(url, status)[source]

Called when an URI should be ignored. This is usually the case when there is a HTTP 404 or recurring HTTP 500’s.

qsize(queue=None)[source]

Calculate the number of known uris. If queue is given, only return the size of this queue, otherwise the size of all queues is returned.

queue_head(queue, n=1, offset=0)[source]

Return the top n elements from the queue. By default, return the top element from the queue.

If you specify offset the first offset entries are ignored.

Any entries with a next_date below 1000 are being ignored. This enables the crawler to ignore URIs _and_ storing the status code.

remove_uris(uris)[source]

Remove all uris.

update_uri(uri)[source]

Update the uri.

update_uris(uris)[source]

Update the list of uris in the database.

class spyder.core.sqlitequeues.SQLiteSingleHostUriQueue(db_name)[source]

This is a queue that can be used for crawling a single host.

Internally there is only one queue for all URLs. Each URL is represented as a tuple of the form: uri = (url, etag, mod_date, next_date, priority).

The queue is ordered using the next_date in ascending fashion.

add_uri(uri)[source]

Add a uri to the specified queue.

add_uris(urls)[source]

Add a list of uris.

all_uris()[source]

A generator for iterating over all available urls.

Note: does not return the full uri object, only the url. This will be used to refill the unique uri filter upon restart.

get_uri(url)[source]

Mostly for debugging purposes.

ignore_uri(url, status)[source]

Called when an URI should be ignored. This is usually the case when there is a HTTP 404 or recurring HTTP 500’s.

queue_head(n=1, offset=0)[source]

Return the top n elements from the queue. By default, return the top element from the queue.

If you specify offset the first offset entries are ignored.

Any entries with a next_date below 1000 are being ignored. This enables the crawler to ignore URIs _and_ storing the status code.

remove_uris(uris)[source]

Remove all uris.

update_uri(uri)[source]

Update the uri.

update_uris(uris)[source]

Update the list of uris in the database.

class spyder.core.sqlitequeues.SQLiteStore(db_name)[source]

Simple base class for sqlite based queue storages. This class basically creates the default pragmas and initializes all the unicode stuff.

checkpoint()[source]

Checkpoint the database, i.e. commit everything.

close()[source]

Close the SQLite connection.

exception spyder.core.sqlitequeues.UriNotFound(url)[source]

Exception raised when an URI could not be found in the storage.

Previous topic

Frontier

Next topic

Workerprocess

This Page