Link Extractors

The DefaultHtmlLinkExtractor will try to extract new links from the curi.content_body. In order to find them two regular expressions are used.

  1. The RELEVANT_TAG_EXTRACTOR extracts the following tags:
    • <script>..</script>
    • <style>..</style>
    • <meta>
    • or any other open tag with at least one attribute (e.g. not <br>).

2. The LINK_EXTRACTOR extracts links from tags using href or src attributes.

If the link is relative, the appropriate prefix is automatically added here.

The regular expressions have been adopted from Heritrix. See the Heritrix 3 source code:

modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java

Note

Heritrix has a newer way of extracting links, i.e. with different regular expressions. Since these are working for me at the moment, I am fine with it.

class spyder.processor.htmllinkextractor.DefaultHtmlLinkExtractor(settings)[source]

The default extractor for Links from HTML pages.

The internal regular expressions currently are not modifiable. Only the maximum length of an opening tag can be configured using the settings.REGEX_LINK_XTRACTOR_MAX_ELEMENT_LENGTH.

Previous topic

Content Fetcher

Next topic

Crawl Scoper

This Page