The DefaultHtmlLinkExtractor will try to extract new links from the curi.content_body. In order to find them two regular expressions are used.
2. The LINK_EXTRACTOR extracts links from tags using href or src attributes.
If the link is relative, the appropriate prefix is automatically added here.
The regular expressions have been adopted from Heritrix. See the Heritrix 3 source code:
modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java
Note
Heritrix has a newer way of extracting links, i.e. with different regular expressions. Since these are working for me at the moment, I am fine with it.
The default extractor for Links from HTML pages.
The internal regular expressions currently are not modifiable. Only the maximum length of an opening tag can be configured using the settings.REGEX_LINK_XTRACTOR_MAX_ELEMENT_LENGTH.