Skip to main content

External Link Extractor

This plugin checks external sources for broken links.

Most of the extractors work from the inside outwards. They extract links from your site's content and check internal and external links. 

This plugin checks links from the outside. An example usage would be adding your site's sitemap to this plugin's configuration.

A sitemap usually represents the menu structure and the main content of your site as you would like web-bots to see your site.

A generated sitemap shouldn't produce any broken of forbidden URL's, it might contain URL's that are redirected.  A manual sitemap might contain outdated links.

The plugin configuration allows adding plain (HTML) destinations, XML sitemaps and CSV formatted documents. note that the plugin does not check the links TO the destination page not ON it.

Options & Settings

Re-extract interval

This is the interval between two extracts in days.

An extract loads the external file and parses the links.

External files do not have a modified field to determine a re-extract atomically. There is a check on 304 - Not Modified possible, maybe in the future.

Setting the interval to zero will disable the time-bases recheck.

Links

Name

The name is used for your convenience in the reporting.

Link

Link can be one of:

A xml formatted sitemap.

Currently, the  <loc> and <image:loc> types of links are supported.A sitemap must respond with a text/xml content-type

A HTML formatted sitemap.

Treads the external link as text source and ectracts links using the enabled parsers

A CSV formatted document. 

The document might contain a header row. Using 'URL' or 'link' as header for the column containing URLs and 'name' or 'title' for the column with anchor text.

Without a header, the first column must contain the URL, a second might contain an anchor.

The delimiter should be auto-detected. Preferred is a comma separated list.

All URLs must start with HTTP(S). Other protocols or relative URLs are not supported.

A CSV file must respond with a text/csv content-type

A Plain Link (default)

all other links a treated as direct links to a page. 

Mime

If the remote server response with a correct mime type, the extractor should auto-detect the format. (text/xml, text/csv). However, in the likely event that a server is not configured correctly, you can enforce a mime time.

Nginx, for example, hasn't configured text/csv in the default settings and returns application/octet-stream, which can be anything.

 

Delete extracted data on plugin save

Plugin settings might impact the extracted data. This option allows to purge all data for a plugin whenever the plugin settings change.