Sometimes it is helpful to have a tool that is able to retrieve some or all pages of one or more websites automatically. Doing this by hand can be very nerve-racking and time-consuming.
This spider scans HTML and WML documents for references to other documents that in turn are also retrieved and scanned if certain conditions are met, in order to avoid fetching the whole Internet! Different elements of a page may be excluded from being retrieved, e.g. inline images and scripts. The spider does not evaluate any kind of formulars and therefore performs GET requests only.
Accepting the settings made within this dialog causes a FetchDocs window that shows the progress in retrieving the documents to be opened. A fetch operation may be re-executed later with suitable settings, perhaps because it was aborted or for other reasons. It should be noted that the specified Directory is treated like a cache in any case, which means that all requests are going through it and for that reason causes already stored documents to be used to satisfy matching requests (for certain restrictions on this see below).
The following parameters for fetching documents can be specified:
Specifies the URL of a document that should be retrieved initially. Additionally, this field provides a predefined list of some common URLs which is stored in the configuration file.
Defines the location of a proxy-server or gateway that should be used to fetch documents from other servers. Additionally, this field provides a predefined list of some common proxy-server addresses which is stored in the configuration file.
Links
,
Images
, Scripts
, Styles
)Determines which kinds of elements of a page should be fetched:
Links
Images
Scripts
<script>
elements,Styles
<link>
elements.Note that scripts and style sheets are not scanned for further references to other documents.
Specifies the name of a new or existing directory into which retrieved documents should be stored.
cache
, dummy
,
files
)Determines the type of the parameter Directory. The following types are supported:
cache
files
dummy
Note that not all fetched documents can be saved to files on the local filesystem (see below).
Specifies a space-separated list of prefixes of
URLs to be exclusively retrieved. If not
specified, all URLs that match the prefix of the
StartURL parameter up to and including the last slash
(/
) of the path component are retrieved.
Specifies a space-separated list of prefixes of URLs to be excluded from being retrieved, superceding conflicting include prefixes.
The following parameters are placed on a separate card and specify certain HTTP header fields that are sent out unmodified with every request whereby empty or disabled fields are just ignored. The desired processing of certain dynamic headers can also be determined.
A list of content-types for the Accept:
header field, e.g. "text/html, text/vnd.wap.wml, text/plain,
*/*
".
A list of content-encodings for the
Accept-Encoding:
header field, e.g. "deflate, gzip,
compress
".
A list of charsets for the
Accept-Charset:
header field, e.g. "utf-8, iso-8859-1,
us-ascii, *
".
A list of language codes for the
Accept-Language:
header field, e.g. "de, en,
*
".
Value for the User-Agent:
header field, e.g.
"spider
" or "Mozilla/4.0 (spider)
".
Enables the processing of cookies, whereby all cookies are treated like session cookies and are only accepted from and sent back again to the originating server.
Enables the transmission of Referer:
headers to servers.
This purely informational request header field contains the
URL of that document from which a request
originates. Authentication information that may be embedded within
server-based URLs (the
user:password@
part) is stripped before
the URL is used for reference.
When not using a cache, storing retrieved documents to files on the local filesystem is not possible in all cases, because in doing so the original requesting URL gets lost as well as any header information. Only responses with status code "200" (OK) are candidates for being saved. A non-empty query part of the URL also causes a document not to be stored.
The filename for an entity to be stored is created by appending the whole
path component of the requesting URL to the specified
directory name. A missing basename is
replaced by the value index
plus an extension that is derived
from the content-type. The last modification date of a saved
file is set to the value of the Last-Modified:
header if it is
present and valid.
Another problem occurs when a saved file is about to be returned in
response to a request. Because of the lost header information, the
content-type has to be determined from the filename alone.
This generally fails for filename extensions that correspond to dynamically
generated output from scripts or programs running on the server such as
".pl
", ".php
" or ".asp
", but also for
unusual extensions and for no extension at all. A solution would be to store
a document only if there is a definite mapping between the filename and its
associated content-type. This is currently not
implemented.