W3Browse - Dialogs

Fetch Documents

Sometimes it is helpful to have a tool that is able to retrieve some or all pages of one or more websites automatically. Doing this by hand can be very nerve-racking and time-consuming.

This spider scans HTML and WML documents for references to other documents that in turn are also retrieved and scanned if certain conditions are met, in order to avoid fetching the whole Internet! Different elements of a page may be excluded from being retrieved, e.g. inline images and scripts. The spider does not evaluate any kind of formulars and therefore performs GET requests only.

Accepting the settings made within this dialog causes a FetchDocs window that shows the progress in retrieving the documents to be opened. A fetch operation may be re-executed later with suitable settings, perhaps because it was aborted or for other reasons. It should be noted that the specified Directory is treated like a cache in any case, which means that all requests are going through it and for that reason causes already stored documents to be used to satisfy matching requests (for certain restrictions on this see below).

The following parameters for fetching documents can be specified:

StartURL (text&list)

Specifies the URL of a document that should be retrieved initially. Additionally, this field provides a predefined list of some common URLs which is stored in the configuration file.

Proxy (checkbox+text&list)

Defines the location of a proxy-server or gateway that should be used to fetch documents from other servers. Additionally, this field provides a predefined list of some common proxy-server addresses which is stored in the configuration file.

FetchOpts (checkboxes: Links, Images, Scripts, Styles)

Determines which kinds of elements of a page should be fetched:

Links: normal hyperlinks,
Images: all sorts of inline images including shortcut icons,
Scripts: all sorts of scripts specified by <script> elements,
Styles: all sorts of style sheets referenced by <link> elements.

Note that scripts and style sheets are not scanned for further references to other documents.

Directory (text)

Specifies the name of a new or existing directory into which retrieved documents should be stored.

DirType (list: cache, dummy, files)

Determines the type of the parameter Directory. The following types are supported:

cache: the specified directory is a cache, which may already be in use by the same instance of w3browse,
files: retrieved documents are saved to files within the specified directory,
dummy: no documents are stored anywhere and the Directory parameter is ignored.

Note that not all fetched documents can be saved to files on the local filesystem (see below).

Incl. URLs (text)

Specifies a space-separated list of prefixes of URLs to be exclusively retrieved. If not specified, all URLs that match the prefix of the StartURL parameter up to and including the last slash (/) of the path component are retrieved.

Excl. URLs (text)

Specifies a space-separated list of prefixes of URLs to be excluded from being retrieved, superceding conflicting include prefixes.

The following parameters are placed on a separate card and specify certain HTTP header fields that are sent out unmodified with every request whereby empty or disabled fields are just ignored. The desired processing of certain dynamic headers can also be determined.

Types (checkbox+text): A list of content-types for the Accept: header field, e.g. "text/html, text/vnd.wap.wml, text/plain, */*".
Encodings (checkbox+text): A list of content-encodings for the Accept-Encoding: header field, e.g. "deflate, gzip, compress".
Charsets (checkbox+text): A list of charsets for the Accept-Charset: header field, e.g. "utf-8, iso-8859-1, us-ascii, *".
Languages (checkbox+text): A list of language codes for the Accept-Language: header field, e.g. "de, en, *".
UserAgent (checkbox+text&list): Value for the User-Agent: header field, e.g. "spider" or "Mozilla/4.0 (spider)".
Accept Cookies (checkbox): Enables the processing of cookies, whereby all cookies are treated like session cookies and are only accepted from and sent back again to the originating server.
Send Referrer (checkbox): Enables the transmission of Referer: headers to servers. This purely informational request header field contains the URL of that document from which a request originates. Authentication information that may be embedded within server-based URLs (the user:password@ part) is stripped before the URL is used for reference.

Restrictions

When not using a cache, storing retrieved documents to files on the local filesystem is not possible in all cases, because in doing so the original requesting URL gets lost as well as any header information. Only responses with status code "200" (OK) are candidates for being saved. A non-empty query part of the URL also causes a document not to be stored.

The filename for an entity to be stored is created by appending the whole path component of the requesting URL to the specified directory name. A missing basename is replaced by the value index plus an extension that is derived from the content-type. The last modification date of a saved file is set to the value of the Last-Modified: header if it is present and valid.

Another problem occurs when a saved file is about to be returned in response to a request. Because of the lost header information, the content-type has to be determined from the filename alone. This generally fails for filename extensions that correspond to dynamically generated output from scripts or programs running on the server such as ".pl", ".php" or ".asp", but also for unusual extensions and for no extension at all. A solution would be to store a document only if there is a definite mapping between the filename and its associated content-type. This is currently not implemented.