HTML is probably the most common used format for pages on the web. This markup language was originally designed for structural markup of content, but in the past, it has also been extended (and abused) by the introduction of several presentational elements. Nowadays, the separation of content and layout becomes more and more important as the number of different applications that make use of HTML documents is constantly growing. This is supported by the increasing use of XHTML, a reformulation of HTML in XML, together with style sheets.
The body of a simple enough and well-structured HTML document is divided into a number of sections. Each section consists of a heading that is followed by a mixture of subsections, paragraphs and different kinds of lists, where the list items themselves are composed of paragraphs as well as sublists. These kinds of components of a document are called block-level elements.
The actual content of an HTML document, as it appears in paragraphs, headings, definition terms and other such places, is primarily constituted by plain text. Further so-called inline (or character-level) elements are used as well to add structural information, e.g. emphasis or hyperlinks, to fragments of text.
An HTML document also includes certain information about itself and/or its relation to other documents. This meta data consists of at least the title of the document, but a brief description and some keywords should be additionally given for the sake of search engines. References to style sheets and other related documents can also be specified.
The web page editor is able to read all known formats of HTML documents (see also [HTML]) and extract the relevant information, at the same time also stripping off any ballast. A generated HTML document makes always use of the XHTML 1.0 format, but without using an explicit document type declaration.
The textual representation of a simple HTML document looks like this:
#Title A Simple HTML Document #Lang en-us #Meta:description This is an example of a simple HTML document. #Meta:keywords example, HTML document, inline markup #Meta:date 2004-06-30 #Link:stylesheet styles/default.css #Link:alternate:stylesheet styles/alternate.css Alternate Style = A Simple HTML Document = This HTML document is just an example and is used to demonstrate some features such as: 1. Paragraphs, headings and [#inline||inline markup]. 2. A horizontal rule at the end of the document. 3) Some kinds of lists. == [|inline|Inline Markup] == Some examples of inline markup: * _emphasis_ and *strong emphasis* * 'sample text' * [internal://admin/editpage||Link to EditPage] ----------
All lines at the beginning of the textual representation of an HTML document up to the first empty line are special and are used to define the meta data of the document. Each line provides a certain piece of information and is split into a sequence of white-space-separated words. The case-insensitive first word or a prefix of it determines the meaning of the rest of the line. The following cases are distinguished:
#Title
(whole word)<title>
element of the document.#Lang
(whole word)en
, en-us
, de
, de-de
.
It is used to define the lang
and xml:lang
attributes of the <html>
root element.#Meta:
(prefix)<meta>
element is created whose name
attribute is set to the rest of the first word and whose content
attribute is created from the following words.#Link:
(prefix)<link>
element whose
rel
attribute is a space-separated list of values that is
derived from the rest of the first word which in turn is a colon-separated
(:
) list of values. The href
attribute is set to
the second word and the optional title
attribute is created from
the third and following words if present.All lines that contain unknown or invalid meta data definitions are just ignored.
The body of an HTML document is build up by all remaining lines that follow the first empty line of the textual representation of an HTML document. Each line is first preprocessed and is then evaluated in the current context of a certain block-level element, starting with a hypothetical non-intended body element.
An empty line or an indented line usually starts a new block-level element, while an outdented line (or the end of the input) always terminates the current element and causes the enclosing element to become current again. That line is then processed once more with respect to the changed context, which may cause the line to become outdented again, and so on until the body element may be reached at latest. The collected (actual) text of a terminated element is finally examined in order to identify inline elements.
During preprocessing of a line, trailing white-space characters are removed and consecutive empty lines are collapsed into one. All leading tabulator characters of a line are finally replaced by eight space characters each.
The indentation of a line is determined by the total number of space characters at the beginning of a line. The precise and absolute value of the indentation usually does not matter, only its relative value in relation to the indentation of an enclosing element such as a list item is relevant (less than = outdented, equal to = non-indented, greater than = indented). If not otherwise stated in the following sections, a line is generally meant to be already adjusted according to the current context, thus normally having a relative indentation of zero.
Empty lines are usually only required in order to separate adjacent text blocks into paragraphs. Text blocks that make up the sole content of (definition) list items are already distinguished by this fact, but it is possible to insert additional empty lines in order to add extra spacing between such items. As a rule of thumb, a text block is regarded as a paragraph and marked up accordingly if it is additionally surrounded by empty lines before and after. This is always assumed to be the case at the top-level and hence in the context of the body element.
A paragraph is just a sequence of non-empty lines that cannot be interpreted in any other way, e.g. as terms of a definition list item. Adjacent paragraphs are separated by empty lines. The given division of a paragraph into lines is preserved between editions of the document.
A block of preformatted text starts with an indented line that cannot be interpreted in any other way, e.g. as the start of a list item. All following lines that have an equal or greater indentation than the first line are part of the same block. Intervening empty lines are also included. Every line of the block is finally examined separately for inline elements.
A heading consists of just one line that
encloses the actual heading title by one space character on each side first
and then by the same sequence of one or more equal signs (=
) on
both sides. The number of equal signs indicate the importance level of a
heading with level 1 being the most important.
= Level 1 Heading = == Level 2 Heading == ... ====== Level 6 Heading ======
A horizontal rule consists of just one
line that is a sequence of four or more identical characters that are taken
from the set "-=*+~_#^
".
******** ====== ----
A blockquote section is very similar
to an unordered list item. The only two differences are
that the bullet consists of a ">
" character and that
consecutive blockquote sections are not merged in any way.
A blockquote section follows: > This is a rather short block quotation. ----------
An item of an ordered or unordered list is introduced by an indented line that starts with a certain so-called bullet and that is further followed by at least one space character. The indentation of that line that results from replacing the bullet by the same number of spaces determines the indentation of a newly created block-level element that is used for the content of the list item. An unindented version of the line is finally processed again in the new context.
The rest of the list item is completed by the usual processing which allows any block-level element to be used in the new context and which also causes the list item to be terminated by an outdented line. That line usually either introduces the next list item or terminates the list. Consecutive list items of the same type are finally merged into a list of that type. The type of a list item is derived from its bullet.
The bullet of an unordered list item
consists of just one of the characters "*
", "-
" or
"+
".
An unordered list follows: * first item * second item * third item ----------
There are three types of ordered list items available:
0
, 1
, ..., 9
)a
, b
, ..., z
)A
, B
, ..., Z
)The bullet of an ordered list item
consists of a non-empty sequence of only those characters that belong to the
same type, and to which either a ".
" or ")
"
character is finally appended. The actual value of the character sequence
does currently not matter.
An ordered list follows: 1. one 2) two 3. three Another ordered list follows: a) apple b. banana c) cherry ----------
An item of a definition list consists
of one or more definition terms and the actual definition. A definition
term is a sequence of non-empty and non-indented lines with the last
line being suffixed by two colons (::
). The actual
definition starts with an indented line that cannot be interpreted
as an item of an ordered or unordered list and is processed
the same way as the content of such a list item. Consecutive definition list
items are finally merged into a definition list.
A definition list follows: First term:: Definition of the first term. Second term:: Third term:: Definition of the second and the third term. ----------
Multiple definition terms for the same definition cannot be separated by
empty lines. When a definition term is spread over multiple lines, all of
them but the last cannot end in any of the characters ".
",
"!
", "?
" or ":
", which are common
terminators of sentences. This rule allows to start a definition term
immediately following a text block without an intervening empty line, because
otherwise the preceding text would become part of the definition term, which
is usually not intended in such cases.
The content of a definition term may be empty, in which case the
definition term consists only of the terminating two colons (::
)
and causes the definition term to be effectively omitted. So, if all
definition terms for a definition are cancelled that way, the definition list
item represents just an indented block and is comparable in its effect to a
blockquote section.
:: :: This is a double indented paragraph. ----------
A so-called compact definition list
item consists of a definition term that is followed by the first
line of the actual definition on the same line. Both parts must be separated
by at least three spaces. The definition term cannot be terminated by the
usual two colons (::
), but they must be used nevertheless for an
empty term. Such a definition term may be regarded as a kind of user-defined
bullet of a list item that is a definition in this case. In
fact, further processing really follows this model and also causes the
procedure to be applied recursively if needed. As a result, it is even
possible to emulate simple tables by using compact definition lists. Note
also the two empty cells 2.2 and 3.3 in the following example.
An indented emulated simple table follows: :: 1.1 1.2 1.3 1.4 2.1 :: 2.3 2.4 3.1 3.2 :: 3.4 -------------------------
Another way to introduce an item of a compact definition list is to place
the definition term on one line that is immediately followed by the actual
definition without an intervening empty line. Such a definition term cannot
be terminated by the usual two colons (::
).
A compact definition list follows: First term Definition of the first term. Second term Definition of the second term. Third term Definition of the third term. ----------
Finally, both variants of compact definition terms may be preceded by
additional definition terms, each of them suffixed by two
colons (::
) in the usual way.
Normal and compact definition list items can be mixed
arbitrarily and consecutive items are merged as usual into a definition list,
but only the first item of a definition list determines whether the list as a
whole gets the compact
attribute or not.
It should be noted that the recognition of compact definition list items (the three spaces pattern) occurs for all non-indented lines in the current context. Headings and preformatted text are not affected, but in other cases, three or more spaces in a row within the actual text of a line should be avoided in order for this rule to get not unintentionally applied.
The following character sequences are treated specially when they occur in the actual text of an element and are given here together with their meaning:
_Emphasized text_
<em>
, on
input also <i>
, <var>
,
<cite>
, <u>
).*Strongly emphasized text*
<strong>
,
on input also <b>
, <dfn>
,
<th>
).'Sample text'
<code>
, on
input also <tt>
, <samp>
,
<kbd>
). This sequence is not recognized within
preformatted text because it is not needed there.[URL|Fragment|Link text]
<a name="Fragment" href="URL">
).
This sequence is not recognized within link texts. Both values of URL
and Fragment cannot be empty at the same time and are also not
allowed to contain white-space characters or any character from the set
{}[|]<>"
. Note that a hyperlink that refers to an anchor
within the same document is specified (in the usual way) with a URL
value such as #Fragment
. For further information see
also section "URL Syntax".<localpart@domain>
<a href="mailto:localpart@domain">
).
This sequence is not recognized within link texts and preformatted
text.The given definitions can more or less obviously lead to misinterpretations in certain situations. In order to resolve many ambiguities, some additional rules are applied during the processing of text fragments in deciding whether a given character sequence represents valid markup or not. These rules can be described as follows, where the start or end of a text fragment also counts as white-space:
[http://www.aksware.de/||]
.'`!"$%&@\/(){}[|]<>=?*+-~#,.;:_^
.
This rule is used to ensure that embedded special characters are left alone in
certain contexts, e.g. in contractions (I'm right, aren't I?), in calculations
(1*2+3*4
) or in variable names (HTTP_USER_AGENT
).[]
), which
means that every opening "[
" must be closed by a "]
"
accordingly, e.g. in [refs.html#html||[HTML]]
.-
) and dots (.
). The localpart
may also contain underscores (_
), but the domain must
contain at least one dot, e.g. <no_reply@aksware.de>
.The third rule used to be just an exclusion of letters and digits, but in the presence of Unicode, doing that is a quite difficult task. So, the sense has been reversed and a set of allowed characters is enumerated instead. This set includes all white-space and printable characters of the US-ASCII character set except letters and digits.
The processing of a text fragment consists of several steps and is performed in a left to right manner:
|
" is searched for in
the text fragment.Not finding a valid start character or prefix just terminates the processing of a text fragment, while not finding a valid end character causes the start character to be skipped and further processing continues right after it. Note that a text fragment can contain line breaks which count also as white-space.
All characters that immediately follow the start character and that are equal to it are made part of the final enclosed text. And in a similar way, all characters that immediately precede the end character and that are equal to it are made part of the final enclosed text. Note that these prefixes and suffixes are not processed again in step c2. Both cases do not apply to links. Example:
***A strongly emphasized text that includes two stars on both sides.***
The following restrictions in the processing of HTML documents should be taken into account: