The Hypertext Markup Language (HTML) and Extensible Markup Language (XML) are both instances of the Standard Generalized Markup Language (SGML). SGML was conceived in the middle 1980's as a text markup notation for exchanging hierarchically organized electronic documents. SGML consists of two parts, namely the document markup in the form of tags, and a meta-description of a document class called a Document Type Definition (DTD).
DTD's are typically designed for special purposes, and define the tag names, tag structures, and hierarchical organization of documents that conform to the DTD.
HTML, as an instance of SGML, consists of a DTD that defines the exact version of HTML being used, and a set of conventions followed by web browsers for rendering the markup on a computer display. Most of the HTML involves how markup should be presented, for example what fonts are used and in what size, colors, spacing, line breaks, and so on. The main clients of HTML are real people viewing the pages marked up in this manner.
In contrast, XML is an instance of SGML for the exchange of content or application specific data over the web. The idea is that if two or more people can agree on a common DTD (that is the markup and structure of a document), they can exchange documents and other information. In a simplistic way XML can be regarded as a variant of HTML where you may define your own markup. The main clients of XML are programs that process the content of web pages -- although XML can be viewed in a browser, it has nothing to do with presentation.
XML documents are typically grouped according to the DTD that is used. For example, XML documents using the Content Definition Format (CDF) DTD are used in push media, and XML documents using the Chemical Markup Language (CML) DTD are used to exchange molecular structures.
At the simplest level, pages consist of sequences of characters (called parsed character data or PCDATA) and markup symbols called tags. Tags consist of characters enclosed between less than (<) and greater than (>) symbols. The tag contents specify a tag name and optionally any number of attributes. Tag names are predefined in HTML (for example, H1, H2, P, FONT, TITLE, etc) whereas XML tags are defined according to a DTD. Attributes consist of (name, value) pairs.
The general tag style is as follows:
Here the tag name name is followed by attribute values A and B, having the values abc and 123 respectively. Values are always quoted by single or double quotes in XML; HTML values have a more flexible syntax allowing certain values to be unquoted. HTML also allows attributes to have no value, for example:
A hierarchical structure is imposed on a page by collecting tags and parsed character data into elements. We distinguish between comment elements, (non-empty) elements, empty elements, processing instruction elements, and SGML directive elements.
Note that in this manual we diverge from SGML terminology so as to explain the unified view WebL presents to the programmer by mapping names and attributes of elements to piece objects (introduced later) and object fields. The name of a piece is derived from the start tag of the element.
Comment elements specify text to be ignored during document parsing. Comments consist of a single tag in the following style:
<!-- this is the comment text -->
The name of a comment element is "!--". In WebL, the comment element has a field called comment, which has as value the text occuring between the "--" tokens.
These consist of a start tag, any number of nested elements or PCDATAs, and a matching end tag. Everything between the start tag and end tag is said to be inside or contained in the element. The general format is as follows:
<tagname A="abc" B="123"> ... </tagname>
The names of the start tag and end tag must match. Note how the end tag starts with a forward slash character. Only the start tag may have attributes. The name and attributes of non-empty elements are those of the start tag.
In the case of attributes with no value, the attribute of the element is set to the empty string.
Empty elements do not have any content, and thus do not require an end tag. They have the format:
Note the forward slash that ends the tag. The element name and attributes are those of the tag.
Empty elements appear only in XML documents. HTML has something similar to an empty element but it cannot be distinguished from a start tag. For example, the HTML markup <br> does not have a corresponding end tag, and thus is equivalent to an empty tag. WebL knows about these anomalies by virtue of the HTML DTD.
Processing instruction elements give instructions to the page parser to perform special handling of its contents. They are used only in XML, and consist of a single tag:
Here the ellipses take the place of the processing instructions. Similar to comment elements, "?tagname" is defined as the name of the element. The processing instructions (between the question marks) are mapped onto the content field of the element).
SGML directives provide information about the DTD of the page being parsed. They have the form:
Here the ellipses take the place of the directive. As before, "!tagname" is defined as the name of the element, and the element has an attribute called content that stores the directive. The most commonly occurring SGML directiv e is !DOCTYPE, which specifies the name of the DTD to be used to parse the remainder of the document.
Parsing of HTML is made complicated by an SGML feature called optional tags (a feature that has explicitly been left out of XML). The idea is that the DTD often gives enough contextual information to infer that a start or end tag must be present at a certain position in the document. For several HTML elements, either start or end tags are declared to be optional, and should be inserted automatically by the parser. For example, the paragraph <P> element is in fact a non-empty element that has a corresponding </P> end tag. However, most HTML documents do not contain these optional end tags, in which case the HTML parser has to infer where paragraphs end. WebL knows the HTML DTD, and can thus insert optional tags when needed. In general, WebL attempts to make a faithful internal representation of the documents it parses (including spaces and new lines), except for the fact that it inserts optional tags when appropriate. Conversion from the internal format to external format might thus result in slightly different (but equivalent) pages.
Inserted in the PCDATA stream we often find character entities of the form &...; (where ... stands for a number or an alphanumeric name) denoting special symbols. For example, < and > denotes the less than and greater than symbols. This encoding is used both to embed special symbols that might be confused with markup, and to provide a human-readable way to represent all Unicode characters. WebL does not perform any translation of character entities when fetching a page, but does provide a built-in called ExpandCharEntities to process them afterwards.
XML is case-sensitive and HTML is case-insensitive. In the case of XML, WebL keeps all element names and attributes in their original case. In the case of HTML, WebL converts all element names and attribute names to lowercase.
Many elements have attributes that specify URLs of other documents on the Web. Most of these URLs are specified relative to the document itself. WebL simplifies handling of these URL attributes by resolving them to an absolute URL when the document is fetched. To determine which attributes refer to URLs, WebL uses slightly modified HTML DTDs internally that explicitly denote which attributes of elements contain URLs. No URL resolution is performed for XML documents.
A surprisingly large number of pages on the web contain errors. Some of the typical errors encountered include:
Illegal nesting of elements forbidden by the DTD,
Non-hierarchical markup where elements overlap instead of nest,
The DTD specified by the DOCTYPE SGML directive does not match the markup the document contains,
Using tags of unknown name, etc.
WebL tries to take all these problems into account. In SGML terms, WebL is a non-validating processor. WebL only uses DTDs to correct simple mistakes and to add optional tags where needed. WebL also corrects overlapping tags in HTML to ensure that we have a hierarchically structured document. In general, we try to make as few changes as possible to the page with the guidance of the DTD. It is thus important to realize that when bad HTML or XML is parsed, the internal representation might not be what you expect from viewing the page source. To give users an idea of what WebL sees, a pretty printing function called Pretty is included in WebL that displays a representation of the parsed web page in a nicely formatted way. We recommend using this tool as it often illustrates the badly formatted markup found on the web.