After a page is retrieved from the Web and parsed according to its MIME type, the page and its content is accessible for further computation in WebL. The computation that can be performed on a page is determined by the WebL markup algebra.
The markup algebra is based on three concepts: tags, pieces and piece sets. In simple terms, a tag corresponds to a markup tag, a piece identifies a contiguous sub-region of a page, and a piece set is a collection of pieces.
The first step in parsing a Web page is to identify of all the markup tags in the page (enclosed between `<' and `>' characters). Each of the tags is converted into a tag (a WebL value of type tag). Conceptually the page then consists of a list of tag objects and text segments (or character data). We can use a simple train-like pictorial representation of a page to illustrate the conversion ( See Converting Markup into Tag and PCData Sequences ). In the figure, each box represents either a tag or a piece of text.
The WebL model also supports unnamed tags, the purpose of which will become clearer in the next chapter. The equivalent HTML or XML for an unnamed tag is `<>' (which of course does not occur in practice). WebL uses unnamed tags as place markers in a page. As their name suggests, unnamed tags do not have a name or attributes.
|
A piece is a WebL value type that denotes a region of page. Each piece refers to two tags: the begin tag that denotes the start of the region, and the end tag that denotes the end of the region. The region includes both the begin and end tag, and everything between them in the page. Note that the begin and end tag can also be the same. Another important fact is that pieces never point to text segments.
The most common types of pieces are those that correspond to elements in a page. We extend the box diagram notation to include triangles to denote pieces, and lines from pieces to tags to denote begin and end tags ( See Piece Notation ). To allow the programmer to access element attributes, the attributes of an element's begin tag are copied into field variables of each piece. Thus, a piece is very similar to the object value type in that it looks and behaves in many ways like an object. Furthermore, we associate the appropriate name with each piece (in this case the names are the strings "!--", "ul", "li", and "li" written above the triangles).
Note how the begin and end tag of the comment piece refer to the same tag object. In accordance with our previous definition, a piece that refers to unnamed begin and end tags is called an unnamed piece (which correspondingly has the empty string as name).
|
As its name indicates, a piece set is a collection of pieces belonging to the same page. It is a set in the sense that a piece can belong only once to a piece set (but a piece can be a member of several piece sets). A piece set is also a list because pieces in a piece set are ordered. The piece ordering in a piece set is based on the begin and end tag positions of a piece in a page. We order pieces according to the left-to-right order of the begin tags.
Piece sets play a very important part in WebL. They form the basis of many of the operations that extract data from web pages.