There are several ways in which piece sets can be created:
Searching for markup elements by explicitly naming interesting elements (for example all the `a' or `li' elements).
Searching for character patterns that match a regular expression.
Searching for stylized sequences of markup patterns.
Searching for segments delimited by explicitly named markup elements (i.e. paragraph extraction).
The Elem function returns a piece set of all elements that match a specific name. The function also allows the search scope to be restricted to a page or piece. Thus, a piece is constructed for each matching begin and end tag pair of a markup element with the indicated name in the indicated scope, and the resulting pieces are collected into a piece set result. For example, the following program fetches a page, calculates a piece set with all the img (image) elements of the page, and proceeds to print out the src attribute of each of those images:
var P = GetURL("http://www.nowhere.com");
As can be seen from the example, the every statement also allows the iteration over the elements (the pieces) of a piece set.
The Pat function searches a page for character patterns that match a regular expression. The Pat function ignores the tag objects in a page -- only the pure text stream is searched. For each occurrence of the pattern, a new piece is created. This involves inserting new unnamed tags just in front of and just after each pattern occurrence to keep track of the location. For example, See Results of Searching for "WebL" shows how a page looks after searching for the word "WebL". In this figure the unnamed tags are indicated by the boxes marked "<>" and "</>".
The unnamed tags created while searching for character patterns are simply pattern locators -- they are ignored by many operators and functions, and are automatically removed from the page when not required any more (in some sense they are "invisible"). Also, when a page is converted back to string format, the unnamed tags are removed. It is also important to know that unnamed tags are always inserted. Thus, searching for the same pattern twice will cause two nested and unnamed pieces to be inserted into the page. Another way of saying this is that tags are never shared by more than one piece.
|
The Pat function also supports Perl5 regular expression groups. Groups, as indicated with parenthesis in Perl5 regular expressions, identify constituent parts of the pattern to be matched. For example, a regular expression matching dates might have groups related to the day, month and year of the date. For each pattern matched in the page, the corresponding piece object of that pattern is attributed with fields numbered from 1 onwards that contain each of the groups occurring from left to right in the pattern. A field named 0 is also added to the piece which contains the complete matched pattern. For example, the following code fragment recognizes dates of the form "day-month-year":
The date pattern contains three groups, one for the two-digit day, one for a word representing the month, and one for the digits of the year. Given the occurrence of the string "20-Jan-1998" in a page, the corresponding piece object would look as follows:
See See Exceptions for more details on the syntax of Perl5 regular expressions.
Once a piece set has been created with the Elem or Pat functions, we can apply WebL operators and functions to the result to perform further computation. For example, by indexing into a piece set with the indexing [] operator, we can extract the nth element of the piece set.
The PCData function returns a piece set of all text segments that are contained in a page or piece. The name PCData is derived from the term "parsed character data", which denotes the text segments on a page, i.e. what is left over when all markup tags are removed from a page. The PCData function is thus complementary to the Elem function, and somewhat related to the Pat function.
As an example, the following program fetches a page and prints out all the text segments occuring on the page (as delimited by markup tags):
var P = GetURL("http://www.nowhere.com");
(The Text function used above will be introduced a little later; it "prints" out the textual content of a piece.) Running this program will typically print a lot of white space; this is because the PCData function regards the empty regions between tags, for example, the area between br and br in the markup
as a distinct text segments. The following program shows how to get rid of these empty regions:
var P = GetURL("http://www.nowhere.com");
Note that the PCData function inserts new unnamed tags just in front of and just after each text segment to keep track of their location. This means that for the markup above, the piece identifying the empty region consists of an unnamed begin tag just after the first br, and an unnamed end tag just before the the second br.
HTML generated on-the-fly by web servers often contains highly stylized markup patterns without hierarchical structure. The markup might be a linear sequence of elements following each other. For example, we might expect an H1 element, followed by a sequence of characters, followed by a BR element. We will be using this as our example in the following discussion.
Given a page and a string describing such a sequence (called a sequence pattern), the Seq function will return a piece set with all the occurrences of the sequence in the page. That is, each piece refers to an unnamed tag just before and after the first and last element of the sequence.
A sequence pattern is a list of element names separated by space characters. The intention is to match exactly that sequence of elements on the same element nesting level. It is important to note that sequence patterns do not match nested elements. For example, in our example, whether the H1 element contains other elements is irrelevant.
To match sequences of characters, we use the # symbol. The # symbol will match the longest sequence of characters or unnamed tags at that position in the page. (Unnamed tags are ignored.)
The following will match all the H1, text, BR sequences in our example:
The H1, text, and BR pieces matched in each of the sequences are accessible by indexing the returned pieces (one for each sequence in the page) with integers from 0 onwards. For example, the following code fragment prints details of the matched sequences:
Paragraph search is one of the more complicated WebL page searching techniques; it is rather seldom used, but still performs a useful function that is sometimes required. The purpose of this searching technique is to break up a page or piece into logical paragraphs. Paragraphs, in the WebL world, are longer regions of a page that logically belong together. Paragraphs in WebL should not be confused with HTML paragraphs (marked up with <p> ... </p> elements). Example paragraphs in WebL might be sequences of markup each terminated with a br tag, or the regions between a set of images. WebL allows the programmer to define his or own meaning of the term paragraph.
To allow the WebL programmer to define an own notion of paragraph, we introduce the notion of a paragraph terminator. A paragraph terminator is a tag which denotes the end of a paragraph. For example, the br tag might be denoted as a paragraph terminator. It is important to note that identifying a non-empty HTML element such as font as a terminator, signifies that both the begin tag <font> and end tag </font> are to regarded as paragraph terminators. Typically sets of terminators are used to break a page into paragraphs. For example, we can specify that all br and p tags are regarded as paragraph terminators, or that all tags except i, b, font, and tt are regarded as paragraph terminators.
Breaking a page into paragraphs with a specific set of paragraph terminators then proceeds as follows:
Identify all the paragraph terminators on the page.
Build a result piece set of paragraphs, namely all the regions that appear between successive terminators on the page (bounding terminator tags excluded). This involves the insertion of unnamed tags as placeholders.
Remove from the result piece set all those pieces p that consist of white space only, i.e. applying Markup(p) returns a string containing only ` `, \n, \r, \t, and character 160 (character code of " ").
The paragraph search function Para expects a piece or page as first argument, and a specification of the paragraph terminators as the second argument. The function returns the pieceset of paragraphs. The paragraph terminator specification is in the form or a string of tag names, delimited by white space. For example:
var p = Para(page, "br p table li")
indicates the br, p, table, and li elements should be regarded as paragraph terminators.
Sometimes it is more convenient to specify the tags that should not be regarded as paragraph terminators. This is done by making the first element name in the paragraph terminator specification a "-":
var p = Para(page, "- font a b i tt img")
This indicates that all tags except for font, a, b, i, tt, and img should be regarded as paragraph terminators.
The last example illustrates a very useful application of the Para function. HTML distinguishes between inline elements and block elements. Block elements typically start and end on a fresh line in the displayed web page. Inline elements flow in the text stream and do not typically start or end a fresh line. Sometimes it is necessary to extract the blocks of inline elements, that make up the paragraphs of a Web page. As the number of inline HTML 4.0 elements are relatively small, we can accomplish this with the following WebL statement:
Para(page, "- tt i b u s strike big small em string
dfn code samp kbd var cite acronym a img applet
object font basefont script map q sub sup span
bdo iframe input select textarea label button")
In a similar vein, the Para function can also play a role when extracting text from a Web page. This addresses a problem of the Text function when retrieving the text of a page. For example, applying the Text function to the following page
<li>word A</li><li>word B</li>
results in the text string "word Aword B", where two words unexpectedly flow together. To insert an extra space at the word boundary is dependent on whether a breaking tag is present or not. The problem can be solved with a script of the following form:
"- tt i b u s strike big small em string
dfn code samp kbd var cite acronym a img applet
object font basefont script map q sub sup span
bdo iframe input select textarea label button");
except for the fact that pieces with no content are filtered out.
Even though the piece searching functions introduced so far already provide powerful ways of extracting pieces from a web page, it might still not be enough. Suppose it is necessary to restrict the contents of a piece set to those elements whose attributes match some criteria. For example, we might be interested in all HTML anchors that point to a specific site. Exactly for this purpose the builtin Select function allows you to filter the contents of a piece set according to a selection function. (The Select function also supports filtering of sets and lists in a similar manner.). The following code fragment illustrates how the select function might be used in this case:
fun(a) Str_StartsWith(a, "http://site.com") end
Note how the selection function is passed as the second argument to Select. The Select function iterates over the elements of its first argument, repeatedly invoking the selection function to determine if that element should be included in the result piece set. The selection function must have a single formal argument and must return a boolean value that indicates whether its argument should be included in the result piece set or not. You are free to specify any selection criteria as long is the result of the function is of type boolean.
|
Returns all the elements that are contained (nested) in piece q. |
|
|
Returns all the elements with a specific name contained in piece q. |
|
|
Extracts the paragraphs in p according to the paragraph terminator specification paraspec. |
|
|
Extracts the paragraphs in p according to the paragraph terminator specification paraspec. |
|
|
Returns all the occurrences of a regular expression pattern in page p. |
|
|
Returns all the occurrences of a regular expression pattern located inside the piece q. |
|
|
Returns the "parsed character data" of the page. This corresponds to the individual seqences of text on the page, as delimited by markup tags. |
|
|
Returns the "parsed character data" of the piece. This corresponds to the individual seqences of text inside the piece, as delimited by markup tags. |
|
|
Matches all the occurrences of a sequence of elements identified by pattern. See PCData search |
|
|
Matches all the occurrences of a sequence of elements identified by pattern inside the piece p. See PCData search |