All the piece set operators are summarized in See Piece and Piece Set Operators. Note that all piece set operators accept both pieces and piece sets as operands. Piece operands are converted automatically to a piece set with the operand as the only element. Most of the operators have a formal definition as defined in See Formal Definitions of Piece Set Operators. The remainder of this section attempts to give an intuitive explanation of the operators with the help of examples. In our examples X will denote a page, P and Q will denote piece sets, and p and q will denote elements of P and Q respectively.
Basic piece set manipulation includes the set union, intersection, and exclusion operators.
The set union operator + merges two piece sets into a single piece set, and eliminates duplicate pieces from the result. Example:
Positional operators express relationships between pieces according to their order in a page. Most positional operators have a negated or inverted version that is indicated by an operator symbol written with an exclamation point (!).
The index operator [] extracts the nth element of a piece set P. Pieces are numbered from 0 to Size(P) - 1. Examples:
// Extract the 4'th table from a page.
// Extract the 2'nd row of the 3'rd table.
Elem(Elem(X, "table")[3], "tr")[2]
// Extract the 2'nd row of the table containing the
The before operator returns all the elements of P that are before (or not before) any element of Q. Note that this is equivalent to all the elements of P that are before (or not before) the last element of Q. Consequently, we often need to index into Q to reduce it to a single piece. Examples:
// Retrieve all the H2's before the appendix
// (assuming only a single appendix is present).
(Elem(X, "h1") contain Pat(X, "Appendix"))
// Retrieve all the headings from Chapter 4 onwards.
(Elem(X, "h1") contain Pat(X, "Chapter 4"))
// Retrieve all the italic elements except the last.
Elem(X, "i") before Elem(X, "i")
// Retrieve the last italic element.
The directlybefore operator returns the pieces of P that are directly before (or not directly before) any element of Q. A piece p of P is directly before a piece q of Q if no other piece in P appears between p and q. For example, given page X contains (excluding the line numbers on the left):
// Retrieve the italics directly before H1's,
Elem(X, "i") directlybefore Elem(X, "h1")
// Retrieve the italics not directly before H1's,
Elem(X, "i") !directlybefore Elem(X, "h1")
// Retrieve all elements directly before H1's,
Elem(X) directlybefore Elem(X, "h1")
// Retrieve the second element directly before H1's,
The after operator returns all the elements of P that are after (or not after) any element of Q. Note that this is equivalent to all the elements of P that are after (or not after) the first element of Q. Consequently, we often need to index into Q to reduce it to a single piece. Examples:
// Retrieve all the H2's after the appendix
// (assuming only a single appendix is present).
(Elem(X, "h1") contain Pat(X, "Appendix"))
// Retrieve all the headings before Chapter 4
(Elem(X, "h1") contain Pat(X, "Chapter 4"))
// Retrieve all the italic elements except the last.
Elem(X, "i") before Elem(X, "i")
// Retrieve the last italic element.
The directlyafter operator returns the pieces of P that are directly after (or not directly after) any element of Q. A piece p of P is directly after a piece q of Q if no other piece in P appears between p and q. Examples (based on the previous page object X):
// Retrieve the italics directly after H1's,
Elem(X, "i") directlyafter Elem(X, "h1")
// Retrieve the italics not directly after H1's,
// i.e. lines 3, 4, 7, 10, 12.
Elem(X, "i") !directlyafter Elem(X, "h1")
// Retrieve all elements directly after H1's,
Elem(X) directlyafter Elem(X, "h1")
// Retrieve the second element directly after H1's,
The hierarchical operators express relationships between pieces involving their hierarchical nesting in the element parse tree.
The inside operator returns the pieces of P that are nested inside (or not nested inside) any piece of Q. Examples:
// Retrieve all the rows in the third table.
Elem(X, "tr") inside Elem(X, "table")[3]
// Retrieve all the italic elements not in a table.
The contain operator returns the pieces of P that contain (or do not contain) any piece of Q. Examples:
// Retrieve all the level 2 headings with
Elem(X, "h2") contain Elem(X, "i")
// Retrieve all the tables that mention "program".
The directlyinside operator returns all the elements of P that are inside (or not inside) any element of Q, and in addition are not inside another element of P. Intuitively this retrieves the "outermost" element of all nested elements. Given a page of the following form:
we can calculate the following:
// All the list items in lists,
// i.e. lines 2, 3, 4, 6, 7, 9.
Elem(X, "li") inside Elem(X, "ul")
// All the list items in the first list,
// i.e. the elements on lines 2, 3, 4-9, 6, 7, 10.
Elem(X, "li") inside Elem(X, "ul")[0]
// All the items directly in the first list,
Elem(X, "li") directlyinside Elem(X, "ul")[0]
// Outermost items in the first list,
The directlycontain operator returns all the elements of P that contain (or do not contain) any element of Q, and in addition do not contain another element of P. Intuitively this retrieves the "innermost" element of all nested elements. Given the page defined previously, we can calculate:
// The lists that contain the first subsection,
// i.e. elements on lines 1-11, 5-8.
Elem(X, "ul") contain Pat(X, "First Subsection")
// The list that directly contains the first subsection
The regional operators construct new pieces to identify parts of a page. (Many other operators return pieces that existed only before the operator was applied.)
The without operator returns the pieces of P where parts of Q that overlap with a piece in P are "cut" away. This might involve creating several new pieces from a piece of P and inserting new unnamed tags as necessary. See Operation of P without Q gives an example where the word WebL is removed from a sentence. Note how unnamed tags are inserted to the left and right of piece A. Examples:
// "Cut" up the second table into its
Elem(X, "table")[1] without Pat(X, `\n`)
// Remove all the bold text from
Elem(X, "p")[0] without Elem(X, "b")
|
The intersect operator intersects each element of P with all the overlapping pieces of Q. The resulting piece set contains all the parts of P that are in common with pieces of Q. As parts of pieces of P are cut away by the intersection, new pieces need to be created, and thus new unnamed tags are inserted into the page. Another way of thinking about the operator is that it calculates the overlap between pieces. See Operation of P intersect Q shows how this is done.
// The parts of a page that is both italic and bold.
Elem(X, "i") intersect Elem(X, "b")
|
The Children function returns all the children pieces of piece p. The children of a piece include all the elements directly contained in the piece and all the text segments directly contained in the piece. Markup elements that are only parially inside p because of overlap, are not regarded as children of p. For example, the children of the following TD element consisting of nested I and B elements:
<td>abc<i><b>def</b></i>ghi<b>jkl</b>mno</td>
are the pieces represented by:
// Everything inside the first table.
// Program to walk recursively through a page.
if Name(x) != "" then // Named piece
PrintLn(Text(x), " parent=", Name(Parent(x)))
var P = NewPage("<td>abc<i><b>def</b></i>
The Parent function returns the direct parent (enclosing) element of piece p. It is implemented by looking at named tags t from right to left starting just before the left tag of p, identifying the piece q that tag t belongs to, and determining if the corresponding end tag of q follows the end tag of p. Example:
The Flatten function returns the union of all elements of P. Intuitively two overlapping pieces p and q of P are replaced repeatedly by a single "joined" piece that covers the union of the regions p and q covered. This also has the effect of removing nested elements of P. New unnamed pieces are inserted into the page to create these new pieces. See Flattening a Piece Set shows how two overlapping pieces are flattened.
|
The Content function returns the content of piece p. The content of a piece is the part of the page between the begin and end tag of p (exclusive). The Content function can also be applied to a page object, in which case a piece is returned that starts at the beginning of the page and ends at the end of the page. In both cases, new unnamed tags are inserted into the page (See Application of the Content Function). For example, given a page:
we can calculate the following:
// i.e. "<td>abc<i>def</i></td>".
|