Piece Set Operators and Functions

All the piece set operators are summarized in See Piece and Piece Set Operators. Note that all piece set operators accept both pieces and piece sets as operands. Piece operands are converted automatically to a piece set with the operand as the only element. Most of the operators have a formal definition as defined in See Formal Definitions of Piece Set Operators. The remainder of this section attempts to give an intuitive explanation of the operators with the help of examples. In our examples X will denote a page, P and Q will denote piece sets, and p and q will denote elements of P and Q respectively.

I. Basic Operators

Basic piece set manipulation includes the set union, intersection, and exclusion operators.

Set Union (P + Q)

The set union operator + merges two piece sets into a single piece set, and eliminates duplicate pieces from the result. Example:

// Retrieve level 1 and two headings from a page

Elem(X, "h1") + Elem(X, "h2")

 

 

Set Exclusion (P - Q)

The set exclusion operator "-" removes all pieces from the left operand that are elements of the right operand. Example:

// Retrieve all level 1 headings except for those

// that contain the word "Figure".

Elem(X, "h1") -

(Elem(X, "h1") contain Pat(X, "Figure"))

 

 

Set Intersection (P * Q)

The set intersection * computes the intersection between its operands. Example:

// Retrieve all the occurrences of the word "WebL"

// written in bold and in italic.

(Pat(X, "WebL") inside Elem(X, "b")) *

(Pat(X, "WebL") inside Elem(X, "i"))

 

II. Positional Operators

Positional operators express relationships between pieces according to their order in a page. Most positional operators have a negated or inverted version that is indicated by an operator symbol written with an exclamation point (!).

Indexing P[i]

The index operator [] extracts the nth element of a piece set P. Pieces are numbered from 0 to Size(P) - 1. Examples:

// Extract the 4'th table from a page.

Elem(X, "table") [4]

 

// Extract the 2'nd row of the 3'rd table.

Elem(Elem(X, "table")[3], "tr")[2]

 

// Extract the 2'nd row of the table containing the

// word WebL

var t = Elem(X, "table") contain Pat(X, "Webl");

(Elem(X, "tr") inside t)[2]

 

 

P before/!before Q

The before operator returns all the elements of P that are before (or not before) any element of Q. Note that this is equivalent to all the elements of P that are before (or not before) the last element of Q. Consequently, we often need to index into Q to reduce it to a single piece. Examples:

// Retrieve all the H2's before the appendix

// (assuming only a single appendix is present).

Elem(X, "h2") before

(Elem(X, "h1") contain Pat(X, "Appendix"))

 

// Retrieve all the headings from Chapter 4 onwards.

Elem(X, "h1") !before

(Elem(X, "h1") contain Pat(X, "Chapter 4"))

 

// Retrieve all the italic elements except the last.

Elem(X, "i") before Elem(X, "i")

 

// Retrieve the last italic element.

Elem(X, "i") !before Elem(X, "i")

 

 

P directlybefore/!directlybefore Q

The directlybefore operator returns the pieces of P that are directly before (or not directly before) any element of Q. A piece p of P is directly before a piece q of Q if no other piece in P appears between p and q. For example, given page X contains (excluding the line numbers on the left):

<h1>A</h1>

<i>a</i>

<i>b</i>

<b>c</b>

<h1>B</h1>

<i>d</i>

<i>e</i>

<h1>C</h1>

<i>f</i>

<i>g</i>

<h1>D</h1>

<i>h</i>

<i>i</i>

<h1>E</h1>

 

we can compute the following:

// Retrieve the italics directly before H1's,

// i.e. lines 3, 7, 10, 13.

Elem(X, "i") directlybefore Elem(X, "h1")

 

// Retrieve the italics not directly before H1's,

// i.e. lines 2, 6, 9, 12.

Elem(X, "i") !directlybefore Elem(X, "h1")

 

// Retrieve all elements directly before H1's,

// i.e. lines 4, 7, 10, 13.

Elem(X) directlybefore Elem(X, "h1")

 

// Retrieve the second element directly before H1's,

// i.e. lines 3, 6, 9, 12.

Elem(X) directlybefore

(Elem(X) directlybefore Elem(X, "h1"))

 

 

P after/!after Q

The after operator returns all the elements of P that are after (or not after) any element of Q. Note that this is equivalent to all the elements of P that are after (or not after) the first element of Q. Consequently, we often need to index into Q to reduce it to a single piece. Examples:

// Retrieve all the H2's after the appendix

// (assuming only a single appendix is present).

Elem(X, "h2") after

(Elem(X, "h1") contain Pat(X, "Appendix"))

 

// Retrieve all the headings before Chapter 4

// inclusive.

Elem(X, "h1") !after

(Elem(X, "h1") contain Pat(X, "Chapter 4"))

 

// Retrieve all the italic elements except the last.

Elem(X, "i") before Elem(X, "i")

 

// Retrieve the last italic element.

Elem(X, "i") !before Elem(X, "i")

 

 

P directlyafter/!directlyafter Q

The directlyafter operator returns the pieces of P that are directly after (or not directly after) any element of Q. A piece p of P is directly after a piece q of Q if no other piece in P appears between p and q. Examples (based on the previous page object X):

// Retrieve the italics directly after H1's,

// i.e. lines 2, 6, 9, 12.

Elem(X, "i") directlyafter Elem(X, "h1")

 

// Retrieve the italics not directly after H1's,

// i.e. lines 3, 4, 7, 10, 12.

Elem(X, "i") !directlyafter Elem(X, "h1")

 

// Retrieve all elements directly after H1's,

// i.e. lines 2, 6, 9, 12.

Elem(X) directlyafter Elem(X, "h1")

 

// Retrieve the second element directly after H1's,

// i.e. lines 3, 7, 10, 13.

Elem(X) directlyafter

(Elem(X) directlyafter Elem(X, "h1"))

 

P overlap/!overlap Q

The overlap operator returns the pieces of P that overlap (or do not overlap) any element of Q. Example:

// Find all the occurrences of words that are

// italic or partially consists of italic text.

Pat(X, `\w+`) overlap Elem(X, "i")

 

III. Hierarchical Operators

The hierarchical operators express relationships between pieces involving their hierarchical nesting in the element parse tree.

P inside/!inside Q

The inside operator returns the pieces of P that are nested inside (or not nested inside) any piece of Q. Examples:

// Retrieve all the rows in the third table.

Elem(X, "tr") inside Elem(X, "table")[3]

 

// Retrieve all the italic elements not in a table.

Elem(X, "i") !inside Elem(X, "table")

 

 

P contain/!contain Q

The contain operator returns the pieces of P that contain (or do not contain) any piece of Q. Examples:

// Retrieve all the level 2 headings with

// italic characters.

Elem(X, "h2") contain Elem(X, "i")

 

// Retrieve all the tables that mention "program".

Elem(X, "table") contain Pat(X, "program")

 

 

P directlyinside/!directlyinside Q

The directlyinside operator returns all the elements of P that are inside (or not inside) any element of Q, and in addition are not inside another element of P. Intuitively this retrieves the "outermost" element of all nested elements. Given a page of the following form:

<UL>

<LI>First Section</LI>

<LI>Second Section</LI>

<LI>Third Section

<UL>

<LI>First Subsection</LI>

<LI>Second Subsection</LI>

</UL>

</LI>

<LI>Fourth Section</LI>

</UL>

 

we can calculate the following:

// All the list items in lists,

// i.e. lines 2, 3, 4, 6, 7, 9.

Elem(X, "li") inside Elem(X, "ul")

 

// All the list items in the first list,

// i.e. the elements on lines 2, 3, 4-9, 6, 7, 10.

Elem(X, "li") inside Elem(X, "ul")[0]

 

// All the items directly in the first list,

// i.e. lines 2, 3, 4-9, 10.

Elem(X, "li") directlyinside Elem(X, "ul")[0]

 

// Outermost items in the first list,

// i.e. lines 2, 3, 4-9, 10.

var x = Elem(X, "li") inside Elem(X, "ul")[0];

x !inside x

 

 

P directlycontain/!directlycontain Q

The directlycontain operator returns all the elements of P that contain (or do not contain) any element of Q, and in addition do not contain another element of P. Intuitively this retrieves the "innermost" element of all nested elements. Given the page defined previously, we can calculate:

// The lists that contain the first subsection,

// i.e. elements on lines 1-11, 5-8.

Elem(X, "ul") contain Pat(X, "First Subsection")

 

// The list that directly contains the first subsection

// i.e. element in lines 5-8.

Elem(X, "ul") directlycontain

Pat(X, "First Subsection")

 

// Innermost list that containsthe first subsection,

// i.e. element in lines 5-8.

var x = Elem(X, "ul") contain

Pat(X, "First Subsection");

x !contain x

 

IV. Regional Operators

The regional operators construct new pieces to identify parts of a page. (Many other operators return pieces that existed only before the operator was applied.)

P without Q

The without operator returns the pieces of P where parts of Q that overlap with a piece in P are "cut" away. This might involve creating several new pieces from a piece of P and inserting new unnamed tags as necessary. See Operation of P without Q gives an example where the word WebL is removed from a sentence. Note how unnamed tags are inserted to the left and right of piece A. Examples:

// "Cut" up the second table into its

// constituent lines.

Elem(X, "table")[1] without Pat(X, `\n`)

 

// Remove all the bold text from

// the first paragraph.

Elem(X, "p")[0] without Elem(X, "b")

 

 

Operation of P without Q

 

P intersect Q

The intersect operator intersects each element of P with all the overlapping pieces of Q. The resulting piece set contains all the parts of P that are in common with pieces of Q. As parts of pieces of P are cut away by the intersection, new pieces need to be created, and thus new unnamed tags are inserted into the page. Another way of thinking about the operator is that it calculates the overlap between pieces. See Operation of P intersect Q shows how this is done.

Example:

// The parts of a page that is both italic and bold.

Elem(X, "i") intersect Elem(X, "b")

 

 

Operation of P intersect Q

 

V. Miscellaneous Functions

Children(p)

The Children function returns all the children pieces of piece p. The children of a piece include all the elements directly contained in the piece and all the text segments directly contained in the piece. Markup elements that are only parially inside p because of overlap, are not regarded as children of p. For example, the children of the following TD element consisting of nested I and B elements:

<td>abc<i><b>def</b></i>ghi<b>jkl</b>mno</td>

 

are the pieces represented by:

"abc", "<i><b>def</b></i>",

"ghi", "<b>jkl</b>", "mno"

 

Examples:

// Everything inside the first table.

Children(Elem(X, "table")[0])

 

// Program to walk recursively through a page.

var walk = fun(x)

if Name(x) != "" then // Named piece

every p in Children(x) do

walk(p)

end

else

PrintLn(Text(x), " parent=", Name(Parent(x)))

end

end;

var P = NewPage("<td>abc<i><b>def</b></i>

ghi<b>jkl</b>mno</td>", "text/xml");

walk(Elem(X, "td")[0])

 

 

Parent(p)

The Parent function returns the direct parent (enclosing) element of piece p. It is implemented by looking at named tags t from right to left starting just before the left tag of p, identifying the piece q that tag t belongs to, and determining if the corresponding end tag of q follows the end tag of p. Example:

// Locate the Parent element of the second table.

Parent(Elem(P, "table")[1])

 

Flatten(P)

The Flatten function returns the union of all elements of P. Intuitively two overlapping pieces p and q of P are replaced repeatedly by a single "joined" piece that covers the union of the regions p and q covered. This also has the effect of removing nested elements of P. New unnamed pieces are inserted into the page to create these new pieces. See Flattening a Piece Set shows how two overlapping pieces are flattened.

 

Flattening a Piece Set

 

Content(p)

The Content function returns the content of piece p. The content of a piece is the part of the page between the begin and end tag of p (exclusive). The Content function can also be applied to a page object, in which case a piece is returned that starts at the beginning of the page and ends at the end of the page. In both cases, new unnamed tags are inserted into the page (See Application of the Content Function). For example, given a page:

<td>abc<i>def</i></td>

 

we can calculate the following:

// Content of the TD element,

// i.e. "abc<i>def</i>".

Content(Elem(P, "td")[0])

 

// Content of the whole page,

// i.e. "<td>abc<i>def</i></td>".

Content(P)

 

Application of the Content Function

 

 

Piece and Piece Set Operators

Function

Description

+(q1: piece, q2: piece): pieceset
+(q: piece, s: pieceset): pieceset
+(s: pieceset, q: piece): pieceset
+(s1: pieceset, s2: pieceset): pieceset

Piece set union.

-(q1: piece, q2: piece): pieceset
-(q: piece, s: pieceset): pieceset
-(s: pieceset, q: piece): pieceset
-(s1: pieceset, s2: pieceset): pieceset

Piece set difference.

*(q1: piece, q2: piece): pieceset
*(q: piece, s: pieceset): pieceset
*(s: pieceset, q: piece): pieceset
*(s1: pieceset, s2: pieceset): pieceset

Piece set intersection.

[](s: pieceset, i: int): piece

Indexing into a piece set. Pieces are numbered 0 to Size - 1.

inside(p: piece, q: piece): pieceset
inside(p: pieceset, q: piece): pieceset
inside(p: piece, q: pieceset): pieceset
inside(p: pieceset, q: pieceset): pieceset

All the elements of p that are located inside any element of q.

!inside(p: piece, q: piece): pieceset
!inside(p: pieceset, q: piece): pieceset
!inside(p: piece, q: pieceset): pieceset
!inside(p: pieceset, q: pieceset): pieceset

All the elements of p that are not located inside any element of q.

directlyinside(p: piece, q: piece): pieceset
directlyinside(p: pieceset, q: piece): pieceset
directlyinside(p: piece, q: pieceset): pieceset
directlyinside(p: pieceset, q: pieceset): pieceset

All the elements of p that are directly inside any element of q.

!directlyinside(p: piece, q: piece): pieceset
!directlyinside(p: pieceset, q: piece): pieceset
!directlyinside(p: piece, q: pieceset): pieceset
!directlyinside(p: pieceset, p: pieceset): pieceset

All the elements of p that are not directly inside any element of q.

contain(p: piece, q: piece): pieceset
contain(p: pieceset, q: piece): pieceset
contain(p: piece, q: pieceset): pieceset
contain(p: pieceset, q: pieceset): pieceset

All the elements of p that contain any element of q.

!contain(p: piece, q: piece): pieceset
!contain(p: pieceset, q: piece): pieceset
!contain(p: piece, q: pieceset): pieceset
!contain(p: pieceset, q: pieceset): pieceset

All the elements of p that do not contain any element of q.

directlycontain(p: piece, q: piece): pieceset
directlycontain(p: pieceset, q: piece): pieceset
directlycontain(p: piece, p: pieceset): pieceset
directlycontain(p: pieceset, q: pieceset): pieceset

All the elements of p that directly contain any element of q.

!directlycontain(p: piece, q: piece): pieceset
!directlycontain(p: pieceset, q: piece): pieceset
!directlycontain(p: piece, q: pieceset): pieceset
!directlycontain(p: pieceset, q: pieceset): pieceset

All the elements of p that do not directly contain any element of q.

after(p: piece, q: piece): pieceset
after(p: pieceset, q: piece): pieceset
after(p: piece, q: pieceset): pieceset
after(p: pieceset, q: pieceset): pieceset

All the elements of p that are after any element of q.

!after(p: piece, q: piece): pieceset
!after(p: pieceset, q: piece): pieceset
!after(p: piece, q: pieceset): pieceset
!after(p: pieceset, q: pieceset): pieceset

All the elements of p that are not after any element of q.

directlyafter(p: piece, q: piece): pieceset
directlyafter(p: pieceset, q: piece): pieceset
directlyafter(p: piece, q: pieceset): pieceset
directlyafter(p: pieceset, q: pieceset): pieceset

All the elements of p that follow directly after any element of q.

!directlyafter(p: piece, q: piece): pieceset
!directlyafter(p: pieceset, q: piece): pieceset
!directlyafter(p: piece, q: pieceset): pieceset
!directlyafter(p: pieceset, q: pieceset): pieceset

All the elements of p that do not follow directly after any element of q.

before(p: piece, q: piece): pieceset
before(p: pieceset, q: piece): pieceset
before(p: piece, q: pieceset): pieceset
before(p: pieceset, q: pieceset): pieceset

All the elements of p that precede any element of q.

!before(p: piece, q: piece): pieceset
!before(p: pieceset, q: piece): pieceset
!before(p: piece, q: pieceset): pieceset
!before(p: pieceset, q: pieceset): pieceset

All the elements of p that do not precede any element of q.

directlybefore(p: piece, q: piece): pieceset
directlybefore(p: pieceset, q: piece): pieceset
directlybefore(p: piece, q: pieceset): pieceset
directlybefore(p: pieceset, q: pieceset): pieceset

All the elements of p that are directly before any element of q.

!directlybefore(p: piece, q: piece): pieceset
!directlybefore(p: pieceset, q: piece): pieceset
!directlybefore(p: piece, q: pieceset): pieceset
!directlybefore(p: pieceset, q: pieceset): pieceset

All the elements of p that are not directly before any element of q.

 

overlap(p: piece, q: piece): pieceset
overlap(p: pieceset, q: piece): pieceset
overlap(p: piece, q: pieceset): pieceset
overlap(p: pieceset, p: pieceset): pieceset

All the elements of p that overlap any element in q.

!overlap(p: piece, q: piece): pieceset
!overlap(p: pieceset, q: piece): pieceset
!overlap(p: piece, q: pieceset): pieceset
!overlap(p: pieceset, q: pieceset): pieceset

All the elements of p that do not overlap any element in q.

without(p: piece, q: piece): pieceset
without(p: pieceset, q: piece): pieceset
without(p: piece, q: pieceset): pieceset
without(p: pieceset, q: pieceset): pieceset

All the elements of p where overlap with any element of q has been removed.

intersect(p: piece, q: piece): pieceset
intersect(p: pieceset, q: piece): pieceset
intersect(p: piece, q: pieceset): pieceset
intersect(q: pieceset, p: pieceset): pieceset

All the elements of p that overlap an element in q, each of them repeatedly intersected with all overlapping elements in q.

 

Piece and Piece Set Functions

Function

Description

Children(q: piece): pieceset

Returns a piece set consisting of all the direct children elements of q in the markup parse tree, unioned with pieces representing all the text segments in q (without all the nested text segments).

Parent(q: piece): piece

Returns the element in which q is nested (direct parent in the parse tree).

Flatten(s: pieceset): pieceset

Returns a "flattened" piece set (without any overlappings) of all the parts of the page that piece set s covers.

Content(p: page): piece

Returns a piece that encompasses the whole page p.

Content(q: piece): piece

Returns a piece inside q, representing everything that is inside q excluding the begin tag and end tag of q.

Formal Definitions of Piece Set Operators

Operator

Definition

P + Q

P { q Q | ¬ p p P p equal q }

P - Q

{ p P | ¬ q q Q p equal q }

P * Q

{ p P | q q Q p equal q }

P inside Q

{ p P | q q Q p inside q }

P !inside Q

{ p P | ¬ q q Q p inside q }

P directlyinside Q

{ p P | q q Q p inside q
(
¬ r r P r inside q p inside r) }

P !directlyinside Q

{ p P | ¬ q q Q p inside q
(
¬ r r P r inside q p inside r) }

P contain Q

{ p P | q q Q p contain q }

P !contain Q

{ p P | ¬ q q Q p contain q }

P directlycontain Q

{ p P | q q Q p contain q
(
¬ r r P r contain q p contain r) }

P !directlycontain Q

{ p P | ¬ q q Q p contain q
(
¬ r r P r contain q p contain r) }

P after Q

{ p P | q q Q p after q }

P !after Q

{ p P | ¬ q q Q p after q }

P directlyafter Q

{ p P | q q Q p after q
(
¬ r r P r after q p after r) }

P !directlyafter Q

{ p P | ¬ q q Q p after q
(
¬ r r P r after q p after r)}

P before Q

{ p P | q q Q p before q }

P !before Q

{ p P | ¬ q q Q p before q }

P directlybefore Q

{ p P | q q Q p before q
(
¬ r r P r before q p before r) }

P !directlybefore Q

{ p P | ¬ q q Q p before q
(
¬ r r P r before q p before r) }

P overlap Q

{ p P | ¬ q q Q p overlap q }

P !overlap Q

{ p P | ¬ q q Q p overlap q }


Up Previous Next