Regular Expressions

Here we summarize the syntax of Perl5 regular expressions, all of which are supported by the WebL. However, for a definitive reference, you should consult the perlre man page that accompanies the Perl5 distribution and also the book Programming Perl, 2nd Edition from O'Reilly & Associates. We need to point out here that for efficiency reasons the character set operator [...] is limited to work on only ASCII characters (Unicode characters 0 through 255). Other than this restriction, all Unicode characters should be useable in the package's regular expressions.

Perl5 regular expressions consist of:

Alternatives separated by |

Quantified atoms ( See Quantified Atoms )

Atoms

Regular expression within parentheses, character classes (e.g., [abcd]), ranges (e.g. [a-z]), and the patterns in See Atoms . Special backslashed characters work within a character class (except for backreferences and boundaries). \b is backspace inside a character class. Any other backslashed character matches itself. Expressions within parentheses are matched as subpattern groups and saved for use by certain methods.

Quantified Atoms

Pattern

Description

{n,m}

Match at least n but not more than m times.

{n,}

Match at least n times.

{n}

Match exactly n times.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 0 or 1 times.

By default, a quantified subpattern is greedy. In other words, it matches as many times as possible without causing the rest of the pattern not to match. To change the quantifiers to match the minimum number of times possible, without causing the rest of the pattern not to match, you may use a "?" right after the quantifier ( See Quantified Atoms with Minimal Matching ). Perl5 extended regular expressions are fully supported (See See Perl5 Extended Regular Expressions ).

Regular Expression Tips

Combining regular expresions and WebL code might sometimes be a little confusing. The following tips might help:

When possible, write WebL regular expressions in single-back quotes e.g. `ab\nc`. This will switch off escape character expansion, and prevent WebL from complaining about illegal escape sequences like "\d".

When matching URLs, keep in mind that "." and "?" do not have a literal meaning in regular expressions. Use the "[]" character classes to match these symbols, e.g. write "www[.]xyz[.]com" instead of "www.xyz.com".

 

Quantified Atoms with Minimal Matching

Pattern

Description

{n,m}?

Matches at least n but not more than m times.

{n,}?

Matches at least n times.

{n}?

Matches exactly n times.

*?

Matches 0 or more times.

+?

Matches 1 or more times.

??

Matches 0 or 1 times.

 

Atoms

Pattern

Description

.

Matches everything except \n.

^

Null token matching the beginning of a string or line (i.e. the position right after a newline or right before the beginning of a string).

$

Null token matching the end of a string or line (i.e. the position right before a newline or right after the end of a string).

\b

Null token matching a word boundary (\w on one side and \W on the other).

\B

Null token matching a boundary that is not a word boundary.

\A

Matches only at beginning of string.

\Z

Matches only at end of string (or before newline at the end).

\n

Newline.

\r

Carriage return.

\t

Tab.

\f

Formfeed.

\d

Digit [0-9].

\D

Non-digit [^0-9].

\w

Word character [0-9a-zA-Z].

\W

Non-word character [0-9a-zA-Z].

\s

A whitespace character [ \t\n\r\f].

\S

A non-whitespace character [^ \t\n\r\f].

\xnn

Hexadecimal representation of character.

\cD

Matches the corresponding control character.

\nn or \nnn

Octal representation of character unless a backreference.

\1, \2, \3, etc.

Matches whatever the first, second, third, etc. parenthesized group matched. This is called a backreference. If there is no corresponding group, the number is interpreted as an octal representation of a character.

\0

Matches null character.

 

Perl5 Extended Regular Expressions

Extended Pattern

Description

(?#text)

An embedded comment causing text to be ignored.

(?:regexp)

Groups things like "()" but does not cause the group match to be saved.

(?=regexp)

A zero-width positive lookahead assertion. For example, \w+(?=\s) matches a word followed by whitespace, without including whitespace in the match result.

(?!regexp)

A zero-width negative lookahead assertion. For example foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Remember that this is a zero-width assertion, which means that a(?!b)d will match ad because a is followed by a character that is not b (the d) and a d follows the zero-width assertion.

(?imsx)

One or more embedded pattern-match modifiers. i enables case insensitivity, m enables multiline treatment of the input, s enables single line treatment of the input, and x enables extended whitespace comments.

 

Index

Symbols

33

! 34

- 33, 99

!= 34

!after 100

!before 100

!contain 99

!directlyafter 100

!directlybefore 101

!directlycontain 100

!directlyinside 99

!inside 99

!overlap 101

* 33, 99

+ 33, 99

. 34

/ 33

== 34

> 33

>= 33

? 47

| 47

A

Abstract syntax tree 18

after 84, 89, 100

and 34

AppendToFile 118

Assert 41

Associative arrays 29

Authentication 113

B

before 84, 87, 100

BeginTag 80, 109

Bool 23

boolean 125

Boolp 41

Built-in functions 40

byte 125

C

Call 41

Case-sensitivity 56

Char 24

char 125

Character Entities 55

Character entities 80

Charp 41

Children 96, 102

Class 124

Clone 41

Command line options 153

Comments 54, 156

Compare 132

Comparison operators 33

Concurrency 116

Concurrent execution 47

Constants 19

Constructors 22

contain 84, 91, 99

Content 98, 102

Contexts 21

Cookies 52, 58, 115

Crawler 137

D

DDE 114

Decode 113, 134, 135

Delete 107, 109, 110, 120

directlyafter 89, 100

directlybefore 88, 100

directlycontain 92, 100

directlyinside 91, 99

Directories 120

div 33

Document Type Definition 52

double 125

DTD 52

Dynamic Data Exchange 114

E

EBNF 156

Elem 68, 77

Elements 54

Empty elements 54

Searching for 68

Encode 113, 135

EndsWith 132

EndTag 80

Equality 31

EqualsIgnoreCase 132

Error 41

ErrorLn 41

Escape Sequences 159

Eval 41, 119

Exceptions

Trap function 44

Try statement 36

Exclusion 86

Exec 42

Exists 118

Exit 42

ExpandCharEntities 80

Expressions 17

F

Farm 116

Fields 29

Files 118

First 42

Flatten 97, 102

float 125

Floating-point 26

Fun 27

Functions 166

Built-ins 40

Funp 41

G

Garbage collection 42

GC 42

Get 124

GetCurrentPage 114

GetURL 46, 57, 58, 61, 119

Overrides 59

Glue 135

GlueQuery 134, 135

GotoURL 114

H

HeadURL 61

Highlight proxy 150

HTML 52

Forms 50

Handling of badly formatted HTML 56

Parsing of 52

HTTP 50

Cookies 52

GET Request 50

Headers 50, 51, 58

MIME types 51, 59

Parameters 50, 51, 57

POST Request 50

Request 50

Response 50

Set-cookie header 115

Status 50

I

Idle 117

Import 45

Indexing 34

IndexOf 132

InsertAfter 106, 109, 110

InsertBefore 106, 110

inside 84, 91, 99

Int 25

int 125

intersect 95, 101

Intersection 86

Intp 41

IsDir 119

IsFile 119

ISO-8859-1 156

J

J-array 124

Java 121

java.lang.String 125

Job queues 116

J-objects 121

K

Kincaid reading score 143

L

LastIndexOf 133

Latin 1 156

Length 124

List 26, 119

Listp 41

Load 115

LoadFromFile 118

LoadStringFromFile 118

Locks 38

long 125

M

Markup 78, 79, 80

Markup algebra 65

Match 132

member 34

meth 30

Methods 30

Methp 41

Mkdir 120

mod 33

Modules 45

Base64 113

Browser 114

Cookies 115

Farm 116

Files 118

Java 121

Url 134

WebCrawler 137

WebServer 138

Mutual exclusion 38

N

Name 78, 80

Native 42

Netscape 114

New 124

NewArray 124

NewFarm 117

NewNamedPiece 104, 110

NewPage 78, 80

NewPiece 80, 104, 110

NewPieceSet 80

nil 23

Non-termination 48

null 125

O

Object-based programming 30

Objectp 41

Objects 29

Pages 57

Operator precedence 160

Operators 19, 32, 162

Optional tags 55

Options 153

or 34

overlap 84, 90, 101

Overrides 59

autoredirect 62

charset 62

dtd 62

emptyparagaphs 62

fixhtml 62

mimetype 63

resolveurls 63

P

Page 79, 80

Pagep 41

Pages 57

Searching functions 68

Para 73, 77

Paragraph search 73

Paragraph terminators 73

Parent 97, 102

Pat 69

Pattern groups 69

Pattern search 69

PCData 71, 77

Perform 117

Perl5 175

PI 55

Piece set

Operators 85

Piece set functions

Children 96

Content 98

Flatten 97

Parent 97

Piece set operators

After 89

Before 87

Contain 91

Directlyafter 89

Directlybefore 88

Directlycontain 92

Directlyinside 91

Indexing 87

Inside 91

Intersect 95

Overlap 90

Set Exclusion 86

Set Intersection 86

Set Union 86

Without 93

Piece sets 67

Piecep 41

Pieces 66

Comparison of 81

Creation of 80, 104

Deleting of 107

Filtering of 76

Graphical notation 67

Insertion of 106

Replacing of 109

Piecesetp 41

Positions 82

PostURL 46, 57, 58, 61, 119

Overrides 59

Predicates 41

Pretty 79, 80

Print 42

PrintLn 42

Processing Instructions 55

Properties 154

Proxies 113, 150

Publish 138, 140

R

Reading grades 143

ReadLn 42

Real 26

Realp 41

Regular expressions 69, 159, 175

Repetition 48

Replace 109, 110, 132

Resolve 135

Rest 42

Retry 42

Running WebL programs 153

S

Save 115

SaveToFile 118

Scoping rules 20

Scrubber 108

Search 132

Search path 154

Select 42, 43

Seq 72, 77

Sequence search 72

Sequential execution 47

Service combinators 46

Services 46

Set 27, 124

Setp 41

SGML 52

SGML Directives 55

Shell commands 41, 42

short 125

ShouldVisit 137

ShowPage 114

Sign 43

Size 43

Sleep 43

Sort 43

Split 133, 135

SplitQuery 134, 135

Start 138, 140

StartsWith 132

Statements 20, 35

Begin statement 39

Every statement 38

If statement 35

Lock statement 38

Repeat statement 36

Return statement 39

Sequences 35

Try statement 36

While statement 36

statuscode 142

statusmsg 142

Stop 117, 140

String 25

Stringp 41

T

Tagp 41

Tags 53, 66

Begin tags 66

End tags 66

Optional tags 55

Positions of 82

Unnamed tags 66

Terminology 17, 49

Text 78, 81

Text segments 66, 71

Threads 116

Mutual exclusion 38

Throw 43

throw 36

Time 43

Time-out 48

Timeout 43

ToChar 43

ToInt 43

ToList 44

ToLowerCase 132

ToReal 44

ToSet 44

ToString 44

ToUpperCase 132

Trap 44

Type 44

Types 18

Bool 23

Char 24

Fun 27

Int 25

j-array 124

J-Object 121

List 26

Meth 30

Nil 23

Object 29

Real 26

Set 27

Special objects 29

String 25

U

Union 86

Unnamed tags 66

URL

Resolution of 56

URLs 49

UTF-8 156

V

Value types 18

Variables 20

Exported variables 45

W

WebCrawler 137, 147

WebL.jar 153

WebL-Java type conversion 122

weblwin32.dll 114

WebServer 138

without 93, 101

X

XML 53


Top Next