Regular Expressions

Here we summarize the syntax of Perl5 regular expressions, all of which are supported by the WebL. However, for a definitive reference, you should consult the perlre man page that accompanies the Perl5 distribution and also the book Programming Perl, 2nd Edition from O'Reilly & Associates. We need to point out here that for efficiency reasons the character set operator [...] is limited to work on only ASCII characters (Unicode characters 0 through 255). Other than this restriction, all Unicode characters should be useable in the package's regular expressions.

Perl5 regular expressions consist of:

Alternatives separated by |

Quantified atoms (See Quantified Atoms)

Atoms

Regular expression within parentheses, character classes (e.g., [abcd]), ranges (e.g. [a-z]), and the patterns in See Atoms. Special backslashed characters work within a character class (except for backreferences and boundaries). \b is backspace inside a character class. Any other backslashed character matches itself. Expressions within parentheses are matched as subpattern groups and saved for use by certain methods.

Quantified Atoms

Pattern

Description

{n,m}

Match at least n but not more than m times.

{n,}

Match at least n times.

{n}

Match exactly n times.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 0 or 1 times.

By default, a quantified subpattern is greedy. In other words, it matches as many times as possible without causing the rest of the pattern not to match. To change the quantifiers to match the minimum number of times possible, without causing the rest of the pattern not to match, you may use a "?" right after the quantifier (See Quantified Atoms with Minimal Matching). Perl5 extended regular expressions are fully supported (See See Perl5 Extended Regular Expressions).

Regular Expression Tips

Combining regular expresions and WebL code might sometimes be a little confusing. The following tips might help:

When possible, write WebL regular expressions in single-back quotes e.g. `ab\nc`. This will switch off escape character expansion, and prevent WebL from complaining about illegal escape sequences like "\d".

When matching URLs, keep in mind that "." and "?" do not have a literal meaning in regular expressions. Use the "[]" character classes to match these symbols, e.g. write "www[.]xyz[.]com" instead of "www.xyz.com".

 

Quantified Atoms with Minimal Matching

Pattern

Description

{n,m}?

Matches at least n but not more than m times.

{n,}?

Matches at least n times.

{n}?

Matches exactly n times.

*?

Matches 0 or more times.

+?

Matches 1 or more times.

??

Matches 0 or 1 times.

 

Atoms

Pattern

Description

.

Matches everything except \n.

^

Null token matching the beginning of a string or line (i.e. the position right after a newline or right before the beginning of a string).

$

Null token matching the end of a string or line (i.e. the position right before a newline or right after the end of a string).

\b

Null token matching a word boundary (\w on one side and \W on the other).

\B

Null token matching a boundary that is not a word boundary.

\A

Matches only at beginning of string.

\Z

Matches only at end of string (or before newline at the end).

\n

Newline.

\r

Carriage return.

\t

Tab.

\f

Formfeed.

\d

Digit [0-9].

\D

Non-digit [^0-9].

\w

Word character [0-9a-zA-Z].

\W

Non-word character [^0-9a-zA-Z].

\s

A whitespace character [ \t\n\r\f].

\S

A non-whitespace character [^ \t\n\r\f].

\xnn

Hexadecimal representation of character.

\cD

Matches the corresponding control character.

\nn or \nnn

Octal representation of character unless a backreference.

\1, \2, \3, etc.

Matches whatever the first, second, third, etc. parenthesized group matched. This is called a backreference. If there is no corresponding group, the number is interpreted as an octal representation of a character.

\0

Matches null character.

 

Perl5 Extended Regular Expressions

Extended Pattern

Description

(?#text)

An embedded comment causing text to be ignored.

(?:regexp)

Groups things like "()" but does not cause the group match to be saved.

(?=regexp)

A zero-width positive lookahead assertion. For example, \w+(?=\s) matches a word followed by whitespace, without including whitespace in the match result.

(?!regexp)

A zero-width negative lookahead assertion. For example foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Remember that this is a zero-width assertion, which means that a(?!b)d will match ad because a is followed by a character that is not b (the d) and a d follows the zero-width assertion.

(?imsx)

One or more embedded pattern-match modifiers. i enables case insensitivity, m enables multiline treatment of the input, s enables single line treatment of the input, and x enables extended whitespace comments.

 

Symbols

33

! 34

- 33, 101

!= 34

!after 102

!before 102

!contain 101

!directlyafter 102

!directlybefore 103

!directlycontain 102

!directlyinside 101

!inside 101

!overlap 103

* 33, 101

+ 33, 101

. 34

/ 33

== 34

> 33

>= 33

? 48

| 48

A

Abstract syntax tree 18

after 86, 91, 102

and 34

AppendToFile 121

Assert 41

assignment 32

Associative arrays 29

Authentication 115

B

before 86, 89, 102

BeginTag 82, 111

Bool 23

boolean 129

Boolp 41

Built-in functions 40

byte 129

C

Call 41

Case-sensitivity 58

Char 24

char 129

Character Entities 57

Character entities 82

Charp 41

Children 98, 104

Class 128

Clone 41

Command line options 159

Comments 56, 162

Compare 136

Comparison operators 33

Concurrency 119

Concurrent execution 48

Constants 19

Constructors 22

contain 86, 93, 101

Content 100, 104

Contexts 21

Cookie Databases 117

cookiedb 66

Cookies 54, 61, 117

Crawler 141

D

DDE 116

Decode 115, 138, 139

Delete 109, 111, 112, 123

DeleteField 42

directlyafter 91, 102

directlybefore 90, 102

directlycontain 94, 102

directlyinside 93, 101

Directories 123

div 33

Document Type Definition 54

double 129

DTD 54

Dynamic Data Exchange 116

E

EBNF 162

Elem 70, 79

Elements 56

Empty elements 56

Searching for 70

Encode 115, 139

EndsWith 136

EndTag 82

Equality 31

EqualsIgnoreCase 136

Error 41

ErrorLn 41

Escape Sequences 165

Eval 41, 122

Exceptions

Trap function 44

Try statement 36

Exclusion 88

Exec 42

Exists 121

Exit 42

ExpandCharEntities 82

expandentities 65

Expressions 17

F

Farm 119

field definition 32

Fields 29

Files 121

First 42

Flatten 99, 104

float 129

Floating-point 26

Fun 27

Functions 173

Built-ins 40

Funp 41

G

Garbage collection 42

GC 42

Get 128

GetCurrentPage 116

GetURL 47, 59, 61, 64, 122

Overrides 63

Glue 139

GlueQuery 138, 139

GotoURL 116

H

HeadURL 64

Highlight proxy 156

HTML 54

Forms 52

Handling of badly formatted HTML 58

Parsing of 54

HTTP 52

Cookies 54

GET Request 52

Headers 52, 53, 61

MIME types 53, 63

Parameters 52, 53, 60

POST Request 52

Request 52

Response 52

Set-cookie header 117

Status 52

I

Idle 120

Import 46

Indexing 34

IndexOf 136

InsertAfter 108, 111, 112

InsertBefore 108, 112

inside 86, 93, 101

Int 25

int 129

intersect 97, 103

Intersection 88

Intp 41

IsDir 123

IsFile 123

ISO-8859-1 162

J

J-array 127

Java 124

java.lang.String 129

Job queues 119

J-objects 124

K

Kincaid reading score 149

L

LastIndexOf 136

Latin 1 162

Length 128

List 26, 122

Listp 41

Load 118

LoadFromFile 121

LoadStringFromFile 121

Locks 38

long 129

M

Markup 80, 81, 82

Markup algebra 67

Match 136

member 34

meth 30

Methods 30

Methp 41

Mkdir 123

mod 33

Modules 46

Base64 115

Browser 116

Cookies 117

Farm 119

Files 121

Java 124

Url 138

WebCrawler 141

WebServer 143

Mutual exclusion 38

N

Name 80, 82

Native 42

Netscape 116

New 128

NewArray 128

NewFarm 120

NewNamedPiece 106, 112

NewPage 80, 82

NewPiece 82, 106, 112

NewPieceSet 82

nil 23

Non-termination 49

null 129

O

Object-based programming 30

Objectp 41

Objects 29

Pages 59

Operator precedence 166

Operators 19, 32, 168

Optional tags 57

Options 159

or 34

overlap 86, 92, 103

Overrides 63

autoredirect 65

charset 65

dtd 65

emptyparagaphs 65

fixhtml 66

mimetype 66

resolveurls 66

P

Page 81, 82

Pagep 41

Pages 59

Searching functions 70

Para 75, 79

Paragraph search 75

Paragraph terminators 75

Parent 99, 104

Pat 71

Pattern groups 71

Pattern search 71

PCData 73, 79

Perform 120

Perl5 195

PI 57

Piece set

Operators 87

Piece set functions

Children 98

Content 100

Flatten 99

Parent 99

Piece set operators

After 91

Before 89

Contain 93

Directlyafter 91

Directlybefore 90

Directlycontain 94

Directlyinside 93

Indexing 89

Inside 93

Intersect 97

Overlap 92

Set Exclusion 88

Set Intersection 88

Set Union 88

Without 95

Piece sets 69

Piecep 41

Pieces 68

Comparison of 83

Creation of 82, 106

Deleting of 109

Filtering of 78

Graphical notation 69

Insertion of 108

Replacing of 111

Piecesetp 41

Positions 84

PostURL 47, 59, 61, 64, 122

Overrides 63

Predicates 41

Pretty 81, 82

Print 42

PrintLn 42

Processing Instructions 57

Properties 160

Proxies 115, 156

Publish 143, 145

R

Reading grades 149

ReadLn 42

Real 26

Realp 41

Regular expressions 71, 165, 195

Repetition 49

Replace 111, 112, 136

Resolve 139

Rest 42

Retry 42

Running WebL programs 159

S

Save 118

SaveToFile 122

Scoping rules 20

Scrubber 110

Search 136

Search path 160

Select 42, 43

Seq 74, 79

Sequence search 74

Sequential execution 48

Service combinators 47

Services 47

Set 27, 128

Setp 41

SGML 54

SGML Directives 57

Shell commands 41, 42

short 129

ShouldVisit 141

ShowPage 116

Sign 43

Size 43, 123

Sleep 43

Sort 43

Split 137, 139

SplitQuery 138, 139

Stall 43

Start 143, 145

StartsWith 137

Statements 20, 35

Begin statement 39

Every statement 38

If statement 35

Lock statement 38

Repeat statement 36

Return statement 39

Sequences 35

Try statement 36

While statement 36

statuscode 147

statusmsg 147

Stop 120, 145

String 25

Stringp 41

T

Tagp 41

Tags 55, 68

Begin tags 68

End tags 68

Optional tags 57

Positions of 84

Unnamed tags 68

Terminology 17, 51

Text 80, 83

Text segments 68, 73

Threads 119

Mutual exclusion 38

Throw 43

throw 36

Time 43

Time-out 49

Timeout 43

ToChar 43

ToInt 44

ToList 44

ToLowerCase 137

ToReal 44

ToSet 44

ToString 44

ToUpperCase 137

Trap 44

Trim 137

Type 44

Types 18

Bool 23

Char 24

Fun 27

Int 25

j-array 127

J-Object 124

List 26

Meth 30

Nil 23

Object 29

Real 26

Set 27

Special objects 29

String 25

U

Union 88

Unnamed tags 68

URL

Resolution of 58

URLs 51

UTF-8 162

V

Value types 18

Variables 20

Exported variables 46

W

WebCrawler 141, 153

WebL.jar 159

WebL-Java type conversion 125

weblwin32.dll 116

WebServer 143

without 95, 103

X

XML 55


Top Next