Regular Expressions
Here we summarize the syntax of Perl5 regular expressions, all of which are supported by the WebL. However, for a definitive reference, you should consult the perlre man page that accompanies the Perl5 distribution and also the book Programming Perl, 2nd Edition from O'Reilly & Associates. We need to point out here that for efficiency reasons the character set operator [...] is limited to work on only ASCII characters (Unicode characters 0 through 255). Other than this restriction, all Unicode characters should be useable in the package's regular expressions.
Perl5 regular expressions consist of:
Alternatives separated by |
Quantified atoms (See Quantified Atoms)
Atoms
Regular expression within parentheses, character classes (e.g., [abcd]), ranges (e.g. [a-z]), and the patterns in See Atoms. Special backslashed characters work within a character class (except for backreferences and boundaries). \b is backspace inside a character class. Any other backslashed character matches itself. Expressions within parentheses are matched as subpattern groups and saved for use by certain methods.
Quantified Atoms
|
Pattern
|
Description
|
|
{n,m}
|
Match at least n but not more than m times.
|
|
{n,}
|
Match at least n times.
|
|
{n}
|
Match exactly n times.
|
|
*
|
Match 0 or more times.
|
|
+
|
Match 1 or more times.
|
|
?
|
Match 0 or 1 times.
|
By default, a quantified subpattern is greedy. In other words, it matches as many times as possible without causing the rest of the pattern not to match. To change the quantifiers to match the minimum number of times possible, without causing the rest of the pattern not to match, you may use a "?" right after the quantifier (See Quantified Atoms with Minimal Matching). Perl5 extended regular expressions are fully supported (See See Perl5 Extended Regular Expressions).
Regular Expression Tips
Combining regular expresions and WebL code might sometimes be a little confusing. The following tips might help:
When possible, write WebL regular expressions in single-back quotes e.g. `ab\nc`. This will switch off escape character expansion, and prevent WebL from complaining about illegal escape sequences like "\d".
When matching URLs, keep in mind that "." and "?" do not have a literal meaning in regular expressions. Use the "[]" character classes to match these symbols, e.g. write "www[.]xyz[.]com" instead of "www.xyz.com".
Quantified Atoms with Minimal Matching
|
Pattern
|
Description
|
|
{n,m}?
|
Matches at least n but not more than m times.
|
|
{n,}?
|
Matches at least n times.
|
|
{n}?
|
Matches exactly n times.
|
|
*?
|
Matches 0 or more times.
|
|
+?
|
Matches 1 or more times.
|
|
??
|
Matches 0 or 1 times.
|
Atoms
|
Pattern
|
Description
|
|
.
|
Matches everything except \n.
|
|
^
|
Null token matching the beginning of a string or line (i.e. the position right after a newline or right before the beginning of a string).
|
|
$
|
Null token matching the end of a string or line (i.e. the position right before a newline or right after the end of a string).
|
|
\b
|
Null token matching a word boundary (\w on one side and \W on the other).
|
|
\B
|
Null token matching a boundary that is not a word boundary.
|
|
\A
|
Matches only at beginning of string.
|
|
\Z
|
Matches only at end of string (or before newline at the end).
|
|
\n
|
Newline.
|
|
\r
|
Carriage return.
|
|
\t
|
Tab.
|
|
\f
|
Formfeed.
|
|
\d
|
Digit [0-9].
|
|
\D
|
Non-digit [^0-9].
|
|
\w
|
Word character [0-9a-zA-Z].
|
|
\W
|
Non-word character [^0-9a-zA-Z].
|
|
\s
|
A whitespace character [ \t\n\r\f].
|
|
\S
|
A non-whitespace character [^ \t\n\r\f].
|
|
\xnn
|
Hexadecimal representation of character.
|
|
\cD
|
Matches the corresponding control character.
|
|
\nn or \nnn
|
Octal representation of character unless a backreference.
|
|
\1, \2, \3, etc.
|
Matches whatever the first, second, third, etc. parenthesized group matched. This is called a backreference. If there is no corresponding group, the number is interpreted as an octal representation of a character.
|
|
\0
|
Matches null character.
|
Perl5 Extended Regular Expressions
|
Extended Pattern
|
Description
|
|
(?#text)
|
An embedded comment causing text to be ignored.
|
|
(?:regexp)
|
Groups things like "()" but does not cause the group match to be saved.
|
|
(?=regexp)
|
A zero-width positive lookahead assertion. For example, \w+(?=\s) matches a word followed by whitespace, without including whitespace in the match result.
|
|
(?!regexp)
|
A zero-width negative lookahead assertion. For example foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Remember that this is a zero-width assertion, which means that a(?!b)d will match ad because a is followed by a character that is not b (the d) and a d follows the zero-width assertion.
|
|
(?imsx)
|
One or more embedded pattern-match modifiers. i enables case insensitivity, m enables multiline treatment of the input, s enables single line treatment of the input, and x enables extended whitespace comments.
|
Symbols
33
! 34
- 33, 101
!= 34
!after 102
!before 102
!contain 101
!directlyafter 102
!directlybefore 103
!directlycontain 102
!directlyinside 101
!inside 101
!overlap 103
* 33, 101
+ 33, 101
. 34
/ 33
== 34
> 33
>= 33
? 48
| 48
A
Abstract syntax tree 18
after 86, 91, 102
and 34
AppendToFile 121
Assert 41
assignment 32
Associative arrays 29
Authentication 115
B
before 86, 89, 102
BeginTag 82, 111
Bool 23
boolean 129
Boolp 41
Built-in functions 40
byte 129
C
Call 41
Case-sensitivity 58
Char 24
char 129
Character Entities 57
Character entities 82
Charp 41
Children 98, 104
Class 128
Clone 41
Command line options 159
Comments 56, 162
Compare 136
Comparison operators 33
Concurrency 119
Concurrent execution 48
Constants 19
Constructors 22
contain 86, 93, 101
Content 100, 104
Contexts 21
Cookie Databases 117
cookiedb 66
Cookies 54, 61, 117
Crawler 141
D
DDE 116
Decode 115, 138, 139
Delete 109, 111, 112, 123
DeleteField 42
directlyafter 91, 102
directlybefore 90, 102
directlycontain 94, 102
directlyinside 93, 101
Directories 123
div 33
Document Type Definition 54
double 129
DTD 54
Dynamic Data Exchange 116
E
EBNF 162
Elem 70, 79
Elements 56
Empty elements 56
Searching for 70
Encode 115, 139
EndsWith 136
EndTag 82
Equality 31
EqualsIgnoreCase 136
Error 41
ErrorLn 41
Escape Sequences 165
Eval 41, 122
Exceptions
Trap function 44
Try statement 36
Exclusion 88
Exec 42
Exists 121
Exit 42
ExpandCharEntities 82
expandentities 65
Expressions 17
F
Farm 119
field definition 32
Fields 29
Files 121
First 42
Flatten 99, 104
float 129
Floating-point 26
Fun 27
Functions 173
Built-ins 40
Funp 41
G
Garbage collection 42
GC 42
Get 128
GetCurrentPage 116
GetURL 47, 59, 61, 64, 122
Overrides 63
Glue 139
GlueQuery 138, 139
GotoURL 116
H
HeadURL 64
Highlight proxy 156
HTML 54
Forms 52
Handling of badly formatted HTML 58
Parsing of 54
HTTP 52
Cookies 54
GET Request 52
Headers 52, 53, 61
MIME types 53, 63
Parameters 52, 53, 60
POST Request 52
Request 52
Response 52
Set-cookie header 117
Status 52
I
Idle 120
Import 46
Indexing 34
IndexOf 136
InsertAfter 108, 111, 112
InsertBefore 108, 112
inside 86, 93, 101
Int 25
int 129
intersect 97, 103
Intersection 88
Intp 41
IsDir 123
IsFile 123
ISO-8859-1 162
J
J-array 127
Java 124
java.lang.String 129
Job queues 119
J-objects 124
K
Kincaid reading score 149
L
LastIndexOf 136
Latin 1 162
Length 128
List 26, 122
Listp 41
Load 118
LoadFromFile 121
LoadStringFromFile 121
Locks 38
long 129
M
Markup 80, 81, 82
Markup algebra 67
Match 136
member 34
meth 30
Methods 30
Methp 41
Mkdir 123
mod 33
Modules 46
Base64 115
Browser 116
Cookies 117
Farm 119
Files 121
Java 124
Url 138
WebCrawler 141
WebServer 143
Mutual exclusion 38
N
Name 80, 82
Native 42
Netscape 116
New 128
NewArray 128
NewFarm 120
NewNamedPiece 106, 112
NewPage 80, 82
NewPiece 82, 106, 112
NewPieceSet 82
nil 23
Non-termination 49
null 129
O
Object-based programming 30
Objectp 41
Objects 29
Pages 59
Operator precedence 166
Operators 19, 32, 168
Optional tags 57
Options 159
or 34
overlap 86, 92, 103
Overrides 63
autoredirect 65
charset 65
dtd 65
emptyparagaphs 65
fixhtml 66
mimetype 66
resolveurls 66
P
Page 81, 82
Pagep 41
Pages 59
Searching functions 70
Para 75, 79
Paragraph search 75
Paragraph terminators 75
Parent 99, 104
Pat 71
Pattern groups 71
Pattern search 71
PCData 73, 79
Perform 120
Perl5 195
PI 57
Piece set
Operators 87
Piece set functions
Children 98
Content 100
Flatten 99
Parent 99
Piece set operators
After 91
Before 89
Contain 93
Directlyafter 91
Directlybefore 90
Directlycontain 94
Directlyinside 93
Indexing 89
Inside 93
Intersect 97
Overlap 92
Set Exclusion 88
Set Intersection 88
Set Union 88
Without 95
Piece sets 69
Piecep 41
Pieces 68
Comparison of 83
Creation of 82, 106
Deleting of 109
Filtering of 78
Graphical notation 69
Insertion of 108
Replacing of 111
Piecesetp 41
Positions 84
PostURL 47, 59, 61, 64, 122
Overrides 63
Predicates 41
Pretty 81, 82
Print 42
PrintLn 42
Processing Instructions 57
Properties 160
Proxies 115, 156
Publish 143, 145
R
Reading grades 149
ReadLn 42
Real 26
Realp 41
Regular expressions 71, 165, 195
Repetition 49
Replace 111, 112, 136
Resolve 139
Rest 42
Retry 42
Running WebL programs 159
S
Save 118
SaveToFile 122
Scoping rules 20
Scrubber 110
Search 136
Search path 160
Select 42, 43
Seq 74, 79
Sequence search 74
Sequential execution 48
Service combinators 47
Services 47
Set 27, 128
Setp 41
SGML 54
SGML Directives 57
Shell commands 41, 42
short 129
ShouldVisit 141
ShowPage 116
Sign 43
Size 43, 123
Sleep 43
Sort 43
Split 137, 139
SplitQuery 138, 139
Stall 43
Start 143, 145
StartsWith 137
Statements 20, 35
Begin statement 39
Every statement 38
If statement 35
Lock statement 38
Repeat statement 36
Return statement 39
Sequences 35
Try statement 36
While statement 36
statuscode 147
statusmsg 147
Stop 120, 145
String 25
Stringp 41
T
Tagp 41
Tags 55, 68
Begin tags 68
End tags 68
Optional tags 57
Positions of 84
Unnamed tags 68
Terminology 17, 51
Text 80, 83
Text segments 68, 73
Threads 119
Mutual exclusion 38
Throw 43
throw 36
Time 43
Time-out 49
Timeout 43
ToChar 43
ToInt 44
ToList 44
ToLowerCase 137
ToReal 44
ToSet 44
ToString 44
ToUpperCase 137
Trap 44
Trim 137
Type 44
Types 18
Bool 23
Char 24
Fun 27
Int 25
j-array 127
J-Object 124
List 26
Meth 30
Nil 23
Object 29
Real 26
Set 27
Special objects 29
String 25
U
Union 88
Unnamed tags 68
URL
Resolution of 58
URLs 51
UTF-8 162
V
Value types 18
Variables 20
Exported variables 46
W
WebCrawler 141, 153
WebL.jar 159
WebL-Java type conversion 125
weblwin32.dll 116
WebServer 143
without 95, 103
X
XML 55
Top Next