Introduction

WebL (pronounced "webble") is a web scripting language for processing documents on the World Wide web. It is well suited for retrieving documents from the web, extracting information from the retrieved documents, and manipulating the contents of documents. In contrast to other general purpose programming languages, WebL is specifically designed for automating tasks on the web. Not only does the WebL language have a built-in knowledge of web protocols like HTTP and FTP, but it also knows how to process documents in plain text, HTML and XML format.

The flexible handling of structured text markup as found in HTML and XML documents is an important feature of the language. In addition, WebL also supports features that simplify handling of communication failures, the exploitation of replicated documents on multiple web services for reliability, and performing multiple tasks in parallel. WebL also provides traditional imperative programming language features like objects, modules, closures, control structures, etc.

To give a better idea of how WebL can be applied for web task automation, and also what makes WebL different from other languages, it is instructive to discuss the computational model that underlies the language. In addition to conventional features you would expect from most languages, the WebL computation model is based on two new concepts, namely service combinators and markup algebra. For now we can describe these two concepts of WebL in the following way.

Service combinators is a formalism that can provide more reliable access to web resources and services. Very succinctly, service combinators is an exception handling mechanism that is powerful enough to encode robust behavior when communication failures occur. This concept is especially important for performing any reliable computation on the unreliable web structures. It often happens that web services are unavailable, suddenly fail or become unacceptably slow. These are very serious complications for computations that depend so much on the web infra-structure. Although service combinators cannot make a web-based computation completely failure-proof, it does add a certain amount of robustness to programming on the web. Service combinators are discussed in detail on See Service Combinators.

Markup algebra is a formalism for extracting information from structured text documents and the manipulation of those documents. It consists of functions to extract elements and patterns from web documents, operators to manipulate what has been extracted in this manner, and functions to change a page, for example to insert or delete parts. The functions and operators all work on the high-level concept of a parsed web page, and there is little need to do lower level string manipulation. Markup algebra is discussed in detail in Chapter 4.

The purpose of this document is to introduce programmers to the WebL language and its features. Before however introducing the language in its totality, we will first summarize WebL's main features.

Basic Features

The WebL language and system is designed for rapid prototyping of Web computations. It is well-suited for the automation of tasks on the WWW.

WebL's emphasis is on high flexibility and high-level abstractions rather than raw computation speed. It is thus better suited as a rapid prototyping tool than a high-volume production tool.

WebL is implemented as a stand-alone application that fetches and processes web pages according to programmed scripts.

Programming Language

WebL is a high level, imperative, interpreted, dynamically typed, multi-threaded, expression, language.

WebL's standard data types include boolean, character, integer (64-bit), double precision floats, Unicode strings, lists, sets, associative arrays (objects), functions, and methods.

WebL has prototype-like objects.

WebL supports fast immutable sets and lists.

WebL has special data types for processing HTML/XML that include pages, pieces (for markup elements), piece sets, and tags.

WebL uses conventional control structures like if-then-else, while-do, repeat-until, try-catch, etc.

WebL has a clean, easy to read syntax with C-like expression and Modula-like control structures.

WebL supports exception handling mechanisms (based on Cardelli & Davies' service combinators) like sequential combination, parallel execution, timeout, and retry. WebL can emulate arbitrary complex page fetching behaviors by combining services.

Protocols Spoken

WebL speaks whatever protocols Java supports, i.e. HTTP, FTP, etc.

WebL can easily fill in web-based forms and navigate between pages.

WebL has HTTP cookie support.

Programmers can define HTTP request headers and inspect response headers.

Programmers can explicitly override mimetypes and DTDs used when parsing Web pages.

Proxy support.

Support for HTTP basic authentication (both client and proxy authentication).

Markup Algebra

WebL 'understands' HTML, XML and plain text mime-types.

WebL uses a DTD-based HTML parser for extensibility (HTML 2.0, 3.2, and 4.0 DTDs included).

WebL has relatively robust page parsing that attempts to make a faithful representation of Web pages.

WebL supports a markup algebra for extracting elements and text from pages, and functions for manipulating the content of a page. Extraction functions include extracting all elements of a specific name, all occurrences of PERL5 regular expressions, and all occurrences of simple element patterns.

Elements and patterns are mapped onto piece objects in WebL, and allow the direct access to markup attributes.

Markup algebra allows the expression of complicated access patterns easily (for example, "extract all the images in the third row of the table (that contains the word 'abc'"), and so on).

WebL can handle overlapping elements internally. (Page manipulation is not based on an internal tree-like representation of markup.)

Page manipulation functions include modifying attributes, deleting elements/tags, copying elements/text, and replacing elements/text.

WebL allows programmers to look at both the markup structure of a page and the raw text (without any tags).

Module Support

Standard modules supplied with WebL include:

File manipulation for writing or downloading pages to disk.

Displaying pages in your web browser, checking which pages are being viewed in Netscape, and instructing Netscape to navigate to a specific URL (Windows only).

Multi-processing with workers, jobs, and job queues.

General string manipulation including PERL5 regular expression searches.

Routines to split and glue together URLs.

An easily customizable multi-threaded web crawler.

A multi-threaded web server that allows the direct execution of WebL functions with full access to HTTP state.

Java servlet support.

Examples to access information from public services like AltaVista, Yahoo!, etc.

Java Support and Integration

WebL is written in nearly completely in Java. (The Browser access module needs access to a few Windows API calls; WebL is completely portable on UNIX platforms.)

It is possible (however not recommeded) to directly code against the WebL API (thus not writing WebL scripts but still using its functionality).

Very easy to add bridges from WebL to Java code. Java objects can be called directly from WebL code without extending the WebL system (see module Java).

Java extensions are loaded dynamically and it is possible to add and remove builtin functions by editing a standard script.

Applications

WebL is a general purpose programming language, and can thus be used to build whatever you can imagine. The example chapter of this book only gives a small taste of what is possible with WebL. Some of the things that we at Compaq have built with WebL include:

Web shopping robots,

Page and site validators,

Meta-search engines,

Tools to extraction connectivity graphs from the Web and analyze them,

Tools for collecting URLs, host names, word frequency lists, etc.,

Page content analysis and statictics,

Reprocessing of results from public services, for example custom rankings of stocks,

Custom servers and proxy-like entities,

Locating and downloading multi-media content,

and downloading of complete Web sites.

 


Up Previous Next