The World Wide Web (WWW) consists of a large number of Web sites, domains or servers that provide services to clients over the Internet. The typical clients of these services are users who retrieve web pages with a software application called a Web browser. In contrast, WebL is a client that fetches pages in an automated manner under the control of a program. The purpose of this section is to show how this works.
Pages are identified by a uniform resource locator or URL. A URL identifies the web site, the location of the page on the web site, the filename of the page, and the Internet transmission protocol required to fetch the page. Much simplified, URLs have the following form:
http://hostname/path/filename.html
Here http refers to the protocol being used, hostname the web site or machine identification, and path the directory on that machine where the page called filename.html is stored.
The Hypertext Transfer Protocol transfers the page over the Internet. The basic steps are:
Establish a communications link from client to the web server identified by host name.
The client sends an HTTP request to the server. The request consists of the location or filename of the page to retrieve (/path/filename.html), headers, and optional parameters.
The server answers with an HTTP response. The response consists of a status code (indicating success or failure), a status message, headers, and the page data itself (contents of /path/filename.html on the server).
Request parameters provide additional information to a web server about the requested data. This information is often used to access a special service on the server that generates appropriate responses dynamically, for example by looking up data in a database. Each parameter consists of a parameter name and a value. Parameters are included in the HTTP request in one of two methods:
The HTTP GET request appends the parameters (encoded in a special way) to the URL,
The HTTP POST request appends the parameters to the end of the request.
GET requests issued with parameters are recognized by a question mark (?) followed by the parameters (name value pairs) appended to the URL. In contrast, parameters of a POST request are hidden and not visible from the URL.
POST requests are the preferred method for transmitting the contents of an HTML fill-in form to a web server. Their main advantage is that larger amounts of data can be submitted than with the GET method. Note however that the GET method is also applicable to fill-in forms, and is typically used when parameters are few and relatively short. The GET method is also the default when no parameters are passed.
WebL supports both the GET and POST methods as built-in functions. These functions accept the request parameters in a WebL object and perform the correct encoding and packing in the HTTP request (either in the URL or at the end of the request).
For each parameter with name N and value V, we construct a string "N=V". All parameter strings are then concatenated (separated by a & symbols), and a question mark is prepended. The URL of a GET request with parameters will thus have the general form:
http://domainname/path/filename.hml?N1=V1&N2=V2&N3=V3
Names consist of alpha-numeric characters. Values may contain any character except those that are reserved for URLs. To encode the latter characters, we replace them with a percentage sign (%) followed by a two-digit hexadecimal number specifying the ASCII code of the character. In addition, spaces are replaced by plus (+) signs.
HTTP request headers give the web server more information about the request itself, the browser that is being used, etc. HTTP response headers give the browser more information about the page that is returned. In contrast to parameters that can be freely picked, headers are pre-defined by the HTTP protocol. A header consists of a name and a value. Although WebL can add request headers and read response headers, scripts seldom need to exercise this control. The main uses of this feature include mimicking a specific web browser model, and retrieving and setting cookies.
One of the important pieces of information returned by an HTTP response is the type of the data that is being retrieved (included in a response header). The MIME type specifies if the data is an HTML page, an XML page, an image, a Postscript file etc. WebL supports onlythe MIME types corresponding to what it can parse: Plain text, HTML and XML. Attempting to process anything else in WebL causes an exception. A common MIME type is the one that identifies HTML documents, typically written in one of the following forms:
The charset parameter is optional -- it indicates the character encoding (or content encoding) the document is encoded in. WebL uses the charset parameter (or makes an educated guess as to its value when missing) to determine how pages are converted into an internal Unicode format.
Unfortunately, many web servers do not return the correct MIME type information for certain documents, which makes it impossible for WebL to parse the document. To prevent this from occuring, it is possible to override the MIME type of a document explicitly when using the GetURL and PostURL builtin functions.
Many web servers today use "cookies" to store client-side state. For example, a typical application of cookies is to unique identify customers at a web store front. At startup time WebL knows about no cookies (i.e. the cookie database is empty). As cookies are set by servers during HTTP requests, the cookie database will fill up. Each WebL HTTP request is checked against the cookie database, and if necessary, WebL will return the appropriate cookies to the server. A special WebL module called Cookies allow you to save the cookie database to a file, and reload it at a later time.
Once an HTTP request is completed, WebL parses the page data into an internal format that makes it easy to query and manipulate the page.
WebL programmers should have high-level understanding of how HTML and XML are handled -- to this end the following section gives an overview of basic markup concepts and how they relate to WebL. This background material is a prerequisite for the following chapter on the search algebra and page manipulation features.