WebL's internal representation of a web resource is a page object. The built-in functions GetURL and PostURL fetch a page from the Web, and return a page object. In the next chapter we will introduce functions that will turn a page object into a string value and back, search and manipulate markup in interesting ways, etc.
The GetURL and PostURL functions take a variable number of arguments that specify the URL to be fetched, request parameters, additional headers, and options. (See See Functions to Retrieve Web Pages .)
WebL's ability to process different URL protocols like http, file, and ftp is inherited from the underlying Java implementation (i.e. WebL does not provide support for any additional protocols). The most common URL used in WebL is the one corresponding to the HTTP protocol. Note that page redirects and cookies are handled transparently by WebL, but this default behavior can be overridden if required.
The request parameters passed to the functions are in the form of objects. For example, the URL of a typical AltaVista request has the following form:
http://www.altavista.digital.com/cgi-bin/
query?pg=q&what=web&kl=XX&q=%22Hannes+Marais%22
This can be converted into a call to the GetURL function in the following manner:
"http://www.altavista.digital.com/cgi-bin/query",
[. pq="q", what="web", kl="XX",
WebL will automatically take care of packing request parameters in the correct way as required by the GET and POST protocol variants. In the case of a PostURL request, the correct construction of the parameter object needs to be deduced by the programmer from the Web form where the request originates from. (This is beyond the scope of this manual.)
Note that the HTTP specification requires that the POST parameters be submitted in the order they appear in the form on the page. It is thus important to list the param object fields in the same order as the fields in the form (recall that object fields are ordered according to definition sequence).
There are two tricks that are sometimes needed when submitting form data. The first trick involves posting multiple parameters that have the same name. An HTML form might allow the user to pick several options from a list (for example to indicate his or her favorite programming language). This can be specified as follows:
Note the use of a field of type list to indicate the multiple values. For those readers familiar with the HTTP specification, the data that will be posted as follows in the body of the HTTP request:
gender=male&language=WebL&language=Java
Note how the parameter language appears twice in the submitted data.
The second trick is a work-around for the case when the submitted parameters do not match well with the WebL object type. This might for example be the case when the server does not conform to the HTTP specification, and allows the posting of data in any format. To handle this case, you may pass a string type instead of an object type to the param argument of PostURL. Of course, in such a case you have to take care yourself of encoding the parameters correctly (and for this See Module Url is useful). Keeping with our example above, this would be coded as:
"gender=male&language=WebL&language=Java")
Also note that the GetURL, Files_GetURL, and Files_PostURL functions also accept a string argument instead of an object for parameters (Also see See Module Files ). In the case of the Get variants of the functions, the parameter string is simply appended to the URL itself (with WebL adding the usual "?" in between).
The GetURL and PostURL functions add extra HTTP header fields to a request in case the optional header object is used as an argument. Headers that might need to be added in this way can be the client identification, cookies, etc. The response headers of a request (with names converted to lowercase) become part of the page object returned by the functions. For example, the program:
var p = GetURL("http://www.digital.com");
"date" = "Fri, 01 May 1998 19:45:47 GMT",
"URL" = "http://www.digital.com/" .]
Some HTTP response headers, like for example Set-Cookie, might be repeated several times. In such a case, the value of the header field will be list of string values, in the order of occurrence in the HTTP response. The page fields of an HTTP response with multiple headers of the same name might thus look as follows:
"URL" = "http://www.digital.com/",
"Set-Cookie" = ["id=123", "pw=abc" ]
The same idea is applied when submitting multiple headers with the same name in a request:
Unfortunately, many servers return the incorrect MIME type for a page. This incorrect MIME type can be overridden by passing a mimetype field in the options object argument of GetURL and PostURL. The mimetype field of the options object must be of type string, and the value must be taken from See Supported MIME Types . In case the content encoding is known, an optional charset MIME type parameter can be specified ( See MIME types ). For example, we can write the following to override the MIME type of page X to be of type plain text:
GetURL(X, nil, nil, [. mimetype="text/plain" .])
A related problem that we often face when processing HTML pages, is that the author of the page either gives no indication which version of HTML is being used, or gives an incorrect indication. The HTML version information is part of the DOCTYPE tag, and identifies the HTML DTD to be used to parse the page. WebL relies on this information to parse an HTML correctly. In case of an incorrectly authored page, the DTD can be explicitly overridden by the WebL programmer by adding a dtd field to the options object argument. The value of the parameter should be the officially assigned named of the DTD. For example, the following option values identify HTML 4.0, 3.2, and 2.0:
[. dtd="-//W3C//DTD HTML 4.0//EN" .]
[. dtd="-//W3C//DTD HTML 3.2//EN" .]
[. dtd="-//IETF//DTD HTML//EN" .]
The fields of the option argument to GetURL and PostURL are summarized in See Fields of the option object .
|
Controls whether moved pages (for example HTTP status code 302) get automatically fetched from their new locations. The default value is "true". |
|
|
Overrides the character set used to parse the document. Typical values are "ISO-8859-1", "UTF8", etc. |
|
|
Overrides the DTD to be used when parsing the page. The value of this field must be string with the official DTD name as defined in the SGML catalog. |
|
|
When this flag is set to true, the HTML parser will regard paragraphs (i.e. <p> tags) as empty markup elements instead of the usual <p>...</p> pairs. (The <br> is an example of another empty markup element). This option is sometimes useful when confronted with pages where <p> is used without regard for the HTML specification (for example, the incorrect use of <p> inside <font>, and so on). The default value of this flag is false. |
|
|
When this flag is set to true, the HTML parser attempts to correct incorrectly nested HTML elements in a page (for example, putting a H2 inside a H1). This has the effect of regularizing badly formatted HTML, at the cost of sometimes unintutive parses. The default value of this flag is false. |
|
|
Overrides the mime type to be used when parsing the page. See See Supported MIME Types for typical string values this field may assume. |
|
|
When this flag is set to false, the URLs in the page are not resolved to absolute form. The default is true. |