Retrieving Page Objects

WebL's internal representation of a web resource is a page object. The built-in functions GetURL and PostURL fetch a page from the Web, and return a page object. In the next chapter we will introduce functions that will turn a page object into a string value and back, search and manipulate markup in interesting ways, etc.

The GetURL and PostURL functions take a variable number of arguments that specify the URL to be fetched, request parameters, additional headers, and options. (See See Functions to Retrieve Web Pages .)

WebL's ability to process different URL protocols like http, file, and ftp is inherited from the underlying Java implementation (i.e. WebL does not provide support for any additional protocols). The most common URL used in WebL is the one corresponding to the HTTP protocol. Note that page redirects and cookies are handled transparently by WebL, but this default behavior can be overridden if required.

The request parameters passed to the functions are in the form of objects. For example, the URL of a typical AltaVista request has the following form:

http://www.altavista.digital.com/cgi-bin/

query?pg=q&what=web&kl=XX&q=%22Hannes+Marais%22

 

This can be converted into a call to the GetURL function in the following manner:

GetURL(

"http://www.altavista.digital.com/cgi-bin/query",

[. pq="q", what="web", kl="XX",

q="\"Hannes Marais\"" .])

 

Parameters

WebL will automatically take care of packing request parameters in the correct way as required by the GET and POST protocol variants. In the case of a PostURL request, the correct construction of the parameter object needs to be deduced by the programmer from the Web form where the request originates from. (This is beyond the scope of this manual.)

Note that the HTTP specification requires that the POST parameters be submitted in the order they appear in the form on the page. It is thus important to list the param object fields in the same order as the fields in the form (recall that object fields are ordered according to definition sequence).

There are two tricks that are sometimes needed when submitting form data. The first trick involves posting multiple parameters that have the same name. An HTML form might allow the user to pick several options from a list (for example to indicate his or her favorite programming language). This can be specified as follows:

PostURL("http://...",

[. gender="male",

language=["WebL", "Java"] .])

 

Note the use of a field of type list to indicate the multiple values. For those readers familiar with the HTTP specification, the data that will be posted as follows in the body of the HTTP request:

gender=male&language=WebL&language=Java

 

Note how the parameter language appears twice in the submitted data.

The second trick is a work-around for the case when the submitted parameters do not match well with the WebL object type. This might for example be the case when the server does not conform to the HTTP specification, and allows the posting of data in any format. To handle this case, you may pass a string type instead of an object type to the param argument of PostURL. Of course, in such a case you have to take care yourself of encoding the parameters correctly (and for this See Module Url is useful). Keeping with our example above, this would be coded as:

PostURL("http://...",

"gender=male&language=WebL&language=Java")

 

Also note that the GetURL, Files_GetURL, and Files_PostURL functions also accept a string argument instead of an object for parameters (Also see See Module Files ). In the case of the Get variants of the functions, the parameter string is simply appended to the URL itself (with WebL adding the usual "?" in between).

Headers

The GetURL and PostURL functions add extra HTTP header fields to a request in case the optional header object is used as an argument. Headers that might need to be added in this way can be the client identification, cookies, etc. The response headers of a request (with names converted to lowercase) become part of the page object returned by the functions. For example, the program:

var p = GetURL("http://www.digital.com");

PrintLn(p);

 

prints the page fields:

[. "server" = "Apache/1.2.4",

"connection" = "close",

"date" = "Fri, 01 May 1998 19:45:47 GMT",

"content-type" = "text/html",

"URL" = "http://www.digital.com/" .]

 

Some HTTP response headers, like for example Set-Cookie, might be repeated several times. In such a case, the value of the header field will be list of string values, in the order of occurrence in the HTTP response. The page fields of an HTTP response with multiple headers of the same name might thus look as follows:

[.

"content-type" = "text/html",

"URL" = "http://www.digital.com/",

"Set-Cookie" = ["id=123", "pw=abc" ]

.]

 

The same idea is applied when submitting multiple headers with the same name in a request:

GetURL(url, nil, [.

HeaderA ="xyz",

HeaderB = ["id=123", "pw=abc"]

.] );

 

Overrides

Unfortunately, many servers return the incorrect MIME type for a page. This incorrect MIME type can be overridden by passing a mimetype field in the options object argument of GetURL and PostURL. The mimetype field of the options object must be of type string, and the value must be taken from See Supported MIME Types . In case the content encoding is known, an optional charset MIME type parameter can be specified ( See MIME types ). For example, we can write the following to override the MIME type of page X to be of type plain text:

GetURL(X, nil, nil, [. mimetype="text/plain" .])

 

Supported MIME Types

MIME Type

Parser Used

text/plain

Plain text

text/html

HTML

text/xml

XML

application/xml

XML

A related problem that we often face when processing HTML pages, is that the author of the page either gives no indication which version of HTML is being used, or gives an incorrect indication. The HTML version information is part of the DOCTYPE tag, and identifies the HTML DTD to be used to parse the page. WebL relies on this information to parse an HTML correctly. In case of an incorrectly authored page, the DTD can be explicitly overridden by the WebL programmer by adding a dtd field to the options object argument. The value of the parameter should be the officially assigned named of the DTD. For example, the following option values identify HTML 4.0, 3.2, and 2.0:

[. dtd="-//W3C//DTD HTML 4.0//EN" .]

[. dtd="-//W3C//DTD HTML 3.2//EN" .]

[. dtd="-//IETF//DTD HTML//EN" .]

 

The fields of the option argument to GetURL and PostURL are summarized in See Fields of the option object .

 

Functions to Retrieve Web Pages

Function

Description

GetURL(url: string): page

Uses the HTTP GET protocol to fetch the resource identified by the URL.

GetURL(url: string, params: {object,string}): page

The params object/string contains the parameters of a GET that includes a query.

GetURL(url: string, params: {object,string}, headers: object): page

The headers object specifies the additional headers to include in the GET request.

GetURL(url: string, params: {object,string}, headers: object. options: object): page

The options object allows, amongst other functions, the overridng of the MIME type and DTD to be used for parsing the page.

PostURL(url: string): page

Uses the HTTP POST protocol to fetch the resource identified by the URL.

PostURL(url: string, params: {object,string}): page

The params object/string contains the parameters of a POST to fill in a web form.

PostURL(url: string, params: {object,string}, headers: object): page

The headers object specifies the additional headers to include in the POST request.

PostURL(url: string, params: {object,string}, headers: object. options: object): page

The options object allows, amongst other functions, the overridng of the MIME type and DTD to be used for parsing the page.

HeadURL(url: string): page

Uses the HTTP HEAD protocol to fetch the resource headers identified by the URL.

HeadURL(url: string, params: {object,string}): page

The params object contains the parameters of the HEAD request.

HeadURL(url: string, params: {object,string}, headers: object): page

The headers object specifies the additional headers to include in the HEAD request.

Fields of the option object

Field

Description

autoredirect

Controls whether moved pages (for example HTTP status code 302) get automatically fetched from their new locations. The default value is "true".

charset

Overrides the character set used to parse the document. Typical values are "ISO-8859-1", "UTF8", etc.

dtd

Overrides the DTD to be used when parsing the page. The value of this field must be string with the official DTD name as defined in the SGML catalog.

emptyparagaphs

When this flag is set to true, the HTML parser will regard paragraphs (i.e. <p> tags) as empty markup elements instead of the usual <p>...</p> pairs. (The <br> is an example of another empty markup element). This option is sometimes useful when confronted with pages where <p> is used without regard for the HTML specification (for example, the incorrect use of <p> inside <font>, and so on). The default value of this flag is false.

fixhtml

When this flag is set to true, the HTML parser attempts to correct incorrectly nested HTML elements in a page (for example, putting a H2 inside a H1). This has the effect of regularizing badly formatted HTML, at the cost of sometimes unintutive parses. The default value of this flag is false.

mimetype

Overrides the mime type to be used when parsing the page. See See Supported MIME Types for typical string values this field may assume.

resolveurls

When this flag is set to false, the URLs in the page are not resolved to absolute form. The default is true.

 


Up Previous Next