In this example, we illustrate how to build a simple web crawler framework that can easily be customized. The basic idea is to define a generic Crawler object of which methods can be overridden to customize its behavior. By the way, our crawler implementation is provided as standard in WebL in a module called WebCrawler.
First we define the generic Crawler object as follows:
// Pages visited so far (and removed from queue)
// and pages waiting in the queue.
// Will contain the farm after the start method is
// Method that should be overridden.
Visit = meth(s, page) PrintLn(page.URL) end,
ShouldVisit = meth(s, url) true end,
// First remove everything following #
var pos = Str_IndexOf("#", url);
var present = s.enqueued[url] ? false;
if !present and s.ShouldVisit(url) then
s.farm.Perform(s.ProcessPage(s, url))
var page = GetURL(url); // fetch the page
// Process all the links from this page.
on true do PrintLn(url, " err: ", E.msg)
s.farm = Farm_NewFarm(noworkers);
First we need to keep track of all pages visited so far with an associative array (aka a WebL object) where the fields are the visited URLs, and the value is either true or false (line 7). Note that an alternative implementation could use a set instead of an object without any performance penalty.
Lines 14 and 15 define the two methods that need to be overridden to customize the crawler. The Visit method is called each time a new page is visited, and the ShouldVisit method indicates whether a specific URL should be crawled or not.
The Enqueue method (lines 17-30) adds a URL to the queue of pages to be fetched. The first task is to strip off any references from the URL (lines 19-22). Line 24 then checks if we visited the page already. Note the use of the ? service combinator to catch the exception should the URL not be in the visited array. If the URL is not present, and we should visit this page (line 25), we remember that we have seen the page (line 26), and then pass the job of retrieving the page to a farm object (line 27).
Eventually when a worker on the farm reaches a new job, the ProcessPage function is invoked (lines 32-44). After the page is fetched (line 34), we call the method Visit to let the crawler process the page (line 35). Lines 38-40 take care of enqueing all the anchors found on the page.
Now we look at how we can create a custom crawler using the generic crawler above. To override the Visit and ShouldVisit methods, we use the Clone builtin applied to the generic crawler and our own object that contains the modifications to the generic crawler we would like to make (lines 3-16).
var MyCrawler = Clone(WebCrawler_Crawler,
var title = Text(Elem(page, "title")[0]) ? "N/A";
PrintLn(page.URL, " title=", title);
`http://www-\w*\.pa\.dec\.com`)
Str_EndsWith(url, "(/)|(.html?)")
MyCrawler.Enqueue("http://www-src.pa.dec.com/");
MyCrawler.Enqueue("http://www-wrl.pa.dec.com/");
while !MyCrawler.Idle() do Sleep(10000) end
Our particular implementation of the Visit method extracts and prints the URL and title of the page (lines 5-8). The ShouldVisit method (lines 10-15) restricts crawling to host names of the form "www-*.pa.dec.com" and URLs that end either in "/" or ".html".
Lines 18-20 start up the crawler with two workers and enqueue two starting point URLs. Line 21 goes in a loop that checks every 10 seconds whether the workers have become idle, in which case the crawler terminates.