Module WebCrawler exports a single object called Crawler that implements a low-performance multi-threaded web crawler. To use the web crawler, the methods Visit and ShouldVisit must be overridden by the programmer. The Visit method is called by the crawler each time a page is visited, and the ShouldVisit method returns true when a specific URL must be crawled. The crawler is activated by the Start method which takes as argument an integer specifying how many threads should perform the crawl. At this point the crawler has no pages to crawl yet. Pages are inserted into the crawler queue with the Enqueue method. As each page in the queue is processed, the crawler will extract all the anchor ("<A>") tags in that page, and call the ShouldVisit method to determine if the page referred to by the anchor should be crawled or not. The Abort() method can be called at any time to terminate the crawl.
The following example implements a web crawler that prints the URL and title of all pages visited. The crawl is restricted to pages on the pa.dec.com domain that have a URL that ends in a "/", ".htm" or ".html". The queue is initially seeded with two pages from where the crawl will progress in a breadth-first fashion. Note that the program itself goes to sleep while the crawl is in progress.
var MyCrawler = Clone(WebCrawler_Crawler,
var title = Text(Elem(page, "title")[0]) ?
PrintLn(page.URL, " title=", title)
`http://www-\w*[.]pa[.]dec[.]com`)
Str_EndsWith(url, "(/)|([.]html?)")
MyCrawler.Start(2); // Only two threads are used.
MyCrawler.Enqueue("http://www-src.pa.dec.com/");
MyCrawler.Enqueue("http://www-wrl.pa.dec.com/");
while true do Sleep(10000) end