The following program calculates the Kincaid score of a set of headline newspaper articles found on the www.news.com web server, and outputs a sorted table of those article titles to the file "kincaid.txt". The Kincaid scoring function is used to judge reading ease of an English document based on its sentence and word characteristics. The function's output ranges from 5.5 to 16.5 in reading grade level.
Note that this implementation does not calculate the correct Kincaid reading grade as it takes some shortcuts in calculating the number of sentences and syllables in a page. Also, web pages tend to contain a lot of headings and so on, which are not identified correctly as sentences. Web pages differ enough from the Navy manuals (on which the scoring function is based) to let us conclude that we are only calculating a relative score between similar pages in a corpus.
var letters = Size(Str_Search(txt, "[A-Za-z]"));
var words = Size(Str_Search(txt, "[0-9a-zA-Z']+"));
var syllables = Size(Str_Search(txt, "[aeiouy]+"));
var exceptions = Pat(page, "[ ]([A-Z0-9][.])+");
Replace(exceptions, NewPiece(" X ", "text/plain"));
var sentences = Size(Pat(page, "[.]"));
ARI = 4.71 * (letters / words) +
0.5 * (words /sentences) - 21.43,
Kincaid = 11.8 * (syllables / words) +
0.39 * (words / sentences) - 15.59,
CLF = 5.89 * (letters / words) -
0.3 * (sentences / (words / 100)) - 15.8,
Flesch = 206.835 - 84.6 * (syllables / words)
PrintLn(count, " scoring ", s);
sc.Title := Text(Elem(page, "title")[0]);
var FollowLink = fun(page, anchortext)
var dest = (Elem(page, "a") contain
var P = GetURL("http://www.news.com");
var H = FollowLink(P, "(?i)all the headlines");
var A = (Elem(H, "a") directlyafter Pat(H, "•"))
PrintLn(Size(res), " articles found.");
var res = ScorePageList(pages);
var diff = a.Kincaid - b.Kincaid;
PrintLn(x.Kincaid, " ", x.Title);
s = s + x.Kincaid + " " + x.Title + "\r\n";
Files_SaveToFile("kincaid.txt", s);
Lines 3-24 implement the core of the scoring function. After extracting the text of the page (line 4), we proceed to calculate the number of letters (line 5), words (line 6), and syllables (line 7) on the page, using a few simple regular expressions. Lines 9-11 take care of removing any initials that might appear in the page. This is necessary as the number of periods in the page is used as a measure of how many sentences are present, and initials containing periods will skew that count. Finally lines 13-26 calculate a few common reading scores, and return a "score" object with the results to the caller.
Lines 29-46 calculate reading scores for a list of URLs. Lines 36-37 fetch the individual pages and calculate the score. In lines 38-39 we extend the score object with fields to identify the URL and title of the page. The ScorePageList function returns a list of these score objects.
The purpose of the GetStories function (lines 54-65) is to retrieve a list of URLs representing newspaper articles. After fetching the root page from news.com (line 56), we follow the link called "All the headlines" to a page that contains all the stories of the day (line 57). Lines 58-59 perform the extraction of the story URLs. We locate all the anchors appearing after a bullet symbol (identified by the "•"; character entity) that are not written in the strong font. Lines 60-62 construct a list object of all the URLs found.
The main program starts at line 67. First we fetch the stories, and then score them. Lines 70-77 take care of sorting the stories according to score.
Finally, lines 79-84 take care of printing the result and writing it to a file.