Scraping web pages in JavaScript with Perl
Sometimes you want to scrape Webpages which contain JavaScript and therefore resist beeing scraped with Web::Scraper or the likes. Imagine some JavaScript code like the following to disguise a email address.
function mail() {
var name = "mail";
var domain = "example.com";
var mailto = 'mailto:' + name + '@' + domain;
document.write(mailto);
}
mail();
One could use somethink elaborate like Selenium to execute the code within a browser and then extract the address with “conventional” means. There are cases when this isn’t sufficent.
Enter JavaScript::SpiderMonkey, which allows you to execute JavaScript Code on the console without a browser. The only problem remaining is that the console doesn’t provide some properties and methods the browser has, so you have to define them yourself. This happens from line 11-14 where we define the “document” and the method “write”. The rest of the code is pretty self explanatory.
000: use strict;
001: use warnings;
002:
003: use Slurp;
004: use JavaScript::SpiderMonkey;
005:
006: my $js = JavaScript::SpiderMonkey->new();
007: my $code = slurp('mailto.js');
008:
009: $js->init();
010:
011: my $obj = $js->object_by_path("document");
012:
013: my @write;
014: $js->function_set("write", sub { push @write, @_ }, $obj);
015:
016: my $rc = $js->eval(
017: $code
018: );
019:
020: printf "document.write:\n%s\n", join "\n", @write;
021: printf "Error: %s\n", $@;
022: printf "Return Code: %s\n", $rc;
023:
024: $js->destroy();
The output is:
document.write:
mailto:mail@example.com
Error:
Return Code: 1
Posted: November 18th, 2010 under JavaScript, Perl.