SQL, perl und Unix/Linux Schulungen in und um Wien

Site search

intern

Oracle und SQL

Perl

Scraping web pages in JavaScript with Perl

Sometimes you want to scrape Webpages which contain JavaScript and therefore resist beeing scraped with Web::Scraper or the likes. Imagine some JavaScript code like the following to disguise a email address.
function mail() { var name = "mail"; var domain = "example.com"; var mailto = 'mailto:' + name + '@' + domain; document.write(mailto); } mail();

One could use somethink elaborate like Selenium to execute the code within a browser and then extract the address with “conventional” means. There are cases when this isn’t sufficent.
Enter JavaScript::SpiderMonkey, which allows you to execute JavaScript Code on the console without a browser. The only problem remaining is that the console doesn’t provide some properties and methods the browser has, so you have to define them yourself. This happens from line 11-14 where we define the “document” and the method “write”. The rest of the code is pretty self explanatory.

000: use strict; 001: use warnings; 002: 003: use Slurp; 004: use JavaScript::SpiderMonkey; 005: 006: my $js = JavaScript::SpiderMonkey->new(); 007: my $code = slurp('mailto.js'); 008: 009: $js->init(); 010: 011: my $obj = $js->object_by_path("document"); 012: 013: my @write; 014: $js->function_set("write", sub { push @write, @_ }, $obj); 015: 016: my $rc = $js->eval( 017: $code 018: ); 019: 020: printf "document.write:\n%s\n", join "\n", @write; 021: printf "Error: %s\n", $@; 022: printf "Return Code: %s\n", $rc; 023: 024: $js->destroy();

The output is:
document.write:
mailto:mail@example.com
Error:
Return Code: 1

Posted: November 18th, 2010 under JavaScript, Perl.

Main menu:

Site search

Categories

Tags

intern

Oracle und SQL

Perl

Scraping web pages in JavaScript with Perl

Write a comment