ReactJS, like many other Javascript libraries / frameworks, uses client-side code (Javascript) to render the final HTML. This means that when you, Jaunt, or your browser fetch the HTML source code from the server, it doesn't yet contain the final code the user will see. The browser needs to run the Javascript program(s) contained in the page, in order to generate the final content you wish to scrape.
My favorite tool for this kind of job is CasperJS
It (or rather the PhantomJS tool that CasperJS uses) is a headless browser, meaning it's a version of Webkit (like Chrome or Safari) that has been stripped of all the GUI (windows, buttons, menus.) What's left is a tool that you can run from a terminal or from your Java program. It won't show any window on the screen, but it will fetch the webpages you ask it to; run any Javascript they contain; and then respond to your commands, such as "click on this link", "give me that text", "capture a screenshot", and so on.
Let's start with a simple ReactJS example:
We want to scrape the "Hello John" text, but if you look at the plain HTML source (Ctrl+U or Alt+Ctrl+U) you won't see it. On the other hand, if you open the console in your browser and use the following selector, you will get the text:
> document.querySelector('#helloExample .playgroundPreview').textContent
"Hello John"
Here is a simple CasperJS script to do the same thing:
var casper = require("casper").create();
casper.start("http://facebook.github.io/react/index.html", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
casper.run();
You can save it as hello.js
and execute it with casperjs hello.js
from a terminal, or use the equivalent Java code Runtime.getRuntime().exec(...)
Here is a better script, that avoids loading images and third-party resources (such as Facebook button, Twitter button, Google Analytics, and such) cutting the loading time by half. It also adds a waitForSelector
step, so that we don't risk trying to fetch the text before ReactJS has had a chance to create it.
var casper = require("casper").create({
pageSettings: {
loadImages: false
}
});
casper.on('resource.requested', function(requestData, request) {
if (requestData.url.indexOf("http://facebook.github.io/") != 0) {
request.abort();
}
});
casper.start("http://facebook.github.io/react/index.html", function() {
this.waitForSelector("#helloExample .playgroundPreview", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
});
casper.run();
How to install CasperJS
I have had some trouble scraping ReactJS and other modern Javascript pages with the older versions of PhantomJS and CasperJS, so I recommend installing PhantomJS 2.0 and the latest CasperJS from GitHub.
For PhantomJS you can just download the official 2.0 package.
For CasperJS, since it's a Python script, you should be able to check out the latest commit from GitHub and link bin/casperjs
onto your PATH. Here's a script for Linux or Mac OS X:
> git clone git://github.com/n1k0/casperjs.git
> cd casperjs
> ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs
You may also want to comment out the line printing Warning PhantomJS v2.0 ...
from your bin/bootstrap.js
file.