But thanks to web 2.0, most of the time nowadays it does not work. Why? Because the data is actually not contained in the HTML itself but put there by a script (usually Javascript) that is run by your browser when you go on the page itself. Meaning that if you try to get it from R, which is not a browser, you'll see squat. Luckily, enter headless browsers. What they are, are command-line tools that mimic browsers by executing this kind of scripts. Personally I use phantomJS because it is the easiest for me to use given i don't really know Javascript that much. So, once you installed phantomJS, all you need to do is write a little script that just reads in the page, processes it and outputs the result as a pure HTML page. Here, as an example, a script I wrote a year or two ago, for a colleague, that grabs data from the website of the IUCN red list:
# The page you want to process
url <- "https://www.iucnredlist.org/species/1301/511335"
# Name of the intermediary file
out <- "species.html"
# A little bit of metaprogramming here:
j <- sprintf("var url ='%s';
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('%s', page.content, 'w');
phantom.exit();
}, 2500);
}", url, out)
# Save that to a file
cat(j, file="scrape.js")
# Asks your shell to run phantomJS on that script
system("phantomjs scrape.js")
# Read in the resulting file
h <- htmlParse("species.html")
# and pick whatever info you wanted in the first place (here using xpath)
xpathSApply(h, "//div[@id='threats-details']//div[@class='text-body']", xmlValue)