Scraping rendered HTML

2020-09-28

Once in a while (or, you know, every day), when you're doing 'big' data analysis you need to do some web scraping – grab directly data from public websites – particularly if they do not have an API to deliver the data you are interested in. Most of the time, all it takes is to read in the HTML page in R (thanks to package XML), and pick up the data, which in the best case scenario already lies there as a table.

But thanks to web 2.0, most of the time nowadays it does not work. Why? Because the data is actually not contained in the HTML itself but put there by a script (usually Javascript) that is run by your browser when you go on the page itself. Meaning that if you try to get it from R, which is not a browser, you'll see squat. Luckily, enter headless browsers. What they are, are command-line tools that mimic browsers by executing this kind of scripts. Personally I use phantomJS because it is the easiest for me to use given i don't really know Javascript that much. So, once you installed phantomJS, all you need to do is write a little script that just reads in the page, processes it and outputs the result as a pure HTML page. Here, as an example, a script I wrote a year or two ago, for a colleague, that grabs data from the website of the IUCN red list:

# The page you want to process
url <- "https://www.iucnredlist.org/species/1301/511335"
# Name of the intermediary file
out <- "species.html"
# A little bit of metaprogramming here:
j <- sprintf("var url ='%s';
var page = new WebPage() 
var fs = require('fs'); 

page.open(url, function (status) { 
  just_wait(); 
}); 

function just_wait() { 
  setTimeout(function() { 
    fs.write('%s', page.content, 'w'); 
    phantom.exit(); 
  }, 2500); 
}", url, out)
# Save that to a file
cat(j, file="scrape.js")
# Asks your shell to run phantomJS on that script
system("phantomjs scrape.js")
# Read in the resulting file
h <- htmlParse("species.html")
# and pick whatever info you wanted in the first place (here using xpath)
xpathSApply(h, "//div[@id='threats-details']//div[@class='text-body']", xmlValue)