WEB SCRAPING WITH R
We will teach you from the ground up how to scrape the web with R and will take you through the fundamentals of web scraping.
Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.
Overall, here’s what you are going to learn:
- R web scraping fundamentals
- Handling different web scraping scenarios with R
- Leveraging rvest and Rcrawler to carry out web scraping
Introduction
The first step towards scraping the web with R requires you to understand HTML and web scraping fundamentals. You’ll learn how to get browsers to display the source code, then you will develop the logic of markup languages which sets you on the path to scrape that information. And, above all – you’ll master the vocabulary you need to scrape data with R.
We would be looking at the following basics that’ll help you scrape R:
- HTML Basics
- Parsing HTML data in R
HTML Basics
HTML is behind everything on the web. Our goal here is to briefly understand how Syntax rules, browser presentation, tags and attributes help us learn how to parse HTML and scrape the web for the information we need.
Parsing a webpage using R
With what we know, let’s use R to scrape an HTML webpage and see what we get. Keep in mind, we only know about HTML page structures so far, we know what RAW HTML looks like. That’s why, with the code, we will simply scrape a webpage and get the raw HTML. It is the first step towards scraping the web as well.
We will use readLines() to map every line of the HTML document and create a flat representation of it.
scrape_url <- "https://www.scrapingbee.com/" flat_html <- readLines(con = url)
Now, when you see what flat_html looks like, you should see something like this in your R Console:
[1] "<!DOCTYPE html>" [2] "<html lang=\"en\">" [3] "<head>" [4] " <meta name=\"generator\" content=\"Hugo 0.60.1\"/>" [6] " <meta http-equiv=\"x-ua-compatible\" content=\"ie=edge\"/>" [7] " <title>ScrapingBee - Web Scraping API</title>" [8] " <meta name=\"description\"" [9] " content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [10] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, shrink-to-fit=no\"/>" [11] " <meta name=\"twitter:title\" content=\"ScrapingBee - Web Scraping API\"/>" [12] " <meta name=\"twitter:description\"" [13] " content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [14] " <meta name=\"twitter:card\" content=\"summary_large_image\"/>" [15] " <meta property=\"og:title\" content=\"ScrapingBee - Web Scraping API\"/>" [16] " <meta property=\"og:url\" content=\"https://www.scrapingbee.com/\" />" [17] " <meta property=\"og:type\" content=\"website\"/>" [18] " <meta property=\"og:image\"" [19] " content=\"https://www.scrapingbee.com/images/cover_image.png\"/>" [20] " <meta property=\"og:description\" content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [21] " <meta property=\"og:image:width\" content=\"1200\"/>" [22] " <meta property=\"og:image:height\" content=\"630\"/>" [23] " <meta name=\"twitter:image\"" [24] " content=\"https://www.scrapingbee.com/images/terminal.png\"/>" [25] " <link rel=\"canonical\" href=\"https://www.scrapingbee.com/\"/>" [26] " <meta name=\"p:domain_verify\" content=\"7a00b589e716d42c938d6d16b022123f\"/>"
The whole output would be a hundred pages so I’ve trimmed it for you. But, here’s something you can do to have some fun before I take you further towards scraping the web with R:
- Scrape www.google.com and try to make sense of the information you received
- Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get
In HTML we have a document hierarchy of tags that looks something like
<!DOCTYPE html> <head> <title>Page Title</title> </head> <body> <h1>My First Heading </h1> <p>My first paragraph.</p> </body> </html>
But clearly, our output from readLines() discarded the markup structure/hierarchies of HTML. Given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.
However, in reality, our code is a lot more complicated. But fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.
First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data through R.
Common web scraping scenarios with R
Access web data using R over FTP
FTP is one of the ways to access data over the web. And with the help of CRAN FTP servers, I’ll show you how you can request data over FTP with just a few lines of code. Overall, the whole process is:
- Save ftp URL
- Save names of files from the URL into an R object
- Save files onto your local directory
Let’s get started now. The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.
ftp_url <- "ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/" get_files <- getURL(ftp_url, dirlistonly = TRUE)
Let’s check the name of the files we received with get_files
> get_files "BayesMixSurv.pdf\r\nChangeLog\r\nDESCRIPTION\r\nNAMESPACE\r\naliases.rds\r\nindex.html\r\nrdxrefs.rds\r\n"
It turns out that when you download those file names you get carriage return representations too. And it is pretty easy to solve this issue. In the code below, I used str_split() and str_extract_all() to get the HTML file names of interest.
extracted_filenames <- str_split(get_files, "\r\n")[[1]] extracted_html_filenames <-unlist(str_extract_all(extracted_filenames, ".+(.html)"))
Let’s print the file names to see what we have now:
> extracted_html_filenames [1] "index.html"
Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.
Now, all we have to do is to write a function that stores them in a folder and a function that downloads HTML docs in that folder from the web.
FTPDownloader <- function(filename, folder, handle) { dir.create(folder, showWarnings = FALSE) fileurl <- str_c(ftp, filename) if (!file.exists(str_c(folder, "/", filename))) { file_name <- try(getURL(fileurl, curl = handle)) write(file_name, str_c(folder, "/", filename)) Sys.sleep(1) } }
We are almost there now! All we now have to do is to download these files to a specified folder in your local drive. Save those files in a folder called scrapignbee_html. To do so, use GetCurlHandle().
Curlhandle <- getCurlHandle(ftp.use.epsv = FALSE)
After that, we’ll use plyr package’s l_ply() function.
library(plyr) l_ply(extracted_html_filenames, FTPDownloader, folder = "scrapingbee_html", handle = Curlhandle)
And, we are done!
I can see that on my local drive I have a folder named scrapingbee_html, where I have inde.html file stored. But, if you don’t want to manually go and check the scraped content, use this command to retrieve a list of HTMLs downloaded:
list.files("./scrapingbee_html") [1] "index.html"
That was via FTP, but what about HTML retrieving specific data from a webpage? That’s what our next section covers.
Handling HTML forms while scraping with R
Imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?
Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2020.
library(rvest) html_form_page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html() weatherstation_identity <- page %>% html_nodes('button#cityname + ul a') %>% html_attr('onclick') %>% sub(".*'(.*)'.*", '\\1', .) weatherdf <- expand.grid(weatherstation_identity, month = sprintf('%02d', 1:12), year = 2016:2020)
Now, we can download those files at scale using lappy().
lapply(urlPages, function(url){download.file(url, basename(url), method = 'curl')})
Note: This is going to download a ton of data once you execute it.
Web scraping using Rvest
Inspired by libraries like BeautifulSoup, rvest is probably one of the most popular packages in R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.
Let’s see rvest in action now. I will scrape information from IMDB and we will scrape Sharknado (because it is the best movie in the world!) https://www.imdb.com/title/tt8031422/
library(rvest) sharknado <- html("[https://www.imdb.com/title/tt8031422/](https://www.imdb.com/title/tt8031422/)")
Awesome movie, awesome cast! Let’s find out what was the cast of this movie.
sharknado %>% html_nodes("table") %>% .[[1]] %>% html_table() X1 X2 1 Cast overview, first billed only: Cast overview, first billed only: 2 Ian Ziering 3 Tara Reid 4 Cassandra Scerbo 5 Judah Friedlander 6 Vivica A. Fox 7 Brendan Petrizzo 8 M. Steven Felty 9 Matie Moncea 10 Todd Rex 11 Debra Wilson 12 Alaska Thunderfuck 13 Neil deGrasse Tyson 14 Marina Sirtis 15 Audrey Latt 16 Ana Maria Varty Mihail
Abhishek Kumar
More posts by Abhishek Kumar