Web scraping with R

Want to scrape the web with R? You’re at the right place!

We will teach you from the ground up how to scrape the web with R and will take you through the fundamentals of web scraping (with examples from R).

Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.

Overall, here’s what you are going to learn:

  • R web scraping fundamentals
  • Handling different web scraping scenarios with R
  • Leveraging rvest and Rcrawler to carry out web scraping

Let’s start the journey!

Introduction

The first step towards scrapping the web with R requires you to understand HTML and web scraping fundamentals. You’ll learn how to get browsers to display the source code, then you will develop the logic of markup languages which sets you on the path to scrape that information. And, above all – you’ll master the vocabulary you need to scrape data with R.

We would be looking at the following basics that’ll help you scrape R:

  • HTML Basics
  • Browser presentation
  • And Parsing HTML data in R

So, let’s get into it.

HTML Basics

HTML is behind everything on the web. Our goal here is to briefly understand how Syntax rules, browser presentation, tags, and attributes help us learn how to parse HTML and scrape the web for the information we need.

Browser Presentation

Before we scrape anything using R we need to know the underlying structure of a webpage. And the first thing you notice, is what you see when you open a webpage, isn’t the HTML document. It’s rather how an underlying HTML code is represented. You can basically open any HTML document using a text editor like notepad.

HTML tells a browser how to show a webpage, what goes into a headline, what goes into a text, etc. The underlying marked-up structure is what we need to understand to actually scrape it.

For example, here’s what Scrapingpass.com looks like when you see it in a browser.

And, here’s what the underlying HTML looks like for it

Looking at this source code might seem like a lot of information to digest at once, let alone scrape it! But don’t worry. The next section exactly shows how to see this information better.

HTML elements and tags

If you carefully checked the raw HTML of Scrapingpass.com earlier, you would notice something like <title>…</title>, <body>…</body etc. Those are tags that HTML uses, and each of those tags has its own unique property. For example <title> tag helps a browser render the title of a web page, similarly, <body> tag defines the body of an HTML document.

Once you understand those tags, that raw HTML would start talking to you and you’d already start to get the feeling of how you would be scraping web using R. All you need to take away from this section is that a page is structured with the help of HTML tags, and while scraping knowing these tags can help you locate and extract the information easily.

Parsing a webpage using R

With what we know, let’s use R to scrape an HTML webpage and see what we get. Keep in mind, we only know about HTML page structures so far, we know what RAW HTML looks like. That’s why, with the code, we will simply scrape a webpage and get the raw HTML. It is the first step towards scrapping the web as well.

Earlier in this post, I mentioned that we can even use a text editor to open an HTML document. And in the code below, we will parse HTML in the same way we would parse a text document and read it with R.

I want to scrape the HTML code of Scrapingpass.com and see how it looks. We will use readLines() to map every line of the HTML document and create a flat representation of it.

scrape_url <- "https://www.scrapingpass.com/" flat_html <- readLines(con = url)

Now, when you see what flat_html looks like, you should see something like this in your R Console:

[1] "<!DOCTYPE html>" [2] "<html lang=\"en\">" [3] "<head>" [4] " <meta name=\"generator\" content=\"Hugo 0.60.1\"/>" [6] " <meta http-equiv=\"x-ua-compatible\" content=\"ie=edge\"/>" [7] " <title>Scrapingpass - Web Scraping API</title>" [8] " <meta name=\"description\"" [9] " content=\"Scrapingpass is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [10] " <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, shrink-to-fit=no\"/>" [11] " <meta name=\"twitter:title\" content=\"Scrapingpass - Web Scraping API\"/>" [12] " <meta name=\"twitter:description\"" [13] " content=\"Scrapingpass is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [14] " <meta name=\"twitter:card\" content=\"summary_large_image\"/>" [15] " <meta property=\"og:title\" content=\"Scrapingpass - Web Scraping API\"/>" [16] " <meta property=\"og:url\" content=\"https://www.scrapingpass.com/\" />" [17] " <meta property=\"og:type\" content=\"website\"/>" [18] " <meta property=\"og:image\"" [19] " content=\"https://www.scrapingpass.com/images/cover_image.png\"/>" [20] " <meta property=\"og:description\" content=\"Scrapingpass is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>" [21] " <meta property=\"og:image:width\" content=\"1200\"/>" [22] " <meta property=\"og:image:height\" content=\"630\"/>" [23] " <meta name=\"twitter:image\"" [24] " content=\"https://www.scrapingpass.com/images/terminal.png\"/>" [25] " <link rel=\"canonical\" href=\"https://www.scrapingpass.com/\"/>" [26] " <meta name=\"p:domain_verify\" content=\"7a00b589e716d42c938d6d16b022123f\"/>"

The whole output would be a hundred pages so I’ve trimmed it for you. But, here’s something you can do to have some fun before I take you further towards scraping the web with R:

Scrape www.google.com and try to make sense of the information you received

Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get

Remember, scraping is only fun if you experiment with it. So, as we move forward with the blog post, I’d love it if you try out each and every example as you go through them and bring your own twist. Share in the comments if you found something interesting or feel stuck somewhere.

While our output above looks great, it still is something that doesn’t closely reflect an HTML document. In HTML we have a document hierarchy of tags that looks something like

<!DOCTYPE html> <head> <title>Page Title</title> </head> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>

But clearly, our output from readLines() discarded the markup structure/hierarchies of HTML. Given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.

However, in reality, our code is a lot more complicated. But fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.

First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data through R.

Common web scraping scenarios with R

Access web data using R over FTP

FTP is one of the ways to access data over the web. And with the help of CRAN FTP servers, I’ll show you how you can request data over FTP with just a few lines of code. Overall, the whole process is:

Save ftp URL

Save names of files from the URL into an R object

Save files onto your local directory

Let’s get started now. The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.

ftp_url <- "ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/" get_files <- getURL(ftp_url, dirlistonly = TRUE)

Let’s check the name of the files we received with get_files

> get_files "BayesMixSurv.pdf\r\nChangeLog\r\nDESCRIPTION\r\nNAMESPACE\r\naliases.rds\r\nindex.html\r\nrdxrefs.rds\r\n"

Looking at the string above can you see what the file names are?

The screenshot from the URL shows real file names

It turns out that when you download those file names you get carriage return representations too. And it is pretty easy to solve this issue. In the code below, I used str_split() and str_extract_all() to get the HTML file names of interest.

extracted_filenames <- str_split(get_files, "\r\n")[[1]] extracted_html_filenames <-unlist(str_extract_all(extracted_filenames, ".+(.html)"))

Let’s print the file names to see what we have now:

extracted_html_filenames
> extracted_html_filenames [1] "index.html"

Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.

Now, all we have to do is to write a function that stores them in a folder and a function that downloads HTML docs in that folder from the web.

FTPDownloader <- function(filename, folder, handle) { dir.create(folder, showWarnings = FALSE) fileurl <- str_c(ftp, filename) if (!file.exists(str_c(folder, "/", filename))) { file_name <- try(getURL(fileurl, curl = handle)) write(file_name, str_c(folder, "/", filename)) Sys.sleep(1) } }

We are almost there now! All we now have to do is to download these files to a specified folder in your local drive. Save those files in a folder called scrapignbee_html. To do so, use GetCurlHandle().

Curlhandle <- getCurlHandle(ftp.use.epsv = FALSE)

After that, we’ll use

plyr package’s l_ply() function.
library(plyr) l_ply(extracted_html_filenames, FTPDownloader, folder = "scrapingpass_html", handle = Curlhandle)

And, we are done!

I can see that on my local drive I have a folder named scrapingpass_html, where I have the inde.html file stored. But, if you don’t want to manually go and check the scraped content, use this command to retrieve a list of HTML downloaded:

list.files("./scrapingpass_html") [1] "index.html"

That was via FTP, but what about HTML retrieving specific data from a webpage? That’s what our next section covers.

Scraping information from Wikipedia using R

In this section, I’ll show you how to retrieve information from Leonardo Da Vinci’s Wikipedia page https://en.wikipedia.org/wiki/Leonardo_da_Vinci.

Let’s take the basic steps to parse information:

wiki_url <- "https://en.wikipedia.org/wiki/Leonardo_da_Vinci" wiki_read <- readLines(wiki_url, encoding = "UTF-8") parsed_wiki <- htmlParse(wiki_read, encoding = "UTF-8")

Leonardo Da Vinci’s Wikipedia HTML has now been parsed and stored in parsed_wiki.

But, let’s say you wanted to see what text we were able to parse. A very simple way to do that would be:

wiki_intro_text <- parsed_wiki["//p"]

By doing that, we have essentially parsed everything that exists within the <p> node. And since it is an XML node-set, we can easily use subsetting rules to access different paragraphs. For example, let’s say we pick the 4th element on a random name. Here’s what you’ll see:

wiki_intro_text[[4]] <p>Born <a href="/wiki/Legitimacy_(family_law)" title="Legitimacy (family law)">out of wedlock</a> to a notary, Piero da Vinci, and a peasant woman, Caterina, in <a href="/wiki/Vinci,_Tuscany" title="Vinci, Tuscany">Vinci</a>, in the region of <a href="/wiki/Florence" title="Florence">Florence</a>, <a href="/wiki/Italy" title="Italy">Italy</a>, Leonardo was educated in the studio of the renowned Italian painter <a href="/wiki/Andrea_del_Verrocchio" title="Andrea del Verrocchio">Andrea del Verrocchio</a>. Much of his earlier working life was spent in the service of <a href="/wiki/Ludovico_il_Moro" class="mw-redirect" title="Ludovico il Moro">Ludovico il Moro</a> in Milan, and he later worked in Rome, Bologna and Venice. He spent his last three years in France, where he died in 1519. </p>

Reading text is fun, but let’s do something else – let’s get all links that exist on this page. We can easily do that by using getHTMLLinks() function:

getHTMLLinks(wiki_read) [1] "/wiki/Wikipedia:Good_articles" [2] "/wiki/Wikipedia:Protection_policy#semi" [3] "/wiki/Da_Vinci_(disambiguation)" [4] "/wiki/Leonardo_da_Vinci_(disambiguation)" [5] "/wiki/Republic_of_Florence" [6] "/wiki/Surname" [7] "/wiki/Given_name" [8] "/wiki/File:Francesco_Melzi_-_Portrait_of_Leonardo.png" [9] "/wiki/Francesco_Melzi" …

Notice what you see above is a mix of actual links and links to files.

You can also see the total number of links on this page by using the length() function:

length(getHTMLLinks(wiki_read)) [1] 1566

I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. XML package in R offers a function named readHTMLTable() which makes our life so easy when it comes to scraping tables from HTML pages.

Leonardo’s Wikipedia page has no HTML though, so I will use a different page to show how we can scrape HTML from a webpage using R. Here’s the new URL:

https://en.wikipedia.org/wiki/Help:Table

As usual, we will read this URL:

wiki_url1 <- "https://en.wikipedia.org/wiki/Help:Table" wiki_read1 <- readLines(wiki_url1, encoding = "UTF-8") Now, let’s see how many tables this webpage exactly has: length((readHTMLTable(wiki_read1))) [1] 108

If you look at the page you’ll disagree with the number “108”. For a closer inspection I’ll use name() function to get names of all 108 tables:

names(readHTMLTable(wiki_read1)) [1] "NULL" [2] "NULL" [3] "NULL" [4] "NULL" [5] "NULL" [6] "The table's caption\n" …

Our suspicion was right, there are too many “NULL” and only a few tables. I’ll now read data from one of those tables in R:

readHTMLTable(wiki_read1)$"The table's caption\n" V1 V2 V3 1 Column header 1 Column header 2 Column header 3 2 Row header 1 Cell 2 Cell 3 3 Row header A Cell B Cell C

Here’s how this table looks in HTML

Awesome isn’t it? Imagine being able to access census, pricing, etc data over R and scraping it. Wouldn’t it be fun? That’s why I took a boring one and kept the fun part for you. Try something much cooler than what I did. Here’s an example of table data that you can scrape https://en.wikipedia.org/wiki/United_States_Census

Let me know how it goes for you. But it usually isn’t that straightforward. We have forms and authentication that can block your R code from scraping. And that’s exactly what we are going to learn to get through here.

Handling HTML forms while scraping with R

Often we come across pages that aren’t that easy to scrape. Take a look at the Meteorological Service Singapore’s page (that lack of SSL though :O). Notice the dropdowns here

Imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?

Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2020.

library(rvest) html_form_page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html() weatherstation_identity <- page %>% html_nodes('button#cityname + ul a') %>% html_attr('onclick') %>% sub(".*'(.*)'.*", '\\1', .) weatherdf <- expand.grid(weatherstation_identity, month = sprintf('%02d', 1:12), year = 2016:2020)

Let’s check what type of data has been able to scrape. Here’s what our data frame looks like:

str(weatherdf) > 'data.frame': 3780 obs. of 3 variables: $ Var1 : Factor w/ 63 levels "S104","S105",..: 1 2 3 4 5 6 7 8 9 10 ... $ month: Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ... $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ... - attr(*, "out.attrs")=List of 2 ..$ dim : Named num 63 12 5 .. ..- attr(*, "names")= chr "" "month" "year" ..$ dimnames:List of 3 .. ..$ Var1 : chr "Var1=S104" "Var1=S105" "Var1=S109" "Var1=S86" ... .. ..$ month: chr "month=01" "month=02" "month=03" "month=04" ... .. ..$ year : chr "year=2016" "year=2017" "year=2018" "year=2019" ...

From the data frame above, we can now easily generate URLs that provide direct access to data of our interest.

urlPages <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_', weatherdf$Var1, '_', weatherdf$year, weatherdf$month, '.csv')

Now, we can download those files at scale using lappy().

lapply(urlPages, function(url){download.file(url, basename(url), method = 'curl')})

Note: This is going to download a ton of data once you execute it.

Web scraping using Rvest

Inspired by libraries like BeautifulSoup, rvest is probably one of the most popular packages in R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.

Let’s see rvest in action now. I will scrape information from IMDB and we will scrape Sharknado (because it is the best movie in the world!) https://www.imdb.com/title/tt8031422/

library(rvest) sharknado <- html("[https://www.imdb.com/title/tt8031422/](https://www.imdb.com/title/tt8031422/)")

Awesome movie, awesome cast! Let’s find out what was the cast of this movie.

sharknado %>% html_nodes("table") %>% .[[1]] %>% html_table() X1 X2 1 Cast overview, first billed only: Cast overview, first billed only: 2 Ian Ziering 3 Tara Reid 4 Cassandra Scerbo 5 Judah Friedlander 6 Vivica A. Fox 7 Brendan Petrizzo 8 M. Steven Felty 9 Matie Moncea 10 Todd Rex 11 Debra Wilson 12 Alaska Thunderfuck 13 Neil deGrasse Tyson 14 Marina Sirtis 15 Audrey Latt 16 Ana Maria Varty Mihail

Awesome cast! Probably that’s why it was such a huge hit. Who knows.

Still, there are skeptics of Sharknado. I guess the rating would prove them wrong? Here’s how you extract ratings of Sharknado from IMDB

sharknado %>% html_node("strong span") %>% html_text() %>% as.numeric() [1] 3.5

I still stand by my words. But I hope you get the point, right? See how easy it is for us to scrape information using rvest, while we were writing 10+ lines of code in much simpler scraping scenarios.

Next on our list is Rcrawler.

Web Scraping using Rcrawler

Rcrawler is another R package that helps us harvest information from the web. But unlike rvest, we use Rcrawler for network graph-related scraping tasks a lot more. For example, if you wish to scrape a very large website, you might want to try Rcrawler in a bit more depth.

Note: Rcrawler is more about crawling than scraping.

We will go back to Wikipedia and we will try to find the date of birth, date of death, and other details of scientists.

library(Rcrawler) List_of_scientists <- c("Niels Bohr", "Max Born", "Albert Einstein", "Enrico Fermi") pages_of_interest = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", list_of_scientists)) scientist_data <- ContentScraper(Url = target_pages , XpathPatterns = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td","//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"), PatternsName = c("scientist", "dob", "dod"), asDataFrame = TRUE)

The output looks like this:

# Scientist dob dod 1 Niels Bohr 7 October 1885Copenhagen, Denmark 18 November 1962 (aged 77) Copenhagen, Denmark 2 Max Born 11 December 1882 5 January 1970 (aged 87) 3 Albert Einstein 14 March 1879 18 April 1955 4 Enrico Fermi 29 September 1901 28 November 1954

And that’s it!

You pretty much know everything you need to get started with Web Scraping in R.

Try challenging yourself with interesting use cases and uncover challenges. Scraping the web with R can be really fun!

While this whole article tackles the main aspect of web scraping with R, it does not talk about web scraping without getting blocked.

If you want to learn how to do it, we have written this complete guide, and if you don’t want to take care of this, you can always use our web scraping API.

Happy scraping.

Was this post helpful?