Web scraping with ruby
Introduction
This post will cover the main tools and techniques for web scraping in Ruby. We start with an introduction to building a web scraper using common Ruby HTTP clients and parsing the response. This approach to web scraping has, however, its limitations and can come with a fair dose of frustration. Not to mention, as manageable as it is to scrape static pages, these tools fail when it comes to dealing with Single Page Applications, the content of which is built with JavaScript. As an answer to that, we will propose using a complete web scraping framework. This article assumes that the reader is familiar with the fundamentals of Ruby and of how the Internet works.
Note: Although there is a multitude of gems, we will focus on those most popular as indicated by their Github “used by”, “star” and “fork” attributed. While we won’t be able to cover all the use cases for these tools, we will provide good grounds for you to get started and explore more on your own.
Part I: Static pages
Setup
In order to be able to code along with this part, you may need to install the following gems:
gem install ‘pry’ #debugging tool gem install ‘nokogiri’ #parsing gem gem install ‘HTTParty’ #HTTP request gem
Moreover, we will use open-uri, net/http and csv, which are part of the standard Ruby library so there’s no need for a separate installation.
I will place all my code in a file called scraper.rb.
Note: My ruby version is 2.6.1
Make a request with HTTP clients in Ruby
In this section, we will cover how to scrape Wikipedia with Ruby.
Imagine you want to build the ultimate Douglas Adams fan wiki. You would for sure start with getting data from Wikipedia. In order to send a request to any website or web app, you would need to use an HTTP client. Let’s take a look at our three main options: net/http, open-uri and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2.
Net/HTTP
Ruby standard library comes with an HTTP client of its own, namely, the net/http gem. In order to make a request to Douglas Adams Wikipedia page easily, we need to first prepare the URL. To do so, we will use the open-uri gem, which also is a part of the standard Ruby library.
# you shouldn't need to require these gems but just in case: # require 'open-uri' # require 'net/http' url = "https://en.wikipedia.org/wiki/Douglas_Adams" uri = URI.parse(url) response = Net::HTTP.get_response(uri) puts response.body #=> "\n<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Douglas Adams - Wikipedia</title>..."
We are given a string with all the HTML from the page.
Note: If the data comes in a JSON format, you can parse it by adding these two lines:
require 'json' JSON.parse(response.body)
That’s it – it works! However, the syntax of net/http is a bit clunky and less intuitive than that of HTTParty or open-uri, which are, in fact, just elegant wrappers for net/http.
HTTParty
The HTTParty gem was created to ‘make http fun’. Indeed, with the intuitive and straightforward syntax, the gem has become widely popular in recent years. The following two lines are all we need to make a successful GET request:
require "HTTParty" html = HTTParty.get("https://en.wikipedia.org/wiki/Douglas_Adams") # => "<!DOCTYPE html>\n" + "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n" + "<head>\n" + "<meta charset=\"UTF-8\"/>\n" + "<title>Douglas Adams - Wikipedia</title>\n" + ...
What is returned is an HTTParty::Response, an array-like collection of strings representing the HTML of the page.
Note: It is much easier to work with objects. If the response Content-Type is application/JSON, HTTParty will parse the response and return Ruby objects with keys as strings. We can learn about the content type by running response.headers[“content-type”]. In order to achieve that, we need to add this line to our code:
JSON.parse response, symbolize_names: true
We can’t, however, do this with Wikipedia as we get text/HTML back.
We wrote an extensive guide about the best Ruby HTTP clients, feel free to check it out.
Open URI
The simplest solution, however, is making a request with the open-uri gem, which also is a part of the standard Ruby library:
require 'open-uri' html = open("https://en.wikipedia.org/wiki/Douglas_Adams") ##<File:/var/folders/zl/8zprgb3d6yn_466ghws8sbmh0000gq/T/open-uri20200525-33247-1ctgjgo>
The return value is a Tempfile containing the HTML and that’s all we need for the next step. It is just one line.
The simplicity of open-URI is already explained in its name. It only sends one type of request, and does it very well: with a default SSL and following redirections.
Parsing HTML with Nokogiri
Once we have the HTML, we need to extract only the parts that are of our interest. As you probably noticed, each of the routes in the previous section has declared the HTML variable. We will use it now as an argument to Nokogiri::HTML method.
response = Nokogiri::HTML(html) # => #(Document:0x3fe41d89a238 { # name = "document", # children = [ # #(DTD:0x3fe41d92bdc8 { name = "html" }), # #(Element:0x3fe41d89a10c { # name = "html", # attributes = [ # #(Attr:0x3fe41d92fe00 { name = "class", value = "client-nojs" }), # #(Attr:0x3fe41d92fdec { name = "lang", value = "en" }), # #(Attr:0x3fe41d92fdd8 { name = "dir", value = "ltr" })], # children = [ # #(Text "\n"), # #(Element:0x3fe41d93e7fc { # name = "head", # children = [ ...
Here the return value is Nokogiri::HTML::Document, a hash-like object, which is actually a snapshot of that HTML converted into a structure of nested nodes.
The good news is that Nokogiri allows us to use CSS selectors or XPath to target the desired element. We will use both the CSS selectors and the XPath.
In order to parse through the object, we need to do a bit of detective work with the browser’s DevTools. In the below example, we are using Chrome to inspect whether the desired element has any attached class:
As we see, the elements on Wikipedia do not have classes. Still, we can target them by their tag. For instance, if we wanted to get all the paragraphs, we’d approach it by first selecting all p keys, and then converting them to text:
description = doc.css("p").text # => "\n\nDouglas Noel Adams (11 March 1952 – 11 May 2001) was an English author, screenwriter, essayist, humorist, satirist and dramatist. Adams was author of The Hitchhiker's Guide to the Galaxy, which originated in 1978 as a BBC radio comedy before developing into a \"trilogy\" of five books that sold more than 15 million copies in his lifetime and generated a television series, several stage plays, comics, a video game, and in 2005 a feature film. Adams's contribution to UK radio is commemorated in The Radio Academy's Hall of Fame.[1]\nAdams also wrote Dirk Gently's...
This approach resulted in a 4,336-word-long string. However, imagine you would like to get only the first introductory paragraph and the picture. You could either use regex or let Ruby do this for you with the .split method. Here we see that the demarcation for paragraphs (\n) has been preserved. We can therefore ask Ruby to extract only the first non-empty paragraph:
description = doc.css("p").text.split("\n").find{|e| e.length > 0}
Alternatively, we can also just get rid of any empty spaces by adding the .strip method and then just selecting the first item:
description = doc.css("p").text.strip.split("\n")[0]
Alternatively, and depending on how the HTML is structured, sometimes an easier way could be traversing it by accessing the XML (Extensible Markup Language), which is the format of the Nokogiri::HTML::Document. To do that, we’d select one of the nodes and dive as deep as necessary:
description = doc.css("p")[1] #=> #(Element:0x3fe41d89fb84 { # name = "p", # children = [ # #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }), # #(Text " (11 March 1952 – 11 May 2001) was an English "), # #(Element:0x3fe41e837560 { # name = "a", # attributes = [ # #(Attr:0x3fe41e833104 { name = "href", value = "/wiki/Author" }), # #(Attr:0x3fe41e8330dc { name = "title", value = "Author" })], # children = [ #(Text "author")] # }), # #(Text ", "), # #(Element:0x3fe41e406928 { # name = "a", # attributes = [ # #(Attr:0x3fe41e41949c { name = "href", value = "/wiki/Screenwriter" }), # #(Attr:0x3fe41e4191cc { name = "title", value = "Screenwriter" })], # children = [ #(Text "screenwriter")] # }),
Once we found the node of our interest, we want to call the .children method on it, which will return – you’ve guessed it – more nested XML objects. We could iterate over them to get the text we need. Here’s an example of return values from two nodes:
doc.css("p")[1].children[0] #=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }) doc.css("p")[1].children[1] #=> #(Text " (11 March 1952 – 11 May 2001) was an English ")
Now we want to call the .children method, which will return – you’ve guessed it – nested XML objects and we could iterate over them to get the text we need. Here’s an example of return values from two nodes:
doc.css("p")[1].children[0] #=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }) doc.css("p")[1].children[1] #=> #(Text " (11 March 1952 – 11 May 2001) was an English ")
Now, let’s locate the picture. That should be easy, right? Well, since there are no classes or ids on Wikipedia, calling doc.css(‘img’) resulted in 16 elements, and increasing selector specificity to doc.css(‘td a img’) still did not allow us to get the main image. We could look for the image by its alt text and then save its url:
picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value
Or we could locate the image using the XPath, which also returns eight objects so we need to find the correct one:
picture = doc.xpath("/html/body/div[3]/div[3]/div[4]/div/table[1]/tbody/tr[2]/td/a/img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value
While all this is possible to achieve, it is really time-consuming and a small change in the page HTML can result in our code-breaking. Hunting for a specific piece of text, with or without regex, oftentimes feels like looking for a few needles in a haystack. To add to that, oftentimes the website themselves are not structured in a logical way and do not follow a clear design. That not only prolongs the time a developer spends with DevTools but also results in many exceptions.
Fortunately, a developer’s experience definitely improves when using a web scraping framework, which not only makes the code cleaner but also has ready-made tools for all occasions.
💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.
Exporting Scraped Data to CSV
Before we move on to covering the complete web scraping framework, let’s just see how to actually use the data we get from a website.
Once you’ve successfully scraped the website, you can save it as a CSV list, which can be used in Excel or integrated into e.g. a mailing platform. It is a popular use case for web scraping. In order to implement this feature, you will use the csv gem.
In the same folder, create a separate data.csv file.
csv works best with arrays so create a data_array variable and define it as an empty array.
Push the data to the array.
Add the array to the csv file.
Run the scraper and check your data.csv file.
The code:
require 'csv' html = open("https://en.wikipedia.org/wiki/Douglas_Adams") doc = Nokogiri::HTML(html) data_arr = [] description = doc.css("p").text.split("\n").find{|e| e.length > 0} picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value data_arr.push([description, picture]) CSV.open('data.csv', "w") do |csv| csv << data end
Part II: A complete Ruby web scraping framework
We have covered scraping static pages with basic tools, which forced us to spend a bit too much time on trying to locate a specific element. While these approaches are more-or-less work, they do have their limitations. For instance, what happens when a website depends on JavaScript, like in the case of Single Page Applications, or the infinite scroll pages? These web apps usually have very limited initial HTML and so scraping them with Nokogiri would not bring the desired effects.
In this case, we could use a framework that works with JS-rendered sites. The friendliest and best-documented one is by far Kimurai, which runs on Nokogiri for static pages and Capybara for imitating user interaction with the website. Apart from a plethora of helper methods for making web scraping easy and pleasant, it works out of the box with Headless Chrome, Firefox and PhantomJS.
In this part of the article, we will scrape a job listing web app. First, we will do it statically by just visiting different URL addresses and then, we will introduce some JS action.
Kimurai Setup
In order to scrape dynamic pages, you need to install a couple of tools – below you will find the list with the macOS installation commands:
Chrome and Firefox: brew cask install google-chrome firefox chromedriver: brew cask install chromedriver geckodriver: brew install geckodriver PhantomJS: brew install phantomjs Kimurai gem: gem install kimurai
In this tutorial, we will use a simple Ruby file but you could also create a Rails app that would scrape a site and save the data to the database.
Static page scraping
Let’s start with what Kimurai considers a bare minimum: a class with options for the scraper and a parse method:
require 'kimurai' class JobScraper < Kimurai::Base @name= 'eng_job_scraper' @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"] @engine = :selenium_chrome def parse(response, url:, data: {}) end end
As you see above, we use the following options:
@name: you can name your scraper whatever you wish or omit it altogether if your scraper consists of just one file; @start_urls: this is an array of start urls, which will be processed one by one inside the parse method; @engine: engine used for scraping; in this tutorial, we are using Selenium with Headless Chrome; if you don't know, which engine to choose, check this description of each one.
Let’s talk about the parse method now. It is the default start method for the scraper and it accepts the following arguments:
response: the Nokogiri::HTML object, which we know from the prior part of this post;
url: a string, which can be either passed to the method manually or otherwise will be taken from the @start_urls array; data: a storage for passing data between requests;
Just like when we used Nokogiri, here you can also parse the response by using CSS selectors or XPath. If you’re not very familiar with the XPath, here is a practical guide to XPath for web scraping. In this part of the tutorial, we will use both.
Before we move on, we need to introduce the browser object. Every instance method of the scraper has access to the Capybara::Session object. Although usually, it is not necessary to use it (because the response already contains the whole page), if you won’t ultimately be able to click or fill out forms, it allows you to interact with the website.
Now would be a good time to have a look at the Page structure:
Since we are only interested in the job listings, it is convenient to see whether they are grouped within a component – in fact, they are all nested in td#resultsCol. After locating that, we do the same with each of the listings. Below you will also see a helper method scrape_page and a @@jobs = [] array, which will be our storage for all the jobs we scrape.
require 'kimurai' class JobScraper < Kimurai::Base @name= 'eng_job_scraper' @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"] @engine = :selenium_chrome @@jobs = [] def scrape_page doc = browser.current_response returned_jobs = doc.css('td#resultsCol') returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element| #code to get only the listings end end def parse(response, url:, data: {}) scrape_page @@jobs end end JobScraper.crawl!
Let’s inspect the page again to check the selectors for title, description, company, location, salary, requirements and the slug for the listing:
With this knowledge, we can scrape an individual listing.
def scrape_page doc = browser.current_response returned_jobs = doc.css('td#resultsCol') returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element| # scraping individual listings title = char_element.css('h2 a')[0].attributes["title"].value.gsub(/\n/, "") link = "https://indeed.com" + char_element.css('h2 a')[0].attributes["href"].value.gsub(/\n/, "") description = char_element.css('div.summary').text.gsub(/\n/, "") company = description = char_element.css('span.company').text.gsub(/\n/, "") location = char_element.css('div.location').text.gsub(/\n/, "") salary = char_element.css('div.salarySnippet').text.gsub(/\n/, "") requirements = char_element.css('div.jobCardReqContainer').text.gsub(/\n/, "") # creating a job object job = {title: title, link: link, description: description, company: company, location: location, salary: salary, requirements: requirements} # adding the object if it is unique @@jobs << job if !@@jobs.include?(job) end end
Instead of creating an object, we could also create an array, depending on what data structure we’d need later:
job = [title, link, description, company, location, salary, requirements]
As the code currently is, we only get the first 15 results or just the first page. In order to get data from the next pages, we can visit subsequent URLs:
def parse(response, url:, data: {}) # scrape first page scrape_page # next page link starts with 20 so the counter will be initially set to 2 num = 2 #visit next page and scrape it 10.times do browser.visit("https://www.indeed.com/jobs?q=software+engineer&l=New+York,+NY&start=#{num}0") scrape_page num += 1 end @@jobs end
Last but not least, we could create a JSON or CSV files by adding these snippets to the parse method to store the scraped data:
CSV.open('jobs.csv', "w") do |csv| csv << @@jobs end or: File.open("jobs.json","w") do |f| f.write(JSON.pretty_generate(@@jobs)) end
Dynamic page scraping with Selenium and Headless Chrome
In bringing JavaScript interaction, we actually won’t change much about our current code except that instead of visiting different URLs, we will use Selenium with headless Chrome to imitate a user interaction and click the button that will take us there.
def parse(response, url:, data: {}) 10.times do # scrape first page scrape_page puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹" # find the "next" button + click to move to the next page browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 " end @@jobs end
To this end we use two methods:
find(): finding an element in the current session by its XPath; click: simulating user interaction.
We added two puts statements to see whether our scraper actually moves forward:
As you see, we successfully scraped the first page but then we encountered an error:
element click intercepted: Element <span class="pn">...</span> is not clickable at point (329, 300). Other element would receive the click: <input autofocus="" name="email" type="email" id="popover-email" class="popover-input-locationtst"> (Selenium::WebDriver::Error::ElementClickInterceptedError) (Session info: headless chrome=83.0.4103.61)
We could either investigate the response to read the HTML and try to understand why the page looks differently or we could use a more elaborate tool, a screenshot of the page:
def parse(response, url:, data: {}) 10.times do # scrape first page scrape_page # take a screenshot of the page browser.save_screenshot # find the "next" button + click to move to the next page browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹" puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 " end @@jobs end
Now, as the code runs, we get screenshots of every page it encounters. This is the first page:
And here is the second page:
Aha! A popup! After running this test a couple of times, and inspecting errors closely, we know that it comes in two versions and that, sadly, it is not clickable. However, we can always refresh the page and the information of our annoyance will be saved in the session. Let’s then add a safeguard:
def parse(response, url:, data: {}) 10.times do scrape_page # if there's the popup, escape it if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst') browser.refresh end # find the "next" button + click to move to the next page browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹" puts "🔺 🔺 🔺 🔺 🔺 CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 " end @@jobs end
Finally, our scraper works without a problem, and after ten rounds, we end up with 155 job listings:
Here’s the full code of our dynamic scraper:
require 'kimurai' class JobScraper < Kimurai::Base @name= 'eng_job_scraper' @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"] @engine = :selenium_chrome @@jobs = [] def scrape_page doc = browser.current_response returned_jobs = doc.css('td#resultsCol') returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element| title = char_element.css('h2 a')[0].attributes["title"].value.gsub(/\n/, "") link = "https://indeed.com" + char_element.css('h2 a')[0].attributes["href"].value.gsub(/\n/, "") description = char_element.css('div.summary').text.gsub(/\n/, "") company = description = char_element.css('span.company').text.gsub(/\n/, "") location = char_element.css('div.location').text.gsub(/\n/, "") salary = char_element.css('div.salarySnippet').text.gsub(/\n/, "") requirements = char_element.css('div.jobCardReqContainer').text.gsub(/\n/, "") # job = [title, link, description, company, location, salary, requirements] job = {title: title, link: link, description: description, company: company, location: location, salary: salary, requirements: requirements} @@jobs << job if !@@jobs.include?(job) end end def parse(response, url:, data: {}) 10.times do scrape_page if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst') browser.refresh end browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹" puts "🔺 🔺 🔺 🔺 🔺 CLICKED NEXT BUTTON 🔺 🔺 🔺 🔺 " end CSV.open('jobs.csv', "w") do |csv| csv << @@jobs end File.open("jobs.json","w") do |f| f.write(JSON.pretty_generate(@@jobs)) end @@jobs end end jobs = JobScraper.crawl!
Alternatively, you could also replace the crawl! the method with parse!, which would allow you to use the return statement and print out the @@jobs array:
jobs = JobScraper.parse!(:parse, url: "https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY") pp jobs
Conclusion
Web scraping is definitely one of the most powerful activities a developer can engage in as it helps you quickly access, aggregate, and process data coming from various sources. It can feel satisfying or daunting, depending on the tools one’s using as some can see through poor web design decisions and deliver only the data you need in a matter of seconds. I definitely do not wish anyone to spend hours trying to scrape a page just to learn that there are sometimes small inconsistencies in how their developers approached web development. Look for tools that make your life easier!
While this whole article tackles the main aspect of web scraping with Ruby, it does not talk about web scraping without getting blocked.
If you want to learn how to do it, we have written this complete guide, and if you don’t want to take care of this, you can always use our web scraping API.
Happy Scraping.
Abhishek Kumar
More posts by Abhishek Kumar