What is Web scraping?
Web scraping is a well-liked method of automatically collecting knowledge from different websites. It allows you to quickly obtain the info without the need to flick through the various pages and replica and paste the info.
Later, it’s outputted into a CSV file with structured information. Scraping tools also are capable of actualizing the changing information.
There are numerous applications, websites, and browser plugins allowing you to parse the knowledge quickly and efficiently. it’s also possible to make your own web scraper – this is often not as hard because it could seem.
How to do web scraping using Ruby
Having considered the variability of web scraping tools and therefore the possible ways to use the scraped data, now let’s mention creating your own custom tool.
Tools
This language provides a good range of ready-made tools for performing typical operations.
They allow developers to use official and reliable solutions rather than reinventing the wheel. For Ruby web scraping, you’ll be got to install the subsequent gems on your computer:
- NokoGiri is an HTML, SAX, and RSS parser providing access to the weather-supported XPath and CSS3-selectors. This gem is often applied not just for web parsing but also for processing different types of XML files.
- HTTParty may be a client for RESTful services, sending HTTP queries to the scrapped pages and automatic parsing of JSON and XML files to your Ruby storage.
- Pry may be a tool used for debugging. it’ll help us to parse the code from the scrapped pages.
Web scraping is sort of an easy operation and, generally, there’s no got to install the Rails framework for this.
Having installed the required gems, you’re now able to find out how to form an internet scraper. Let’s proceed!
Step 1: Creating the Scrapping file.
Create the directory where the appliance data are going to be stored. Then add a blank document named after the appliance and reserve it to the folder. Let’s call it “web_scraper.rb”.
In the file, integrate the Nokogiri, HTTParty, and Pry gems by running these commands:
require ‘nokogiri’ require ‘httparty’ require ‘pry’
Step 2: Sending the HTTP queries.
Create a variable and send the HTTP request to the page you’re getting to scrape:
page = HTTParty.get(‘https://www.iana.org/domains/reserved’)
Step 3. Launching NokoGiri
The aim of this stage is to convert the list items into Nokogiri objects for further parsing. Set a replacement variable named “parsed_page” and make it adequate to the Nokogiri method of converting the HTML data to things – you’ll use it throughout the method.
parsed_page = Nokogiri::HTML(page) Pry.start(binding)
Save your file and launch it once more. Execute a “parsed_page” variable for retrieving the required page because of the set of Nokogiri objects.
In the same folder, create an HTML file (let’s call it “output”), and save the results of the “parse page command” there. you’ll be ready to ask for this document later.
Before proceeding, exit from Pry within the terminal.
Step 4. Parsing
Now you would like to extract all the needed list items. to try to do this, select the required CSS item and enter it into the Nokogiri output. you’ll locate the selector by viewing the page’s source code:
array = parsed_page.css(‘h2’).map(&:text)
Once the parsing is complete, it’s necessary to export the parsed data to the CSV file so it won’t stray.
Step 5. Export
Having parsed the knowledge, you would like to finish the scraping and convert the info into a structured table. Return to the terminal and execute the commands:
require ‘csv’ CSV.open(‘reserved.csv’, ‘w’) { |csv| csv
Abhishek Kumar
More posts by Abhishek Kumar