Web Scraping 101 with Nodejs

We will be using the subsequent libraries to check the scraping with nodejs

Simplecrawler
Cheerio
Puppeteer
Playwright
HTTP Client — Axios, Unirest & Superagent
Nightmare

We will mention the foremost important thing which has got to be kept in mind during data extraction.

Simplecrawler

Simplecrawler is meant to supply a basic, flexible, and robust API for crawling websites. it had been written to archive, analyze, and search some very large websites and has happily chewed through many thousands of pages and written tens of gigabytes to disk without issue. it’s a versatile queue system that will be frozen to disk and defrosted.

So, very first thing is to put in Simplecrawler.

npm install --save simplecrawler

I have created a scraper.js enter my folder. Inside that file write.

var Crawler = require("simplecrawler");
var crawler = new Crawler("https://books.toscrape.com/");

We supply the constructor with a URL that indicates which domain to crawl and which resource to fetch first. you’ll configure 3 things before scraping the website.

a) Request Interval

crawler.interval = 10000; // Ten seconds

b) Concurrency of requests

crawler.maxConcurrency = 3;

c) Number of links to fetch

crawler.maxDepth = 1; // Only first page is fetched (with linked CSS & images)
// Or:
crawler.maxDepth = 2; // First page and discovered links from it are fetched

This library also provides more properties which may be found here.

You’ll also get to find out event listeners for the events you would like to concentrate on. crawler.fetchcomplete and crawler.complete are good places to start out.

crawler.on("fetchcomplete", function(queueItem, responseBuffer, response) {
  console.log("I just received %s (%d bytes)", queueItem.url,      responseBuffer.length);
  console.log("It was a resource of type %s", response.headers['content-type']);
});

Then, when you’re satisfied and prepared to travel, start the crawler! It’ll run through its queue finding linked resources on the domain to download until it can’t find any longer.

crawler.start();

Cheerio

Cheerio may be a library that’s wont to parse HTML and XML documents. you’ll use jquery syntax with the downloaded data. Cheerio removes all the DOM inconsistencies and browser cruft from the jQuery library, revealing its truly gorgeous API. you’ll filter the info you would like using selectors. Cheerio works with a really simple, consistent DOM model. As a result, parsing, manipulating, and rendering are incredibly efficient. Preliminary end-to-end benchmarks suggest that cheerio is about 8x faster than JSDOM.

Example

Type the subsequent code to extract the specified text.

const cheerio = require(‘cheerio’)
const axios = require(‘axios’);
var scraped_data =await axios.get("https://books.toscrape.com/");
const $ = cheerio.load(scraped_data.data)
var name = $(".page_inner").first().find("a").text();
console.log(name)
//Books to Scrape

First, we’ve made an HTTP request to the website then we’ve stored the info to scraped_data. we’ll load it in cheerio then use the category name to urge the info.

Puppeteer

Puppeteer may be a Node library that gives a high-level API to regulate Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but is often configured to run full (non-headless) Chrome or Chromium. It also can be changed to observe the execution sleep in non-headless mode. It removes the dependency on any external driver to run the operation. Puppeteer provides better control over chrome.

Example

In your scraper.js file write the subsequent code.

const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
var results = await page.goto('https://books.toscrape.com/');
await page.waitFor(1000);  
browser.close();
console.log(results);

Let’s break it down line by line:

First, we create our browser and set the headless mode to false. this enables us to observe exactly what’s going on:

const browser = await puppeteer.launch({headless: false});

Then, we create a replacement page in our browser:

const page = await browser.newPage();

Next, we attend the books.toscrape.com URL:

var results = await page.goto('https://books.toscrape.com/');

A delay of 1000 milliseconds has been added. While normally not necessary, this may ensure everything on the page loads:

await page.waitFor(1000);

Finally, after everything is completed, we’ll close the browser and print our result.

browser.close();
console.log(results);

The setup is complete.

Playwright

The playwright may be a Node.js library to automate Chromium, Firefox, and WebKit with one API very almost like a puppeteer. The playwright is made to enable cross-browser web automation that’s ever-green, capable, reliable, and fast. From automating tasks and testing web applications to data processing.

Example

We will build an easy scraper to demonstrate the appliance of the playwright. we’ll scrape the primary book from this URL.

Building a scraper

Creating a scraper with Playwright is surprisingly easy, albeit you’ve got no previous scraping experience. If you understand JavaScript and CSS, it’ll be a bit of cake.

In your project folder, create a file called scraper.js (or choose the other name) and open it in your favorite code editor. First, we’ll confirm that Playwright is correctly installed and dealing by running an easy script.

// Import the playwright library into our scraper.
const playwright = require('playwright');

async function main() {
    // Open a Chromium browser. We use headless: false
    // to be able to watch what's going on.
    const browser = await playwright.chromium.launch({
        headless: false
    });
    // Open a new page / tab in the browser.
    const page = await browser.newPage();
    // Tell the tab to navigate to the JavaScript topic page.
    var results= await page.goto('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
    // Pause for 10 seconds, to see what's going on.
    await page.waitForTimeout(10000);
    // Turn off the browser to clean up after ourselves.
    await browser.close();
}

main();

Results variable has all the info. Now, you’ll use cheerio to urge all the knowledge.

Clicking buttons is extremely easy with Playwright. By prefixing text= to a string you’re trying to find, Playwright will find the element that has this string and click on it. it’ll also await the element to see if it’s not rendered on the page yet. this is often an enormous advantage over puppeteers. Once you’ve got clicked you’ve got to attend for the page to load then use cheerio to urge the knowledge you’re trying to find.

HTTP Clients

An HTTP Client is often wont to send requests to a server and retrieve their responses. we’ll discuss 3 libraries that are simply wont to make an HTTP request to the server or the online page which you’re trying to scrape.

Axios

It is a promise-based HTTP client for both browser and node.js. it’ll provide us with the entire HTML code of the target website. Making an invitation using Axios is sort of simple and easy .

var axios = require(‘axios’); async function main() {


try{
var scraped_data =await axios.get(“https://books.toscrape.com/");
console.log(scraped_data.data); 

//......//

}catch(err){
console.log(err)
}

} 

main();

Unirest

Unirest may be a set of lightweight HTTP libraries available in multiple languages, built and maintained by Kong, who also maintain the open-source API Gateway Kong. Using Unirest is analogous to how we use Axios. you’ll use it as an alternate for Axios.

var unirest = require(‘unirest’);async function main() {


try{
var scraped_data =await unirest.get(“https://books.toscrape.com/");
console.log(scraped_data.body);

//……//

}catch(err){
console.log(err)
}

}

main();

Superagent

Small progressive client-side HTTP request library, and Node.js module with an equivalent API, supporting many high-level HTTP client features. it’s an identical API like Axios and supports promise and async/await syntax.

const superagent = require(‘superagent’);

async function main() {


try{
var scraped_data =await superagent.get(“https://books.toscrape.com/");
console.log(scraped_data.text);
//……//
}catch(err){
console.log(err)
}

}

main();

Nightmare

Nightmare may be a high-level browser automation library from Segment. It uses Electron(the same Google Chrome-derived framework that powers the Atom text editor) which is analogous to PhantomJs but twice as fast and a touch modern too. it had been originally designed for automating tasks across sites that don’t have APIs but is most frequently used for UI testing and crawling.

Nightmare is a perfect choice over Puppetteer if you don’t just like the heavy bundle it comes up with.

Once Nightmare is installed we’ll find Scrapingpass’s website link through the Duckduckgo program.

const Nightmare = require(‘nightmare’)
const nightmare = Nightmare()
nightmare
 .goto(‘https://duckduckgo.com')
 .type(‘#search_form_input_homepage’, ‘Scrapingpass’)
 .click(‘#search_button_homepage’)
 .wait(‘#links .result__a’)
 .evaluate(() => document.querySelector(‘#links .result__a’).href)
 .end()
 .then((link) => {
 console.log(‘Scrapingpass Web Link:’, link)
//Scrapingpass Web Link: https://www.scrapingpass.com/
 })
 .catch((error) => {
 console.error(‘Search failed:’, error)
 })

Now, we’ll go line by line. First, we’ve created an instance of Nightmare. Then we’ll open the Duckduckgo program using .goto. Then we’ll fetch the search bar by using its selector. we’ve changed the worth of the search box to “xyz” using .type. Once all this is often done we are getting to submit it. Nightmare will wait till the primary link has loaded and then, it’ll use the DOM method to urge the worth of href attribute. After receiving the link it’ll print it on the console.

Conclusion

So, these were some open source web scraping tools and libraries that you simply can use for web scraping projects. If you only want to specialize in data collection then you’ll always use web scraping API.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Web Scraping 101 with Nodejs

Simplecrawler

Cheerio

Puppeteer

Playwright

HTTP Clients

Nightmare

Conclusion

Was this post helpful?

Abhishek Kumar

Leave a Reply Cancel

Web Scraping 101 with Nodejs

Simplecrawler

Cheerio

Puppeteer

Playwright

HTTP Clients

Nightmare

Conclusion

Was this post helpful?

Abhishek Kumar

Leave a Reply Cancel

More recent articles

Our Representative will get in touch with you for further process.