In this post, we’ll learn to scrape sites using browser automation with JavaScript. We’ll be using a puppeteer for this.

How to proceed

Generally, web scraping is split into two parts:

  • Fetching data by making an HTTP request
  • Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Puppeteer
  2. Nodejs

We are getting Scrapbook prices and titles from this website. Which may be a fake bookstore specifically found out to assist people to practice scraping.

Setup

Our setup is pretty simple. Just create a folder and install puppeteer. For creating a folder and installing libraries type below given commands. 

mkdir scraper
cd scraper
npm i puppeteer — save

Now, create a file inside that folder by any name you wish. I’m using xyz.js

Preparing the Food

Now, insert the subsequent boilerplate code in xyz.js

const puppeteer = require(‘puppeteer’);
let scrape = async () => { 
  // Actual Scraping goes Here…
  // Return a value
}; 
scrape().then((value) => { 
  console.log(value);  // Success!
});

Something important to notice is that our scrape() function is an async function and makes use of the new ES 2017 async/await features. Because this function is asynchronous, when it’s called it returns a Promise. When the async function finally returns a worth, the Promise will resolve.

Since we’re using an async function, we will use the await expression which can pause the function execution and await the Promise to resolve before moving on. It’s okay if none of this is sensible immediately. it’ll become clearer as we continue with the tutorial.

We can test the above code by adding a line of code to the scrape function. do this out:

let scrape = async () => {
  return 'test';
};

Now run node xyz.js within the console. You ought to get the test returned Perfect, our returned value is being logged to the console. Now we will start filling out our scrape function.

Step 1: Setup

The first thing we’d like to try to do is to create an instance of our browser, open up a replacement page, and navigate to a URL.

let scrape = async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
  await page.waitFor(1000);
  // Scrape
  browser.close();
  return result;};

The setup is complete. Now, let’s scrape!

Step 2: Scraping

We are getting to scrape the Book title and its price.

Looking at the Puppeteer API we will find the tactic that permits us to urge the HTML out of the page.

In order to retrieve these values, we’ll use the page.evaluate() method. This method allows us to use built-in DOM selectors like querySelector().

First thing we’ll do is create our page.evaluate() function and save the returned value to a variable named result:

const result = await page.evaluate(() => {
  // return something
});

Within our function, we will select the weather we desire. We’ll use the Google Developer Tools to work this out again.

As you’ll see within the elements panel, the title is just an h1 element. we will now select this element with the subsequent code:

let title = document.querySelector('h1');

Since we would like the text contained within this element, we’d like to add in .innerText — Here’s what the ultimate code looks like:

let title = document.querySelector('h1').innerText;

Similarly, we will select the worth by right-clicking and inspecting the element:

As you’ll see, our price features a class of price_color. we will use this class to pick the element and its inner text.

let price = document.querySelector('.price_color').innerText;

Now that we have the text that we’d like, we will return it in an object:

return {
  title,
  price
}

We’re now selecting the title and price, saving them to an object, and returning the worth of that object to the result variable. Here’s what it’s like when it’s all put together:

 

const result = await page.evaluate(() => {
  let title = document.querySelector('h1').innerText;
  let price = document.querySelector('.price_color').innerText;return {
  title,
  price
}});

The only thing left to try to do is return our result so it is often logged to the console:

 

return result;

Here’s what your final code should look like:

const puppeteer = require(‘puppeteer’);

let scrape = async () => {
 const browser = await puppeteer.launch({headless: false});
 const page = await browser.newPage();
 await page.goto(‘http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
 await page.waitFor(1000);
 const result = await page.evaluate(() => {
  let title = document.querySelector(‘h1’).innerText;
  let price = document.querySelector(‘.price_color’).innerText;
 return {title,price}
});
browser.close();
 return result;
};
scrape().then((value) => {
 console.log(value); // Success!
});

You can now run your Node file by typing the subsequent into the console:

node scrape.js// { title: 'A Light within the Attic', price: '£51.77' }

You should see the title and price of the chosen book returned to the screen

Conclusion

In this article, we understood how we will scrape data using Nodejs & Puppeteer no matter the sort of website.

Was this post helpful?