Block resources with puppeteer

Introduction 

Puppeteer is one of the foremost widely used tools for web scraping and automation. There are a few ways to dam resources in Puppeteer. In this article, we’ll re-evaluate all the varied methods we will use to block/intercept specific network requests in our automation scripts.

BLOCK RESOURCES WITH PUPPETEER

Why block resources

First of all, let’s take a glance at why we might want to dam certain resources. Once we write automation scripts with Puppeteer, often our page will load resources that aren’t necessary for our use case.

For instance, instance, we are writing an internet scraper to crawl a few e-commerce sites to seek out a specific product and its price. The online pages of the e-commerce sites will have several images of the merchandise that we aren’t curious about . the online pages may additionally have external javascript libraries to trace user behavior and collect analytics. These are all resources that we aren’t curious about but they’re going to hamper our script. Therefore, we will block these resources from loading to make our script run faster.

Blocking resources with Puppeteer’s request interception API

Puppeteer features a native API method called setRequestInterception too damn requests. This is often the foremost simple thanks to blocking resources. for instance 

<code class="language-javascript" data-lang="javascript">    // index.js
    const puppeteer = require('puppeteer');
    async function main() {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.setRequestInterception(true);
        page.on('request', (request) => {
            // Block All Images
            if (request.url().endsWith('.png') || request.url().endsWith('.jpg')) {
                request.abort();
            } else {
                request.continue()
            }
        });
        await page.goto('https://www.scrapingbee.com/');
        await page.waitForTimeout(5000); // wait for 5 seconds
        await browser.close();
    }
    main();

Puppeteer provides us with an invitation listener function. it’s a function that lets us tap into browser HTTP requests. We are taking note of all the requests that are made by the browser and blocking those that contain a picture extension. you’ll see that the request.URL() function holds the URL information. We are checking if the request URL has any image extension like png or jpeg. The requests that contain a picture extension are blocked.

Now, we will run the code with the node index.js command. 

  1. Blocking resources by type

There are times when we would want to dam specific resources by type. For instance, we’d want to dam all media files (i.e. mp3, mp4). Video and audio files take significant time to load and hamper the script. Therefore it’s an honest idea to dam these resources once we are writing an internet scraper. Below is an example of blocking resources by type.

<code class="language-javascript" data-lang="javascript">const puppeteer = require('puppeteer');
    async function main() {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.setRequestInterception(true);
        page.on('request', (request) => {
            // Block All Images
            if (request.url().endsWith('.png') || request.url().endsWith('.jpg')) {
                request.abort();
            } else if (request.resourceType() === 'video') {
                request.abort();
            }
            else {
                request.continue()
            }
        });
        await page.goto('https://ca.news.yahoo.com/');
        await page.waitForTimeout(5000); // wait for 5 seconds
        await browser.close();
    }
    main();

In the above example, we are calling the resourceType() function within the request object. This function returns the resource sort of the request. We are checking if the request resource type may be a video or not. If it’s a video then we are blocking it.

  1. Blocking resources with plugins

Next, let’s take a glance at how we will block resources by using Puppeteer plugins. If you’re not conversant in Puppeteer plugins I highly recommend you’re taking a glance at the Puppeteer-extra project. Puppeteer-extra may be a wrapper for Puppeteer that permits you to use various useful plugins and libraries with Puppeteer.

To get started we first need to install Puppeteer-extra. we will do this with the subsequent command.

npm i puppeteer-extra

Next, we’ll install the Block Resource Plugin with the subsequent command.

npm i puppeteer-extra-plugin-block-resources

Let’s create a newscript and obtain the Block Resource Plugin setup with Puppeteer.

const puppeteer = require('puppeteer-extra');
   const blockResourcesPlugin = require('puppeteer-extra-plugin-block-resources')()
   puppeteer.use(blockResourcesPlugin)
   
   async function withPlugIn() {
       const browser = await puppeteer.launch({
           headless: false
       });
       const page = await browser.newPage();
       blockResourcesPlugin.blockedTypes.add('media')
       blockResourcesPlugin.blockedTypes.add('script')
       await page.goto('http://www.youtube.com', {waitUntil: 'domcontentloaded'})
       await page.waitForTimeout(5000); // wait for 5 seconds
       await browser.close();
   }
   withPlugIn();

Notice within the code above, we are using the utilization function to feature the plugin in Puppeteer. The Block Resource Plugin makes it easy to dam resources. All we have to do is call the blockedTypes.add() with the acceptable parameter. Within the example, above we are blocking external JavaScript and media resources.

The official docs for Block Resource Plugin may be a good place to start out if you would like to find out more about it.

Was this post helpful?