Introduction 

In this article, we’ll discuss the way to efficiently download files with Puppeteer. You perhaps have to explicitly specify a download location, download multiple files at an equivalent time, and so on. 

Unfortunately, these use cases aren’t well documented. That’s why this text is there to share a number of the ideas and tricks which will be used while working with Puppeteer. We’ll undergo several practical examples and take a deep dive into Puppeteer’s APIs used for file download. 

Downloading a picture by simulating button click

In the first example, we’ll take a glance at an easy scenario where we automate a button click to download a picture. we’ll open up a URL during a new browser tab. Then we’ll find the download button on the page. Finally, we’ll click on the download button.

We can use the subsequent script to automate the download process

.Puppeteer

<code class="language-javascript" data-lang="javascript"> const puppeteer = require('puppeteer');
    async function simplefileDownload() {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.goto(
            'https://unsplash.com/photos/tn57JI3CewI', 
            { waitUntil: 'networkidle2' }
        );
        await page.click('._2vsJm ')
    }
    simplefileDownload();

The headless option is about to be false. This may allow us to watch the automation in real-time. We are creating a replacement instance of Puppeteer. Then we are opening up a replacement tab with the given URL. Finally, we are using the click() function to simulate the button click. 

We can avoid the default download path by explicitly specifying the trail in our script. Let’s update our script to line the trail.

<code class="language-javascript" data-lang="javascript"> const puppeteer = require('puppeteer');
    const path = require('path');
    const downloadPath = path.resolve('./download');
    async function simplefileDownload() {
        const browser = await puppeteer.launch({
            headless: false
        });
        
        const page = await browser.newPage();
        await page.goto(
            'https://unsplash.com/photos/tn57JI3CewI', 
            { waitUntil: 'networkidle2' }
        );
        
        await page._client.send('Page.setDownloadBehavior', {
            behavior: 'allow',
            downloadPath: downloadPath 
        });
        await page.click('._2vsJm ')
    }
    simplefileDownload();

Replicating the download request

Next, let’s take a glance at how we will download files by making an HTTP request. rather than simulating clicks, we are getting to find the image source. Here’s how its done

const fs = require('fs');
    const https = require('https');
    
    async function downloadWithLinks() {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.goto(
            'https://unsplash.com/photos/tn57JI3CewI', 
            { waitUntil: 'networkidle2' }
        );
        const imgUrl = await page.$eval('._2UpQX', img => img.src);
        
        https.get(imgUrl, res => {
            const stream = fs.createWriteStream('somepic.png');
            res.pipe(stream);
            stream.on('finish', () => {
                stream.close();
            })
        })
        browser.close()
    }

We are finding the image’s DOM node directly on the page and getting its src property. The src property is an URL. We are making an HTTPS GET request to the present URL and using Node’s native fs module to write down that file stream to our local filing system. This method can only be used if the file we would like to download has its src exposed within the DOM.

Well, first of all, this method is often faster. we will simultaneously download multiple files. Let’s say a page features a few images and that we want to download all of them. Clicking one among these images will take the user to a replacement page and from there, the user can download that image. To download a subsequent image the user has to return to the previous page. This seems tedious. we will write a script that mimics this behavior but we don’t get to see if the primary page has the image URL exposed within the DOM.

The following script will gather all the image sources and download them.

<code class="language-javascript" data-lang="javascript">const puppeteer = require('puppeteer');
    const fs = require('fs');
    const https = require('https');
    
    async function downloadMultiple() {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.goto(
            'https://unsplash.com/t/wallpapers', 
            { waitUntil: 'networkidle2' }
        );
        const imgUrls = await page.$$eval('._2UpQX', imgElms => {
            const urls = [];
            imgElms.forEach(elm => {
                urls.push(elm.src);
            })
            return urls;
        });
    
        imgUrls.forEach((url , index) => {
            https.get(url, res => {
                const stream = fs.createWriteStream(`download-${index}.png`);
                res.pipe(stream);
                stream.on('finish', () => {
                    stream.close();
                })
            })
        });
        browser.close()
    }
downloadMultiple();

Downloading multiple files in parallel

In this next part, we’ll dive deep into a number of advanced concepts. We’ll discuss parallel downloading. Downloading small files is straightforward. However, if you’ve got to download multiple large files things start to get complicated. You see Node.js in its core may be a single-threaded system. Node features a single event loop. It can only execute one process at a time. Therefore if we’ve to download 10 files each 1 gigabyte in size and each requires about 3 mins to download then with one process we’ll need to wait for 10 x 3 = half-hour for the task to the end. This is often not performant in the least.

Our CPU cores can run multiple processes at an equivalent time. We will fork multiple child_proces in Node. Child process is how Node.js handles parallel programming. We will combine the kid process module with our Puppeteer script and download files in parallel.

The code snippet below may be a simple example of running parallel downloads with Puppeteer.

// main.js 
    
    const fork = require('child_process').fork;
    const ls = fork("./child.js");
    ls.on('exit', (code)=>{
        console.log(`child_process exited with code ${code}`);
    });
    ls.on('message', (msg) => {
        ls.send('https://unsplash.com/photos/AMiglZWQSQQ');
        ls.send('https://unsplash.com/photos/TbEqd-GNC5w');
        ls.send('https://unsplash.com/photos/FiVujM6egyU');
        ls.send('https://unsplash.com/photos/yGBJB6lHYVw');
    });

We have two files during this solution, the primary one is main.js. During this file we initiate our child processes and send the URL of the pictures we would like to download. For every URL a replacement child thread is going to be initiated.

<code class="language-javascript" data-lang="javascript"> // child.js
    
    const puppeteer = require('puppeteer');
    const path = require('path');
    const downloadPath = path.resolve('./download');
    
    process.on('message', async (url)=> {
        console.log("CHILD: url received from parent process", url);
        await download(url)
    });
    
    process.send('Executed');
    
    async function download(url) {
        const browser = await puppeteer.launch({
            headless: false
        });
        const page = await browser.newPage();
        await page.goto(
            url, 
            { waitUntil: 'networkidle2' }
        );
        // Download logic goes here
        await page._client.send('Page.setDownloadBehavior', {
            behavior: 'allow',
            downloadPath: downloadPath 
        });
        await page.click('._2vsJm ')
        // 
    }

In the child.js file, most download functionality is implemented. Observe that the download function we’ve here is just like our earlier solution. The sole difference is the Process. on function. This function receives a message from the most recent process and initiates the kid process outside of the Node’s main event loop. Our main script simultaneously spawns 4 instances of chrome and initiates file download in parallel.

Was this post helpful?