The introduction of the Fetch API changed how Javascript developers make HTTP calls. This means that developers no longer have to download third-party packages just to make an HTTP request. While that is great news for frontend developers, as fetch can only be used in the browser, backend developers still had to rely on different third-party packages. Until node-fetch came along, which aimed to provide the same fetch API that browsers support. In this article, we will take a look at how node-fetch can be used to help you scrape the web!
What is the Fetch API?
Fetch is a specification that aims to standardize a request, response, and everything in between, which the standard declares as fetching (hence the name fetch). The browser fetch API and node-fetch are implementations of this specification. The biggest and most important difference between fetch and its predecessor XHR is the fact that it’s built around Promises. This means that developers no longer have to fear the callback hell, messy code, and the extremely verbose APIs that XHR has.
There are a few more technical differences: i.e when a request returns with an HTTP status code 404, the promise that is returned from the fetch call doesn’t get rejected.
node-fetch brings all of this to the server-side. This means that developers no longer have to learn different APIs, their various terminologies, or how fetching actually happens behind the scenes to perform HTTP requests from the server-side. It’s as simple as running npm install node-fetch and writing HTTP requests almost the same way you would in a browser.
Scraping the web with node-fetch and cheerio
To get the gig rolling, you must first install cheerio alongside node-fetch. While node-fetch allows us to get the HTML of any page, because the result will just be a bunch of text, you will need some tooling to extract what you need from it. cheerio helps with that, it provides a very intuitive JQuery-like API and will allow you to extract data from the HTML you received with node-fetch.
For the purpose of this article, we will scrape Reddit:
const fetch = require('node-fetch'); const getReddit = async () => { const response = await fetch('https://reddit.com/'); const body = await response.text(); console.log(body); // prints a chock full of HTML richness return body; };
fetch has a single mandatory argument which is the resource URL. When fetch is called, it returns a promise which will resolve to a Response object as soon as the server responds with the headers. At this point, the body is not yet available. The promise that is returned resolves and it does not matter whether or not the request failed. The promise will only be rejected due to network errors like connectivity issues, meaning that the promise can resolve even if the servers respond with a 500 Server Error. The Response class implements the Body class which is a ReadableStream that gives a convenient set of promise-based methods meant for stream consumption.
Body.text() is one of them, and since Response implements Body, all the methods that Body has can be used by a Response instance. Calling any of these methods returns a promise that eventually resolves to the data.
With this data, in this case, HTML text, we can use cheerio to create a DOM, then query it to extract that interests you. For example, if you want a list of all the posts in the feed you could get the selector (using your browser’s dev tools) for the post list and then use cheerio like this:
const fetch = require('node-fetch'); const cheerio = require('cheerio'); const getReddit = async () => { // get html text from reddit const response = await fetch('https://reddit.com/'); // using await to ensure that the promise resolves const body = await response.text(); // parse the html text and extract titles const $ = cheerio.load(body); const titleList = []; // using CSS selector $('._eYtD2XCVieq6emjKBH3m').each((i, title) => { const titleNode = $(title); const titleText = titleNode.text(); titleList.push(titleText); }); console.log(titleList); }; getReddit();
cheerio.load() allows you to parse any HTML text into a query-able DOM. cheerio provides various methods to extract components out of the now constructed DOM, one of which is each(), this method allows you to iterate over a list of nodes. How do we know that we get a list? We’re looking for a list of the titles of the posts on Reddit’s home page, currently, the class name of one such title is ._eYtD2XCVieq6emjKBH3m and but it may change in the future.
By iterating over the list using each(), you get each HTML element, which you can feed again to cheerio and it will allow you to once again extract the text out of each title.
This process is fairly intuitive and can be done with any website, as long as the website in question does not have anti-scraping mechanisms to throttle, limit, or prevent you from scraping. While this can be worked around, the effort and dev time required to do so may simply just be unaffordable. This guide can help you out in such cases!
Making fetch requests in parallel
At times you may want to make multiple different fetch calls to different URLs at the same time. Doing them one after the other will ultimately lead to bad performance and hence long wait times for your end-users.
To solve this problem, you should parallelize your code. Sending HTTP requests consumes very little resources of your computer, it takes time only because your computer is waiting, idle, for the server to respond. We call those kinds of tasks “io bound”, as opposed to tasks that are slow because they consume a lot of computing power, those are “CPU-bound”.
“io bound” tasks can be efficiently parallelized with promises. And since fetch is promise-based, you can make use of Promise. all to make multiple fetch calls at the same time like this:
const newProductsPagePromise = fetch('https://some-website.com/new-products'); const recommendedProductsPagePromise = fetch('https://some-website.com/recommended-products'); // Returns a promise that resolves to a list of the results Promise.all([newProductsPagePromise, recommendedProductsPagePromise]);
Conclusion
With that, you’ve just mastered node-fetch for web scraping. Although fetch is great for simple use cases, it can get a tad bit difficult to get right when you have to deal with Single Page Applications that use Javascript to render most of its page. Challenging tasks like scraping concurrently and such should be done by hand as node-fetch is simply an HTTP request client like any other.
The other benefit of using node-fetch is that it’s much more efficient than using a headless browser. Even for simple tasks like submitting a form, headless browsers are slow and use a lot of server resources.
Since you’ve mastered node-fetch give ScrapingBee a try, you get the first 1000 requests for free to try it out. Check out the getting started guide here!
Scraping the web is challenging given the fact that anti-scraping mechanisms get smarter day by day. Even if you manage to do it, getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: the data
Abhishek Kumar
More posts by Abhishek Kumar