Web scraping is automating the method of knowledge collection online. This usually means deploying a “crawler” that automatically searches the online and scrapes data from selected pages. Data collection through scraping is often much faster, eliminating the necessity for manual data-gathering, and perhaps mandatory if the website has no provided API. Scraping methods changed and supported the website’s data display mechanisms.
One way to display content is through a one-page website, also referred to as a single-page application. Single-page applications became a trend, and with the implementation of infinite scrolling techniques, programmers can develop SPA that permits users to scroll forever. If you’re a fanatical social media user, you’ve probably experienced this feature before on platforms like Instagram, Twitter, Facebook, Pinterest, etc.
While a one-page website is useful for user experience (UX), it can make your attempts to extract data seem more complicated. But there’s no need to worry because of Puppeteer, you’ll be ready to scrape data infinitely by the end of this text.
Prerequisites & Goals
- Some experience with writing ES6 Javascript.
- A proper understanding of promises and a few experiences with async/await.
- Node.js installed on your development machine.
What is Infinite Scrolling?
Before you plan to scrape data from a never-ending timeline, it’s essential to ask yourself, what exactly is infinite scrolling?
Infinite Scrolling, a web-design technique that loads content continuously because the user scrolls down the page. there’s an Infinite Scroll JavaScript plugin that automatically adds subsequent pages, preventing a page load. The primary version was created in 2008 by Paul Irish and was a breakthrough in web development. The plugin uses ajax to pre-fetch content from a subsequent page then adds it to the present page. There are many other ways to supply infinite scrolling content, like data delivered through API endpoints to incrementally deliver more data, processing data from multiple endpoints before injecting something into a webpage, or data delivery in real-time through WebSockets.
Advantages
1.) Discovery Applications
It is almost a must-have feature for discovery applications/interphases. If a user doesn’t know what to look for specifically, they’ll get to see an immense amount of things to seek out the one thing they like.
2.) Mobile Devices
Since mobile devices have a way smaller screen size, infinite scrolling can create a way more pleasant UX.
3.) User Engagement
Since new results are always loading on the page, users are sucked into the appliance.
Disadvantages
1.) Poor for Page Performance
Page loading is vital for UX. As a user scrolls further down a page, more content has got to load on an equivalent page. As a result, the page performance will become increasingly slow.
2.) Poor for Item Search and site
Users may get to a particular point within the stream where they can’t bookmark their location. If they leave the location, they’re going to lose all their progress, decreasing UX.
3.) Loss of Footers
A viable part of applications that will contain easily accessible, essential information is now gone.
Now that you simply know a touch more about the content presentation style and its developmental uses, you’ll better understand the way to scrape data from infinite scrolling interphases. That’s where Puppeteer comes into play.
What is Puppeteer?
It can take time to know and reverse-engineer an app’s data delivery, plus websites may take different approaches to make infinite scrolling content. However, you’ll not have to worry about any of that today, all because of Puppeteer. And, no, not the type that works Puppets. Puppeteer maybe a headless Chrome Node API, allows you to emulate scrolling on the page and retrieve the specified data needed from the rendered elements.
Puppeteer allows you to behave almost exactly as if you were in your regular browser, except programmatically and without an interface.
Here are some samples of what you’ll do:
- Generate screenshots and PDFs of pages.
- Crawl a SPA and generate pre-rendered content.
- Automate form submission, UI testing, keyboard input, etc.
- Create an up-to-date, automated testing environment – Run your tests directly within the latest version of Chrome using the newest JavaScript and browser features.
- Capture a timeline trace of your site to assist diagnose performance issues.
- Test Chrome Extensions.
How to Scrape Infinite Scrolling Websites Using Puppeteer
Presuming you have already got npm installed, create a folder to store your Puppeteer project.
mkdir infinite-scroll cd infinite-scroll npm install --save puppeteer
By using npm, you’re installing both Puppeteer and a version of Chromium browser employed by Puppeteer. On Linux machines, Puppeteer might require some additional dependencies. This enables you to save lots of time by jumping into writing the scraping script. Open your go-to text editor and make a scrape-infinite-scroll.js file. therein file, copy within the following code:
// Puppeteer will not run without these lines const fs = require('fs'); const puppeteer = require('puppeteer');
These first couple lines are boilerplate code configurations. you’ll create a function for the things you’d wish to scrape. Open up your console, and examine the page HTML to work out your extracted elements constant.
<code class="language-javascript" data-lang="javascript">function extractItems() { /* For extractedElements, you are selecting the tag and class, that holds your desired information, then choosing the disired child element you would like to scrape from. in this case, you are selecting the "<div class=blog-post />" from "<div class=container />" See below: */ const extractedElements = document.querySelectorAll('#container > div.blog-post'); const items = []; for (let element of extractedElements) { items.push(element.innerText); } return items; }
The next function called is, “scrapeItems”.This function controls the particular scrolling and extraction, by using page. evaluate to repeatedly scroll down each page, extracting any items from the injected extract items method, until a minimum of itemCount many items is scraped.
<code class="language-javascript" data-lang="javascript">async function scrapeItems( page, extractItems, itemCount, scrollDelay = 800, ) { let items = []; try { let previousHeight; while (items.length < itemCount) { items = await page.evaluate(extractItems); previousHeight = await page.evaluate('document.body.scrollHeight'); await page.evaluate('window.scrollTo(0, document.body.scrollHeight)'); await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`); await page.waitForTimeout(scrollDelay); } } catch(e) { } return items; }
This last chunk of code handles beginning and navigating to the Chromium browser, also because of the number of things you’re scraping and where that data goes.
<code class="language-javascript" data-lang="javascript">(async () => { // Set up Chromium browser and page. const browser = await puppeteer.launch({ headless: false, args: ['--no-sandbox', '--disable-setuid-sandbox'], }); const page = await browser.newPage(); page.setViewport({ width: 1280, height: 926 }); // Navigate to the example page. await page.goto('https://mmeurer00.github.io/infinite-scroll-example/'); // Auto-scroll and extract desired items from the page. Currently set to extract ten items. const items = await scrapeItems(page, extractItems, 10); // Save extracted items to a new file. fs.writeFileSync('./items.txt', items.join('\n') + '\n'); // Close the browser. await browser.close(); })();
It’s important to incorporate everything you desire for item extraction within the extract items function’s definition. the subsequent line:
items = await page.evaluate(extractItems);
will serialize the extract items function before reviewing it within the browser’s context, essentially the lexical environment becomes unavailable during execution.
When finished, your file should look similar to:
<code class="language-javascript" data-lang="javascript">// Puppeteer will not run without these lines const fs = require('fs'); const puppeteer = require('puppeteer'); function extractItems() { /* For extractedElements, you are selecting the tag and class, that holds your desired information, then choosing the desired child element you would like to scrape from. in this case, you are selecting the "<div class=blog-post />" from "<div class=container />" See below: */ const extractedElements = document.querySelectorAll('#container > div.blog-post'); const items = []; for (let element of extractedElements) { items.push(element.innerText); } return items; } async function scrapeItems( page, extractItems, itemCount, scrollDelay = 800, ) { let items = []; try { let previousHeight; while (items.length < itemCount) { items = await page.evaluate(extractItems); previousHeight = await page.evaluate('document.body.scrollHeight'); await page.evaluate('window.scrollTo(0, document.body.scrollHeight)'); await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`); await page.waitForTimeout(scrollDelay); } } catch(e) { } return items; } (async () => { // Set up Chromium browser and page. const browser = await puppeteer.launch({ headless: false, args: ['--no-sandbox', '--disable-setuid-sandbox'], }); const page = await browser.newPage(); page.setViewport({ width: 1280, height: 926 }); // Navigate to the example page. await page.goto('https://mmeurer00.github.io/infinite-scroll-example/'); // Auto-scroll and extract desired items from the page. Currently set to extract ten items. const items = await scrapeItems(page, extractItems, 10); // Save extracted items to a new file. fs.writeFileSync('./items.txt', items.join('\n') + '\n'); // Close the browser. await browser.close(); })();
Run the script with:
node scrape-infinite-scroll.js
That line of code will open the demo page within the headless browser and scroll until ten #container > div. blog-post items are loaded, saving the text from the extracted items in ./items.txt. By Running
open ./items.txt
you will have access to all or any of your scraped data.
What happens if the script is unable to exact the number of things indicated? Well, Puppeteer has functions that evaluate JavaScript on the page like the page. wait for function that sometimes has a 30-second timeout. The function waits for page height to extend after each scroll. So, when the page loads more items, it’ll undergo a short time loop, only breaking and throwing a mistake when the peak does not change for 30 seconds, or the custom timeout.
3.) Alternative Scraping Methods for Infinite Scrolling
While Puppeteer can decrease your workload, it’s going to not always be the simplest approach to scrapping, counting on your case. an alternate and fewer extensive thanks to scraping is thru Cheerio. Cheerio is an NPM library, also called “JQuery for Node”, allowing you to scrape data with a lightweight framework. Cheerio works with raw HTML data that’s input thereto, working best when the info you would like to parse is extracted directly from a URL.
Conclusion
Thanks to Puppeteer, you’ll now extract data on infinite scrolling applications quickly and efficiently. While it’s going to not be what you utilize altogether, the script from this text should function as a start line for emulating human-like scrolling on an application.
Abhishek Kumar
More posts by Abhishek Kumar