How to execute javascript with scrapy?

Most modern websites use a client-side JavaScript framework such as React, Vue or Angular. Scraping data from a dynamic website without server-side rendering often requires executing JavaScript code.

In this article, we will compare the most popular solutions to execute JavaScript with Scrapy, how to scale headless browsers, and introduce an open-source integration with ScrapingBee API for JavaScript support and proxy rotation.

Scraping dynamic websites with Scrapy

Scraping client-side rendered websites with Scrapy used to be painful. It is often found inspecting API requests on the browser network tools and extracting data from JavaScript variables. While these hacks may work on some websites, the code is generally found harder to understand and maintain than traditional XPATHs. But to scrape client-side data directly from the HTML you first need to execute the JavaScript code.

Scrapy middlewares for headless browsers

A headless browser is a web browser without a graphical user interface. We’ve used three libraries to execute JavaScript with Scrapy: scrapy-selenium, scrapy-splash, and scrappy-scraping bee.

All three libraries are integrated as a Scrapy downloader middleware. Once configured in your project settings, instead of yielding a normal Scrapy Request from your spiders, you yield a SeleniumRequest, SplashRequest, or ScrapingBeeRequest.

Executing JavaScript in Scrapy with Selenium

Locally, you can interact with a headless browser with Scrapy with the scrapy-selenium middleware. Selenium is a framework to interact with browsers commonly used for testing applications, web scraping, and taking screenshots.

from shutil import which

SELENIUM_DRIVER_NAME = 'firefox'

SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')

SELENIUM_DRIVER_ARGUMENTS=['-headless']

DOWNLOADER_MIDDLEWARES = {

    'scrapy_selenium.SeleniumMiddleware': 800

}

In your spiders, you can then yield a SeleniumRequest.

from scrapy_selenium import SeleniumRequest

yield SeleniumRequest(url, callback=self.parse)

Selenium allows you to interact with the browser in Python and JavaScript. The driver object is accessible from the Scrapy response. Sometimes it can be useful to inspect the HTML code after you click on a button. Locally, you can set up a breakpoint with an ipdb debugger to inspect the HTML response.

def parse(self, response):


    driver = response.request.meta['driver']
    driver.find_element_by_id('show-price').click()


    import ipdb; ipdb.set_trace()
    print(driver.page_source)

Otherwise, Scrapy XPATH and CSS selectors are accessible from the response object to select data from the HTML.

def parse(self, response):
    title = response.selector.xpath(
        '//title/@text'
    ).extract_first()

SeleniumRequest takes some additional arguments such as wait_time to wait before returning the response, wait_until to wait for an HTML element, screenshot to take a screenshot, and script for executing a custom JavaScript script.

On production, the main issue with scrapy-selenium is that there is no trivial way to set up a Selenium grid to have multiple browser instances running on remote machines.

Using Scrapy cache and concurrency to scrape faster

Scrapy uses Twisted under the hood, an asynchronous networking framework. Twisted makes Scrapy fast and able to scrape multiple pages concurrently. However, to execute JavaScript code you need to resolve requests with a real browser or a headless browser. There are two challenges with headless browsers: they are slower and hard to scale.

Executing JavaScript in a headless browser and waiting for all network calls can take several seconds per page. When scraping multiple pages, it makes the scraper significantly slower. Hopefully, Scrapy provides caching to speed-up development and concurrent requests for production runs.

Locally, while developing a scraper you can use Scrapy’s built-in cache system. It will make subsequent runs faster as the responses are stored on your computer in a hidden folder .scrapy/httpcache. You can activate the HttpCacheMiddleware in your project settings:

HTTPCACHE_ENABLED = True

Another issue with headless browsers is that they consume memory for each request. On production, you need an environment that can handle multiple browsers. To make several requests concurrently, you can modify your project settings:

CONCURRENT_REQUESTS = 1

When using ScrapingBee, remember to set concurrency according to your ScrapingBee plan.

Conclusion

I compared three Scrapy middlewares to render and execute JavaScript with Scrapy. Selenium allows you to interact with the web browser using Python in all major headless browsers but can be hard to scale. Splash can be run locally with Docker or deployed to Scrapinghub but relies on a custom browser implementation and you have to write scripts in Lua. ScrapingBee uses the latest Chrome headless browser, allows you to execute custom scripts in JavaScript and also provides proxy rotation for the hardest websites to scrape.

Was this post helpful?