Python Libraries for Web Scraping

python web scraping libraries

web-scraping using python

Web scraping is the process of extracting structured and unstructured data from the internet with the assistance of programs and exporting it into a useful format.

  1. Requests (HTTP for Humans) Library for Web Scraping

 

Let’s start with the foremost basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the info on its page.

Requests may be a Python library used for creating various sorts of HTTP requests like GET, POST, etc. due to its simplicity and simple use, it comes with the motto of HTTP for Humans.

However, the Requests library doesn’t parse the HTML data retrieved. If we would like to try to do that, we require libraries like lxml and delightful Soup.

Let’s take a glance at the benefits and drawbacks of the Requests Python library.

Advantages:

  • Simple
  • Basic/Digest Authentication
  • International Domains and URLs
  • Chunked Requests
  • HTTP(S) Proxy Support

Disadvantages:

  • Retrieves only static content of a page
  • Can’t be used for parsing HTML
  • Can’t handle websites made purely with JavaScript
  1. lxml Library for Web Scraping

 

We know the requests library cannot parse the HTML retrieved from an internet page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.

It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re scraping large datasets. The mixture of requests and lxml is extremely common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.

Let’s take a glance at the benefits and drawbacks of the lxml Python library.

Advantages:

  • Faster than most of the parsers out there.
  • Light-weight
  • Uses element trees
  • Pythonic API

Disadvantages:

  • Does not work well with poorly designed HTML
  • The official documentation isn’t very beginner-friendly
  1. Beautiful Soup Library for Web Scraping

 

BeautifulSoup is probably the foremost widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents.

One of the first reasons the gorgeous Soup library is so popular is that it’s easier to figure with and compatible for beginners. We will also combine Beautiful Soup with other parsers like lxml.

Advantages:

  • Requires a couple of lines of code
  • Great documentation
  • Easy to find out for beginners
  • Robust
  • Automatic encoding detection

Disadvantages:

  • Slower than lxml
  1. Selenium Library for Web Scraping

 

There is a limitation to all or any of the Python libraries we’ve discussed thus far – we cannot easily scrape data from dynamically populated websites. It happens because sometimes the info present on the page is loaded through JavaScript. That’s where Selenium comes into play.

Selenium may be a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the info science community turned that around pretty quickly

If time and speed aren’t a priority for you, then you’ll definitely use Selenium.

Advantages:

  • Beginner-friendly
  • Automated web scraping
  • Can scrape dynamically populated sites 
  • Automates web browsers
  • Can do anything on an internet page almost like an individual 

Disadvantages:

  • Very slow
  • Difficult to set up
  • High CPU and memory usage
  • Not ideal for giant projects
  1. Scrapy

 

Scrapy isn’t just a library; it’s a whole web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. it’s a full-fledged web scraping solution that does all the work for you.

Scrapy provides spider bots that will crawl multiple websites and extract the info. With Scrapy, you’ll create your spider bots, host them on Scrapy Hub, or as an API. It allows you to make fully functional spiders in a matter of a couple of minutes. you’ll also create pipelines using Scrapy.

Thes neatest thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously.

You can also add plugins to Scrapy to reinforce its functionality. Although Scrappy isn’t ready to handle JavaScript like selenium, you’ll pair it with a library called Splash, a lightweight browser. With Splash, Scrapy can even extract data from dynamic websites.

Advantages:

  • Asynchronous
  • Excellent documentation
  • Various plugins
  • Create custom pipelines and middlewares
  • Low CPU and memory usage
  • Well designed architecture
  • A plethora of obtainable online resources

Disadvantages:

  • Steep learning curve
  • Overkill for straightforward jobs
  • Not beginner-friendly

Modern Python Web scraping using multiple libraries

Web scraping is the act of extracting data from websites across the web. Other synonyms for web scraping are web crawling or web extraction. It’s an easy process with an internet site URL because of the initial target.

Python may be a general-purpose language. it’s many uses starting from web development, AI, machine learning, and far more. you’ll perform web scraping with Python by taking advantage of some libraries and tools available on the web.

We will undergo some popular tools and services we will use with Python to scrap an internet page. The tools we’ll discuss include: 

Beautiful Soup, Requests, Selenium, Scrapy.

Create a project directory and navigate into the directory. Open your terminal and run the commands below.

mkdir python_scrapercd python_scraper
  1. Python Web Scraping using Beautiful Soup

python web scraping library

python web scraping using beautiful soup

Beautiful Soup may be a library that pulls data out of HTML and XML. It works best with parsers, providing elegant ways of navigating, searching and modifying the parse tree.

Open your terminal and run the command below:

pip install beautifulsoup4

With Beautiful Soup installed, create a replacement python file, name it.

We are scraping the website for demonstration purposes.

We want to extract the titles of every book and display them on the terminal. the primary step in scraping an internet site is knowing its HTML layout.

What we would like is that the book title, that’s inside the, inside the, inside the, and eventually inside the element.

To scrape and obtain the book title, let’s create a replacement Python file and call it beautiful_soup.py

When done add the subsequent code to beautiful_soup.py file:

 

from urllib.request import urlopen
from bs4 import BeautifulSoup
url_to_scrape = “https://books.toscrape.com/"
request_page = urlopen(url_to_scrape)
page_html = request_page.read()
request_page.close()
html_soup = BeautifulSoup(page_html, ‘html.parser’)
# get book title
for data in html_soup.select(‘ol’):
  for title in data.find_all(‘a’):
    print(title.get_text())

In the above code snippet, we open our webpage with the assistance of the urlopen() method. The read() method reads the entire page and assigns the contents to the page_html variable. We then parse the page using html.parser to assist us to understand HTML code in a nested fashion.

Next, we use the select() method provided by the BS4 library to urge the element. We loop through the HTML elements inside the element to urge the tags which contain the book names. Finally, we print out each text inside the tags on every loop it runs with the assistance of the get_text() method.

You can execute the file using the terminal by running the command below.

python beautiful_soup.py

Now let’s get the costs of the books too.

The price of the book is inside a tag, inside a tag. As you’ll see there’s quite one tag and quite one tag. to urge the proper element with book price, we’ll use CSS class selectors, lucky for us, each class is exclusive for every tag.

Below is that the code snippet to urge the costs of every book, add it at the rock bottom of the file:

# get book prices
for price in html_soup.find_all(“p”, class_=”price_color”):
  print( price.get_text())

Your completed code should appear as if this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url_to_scrape = “https://books.toscrape.com/"
request_page = urlopen(url_to_scrape)
page_html = request_page.read()
request_page.close()
html_soup = BeautifulSoup(page_html, ‘html.parser’)
# get book title
for data in html_soup.select(‘ol’):
 for a in data.find_all(‘a’):
  print(a.get_text())
# get book prices
for price in html_soup.find_all(“p”, class_=”price_color”):
 print(price.get_text())
  1. Python scraping with Requests

python web scraping libraries

python web scraping with requests

Requests is a chic HTTP library. It allows you to send HTTP requests without the necessity to feature query strings to your URLs.

To use the requests library we first got to install it. Open your terminal and run the command below

pip3 install requests_html

Once you’ve got installed create a replacement Python file for the code. Let’s name the file

requests_scrape.py

Now add the code below inside the created file:

from requests_html import HTMLSession
session = HTMLSession()
r= session.get(‘https://books.toscrape.com/')
get_books = r.html.find(‘.row’)[2]
# get book title
for title in get_books.find(‘h3’):
 print(title.text)
# get book prices
for price in get_books.find(‘.price_color’):
 print(price.text)

In this code snippet. within the first line, we imported HTMLSession from the request_html library. And instantiated it. We use the session to perform a get request from the BooksToScrape URL.

After performing the get request. We get the unicorn representation of HTML content from our BooksToScrape website. From the HTML content, we get the category row. Located at index 2 that contains the list of books and assign it to the get_books variable.

To get the costs of every book we only change what element the find method should look for within the HTML content. Luckily, the worth is inside a with a singular class price_color that’s not anywhere else. We loop through the HTML content and print out the text content of every tag.

Execute the code by running the subsequent command in your terminal:

python requests_scrape.py

And hence print the output.

  1. Python Web Scraping with Selenium

python web scraping libraries

python web scraping with selenium

Selenium may be a web-based automation tool. Its primary purpose is for testing web applications but it can still had best in web scraping.

We are getting to import various tools to assist us in scraping.

We will be using the chrome browser, and for this, we’d like the chrome web driver to figure with Selenium.

Download chrome web driver using either of the subsequent methods:

  1. you’ll either download directly from the link below
  2.  Or by employing a Linux machine.

When you’re done creating a replacement Python file, let’s call it selenium_scrape.py.

Add the subsequent code to the file:

from selenium import webdriver
from selenium.webdriver.common.by import By
url = ‘https://books.toscrape.com/'
driver = webdriver.Chrome(‘/home/marvin/chromedriver’)
driver.get(url)
container =
driver.find_element_by_xpath (‘//[@id=”default”]/div/div/div/div/section/div[2]/ol’)
# get book titles
titles = container.find_elements(By.TAG_NAME, ‘a’)
for title in titles:
 print(title.text)

In the above code, we first import an internet driver from selenium which can control chrome. Selenium requires a driver to interface with a selected browser.

We then specify the driving force we would like to use, which is chrome. It takes the trail to the chrome driver and goes to the location URL. Because we’ve not launched the browser in headless mode. The browser appears and we can see what it’s doing.

The variable container contains the XPath of the tag which has the book title. Selenium provides methods for locating elements, tags, class names, and more. you’ll read more from selenium location elements

To get the XPath of the tag. Inspect the weather, find the tag that has the book title, right-click thereon. A dropdown menu will appear, select Copy then select Copy XPath.

From the variable container. we will then find the titles by the tag name and loop through to print all titles in sort of text.

The output is going to be as shown below:

Now, let’s change the file to urge book prices, by adding the subsequent code after the get book titles code.

prices = container.find_elements(By.CLASS_NAME, ‘price_color’)
for price in prices:
 print(price.text)

Next, we would like to access more data by clicking the subsequent button and collecting the opposite books from other pages.

Change the file to resemble the one below:

from selenium import webdriver
from selenium.webdriver.common.by import By
url = ‘https://books.toscrape.com/'
driver = webdriver.Chrome(‘/home/marvin/chromedriver’)
driver.get(url)
def get_books_info():
 container      =driver.find_element_by_xpath(‘//[@id=”default”]/div/div/div/div/section/div[2]/ol’)
 titles = container.find_elements(By.TAG_NAME, ‘a’)
 for title in titles:
  print(title.text)
 prices = container.find_elements(By.CLASS_NAME, ‘price_color’)
 for price in prices:
  print(price.text)
 next_page = driver.find_element_by_link_text(‘next’)
 next_page.click()
for x in range(5):
 get_books_info()
driver.quit()

We have created the get_books_info function. it’ll run several times to scrape data from some pages, during this case, 5 times.

We then use the element_by_link_text() method. to urge the text of the element containing the link to the subsequent page.

Next, we add a click function, to require us to follow the page. We scrape data and print it out on the console, we repeat this 5 times due to the range function. After 5 successful data scrapes, the driving force .quit() method closes the browser.

You can choose how to store the info either as a JSON file or during a CSV file.

Was this post helpful?