Python Libraries for Web Scraping
Web scraping is the process of extracting structured and unstructured data from the internet with the assistance of programs and exporting it into a useful format.
Requests (HTTP for Humans) Library for Web Scraping
Let’s start with the foremost basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the info on its page.
Requests may be a Python library used for creating various sorts of HTTP requests like GET, POST, etc. due to its simplicity and simple use, it comes with the motto of HTTP for Humans.
However, the Requests library doesn’t parse the HTML data retrieved. If we would like to try to do that, we require libraries like lxml and delightful Soup.
Let’s take a glance at the benefits and drawbacks of the Requests Python library.
- Basic/Digest Authentication
- International Domains and URLs
- Chunked Requests
- HTTP(S) Proxy Support
- Retrieves only static content of a page
- Can’t be used for parsing HTML
lxml Library for Web Scraping
We know the requests library cannot parse the HTML retrieved from an internet page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.
It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re scraping large datasets. The mixture of requests and lxml is extremely common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.
Let’s take a glance at the benefits and drawbacks of the lxml Python library.
- Faster than most of the parsers out there.
- Uses element trees
- Pythonic API
- Does not work well with poorly designed HTML
- The official documentation isn’t very beginner-friendly
Beautiful Soup Library for Web Scraping
BeautifulSoup is probably the foremost widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents.
One of the first reasons the gorgeous Soup library is so popular is that it’s easier to figure with and compatible for beginners. We will also combine Beautiful Soup with other parsers like lxml.
- Requires a couple of lines of code
- Great documentation
- Easy to find out for beginners
- Automatic encoding detection
- Slower than lxml
Selenium Library for Web Scraping
Selenium may be a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the info science community turned that around pretty quickly
If time and speed aren’t a priority for you, then you’ll definitely use Selenium.
- Automated web scraping
- Can scrape dynamically populated sites
- Automates web browsers
- Can do anything on an internet page almost like an individual
- Very slow
- Difficult to set up
- High CPU and memory usage
- Not ideal for giant projects
Scrapy isn’t just a library; it’s a whole web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. it’s a full-fledged web scraping solution that does all the work for you.
Scrapy provides spider bots that will crawl multiple websites and extract the info. With Scrapy, you’ll create your spider bots, host them on Scrapy Hub, or as an API. It allows you to make fully functional spiders in a matter of a couple of minutes. you’ll also create pipelines using Scrapy.
Thes neatest thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously.
- Excellent documentation
- Various plugins
- Create custom pipelines and middlewares
- Low CPU and memory usage
- Well designed architecture
- A plethora of obtainable online resources
- Steep learning curve
- Overkill for straightforward jobs
- Not beginner-friendly
Modern Python Web scraping using multiple libraries
Web scraping is the act of extracting data from websites across the web. Other synonyms for web scraping are web crawling or web extraction. It’s an easy process with an internet site URL because of the initial target.
Python may be a general-purpose language. it’s many uses starting from web development, AI, machine learning, and far more. you’ll perform web scraping with Python by taking advantage of some libraries and tools available on the web.
We will undergo some popular tools and services we will use with Python to scrap an internet page. The tools we’ll discuss include:
Beautiful Soup, Requests, Selenium, Scrapy.
Create a project directory and navigate into the directory. Open your terminal and run the commands below.
mkdir python_scrapercd python_scraper
Python Web Scraping using Beautiful Soup
Beautiful Soup may be a library that pulls data out of HTML and XML. It works best with parsers, providing elegant ways of navigating, searching and modifying the parse tree.
Open your terminal and run the command below:
pip install beautifulsoup4
With Beautiful Soup installed, create a replacement python file, name it.
We are scraping the website for demonstration purposes.
We want to extract the titles of every book and display them on the terminal. the primary step in scraping an internet site is knowing its HTML layout.
What we would like is that the book title, that’s inside the, inside the, inside the, and eventually inside the element.
To scrape and obtain the book title, let’s create a replacement Python file and call it beautiful_soup.py
When done add the subsequent code to beautiful_soup.py file:
from urllib.request import urlopen from bs4 import BeautifulSoup url_to_scrape = “https://books.toscrape.com/" request_page = urlopen(url_to_scrape) page_html = request_page.read() request_page.close() html_soup = BeautifulSoup(page_html, ‘html.parser’) # get book title for data in html_soup.select(‘ol’): for title in data.find_all(‘a’): print(title.get_text())
In the above code snippet, we open our webpage with the assistance of the urlopen() method. The read() method reads the entire page and assigns the contents to the page_html variable. We then parse the page using html.parser to assist us to understand HTML code in a nested fashion.
Next, we use the select() method provided by the BS4 library to urge the element. We loop through the HTML elements inside the element to urge the tags which contain the book names. Finally, we print out each text inside the tags on every loop it runs with the assistance of the get_text() method.
You can execute the file using the terminal by running the command below.
Now let’s get the costs of the books too.
The price of the book is inside a tag, inside a tag. As you’ll see there’s quite one tag and quite one tag. to urge the proper element with book price, we’ll use CSS class selectors, lucky for us, each class is exclusive for every tag.
Below is that the code snippet to urge the costs of every book, add it at the rock bottom of the file:
# get book prices for price in html_soup.find_all(“p”, class_=”price_color”): print( price.get_text())
Your completed code should appear as if this:
from urllib.request import urlopen from bs4 import BeautifulSoup url_to_scrape = “https://books.toscrape.com/" request_page = urlopen(url_to_scrape) page_html = request_page.read() request_page.close() html_soup = BeautifulSoup(page_html, ‘html.parser’) # get book title for data in html_soup.select(‘ol’): for a in data.find_all(‘a’): print(a.get_text()) # get book prices for price in html_soup.find_all(“p”, class_=”price_color”): print(price.get_text())
Python scraping with Requests
Requests is a chic HTTP library. It allows you to send HTTP requests without the necessity to feature query strings to your URLs.
To use the requests library we first got to install it. Open your terminal and run the command below
pip3 install requests_html
Once you’ve got installed create a replacement Python file for the code. Let’s name the file
Now add the code below inside the created file:
from requests_html import HTMLSession session = HTMLSession() r= session.get(‘https://books.toscrape.com/') get_books = r.html.find(‘.row’) # get book title for title in get_books.find(‘h3’): print(title.text) # get book prices for price in get_books.find(‘.price_color’): print(price.text)
In this code snippet. within the first line, we imported HTMLSession from the request_html library. And instantiated it. We use the session to perform a get request from the BooksToScrape URL.
After performing the get request. We get the unicorn representation of HTML content from our BooksToScrape website. From the HTML content, we get the category row. Located at index 2 that contains the list of books and assign it to the get_books variable.
To get the costs of every book we only change what element the find method should look for within the HTML content. Luckily, the worth is inside a with a singular class price_color that’s not anywhere else. We loop through the HTML content and print out the text content of every tag.
Execute the code by running the subsequent command in your terminal:
And hence print the output.
Python Web Scraping with Selenium
Selenium may be a web-based automation tool. Its primary purpose is for testing web applications but it can still had best in web scraping.
We are getting to import various tools to assist us in scraping.
We will be using the chrome browser, and for this, we’d like the chrome web driver to figure with Selenium.
Download chrome web driver using either of the subsequent methods:
- you’ll either download directly from the link below
- Or by employing a Linux machine.
When you’re done creating a replacement Python file, let’s call it selenium_scrape.py.
Add the subsequent code to the file:
from selenium import webdriver from selenium.webdriver.common.by import By url = ‘https://books.toscrape.com/' driver = webdriver.Chrome(‘/home/marvin/chromedriver’) driver.get(url) container = driver.find_element_by_xpath (‘//[@id=”default”]/div/div/div/div/section/div/ol’) # get book titles titles = container.find_elements(By.TAG_NAME, ‘a’) for title in titles: print(title.text)
In the above code, we first import an internet driver from selenium which can control chrome. Selenium requires a driver to interface with a selected browser.
We then specify the driving force we would like to use, which is chrome. It takes the trail to the chrome driver and goes to the location URL. Because we’ve not launched the browser in headless mode. The browser appears and we can see what it’s doing.
The variable container contains the XPath of the tag which has the book title. Selenium provides methods for locating elements, tags, class names, and more. you’ll read more from selenium location elements
To get the XPath of the tag. Inspect the weather, find the tag that has the book title, right-click thereon. A dropdown menu will appear, select Copy then select Copy XPath.
From the variable container. we will then find the titles by the tag name and loop through to print all titles in sort of text.
The output is going to be as shown below:
Now, let’s change the file to urge book prices, by adding the subsequent code after the get book titles code.
prices = container.find_elements(By.CLASS_NAME, ‘price_color’) for price in prices: print(price.text)
Next, we would like to access more data by clicking the subsequent button and collecting the opposite books from other pages.
Change the file to resemble the one below:
from selenium import webdriver from selenium.webdriver.common.by import By url = ‘https://books.toscrape.com/' driver = webdriver.Chrome(‘/home/marvin/chromedriver’) driver.get(url) def get_books_info(): container =driver.find_element_by_xpath(‘//[@id=”default”]/div/div/div/div/section/div/ol’) titles = container.find_elements(By.TAG_NAME, ‘a’) for title in titles: print(title.text) prices = container.find_elements(By.CLASS_NAME, ‘price_color’) for price in prices: print(price.text) next_page = driver.find_element_by_link_text(‘next’) next_page.click() for x in range(5): get_books_info() driver.quit()
We have created the get_books_info function. it’ll run several times to scrape data from some pages, during this case, 5 times.
We then use the element_by_link_text() method. to urge the text of the element containing the link to the subsequent page.
Next, we add a click function, to require us to follow the page. We scrape data and print it out on the console, we repeat this 5 times due to the range function. After 5 successful data scrapes, the driving force .quit() method closes the browser.
You can choose how to store the info either as a JSON file or during a CSV file.