Web scraping with Python

In this post, we are getting to learn web scraping with python. Using python we are scraping Yahoo Finance. This is often an excellent source for stock-market data. we’ll code a scraper for that. Using that scraper you’d be ready to scrape stock data of any company from yahoo finance.

This tool will help us to scrape dynamic websites using many rotating proxies in order that we don’t get blocked. It also provides a captcha clearing facility. It uses headerless chrome to scrape dynamic websites.

Requirements

Generally, web scraping is split into two parts

  1. Fetching data by making an HTTP request.
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Beautiful Soup may be a Python library for pulling data out of HTML and XML files.
  2. Requests allow you to send HTTP requests very easily.
  3. web scraping tool to extract the HTML code of the target URL.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands.

mkdir scraper
pip install beautifulsoup4
pip install requests

Now, create a file inside that folder by any name you wish . I’m using scraping.py.

Firstly, you’ve got to check in for the scrapingdog API. it’ll provide you with 1000 FREE credits. Then just import Beautiful Soup & requests in your file. like this.

from bs4 import BeautifulSoup
import requests

Here is that the list of fields we’ll be extracting:

  • Previous Close
  • Open
  • Bid
  • Ask
  • Day’s Range
  • 52 Week Range
  • Volume
  • Avg. Volume
  • Market Cap
  • Beta
  • PE Ratio
  • EPS
  • Earnings Rate
  • Forward Dividend & Yield
  • Ex-Dividend & Date
  • 1y target EST
  • Yahoo Finance

Preparing the Food

Now, since we’ve all the ingredients to organize the scraper, we should always make a GET request to the target URL to urge the raw HTML data. Now we’ll scrape Yahoo Finance for financial data using the requests library as shown below.

r = requests.get("https://api.scrapingdog.com/scrape?api_key=<your-api-key>&url=https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch").text

this will provide you with an HTML code of that focus on URL.

Now, you’ve got to use BeautifulSoup to parse HTML.

soup = BeautifulSoup(r,’html.parser’)

Now, on the whole page, we’ve four “tbody” tags. We have an interest within the first two because we currently don’t need the info available inside the third & fourth “tbody” tags.

tbody append website

First, we’ll determine all those “tbody” tags using the variable “soup”.

 

alldata = soup.find_all(“tbody”)

tr & td tags inside tbody

As you’ll notice that the primary two “tbody” has 8 “tr” tags and each “tr” tag has two “td” tags.

try:
 table1 = alldata[0].find_all(“tr”)
except:
 table1=None
try:
 table2 = alldata[1].find_all(“tr”)
except:
 table2 = None

Now, each “tr” tag has two “td” tags. the primary td tag consists of the name of the property and therefore the other one has the worth of that property. It’s something sort of a key-value pair.

data inside td tags

Now , we are getting to declare an inventory and a dictionary before starting a for loop.

l={}
u=list()

For making the code simple I will be able to run two different “for” loops for every table. First for “table1”

for i in range(0,len(table1)):
 try:
   table1_td = table1[i].find_all(“td”)
 except:
   table1_td = None
 l[table1_td[0].text] = table1_td[1].text
 u.append(l)
 l={}

Now, what we’ve done is we are storing all the td tags during a variable “table1_td”. then we are storing the worth of the primary & second td tag during a “dictionary”. Then we are pushing the dictionary into an inventory. Since we don’t want to store duplicate data we are going to make the dictionary empty at the top. Similar steps are going to be followed for “table2”.

for i in range(0,len(table2)):
 try:
   table2_td = table2[i].find_all(“td”)
 except:
   table2_td = None
 l[table2_td[0].text] = table2_td[1].text
 u.append(l)
 l={}

Then at the top once you print the list “u” you get a JSON response.

{
 “Yahoo finance”: [
 {
   “Previous Close”: “2,317.80”
 },
 {
   “Open”: “2,340.00”
 },
 {
   “Bid”: “0.00 x 1800”
 },
 {
   “Ask”: “2,369.96 x 1100”
 },
 {
   “Day’s Range”: “2,320.00–2,357.38”
 },
 {
   “52 Week Range”: “1,626.03–2,475.00”
 },
 {
   “Volume”: “3,018,351”
 },
 {
   “Avg. Volume”: “6,180,864”
 },
 {
   “Market Cap”: “1.173T”
 },
 {
   “Beta (5Y Monthly)”: “1.35”
 },
 {
   “PE Ratio (TTM)”: “112.31”
 },
 {
   “EPS (TTM)”: “20.94”
 },
 {
   “Earnings Date”: “Jul 23, 2020 — Jul 27, 2020”
 },
 {
   “Forward Dividend & Yield”: “N/A (N/A)”
 },
 {
   “Ex-Dividend Date”: “N/A”
 },
 {
   “1y Target Est”: “2,645.67”
 }
 ]
}

We have an array of python Objects containing the financial data of the corporate Amazon. In this way, we will scrape the info from any website.

Conclusion

In this article, we understood how we will scrape data using the data scraping tools & BeautifulSoup no matter the sort of website.

Was this post helpful?