In this post, we are getting to scrape data from Linkedin using Python and an internet Scraping Tool. We are getting to extract name, Website, Industry, Company Size, Number of employees, Headquarters Address, and Specialties.
Procedure
- Generally, web scraping is split into two parts:
- Fetching data by making an HTTP request
- Extracting important data by parsing the HTML DOM
Libraries & Tools
Beautiful Soup may be a Python library for pulling data out of HTML and XML files.
- Requests allow you to send HTTP requests very easily.
- Pandas provide fast, flexible, and expressive data structures
- Web Scraper to extract the HTML code of the target URL.
Setup
Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I’m assuming that you simply have already installed Python 3.x.
mkdir scraper pip install beautifulsoup4 pip install requests pip install pandas
Now, create a file inside that folder by any name you wish. I’m using xyz.py.
from bs4 import BeautifulSoup import requests import pandas as pd
What we are getting to scrape
We are scraping the “about” page of Google from Linkedin.
Preparing the Food
Now, since we’ve all the ingredients to organize the scraper, we should always make a GET request to the target URL to urge the raw HTML data. If you’re not conversant in the scraping tool, I might urge you to travel through its documentation. we’ll use requests to form an HTTP GET request. Now Since we are scraping a corporation page, I have even set “type” as a company and “linked” as google/about/. LinkedIn is often found in Linkedin’s target URL.
r = requests.get(‘https://api.scrapingpass.com/linkedin/?api_key=YOUR-API-KEY&type=company&linkId=google/about/').text
This will provide you with an HTML code of these target URLs.
Please use your Scrapingpass API key while making the above requests.
Now, you’ve got to use BeautifulSoup to parse the HTML.
soup=BeautifulSoup(r,’html.parser’) l={} u=list()
So, we’ll use variable soup to extract that text.
try: l[“Company”]=soup.find(“h1”,{“class”:”org-top-card-summary__title t-24 t-black truncate”}).text.replace(“\n”,””) except: l[“Company”]=None
I have replaced \n with an empty string.
Now, we’ll specialize in extracting website, Industry, Company Size, Headquarters(Address), Type, and Specialties.
All of the above properties (except Company Size)are stored in the school
allProp = soup.find_all(“dd”,{“class”:”org-page-details__definition-text t-14 t-black — light t-normal”})
Now, we’ll one by one extract the properties from the allProp list.
try: l[“website”]=allProp[0].text.replace(“\n”,””) except: l[“website”]=None try: l[“Industry”]=allProp[1].text.replace(“\n”,””) except: l[“Industry”]=None try: l[“Address”]=allProp[2].text.replace(“\n”,””) except: l[“Address”]=None try: l[“Type”]=allProp[3].text.replace(“\n”,””) except: l[“Type”]=None try: l[“Specialties”]=allProp[4].text.replace(“\n”,””) except: l[“Specialties”]=None
Now, we’ll scrape Company Size.
As you’ll see that Company Size is stored in school “org-about-company-module__company-size-definition-text t-14 t-blacklight mb1 fl” with tag dd.
try: l[“Company Size”]=soup.find(“dd”,{“class”:”org-about-company-module__company-size-definition-text t-14 t-black — light mb1 fl”}).text.replace(“\n”,””) except: l[“Company Size”]=None
Now, pushing dictionary l to list u. then we’ll create a data frame of list u using pandas.
u.append(l) df = pd.io.json.json_normalize(u)
Now, finally saving our data to a CSV file.
df.to_csv(‘linkedin.csv’, index=False, encoding=’utf-8')
We have successfully scraped a Linkedin Company Page. Similarly, you’ll also scrape a Profile.
Complete Code
from bs4 import BeautifulSoup import requests import pandas as pd r = requests.get(‘https://api.scrapingpass.com/linkedin/?api_key=YOUR-API-KEY&type=company&linkId=google/about/').text soup=BeautifulSoup(r,’html.parser’) u=list() l={} try: l[“Company”]=soup.find(“h1”,{“class”:”org-top-card-summary__title t-24 t-black truncate”}).text.replace(“\n”,””) except: l[“Company”]=None allProp = soup.find_all(“dd”,{“class”:”org-page-details__definition-text t-14 t-black — light t-normal”}) try: l[“website”]=allProp[0].text.replace(“\n”,””) except: l[“website”]=None try: l[“Industry”]=allProp[1].text.replace(“\n”,””) except: l[“Industry”]=None try: l[“Company Size”]=soup.find(“dd”,{“class”:”org-about-company-module__company-size-definition-text t-14 t-black — light mb1 fl”}).text.replace(“\n”,””) except: l[“Company Size”]=None try: l[“Address”]=allProp[2].text.replace(“\n”,””) except: l[“Address”]=None try: l[“Type”]=allProp[3].text.replace(“\n”,””) except: l[“Type”]=None try: l[“Specialties”]=allProp[4].text.replace(“\n”,””) except: l[“Specialties”]=None u.append(l) df = pd.io.json.json_normalize(u) df.to_csv(‘linkedin.csv’, index=False, encoding=’utf-8') print(df)
Conclusion
In this article, we understood how we will scrape data from Linkedin using proxy scraper & Python. Although you’ll scrape a Profile too but just read the docs before trying it.
Abhishek Kumar
More posts by Abhishek Kumar