In this post, we are getting to scrape data from Linkedin using Python and an internet Scraping Tool. We are getting to extract name, Website, Industry, Company Size, Number of employees, Headquarters Address, and Specialties.

Procedure

  1. Generally, web scraping is split into two parts:
  2. Fetching data by making an HTTP request
  3. Extracting important data by parsing the HTML DOM

Libraries & Tools

Beautiful Soup may be a Python library for pulling data out of HTML and XML files.

  1. Requests allow you to send HTTP requests very easily.
  2. Pandas provide fast, flexible, and expressive data structures
  3. Web Scraper to extract the HTML code of the target URL.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I’m assuming that you simply have already installed Python 3.x.

mkdir scraper
pip install beautifulsoup4
pip install requests
pip install pandas

Now, create a file inside that folder by any name you wish. I’m using xyz.py.

from bs4 import BeautifulSoup
import requests
import pandas as pd

What we are getting to scrape

We are scraping the “about” page of Google from Linkedin.

Preparing the Food

Now, since we’ve all the ingredients to organize the scraper, we should always make a GET request to the target URL to urge the raw HTML data. If you’re not conversant in the scraping tool, I might urge you to travel through its documentation. we’ll use requests to form an HTTP GET request. Now Since we are scraping a corporation page, I have even set “type” as a company and “linked” as google/about/. LinkedIn is often found in Linkedin’s target URL.

r = requests.get(‘https://api.scrapingpass.com/linkedin/?api_key=YOUR-API-KEY&type=company&linkId=google/about/').text

This will provide you with an HTML code of these target URLs.

Please use your Scrapingpass API key while making the above requests.

Now, you’ve got to use BeautifulSoup to parse the HTML.

soup=BeautifulSoup(r,’html.parser’)
l={}
u=list()

So, we’ll use variable soup to extract that text.

try:
   l[“Company”]=soup.find(“h1”,{“class”:”org-top-card-summary__title t-24 t-black truncate”}).text.replace(“\n”,””)
except:
   l[“Company”]=None

I have replaced \n with an empty string.

Now, we’ll specialize in extracting website, Industry, Company Size, Headquarters(Address), Type, and Specialties.

All of the above properties (except Company Size)are stored in the school 

allProp = soup.find_all(“dd”,{“class”:”org-page-details__definition-text t-14 t-black — light t-normal”})

Now, we’ll one by one extract the properties from the allProp list.

try:
 l[“website”]=allProp[0].text.replace(“\n”,””)
except:
 l[“website”]=None
try:
 l[“Industry”]=allProp[1].text.replace(“\n”,””)
except:
 l[“Industry”]=None
try:
 l[“Address”]=allProp[2].text.replace(“\n”,””)
except:
 l[“Address”]=None
try:
 l[“Type”]=allProp[3].text.replace(“\n”,””)
except:
 l[“Type”]=None
try:
 l[“Specialties”]=allProp[4].text.replace(“\n”,””)
except:
 l[“Specialties”]=None

Now, we’ll scrape Company Size.

As you’ll see that Company Size is stored in school “org-about-company-module__company-size-definition-text t-14 t-blacklight mb1 fl” with tag dd.

try:
 l[“Company Size”]=soup.find(“dd”,{“class”:”org-about-company-module__company-size-definition-text t-14 t-black — light mb1 fl”}).text.replace(“\n”,””)
except:
 l[“Company Size”]=None

Now, pushing dictionary l to list u. then we’ll create a data frame of list u using pandas.

u.append(l)
df = pd.io.json.json_normalize(u)

Now, finally saving our data to a CSV file.

df.to_csv(‘linkedin.csv’, index=False, encoding=’utf-8')

We have successfully scraped a Linkedin Company Page. Similarly, you’ll also scrape a Profile.

Complete Code

from bs4 import BeautifulSoup
import requests
import pandas as pd
r = requests.get(‘https://api.scrapingpass.com/linkedin/?api_key=YOUR-API-KEY&type=company&linkId=google/about/').text
soup=BeautifulSoup(r,’html.parser’)
u=list()
 l={}
try:
 l[“Company”]=soup.find(“h1”,{“class”:”org-top-card-summary__title t-24 t-black truncate”}).text.replace(“\n”,””)
except:
 l[“Company”]=None
allProp = soup.find_all(“dd”,{“class”:”org-page-details__definition-text t-14 t-black — light t-normal”})
try:
 l[“website”]=allProp[0].text.replace(“\n”,””)
except:
 l[“website”]=None
try:
 l[“Industry”]=allProp[1].text.replace(“\n”,””)
except:
 l[“Industry”]=None
try:
 l[“Company Size”]=soup.find(“dd”,{“class”:”org-about-company-module__company-size-definition-text t-14 t-black — light mb1 fl”}).text.replace(“\n”,””)
except:
 l[“Company Size”]=None
try:
 l[“Address”]=allProp[2].text.replace(“\n”,””)
except:
 l[“Address”]=None
try:
 l[“Type”]=allProp[3].text.replace(“\n”,””)
except:
 l[“Type”]=None
try:
 l[“Specialties”]=allProp[4].text.replace(“\n”,””)
except:
 l[“Specialties”]=None
u.append(l)
df = pd.io.json.json_normalize(u)
df.to_csv(‘linkedin.csv’, index=False, encoding=’utf-8')
print(df)

Conclusion

In this article, we understood how we will scrape data from Linkedin using proxy scraper & Python. Although you’ll scrape a Profile too but just read the docs before trying it.

Was this post helpful?