Scraping data behind authentication with python

We are getting to scrape data behind authentication with python. Using python we are getting to scrape LinkedIn using session. this is often an excellent source for public data for lead generation, sentiment analysis, jobs, etc. we’ll code a scraper for that. Using that scraper you’d be ready to scrape person profiles, jobs, company profiles, etc.

Requirements

Generally, web scraping is split into two parts:

Fetching data by making an HTTP request
Extracting important data by parsing the HTML DOM

Libraries and tools

Beautiful Soup may be a Python library for pulling data out of HTML and XML files.

Requests allow you to send HTTP requests very easily.

Setup

The setup is pretty simple. Just create a folder and install Beautiful Soup & requests.

mkdir scraper
pip install beautifulsoup4
pip install requests

Now, create a file inside that folder by any name you wish. I’m using xyz.py.

Firstly, you’ve got to check in for a LinkedIn account. Then just import Beautiful Soup & requests in your file. like this.

from bs4 import BeautifulSoup
import requests

We just want to urge the HTML of a profile.

Session

We will use a Session an object within the request to persist the user session. The session is later wont to make the requests.

All cookies will then persist within the session for every forthcoming request. Meaning, if we check-in, the session will remember us and use that for all future requests we make.

client = requests.Session()

Prepration

Now, we’ve all the ingredients in situ to create a scraper. Let’s just open the developer tools, attend the Network tab and log in so we will catch the URL.

Developers Toolemail = "******@*****"
password = "******"
HOMEPAGE_URL = 'https://www.linkedin.com'
LOGIN_URL = 'https://www.linkedin.com/checkpoint/lg/login-submit'

Paste your own email and password.

LOGIN_URL might be https://www.linkedin.com/checkpoint/lg/login-submit from the developer tools.

The URL will clearly be shown to be https://www.linkedin.com/checkpoint/lg/login-submit, so let’s save that. this is often where our first request will go.

You will notice from the developers’ tool that login also requires a CSRF token. It also takes other data too except for this tutorial, we’ll consider CSRF only.

CSRF Token

We will make an HTTP request to HOMEPAGE_URL then we’ll use BeautifulSoup to extract the CSRF token out of it.

html = client.get(HOMEPAGE_URL).content
soup = BeautifulSoup(html, "html.parser")
csrf = soup.find('input', {'name': 'loginCsrfParam'}).get('value')

Now, we’ve received the CSRF token. the sole job left is now to log in and scrape the profile.

Login

We will log in by making a POST request to “LOGIN_URL”

login_information = {
'session_key':email,
'session_password':password,
'loginCsrfParam': csrf,
}
client.post(LOGIN_URL, data=login_information)

Now you’re basically through with your log partially. you’ve got made the request to check-in. All other requests you create within the same script are going to be considered signed in.

Scrape profile

s = client.get('https://www.linkedin.com/in/rbranson').textprint (s)

Conclusion

Now we understood how we will scrape data using session & BeautifulSoup no matter the sort of website.

Was this post helpful?

Let us know if you liked the post. That’s the only way we can improve.

Scraping data behind authentication with python

Requirements

Libraries and tools

Session

Prepration

CSRF Token

Login

Scrape profile

Conclusion

Was this post helpful?

Abhishek Kumar

Leave a Reply Cancel

Scraping data behind authentication with python

Requirements

Libraries and tools

Session

Prepration

CSRF Token

Login

Scrape profile

Conclusion

Was this post helpful?

Abhishek Kumar

Leave a Reply Cancel

More recent articles

Our Representative will get in touch with you for further process.