Python is one of the foremost popular languages within the data world has not been overlooked. Our walkthrough with BeautifulSoup and selenium python libraries should get you on your thanks to becoming a knowledge master.
For this walkthrough, we’ll scrape data from the lonelyplanet which may be a travel guide website. Specifically their experiences section. We’ll extract this data and store it in various formats like JSON, CSV, and XML. the info can then be analyzed and wont to plan our next trip!
Set up
To get started, we’ll get to install the scrapy library. Remember to separate your python dependencies by using virtual environments. Once you’ve found out a virtual environment and activated it, run:
pip install scrapy
Initializing
With the 2 steps complete, we should always be able to found out the online crawler.
Run the command scrapy startproject “projectName”
This creates a scrapy project with the project structure.
We’ll create an enter the spider’s folder and name it “destinations.py”.This will contain most of the logic for our web scraper.
The ASCII text file within the destinations.pythe file will appear as if so:
from scrapy import Request, Spider from ..items import TripsItem class DestinationsCrawl(Spider): name = 'destinations' items = TripsItem() allowed_domains = ['lonelyplanet.com'] url_link = f'<https://www.lonelyplanet.com/europe/activities>' start_urls = [url_link] def __init__(self, name,continent, **kwargs): self.continent = continent super().__init__(name=name, **kwargs) def start_requests(self): if self.continent: # taking input from command line parameters url = f'<https://www.lonelyplanet.com/{self.continent}/activities>' yield Request(url, self.parse) else: for url in self.start_urls: yield Request(url, dont_filter=True) def parse(self, response): experiences = response.css("article.rounded.shadow-md") items = TripsItem() for experience in experiences: items["name"] = experience.css( 'h2.text-xl.leading-tight::text').extract() items["experience_type"] = experience.css( 'span.mr-4::text').extract() items["price"] = experience.css("span.text-green::text").extract() items["duration"] = experience.css( "p.text-secondary.text-xs::text").extract() items["description"] = experience.css( "p.text-sm.leading-relaxed::text").extract() items[ "link"] = f'https://{self.allowed_domains[0]}{experience.css("a::attr(href)").extract()[0]}' yield items
The first three lines are library imports and items we’ll get to create a functional web scraper.
from scrapy import Request, Spider from ..items import TripsItem
Setting up a custom proxy
We’ll define a config within the same directory because of the destinations.py. this may contain the essential credentials needed to access the rotating proxy service.
# don't keep this in version control, use a tool like python-decouple # and store sensitive data in .env file API_KEY='your_scraping_dog_api_key'
This is a file that will host the scraping dog API key. We’ll need to found out a custom middleware in scrapy to permit us to proxy our requests through the rotating proxy pool. From the tree folder structure, we notice there’s a [middlewares.py]() file. We’ll write our middleware here.
from w3lib.http import basic_auth_header from .spiders.config import API_KEY class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "<http://proxy.scrapingpass.com:8081>" request.headers['Proxy-Authorization'] = basic_auth_header('scrapingpass', API_KEY)
Finally, we’ll register the middleware in our settings file.
# Enable or disable downloader middlewares # See <https://docs.scrapy.org/en/latest/topics/downloader-middleware.html> DOWNLOADER_MIDDLEWARES = { 'trips.middlewares.CustomProxyMiddleware': 350, 'trips.middlewares.TripsDownloaderMiddleware': 543, }
With this configuration, all our scraping requests have access to the proxy pool.
Let’s look at the destinations.py
class DestinationsCrawl(Spider): name = 'destinations' items = TripsItem() allowed_domains = ['lonelyplanet.com'] url_link = f'<https://www.lonelyplanet.com/europe/activities>' start_urls = [url_link] def __init__(self, name,continent, **kwargs): self.continent = continent super().__init__(name=name, **kwargs) def start_requests(self): if self.continent: # taking input from command line parameters url = f'<https://www.lonelyplanet.com/{self.continent}/activities>' yield Request(url, self.parse) else: for url in self.start_urls: yield Request(url, dont_filter=True)
The DestinationsCrawl class inherits from scrapy’s Spider class. This class is going to be the blueprint of our web scraper and we’ll specify the logic of the crawler in it.
The name variable specifies the name of our web scraper and therefore the name is going to be used later once we want to execute the online scraper afterward.
The “url_link” variable points to the default URL link we would like to scrape. The “start_urls” variable may be a list of default URLs. This list will then be employed by the default implementation “start_requests()” to make the initial requests for our spider. We’ll override this method however to require in instruction arguments to form our web scraper a touch more dynamic.
Since we’re inheriting from Spider class, we’ve access to the “start_requests()” method. This method returns an iterable of Requests which the Spider will begin to crawl from. Subsequent requests are going to be generated successively from these initial requests. In short, all requests start here in scrapy. Bypassing within the continent name within the instruction, this variable is captured by the spider’s initializer, and that we can then use this variable in our target link. Essentially creating a reusable web scraper.
Remember all our requests are being proxied because the CustomProxyMiddlewareis executed on every request.
def parse(self, response): experiences = response.css("article.rounded.shadow-md") items = TripsItem() for experience in experiences: items["name"] = experience.css( 'h2.text-xl.leading-tight::text').extract() items["experience_type"] = experience.css( 'span.mr-4::text').extract() items["price"] = experience.css("span.text-green::text").extract() items["duration"] = experience.css( "p.text-secondary.text-xs::text").extract() items["description"] = experience.css( "p.text-sm.leading-relaxed::text").extract() items[ "link"] = f'https://{self.allowed_domains[0]}{experience.css("a::attr(href)").extract()[0]}' yield items
From scrapy’s documentation,
The parse method is responsible for processing the response and returning scraped data and/or more URLs to follow.
What this suggests is that the parse method can manipulate the info received from the target internet site we would like to control. By taking advantage of patterns within the web page’s underlying code, we will gather unstructured data and process and store it during a structured format.
By identifying the patterns within the web page’s code, we will automate data extraction. These are typically HTML elements. Therefore We’ll use a browser extension called selectorGadget to quickly identify the HTML elements we’d like. Optionally, we will use the browser developer tools to examine elements.
We’ll notice that the destinations contained within the article element of with classes rounded-shadow and shadow-md. Scrapy has some pretty cool CSS selectors that’ll ease the capturing of those targets. Hence, experiences = response.css “(“article.rounded.shadow-md”)” equates to retrieving all the weather that meets these criteria.
We’ll then loop through all the weather extracting additional attributes from their child elements. like the name of the trip, type, price, description, and links to their main website on a lonely planet.
Before proceeding, let’s address the “TripsItem()” class we imported at the start of the script.
import scrapy class TripsItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() experience_type = scrapy.Field() description = scrapy.Field() price = scrapy.Field() duration = scrapy.Field() link = scrapy.Field()
After successfully crawling the online page, we’d like to store the info in a structured format. these things objects are containers that collect the scraped data. We map the collected values to those fields and from the sector types in our items object, CSV, JSON, and XML files are often generated. For more information, please inspect the scrapy documentation.
Finally, let’s run our crawler. To extract the info in CSV format we will run
scrapy crawl destinations -a continent=asia -a name=asia -o asia.csv
-a flag means arguments and these are utilized in our scraper’s init method and this feature makes our scraper dynamic. However, one can do without this and may run the crawler as-is since the arguments are optional.
scrapy crawl destinations -o europe.csv
For other file types we will run:
scrapy crawl destinations -a continent=africa -a name=africa -o africa.json
scrapy crawl destinations -a continent=pacific -a name=pacific -o pacific.xml
With this data, you’ll now automate your trip planning
Some websites have a robots.txt which may be a file that tells if the website allows scraping or if they are doing not. Scrapy allows you to ignore these rules by setting ROBOTSTXT_OBEY = Falsein their settings.py file.
Abhishek Kumar
More posts by Abhishek Kumar