Written By : Vivek Sharma
When one talks about data it’s not just about tabular or textual data. With the enhancement in technologies, we are now able to process huge images, videos, and audio data in a mere fraction of time. With the advancement and rapidly growing technology, we have numerous applications of data. We do require huge data to develop a solution based on features in images, extract information from them, or to draw conclusions from them.
It is a well-known saying in machine learning ‘A machine learning model can not outperform its training data.’ Hence it is much important for you to obtain the best possible image data collection services for your machine learning model in order to achieve a superior machine learning model.
In today’s example, we will be showing how to develop a scrapper from a static webpage. We will be scraping a yoga posture website for our example. We will be scraping images available on ‘https://greatist.com/’ and saving them using urllib.
Webpage for scraping
We should start by getting the content on the webpage. For this, we will be requiring requests library. We will save the URL in a container named ‘url’ and pass this to requests.get() function. After we get the response we will parse the webpage using BeautifulSoup and convert it to an HTML tree-based structure using the below code.
import requests
from bs4 import BeautifulSoup
url = 'https://greatist.com/move/common-yoga-poses#easy'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
Now we need to search for the tag using BeautifulSoup for getting the images from the page. We have to traverse through the images to go get every source link and then can save it in a .jpg format. we have to search for the picture tag and then the source. In the source, we can find the link and can clean it to get the actual URL of the image by using below the loop.
for picture in soup.find_all('picture',class_='css-16pk1is'): link = 'http:' + picture.find('source')['srcset'].split('?')[0] name = link.split('/')[-1] urlretrieve(link,name)
Now we are using urlretrieve() from urllib.request library to save it into a .jpg format. This function takes in the URL as input and the name of the file to save it.