Web Scraping

10 tips to avoid getting blocked while web scraping

Understanding Web Scraping:

Web Scraping refers to the extraction of content from a website by extracting underlying HTML code and data stored in a database. Web scraping can be also used for illegal purposes, like the undercutting of prices and the theft of copyrighted content. It can be done manually, but this is extremely monotonous work. To speed up the process Web Scraping Tools can be used which are automated, cost less, and work more briskly.

Working of web scraper:

First, the Web Scraper is given the URLs. After that the scraper will lad the complete code of HTML. The Web Scraper will then extract all the data on the page or the specific data depending on what the user has selected. Finally, the Web Scraper will present all the data collected into a usable format.

Best Web Scraping tools:

There are top 8 Web Scraping Tools:

  1. ParseHub
  2. Scrapy
  3. OctoParse
  4. Scraper API
  5. Mozenda
  6. Webhose.io
  7. Content Grabber
  8. Common Crawl

Web scraping without getting blocked:

Web Scraping

Web scraping without getting blocked

When most popular sites actively try to prevent developers from scraping their websites, web scraping can be a difficult task. Especially when using a variety of techniques such as IP address detection, CAPTCHAs, HTTP request header checking and more. However on the contrary, there are other analogous strategies too that developers avoid these blocks as well, allowing them to build web scrapers that are nearly impossible to detect. Here are 10 tips on how to scrape a website without getting blocked:  

  • IP ROTATION

This is the easiest way for anti-scraping mechanisms to catch you red-handed. You will be blocked if you keep using the same IP for every request. So, you must use a new IP address for every successful scraping request. You need to have a wide range of at least 10 IPs before making an HTTP request. To avoid getting blocked use proxy rotating services like Scrapingpass etc.

For websites that have advanced bot detection systems, you have to use mobile proxies. You can get access to millions of IPs by using these proxies which will be further helpful in scraping millions of pages for a longer period of time.

  • SET A REAL USER AGENT

The User-Agent request is a character string that allows the servers and network companions to identify the application, operating system, version of the requesting user agent. There are some websites that will block requests if they contain User-Agent that don’t belong to a major browser. Many websites won’t allow viewing their content if user-agents are not set.

An anti-scraping mechanism has somewhat the same technique enforced which they use while banning IPs. You will be banned in no time if you are using the same user-agent for every request. To overcome this, there is a pretty simple solution, you have to create a list of User-Agents.

  • KEEP RANDOM INTERVALS IN BETWEEN

It is easy to detect a web scraper because it will probably send exactly one request each second the whole day. No  actual human being would ever use a website like that. To surpass this situation, program your bot sleep periodically in between scraping processes. Put a timeout of around 10 to 20 seconds and then continue scraping, this will make your bot look more human. By making simultaneous requests, scrape a very small number of pages at a time. Use auto throttling mechanisms which will automatically smother the crawling speed based on the load of the website that you are crawling.

1. SET OTHER REQUEST HEADERS

Authentic web browsers have a whole host of headers set, which can be checked carefully by websites to block your web scraper. In order to make your scraper appear like a realistic browser, you can navigate to https://httpbin.org/anything, and simply copy the headers that you see there. Things like “Accept”, “Accept-Language”, and “Upgrade-Insecure-Requests” will make your requests look like they are coming from a real browser. And that way you should be able to avoid being detected by 99% of websites.

2. HEADLESS BROWSER

Usually the trickiest websites to scrape detects stuff like extensions, browser cookies, and javascript execution in order to determine whether the request is coming from a real user or a bot. In order to scrape these websites you may need to allocate your own headless browser.  

Automation browsers such as Selenium or Puppeteer provides APIs to control browsers and Scrape websites. A lot of effort is invested in making these browsers go undetectable. To let you open an instance of a browser on their servers rather than increasing the load on your server you can even use certain browserless services. 

3.  ROBOTS.TXT

So basically the robot.txt file tells the search engine crawlers which pages or files they can or can’t request from a site. This is mainly used to avoid overloading any website with requests and provides standard rules for scraping.

Basically the mechanism of anti-scraping works on one fundamental rule-Is it a bot or a human? For examining this rule it has to follow certain criteria in order to make a decision.

Points mentioned by an anti-scraping mechanism:

  1. If you are scraping pages faster than a human possibility, you will fall into a category called “bots”.
  2. Following the same pattern while scraping. For example, you are going through every page of that target domain just collecting images or links.
  3. If you’re using the same IP address for a certain period of time for scraping.
  4. User Agent missing.

Keeping these points in mind, you will be pro in scraping any website.

  • CHANGE IN SCRAPING PATTERN AND DETECT WEBSITE CHANGE

Generally, humans perform discontinued random tasks and actions as they browse through a site. But web scraping bots are programmed to crawl in the same pattern. As earlier mentioned some websites have great anti-scraping mechanisms. They will detect your bot and ban it permanently.

So you need to incorporate some random clicks so as to protect your bot from being caught on the page, mouse movements, and random actions that will make a spider look like a human.

Another problem is that many websites change their layouts for different reasons and due to this your scraper will fail to bring data that you’d be expecting. To overcome this, you should have a proper monitoring system that detects changes in their layouts and then notify you with the scenario. Then this information can be used by your scraper to work accordingly.

  • CAPTCHA SOLVING SERVICES

A lot of websites use ReCaptcha from Google which lets you in only if you pass a test. If the test goes successful within a certain time frame then it concludes that you are not a bot but a real human being. If you are scraping a website for example on a large scale, then you will eventually get blocked and the website will start showing you captcha pages instead of web pages.

There are services to get past these limitations such as  2Captcha. It is a captcha solution service that provides solutions of almost all known captcha types via simple to use API. It helps to detour captchas on sites without any human involvement in activities like data parsing, web-scraping, web automation etc.

  • HONEYPOT TRAPS

To detect hacking or web scraping, there are many invisible links. It is basically an application that imitates the behavior of a real system. There are certain websites that have installed honeypots on their system which can be seen by bots or web scrapers but are not visible to a normal user. You need to find out whether a link has the “display: none” or “visibility: hidden” CSS properties set. In case they do, avoid following that link else you will be identified as a programmatic scraper and will end up getting blocked.

  • GOOGLE CACHE

There are times when Google keeps cached copies of websites. So, you can also make a request to its cached copy rather than making a request to that website. 

But one thing should be kept in mind is that this approach should be used for websites that do not have sensitive information. For example, Google cannot cache LinkedIn’s data as it doesn’t allows Google to do so. Google also creates a cached copy of a website in a certain interval of time depending on the popularity of that website.

Was this post helpful?