Web scraping

Web scraping is the method of using bots to extract content and data from an online site.

 Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored during a database. The scraper can then replicate the entire website content elsewhere. Web scraping is used by digital businesses that believe in data harvesting.

Legitimate use cases include:

  • Search engine bots crawl a site, analyze its content then rank it.
  • Price comparison sites deploying bots to auto-fetch prices and merchandise descriptions for allied seller websites.
  • Market research companies use scrapers to tug data from forums and social media.

Web scraping with PHP

This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably within the intervening years since the question was asked. I actively maintain the last word Web Scraper Toolkit, which hasn’t been mentioned yet but predates many of the opposite tools listed here apart from Simple HTML DOM.

Web scraping with php

Web scraping with PHP

The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with an endless streaming tokenizer for precise data extraction.

To answer the first question, “Is there any simple due to do that with no external libraries/classes?” the answer isn’t any. HTML is sort of complex and there’s nothing built into PHP that’s particularly suitable for the task. you really need a reusable library to parse generic HTML correctly and consistently. Plus you’ll find many uses for such a library.

Also, a very good web scraper toolkit will have three major, highly polished components/capabilities

  1. Data retrieval:

 

This is making an HTTP(S) request to a server and demolition data. an honest web scraping library also will leave large binary data blobs to be written on to disk as they’re available down off the network instead of loading the whole thing into RAM. the facility to undertake dynamic form extraction and submission is additionally very handy. a really good library will allow you to fine-tune every aspect of each request to each server also as inspect the info is sent and received on the wire. Some web servers are extremely picky about input, so having the facility to accurately replicate a browser is handy.

  1. Data extraction:

 

This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. an honest web scraping library also is going to be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and ASP.NET output where odd things show up kind of one HTML tag that spans several lines. the facility to easily extract all the data from poorly designed, complex, classless tags like ASP.NET HTML table elements that some overpaid government employees made is additionally very nice to possess. Also, in your case, the facility to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you discover what you’re trying to seek out could also be a plus, which could be useful if someone submits a URL to a 500MB file.

  1. Data manipulation:

 

This is the inverse of #2. a really good library is getting to be able to modify the input HTML document several times without negatively impacting performance. When would you’d wish to attempt to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending another email, downloading content for offline viewing, or preparing content for transport to a special service that’s finicky about input. the facility to form a custom HTML-style template language is additionally a pleasing bonus.

Obviously, Ultimate Web Scraper Toolkit does all of the above and more. it had been also relatively simple to mean the clients on their heads and make WebServer and WebSocketServer classes.

Php Web scraping with headless browser

A headless browser could also be a browser without a graphical interface. Headless browsers allow you to use your terminal to load an online page in an environment almost like an online browser. This permits you to write down code to manage the browsing as we’ve just exhausted the previous steps.

In modern web development, most developers are using JavaScript web frameworks. These frameworks generate the HTML code inside the browsers. In other cases, AJAX is used to dynamically load content. within the previous examples, we used a static HTML page, hence the output was consistent. In dynamic cases, where JavaScript and AJAX are used to generate the HTML, the output of the DOM tree may differ greatly, resulting in failures of our scrapers. Headless browsers inherit the image to handle such issues in modern websites.

A library that we’ll use for a headless browser is the Symfony Panther PHP library. you’ll use the library to Scrape websites and run tests using real browsers. additionally, it provides the same methods as the Goutte library, hence, you’ll use it instead of Goutte.

We have already been doing plenty of scraping, let’s try something different. we’ll be loading an HTML page and taking a screenshot of the page.

Install Symfony Panther with the next command:

composer require symfony/panther

Create a replacement PHP file, let’s call it panther_requests.php. 

Add subsequent code to the file:

<?php
# scraping books to scrape: https://books.toscrape.com/
require ‘vendor/autoload.php’;
$httpClient = \Symfony\Component\Panther\Client::createChromeClient();

// for a Firefox client use the line below instead//
$httpClient = \Symfony\Component\Panther\Client::createFirefoxClient();

// get response$response = $httpClient->get(‘https://books.toscrape.com/');

// take screenshot and store in current directory
$response->takeScreenshot($saveAs = ‘books_scrape_homepage.jpg’);

// let’s display some book titles$response->getCrawler()->filter(‘.row li article h3 a’)

->each(function ($node) 
-{echo $node->text() . PHP_EOL;
});

For this code to run on your system, you would like to put in the drivers for Chrome or Firefox, relying on which client you utilized in your code. Fortunately, Composer can automatically do this for you. Execute the next command in your terminal to place in and detect the drivers:

composer require — dev dbrekelmans/bdi && vendor/bin/bdi detect drivers

Now you’ll execute the PHP, enter your terminal and it will take a screenshot of the webpage and store it within the present directory, it’ll then display a listing of titles from the online site.

 

Was this post helpful?