Web scraping with PHP
Introduction
In this article, we’ll look at some ways to scrape the web with PHP. Please keep in mind that there is no general “the best way” – each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.
As an example, we will try to get a list of people that share the same birthday, as you can see, for instance, on famousbirthdays.com.
-
HTTP Requests
When it comes to browsing the web, the most commonly used communication protocol is HTTP, the Hypertext Transport Protocol. It specifies how participants on the World Wide Web can communicate with each other. There are servers hosting resources and clients requesting resources from them.
Your browser is such a client – when we enable the developer console, select the “Network” tab and open the famous example.com
In its most basic form, a request looks like this:
GET / HTTP/1.1 Host: www.example.com
Let’s try to recreate what the browser just did for us
fsockopen()
We usually don’t see this lower-deck communication, but just for the sake of it, let’s create this request with the most basic tool PHP has to offer: fsockopen():
<?php # fsockopen.php // In HTTP, lines have to be terminated with "\r\n" because of // backward compatibility reasons $request = "GET / HTTP/1.1\r\n"; $request .= "Host: www.example.com\r\n"; $request .= "\r\n"; // We need to add a last new line after the last header // We open a connection to www.example.com on the port 80 $connection = fsockopen('www.example.com', 80); // The information stream can flow, and we can write and read from it fwrite($connection, $request); // As long as the server returns something to us... while(!feof($connection)) { // ... print what the server sent us echo fgets($connection); } // Finally, close the connection fclose($connection);
And indeed, if you put this code snippet into a file fsockopen.php and run it with php fsockopen.php, you will see the same HTML that you get when you open http://example.com in your browser.
cURL
<?php # curl.php // Initialize a connection with cURL (ch = cURL handle, or "channel") $ch = curl_init(); // Set the URL curl_setopt($ch, CURLOPT_URL, 'http://www.example.com'); // Set the HTTP method curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET'); // Return the response instead of printing it out curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // Send the request and store the result in $response $response = curl_exec($ch); echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL; echo 'Response Body: ' . $response . PHP_EOL; // Close cURL resource to free up system resources curl_close($ch);
To follow a website redirect, all we need is a curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);, and there are many more options available to accommodate further needs.
-
Strings, regular expressions, and Wikipedia
We can see that:
- There’s an <h2> header element containing <span id=”Births” …>Births</span> (only one element on the whole page should™ have an ID named “Births”).
- The header is immediately followed by an unordered list (<ul>).
- Each list item (<li>…</li>) contains a year, a dash, a name, a comma, and a teaser what the given person is known for.
This is something we can work with
<?php # wikipedia.php $html = file_get_contents('https://en.wikipedia.org/wiki/December_10'); echo $html;
The important thing is that we know where we should start looking: we’re only interested in the part starting with id=”Births” and ending after the closing </ul> of the list right after that:
<?php # wikipedia.php $html = file_get_contents('https://en.wikipedia.org/wiki/December_10'); $start = stripos($html, 'id="Births"'); $end = stripos($html, '</ul>', $offset = $start); $length = $end - $start; $htmlSection = substr($html, $start, $length); echo $htmlSection;
Let’s use a regular expression to load all list items into an array so that we can handle each item one by one:
preg_match_all('@<li>(.+)</li>@', $htmlSection, $matches); $listItems = $matches[1]; foreach ($listItems as $item) { echo "{$item}\n\n"; }
Finally, the name is located within the following <a> element. Let’s grab ‘em all, and we’re done.
<?php # wikipedia.php $html = file_get_contents('https://en.wikipedia.org/wiki/December_10'); $start = stripos($html, 'id="Births"'); $end = stripos($html, '</ul>', $offset = $start); $length = $end - $start; $htmlSection = substr($html, $start, $length); preg_match_all('@<li>(.+)</li>@', $htmlSection, $matches); $listItems = $matches[1]; echo "Who was born on December 10th\n"; echo "=============================\n\n"; foreach ($listItems as $item) { preg_match('@(\d+)@', $item, $yearMatch); $year = (int) $yearMatch[0]; preg_match('@;\s<a\b[^>]*>(.*?)</a>@i', $item, $nameMatch); $name = $nameMatch[1]; echo "{$name} was born in {$year}\n"; }
-
Guzzle, XML, XPath, and IMDb
Guzzle is a popular HTTP Client for PHP that makes it easy and enjoyable to send HTTP requests. It provides you with an intuitive API, extensive error handling, and even the possibility of extending its functionality with middleware. This makes Guzzle a powerful tool that you don’t want to miss. You can install Guzzle from your terminal with composer require guzzlehttp/guzzle.
In our new script, we’ll first fetch the page with Guzzle, convert the returned HTML string into a DOMDocument object and initialize an XPath parser with it:
<?php # imdb.php require 'vendor/autoload.php'; $httpClient = new \GuzzleHttp\Client(); $response = $httpClient->get('https://www.imdb.com/search/name/?birth_monthday=12-10'); $htmlString = (string) $response->getBody(); // HTML is often wonky, this suppresses a lot of warnings libxml_use_internal_errors(true); $doc = new DOMDocument(); $doc->loadHTML($htmlString); $xpath = new DOMXPath($doc);
Let’s have a closer look at the HTML in the window above:
- The list is contained in a <div class=”lister-list”> element
- Each direct child of this container is a <div> with a lister-item mode-detail class attribute
- Finally, the name can be found within an <a> within a <h3> within a <div> with a lister-item-content
If we look closer, we can make it even simpler and skip the child divs and class names: there is only one <h3> in a list item, so let’s target that directly:
$links = $xpath->evaluate('//div[@class="lister-list"][1]//h3/a'); foreach ($links as $link) { echo $link->textContent.PHP_EOL; }
- //div[@class=”lister-list”][1] returns the first ([1]) div with an attribute named class that has the exact value lister-list
- within that div, from all <h3> elements (//h3) return all anchors ( <a> )
- We then iterate through the result and print the text content of the anchor elements
-
Headless Browsers
In the static HTML case, the output might not differ, but the more JavaScript is embedded in the HTML source, the more likely it will be that the resulting DOM tree is very different. When a website uses AJAX to dynamically load content, or when even the complete HTML is generated dynamically with JavaScript, we cannot access it by just downloading the original HTML document from the server.
This is where so-called headless browsers come into play. A headless browser is a browser engine without a graphical user interface and can be controlled programmatically in a similar way as we did before with the simulated browser.
Symfony Panther is a standalone library that provides the same APIs as Goutte – this means you could use it as a drop-in replacement in our previous Goutte scripts. A nice feature is that it can use an already existing installation of Chrome or Firefox on your computer so that you don’t need to install additional software.
Since we have already achieved our goal of getting the birthdays from IMDB, let’s conclude our journey with getting a screenshot from the page that we so diligently parsed.
After installing Panther with composer require symfony/panther we could write our script for example like this:
<?php # screenshot.php require 'vendor/autoload.php'; $client = \Symfony\Component\Panther\Client::createFirefoxClient(); // or // $client = \Symfony\Component\Panther\Client::createChromeClient(); $client ->get('https://www.imdb.com/search/name/?birth_monthday=12-10') ->takeScreenshot($saveAs = 'screenshot.jpg');
Conclusion
We’ve learned about several ways to scrape the web with PHP today. Still, there are a few topics that we haven’t spoken about – for example, website providers like their sites to be seen in a browser and often frown upon being accessed programmatically.
- When we used Goutte to load 50 pages in quick succession, IMDb could have interpreted this as unusual and could have blocked our IP address from further accessing their website.
- Many websites have rate limiting in place to prevent Denial-of-Service attacks.
- Depending on which country you live and where a server is located, some sites might not be available from your computer.
- Managing headless browsers for different use cases can take a toll on you and your computer
Abhishek Kumar
More posts by Abhishek Kumar