Web scraping with HTML agility pack

Introduction 

For any project that pulls content from the online in C# and parses it to a usable format, you’ll presumably find the HTML Agility Pack. The Agility Pack is standard for parsing HTML content because it’s several methods and properties that conveniently work with the DOM. Rather than writing your own parsing engine, the HTML Agility Pack has everything you would like to seek out specific DOM elements, traverse through child and parent nodes, and retrieve text and properties (e.g., HREF links) within specified elements.

The first step is to put in the HTML Agility Pack after you create your C# .NET project. to put in the Agility Pack, you would like to use NuGet. NuGet is out there within the Visual Studio interface by getting to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution. during this Window, perform an inquiry for HTML Agility Pack, and install it into your solution dependencies. After you put in it, you’ll notice the dependency in your solution, and you’ll find it referenced in your using statements. If you are doing not see the reference in your using statements, you want to add the subsequent line to each code file where you employ the Agility Pack:

using HtmlAgilityPack;

Pull HTML from an internet Page Using Native C# Libraries

With the Agility Pack dependency installed, you’ll now practice parsing HTML. For this tutorial, we’ll use Hacker News. It’s an honest example since it’s a dynamic page with an inventory of popular links which will be read by viewers. We’ll take the highest 10 links on Hacker News, parse the HTML and place it into a JSON object.

Before you scrape a page, you ought to understand its structure and take a glance at the code behind on the page. We’re using Chrome, but this feature is out there in Firefox and Edge. Right-click and inspect the element for the primary link on Hacker News. You’ll notice that links are contained within a table, and every title is listed during a table row with specific class names. These class names are often wont to pull content within each DOM element once you scrape the page.

The title class contains the weather for the most title that displays on the page, and therefore the rank class displays the title’s rank. The story link and score classes also contain important information about the link that we could increase the JSON object.

We also want to focus on specific DOM element properties that contain information that we’d like. The <a> and <span> elements contain content that we would like, and therefore the Agility Pack can pull them from the DOM and display content.

Now that we understand the page DOM structure, we will write code that pulls the homepage for Hacker News. Before starting, add the subsequent using statements to your code:

using HtmlAgilityPack;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Threading.Tasks;
using System.Net;
using System.Text;

With the using statements in situ, you’ll write a little method which will dynamically pull any website and cargo it into a variable named response. Here is an example of pulling an internet page in C# using native libraries:

<code class="language-c#" data-lang="c#">string fullUrl = "https://news.ycombinator.com/";
var response = CallUrl(fullUrl).Result;

private static async Task<string> CallUrl(string fullUrl)
{
       HttpClient client = new HttpClient();
       ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls13;
       client.DefaultRequestHeaders.Accept.Clear();
        var response = client.GetStringAsync(fullUrl);
        return await response;
}

You can check for sure that the online page content was pulled by setting a breakpoint and using the HTML Visualizer to look at the content

Parsing HTML Using Agility Pack

With the HTML loaded into a variable, you’ll now use Agility Pack to parse it. you’ve got two main options:

  • Use XPath and SelectNodes
  • Use LINQ

LINQ is beneficial once you want to look through nodes to seek out specific content. The XPath option is restricted to Agility Pack and employed by most developers to iterate through several elements. We’re getting to use LINQ to urge the highest 10 stories, then XPath to parse child elements and obtain specific properties for everyone and cargo it into a JSON object. We use a JSON object because it’s a universal language that will be used across platforms, APIs, and programming languages. Most systems support JSON, so it’s a simple thanks to work with external applications if you would like to.

We’ll create a replacement method that will parse the HTML.

<code class="language-c#" data-lang="c#">private void ParseHtml(string html)
{
      HtmlDocument htmlDoc = new HtmlDocument();
      htmlDoc.LoadHtml(html);
      var programmerLinks = htmlDoc.DocumentNode.Descendants("tr")
              .Where(node => node.GetAttributeValue("class", "").Contains("athing")).Take(10).ToList();

}

This code loads the HTML into the Agility Pack HtmlDocument object. Using LINQ, we pulled all tr elements where the category name contains athing. The Take() method tells the LINQ query to only take the highest 10 from the list. LINQ makes it much easier to tug a selected number of elements and cargo them into a generic list.

We don’t want all elements within each table row, so we’d like to iterate through each item and use Agility Pack to tug only story titles, URLs, rank, and score. We’ll add this functionality to the ParseHtml() method since the new functionality may be a part of the parsing process.

The following code displays the added functionality during a for-each loop:

<code class="language-c#" data-lang="c#">private void ParseHtml(string html)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    var programmerLinks = htmlDoc.DocumentNode.Descendants("tr")
            .Where(node => node.GetAttributeValue("class", "").Contains("athing")).Take(10).ToList();

    foreach (var link in programmerLinks)
    {
        var rank = link.SelectSingleNode(".//span[@class='rank']").InnerText;
        var storyName = link.SelectSingleNode(".//a[@class='storylink']").InnerText;
        var url = link.SelectSingleNode(".//a[@class='storylink']").GetAttributeValue("href", string.Empty);
        var score  = link.SelectSingleNode("..//span[@class='score']").InnerText;
    }


}

The above code iterates through all top 10 links on Hacker News and gets the knowledge that we would like, but it doesn’t do anything with the knowledge. We now got to create a JSON object to contain the knowledge. Once we have a JSON object, we will then pass it to anything we would like – another method in our code, an API on an external platform, or to a different application that will ingest JSON.

The easiest way to create a JSON object is to serialize it from a category. you’ll create a category within the same namespace as you’ve been creating your code within the previous examples. 

In this example, the code we’ve been creating is within the namespace Scraper. Controllers. Your namespace is perhaps different from ours, but you’ll find it at the highest of your file under the using statements. Create the HackerNewsItems class within the same file as you’re using for this tutorial and return to the ParseHtml() method where we’ll create the thing.

using Newtonsoft.Json; 

With the HackerNewsItems class created, now we will add JSON code to the parsing method to make a JSON object. Take a glance at the ParseHtml() method now:

<code class="language-c#" data-lang="c#">private string ParseHtml(string html)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    var programmerLinks = htmlDoc.DocumentNode.Descendants("tr")
            .Where(node => node.GetAttributeValue("class", "").Contains("athing")).Take(10).ToList();

    List<HackerNewsItems> newsLinks = new List<HackerNewsItems>();

    foreach (var link in programmerLinks)
    {
        var rank = link.SelectSingleNode(".//span[@class='rank']").InnerText;
        var storyName = link.SelectSingleNode(".//a[@class='storylink']").InnerText;
        var url = link.SelectSingleNode(".//a[@class='storylink']").GetAttributeValue("href", string.Empty);
        var score  = link.SelectSingleNode("..//span[@class='score']").InnerText;
                HackerNewsItems item = new HackerNewsItems();
                item.rank = rank.ToString();
                item.title = storyName.ToString();
                item.url = url.ToString();
                item.score = score.ToString();
                newsLinks.Add(item);
    }

    string results = JsonConvert.SerializeObject(newsLinks);

    return results;

}

You’ve pulled the highest 10 news links from Hacker News and created a JSON object. Here is that the full code from start to end with the ultimate JSON object contained within the linkList variable:

<code class="language-c#" data-lang="c#">string fullUrl = "https://news.ycombinator.com/";
var response = CallUrl(fullUrl).Result;
var linkList = ParseHtml(response);

private static async Task<string> CallUrl(string fullUrl)
{
    HttpClient client = new HttpClient();
    ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls13;
    client.DefaultRequestHeaders.Accept.Clear();
    var response = client.GetStringAsync(fullUrl);
    return await response;
}

private string ParseHtml(string html)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    var programmerLinks = htmlDoc.DocumentNode.Descendants("tr")
            .Where(node => node.GetAttributeValue("class", "").Contains("athing")).Take(10).ToList();

    List<HackerNewsItems> newsLinks = new List<HackerNewsItems>();

    foreach (var link in programmerLinks)
    {
        var rank = link.SelectSingleNode(".//span[@class='rank']").InnerText;
        var storyName = link.SelectSingleNode(".//a[@class='storylink']").InnerText;
        var url = link.SelectSingleNode(".//a[@class='storylink']").GetAttributeValue("href", string.Empty);
        var score  = link.SelectSingleNode("..//span[@class='score']").InnerText;
                HackerNewsItems item = new HackerNewsItems();
                item.rank = rank.ToString();
                item.title = storyName.ToString();
                item.url = url.ToString();
                item.score = score.ToString();
                newsLinks.Add(item);
    }
    string results = JsonConvert.SerializeObject(newsLinks);
    return results;
}

Pull HTML Using Selenium and a Chrome Browser Instance

In some cases, you’ll have to use Selenium with a browser to tug HTML from a page. This is often because some websites work with client-side code to render results. Since client-side code executes after the browser loads HTML and scripts, the previous example won’t get the results that you simply need. To emulate code loading during a browser, you’ll use a library named Selenium. Selenium allows you to pull HTML from a page using your browser executable, then you’ll parse the HTML using the Agility Pack in the same way we did above.

Add the subsequent using statements to your file:

using OpenQA.Selenium; 

using OpenQA.Selenium.Chrome; 

<code class="language-c#" data-lang="c#">string fullUrl = "https://news.ycombinator.com/";
var options = new ChromeOptions()
{
       BinaryLocation = "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe"
 };

options.AddArguments(new List<string>() { "headless", "disable-gpu" });
var browser = new ChromeDriver(options);
browser.Navigate().GoToUrl(fullUrl);
var linkList = ParseHtml(browser.PageSource);

Notice within the code above that an equivalent ParseHtml() method is employed, but at this point we pass the Selenium page source as an argument. The BinaryLocation variable points to the Chrome executable, but your path could be different so confirm it’s an accurate path location in your own code. By reusing an equivalent method, you’ll switch between direct loading of HTML using native C# libraries and loading client-side content and parsing it without writing new code for every event.

The full code to perform the request and parse HTML is below:

<code class="language-c#" data-lang="c#">string fullUrl = "https://news.ycombinator.com/";
var options = new ChromeOptions()
{
       BinaryLocation = "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe"
 };

options.AddArguments(new List<string>() { "headless", "disable-gpu" });
var browser = new ChromeDriver(options);
browser.Navigate().GoToUrl(fullUrl);
var linkList = ParseHtml(browser.PageSource);



```c#
private string ParseHtml(string html)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    var programmerLinks = htmlDoc.DocumentNode.Descendants("tr")
            .Where(node => node.GetAttributeValue("class", "").Contains("athing")).Take(10).ToList();

    List<HackerNewsItems> newsLinks = new List<HackerNewsItems>();

    foreach (var link in programmerLinks)
    {
        var rank = link.SelectSingleNode(".//span[@class='rank']").InnerText;
        var storyName = link.SelectSingleNode(".//a[@class='storylink']").InnerText;
        var url = link.SelectSingleNode(".//a[@class='storylink']").GetAttributeValue("href", string.Empty);
        var score  = link.SelectSingleNode("..//span[@class='score']").InnerText;
                HackerNewsItems item = new HackerNewsItems();
                item.rank = rank.ToString();
                item.title = storyName.ToString();
                item.url = url.ToString();
                item.score = score.ToString();
                newsLinks.Add(item);

    }

    string results = JsonConvert.SerializeObject(newsLinks);

    return results;

}

The code still parses the HTML and converts it to a JSON object from the HackerNewsItems class, but the HTML is parsed after loading it into a virtual browser. During this example, we used headless Chrome with Selenium, but Selenium also has drivers for headless Firefox available from NuGet.

For now, we used LINQ and XPath to pick CSS classes, but the Agility Pack creators promise that CSS selectors are coming.

Was this post helpful?