Web scraping with C#
Introduction
C# is still a popular backend programming language, and you might find yourself in need of it for scraping a web page. In this article, we will cover scraping with C# using an HTTP request, parsing the results, and then extracting the information that you want to save. This method is common with basic scraping, but you will sometimes come across single-page web applications with built-in JavaScript such as Node.js, which require a different approach. We’ll also cover scraping these pages using PuppeteerSharp, Selenium WebDriver, and Headless Chrome.
Making an HTTP Request to a Web Page in C#
Imagine that you have a scraping project where you need to scrape Wikipedia for information on famous programmers. Wikipedia has a page with a list of famous programmers with links to each profile page. You can scrape this list and add it to a CSV file to save for future review and use. This is just one simple example of what you can do with web scraping, but the general concept is to find a site that has the information you need, use C# to scrape the content, and store it for later use. In more complex projects, you can crawl pages using the links found on a top category page.
Using .NET HTTP Libraries to Retrieve HTML
.NET Core introduced asynchronous HTTP request libraries to the framework. These libraries are native to .NET, so no additional libraries are needed for basic requests. Before you make the request, you need to build the URL and store it in a variable. Because we already know the page that we want to scrape, a simple URL variable can be added to the HomeController’s Index() method. The HomeController Index() method is the default call when you first open an MVC web application.
Add the following code to the Index() method in the HomeController file:
public IActionResult Index() { string url = "https://en.wikipedia.org/wiki/List_of_programmers"; return View(); }
Using .NET HTTP libraries, a static asynchronous task is returned from the request, so it’s easier to put the requested functionality in its own static method. Add the following method to the HomeController file:
private static async Task<string> CallUrl(string fullUrl) { HttpClient client = new HttpClient(); ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls13; client.DefaultRequestHeaders.Accept.Clear(); var response = client.GetStringAsync(fullUrl); return await response; }
The following code is what your index() method should contain (for now):
public IActionResult Index() { string url = "https://en.wikipedia.org/wiki/List_of_programmers"; var response = CallUrl(url).Result; return View(); }
The code to make the HTTP request is done. We still haven’t parsed it yet, but now is a good time to run the code to ensure that the Wikipedia HTML is returned instead of any errors. Make sure you set a breakpoint in the Index() method at the following line:
`return View();`
This will ensure that you can use the Visual Studio debugger UI to view the results.
Parsing the HTML
With the HTML retrieved, it’s time to parse it. HTML Agility Pack is a common tool, but you may have your own preference. Even LINQ can be used to query HTML, but for this example and for ease of use, the Agility Pack is preferred and what we will use.
We will parse the document in its own method in the HomeController, so create a new method named ParseHtml() and add the following code to it:
private List<string> ParseHtml(string html) { HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(html); var programmerLinks = htmlDoc.DocumentNode.Descendants("li") .Where(node => !node.GetAttributeValue("class", "").Contains("tocsection")).ToList(); List<string> wikiLink = new List<string>(); foreach (var link in programmerLinks) { if (link.FirstChild.Attributes.Count > 0) wikiLink.Add("https://en.wikipedia.org/" + link.FirstChild.Attributes[0].Value) ; } return wikiLink; }
In the above code, a generic list of strings is created from the parsed HTML with a list of links to famous programmers on the selected Wikipedia page. We use LINQ to eliminate the table of content links, so now we just have the HTML content with links to programmer profiles on Wikipedia. We use .NET’s native functionality in the foreach loop to parse the first anchor tag that contains the link to the programmer profile. Because Wikipedia uses relative links in the href attribute, we manually create the absolute URL to add convenience when a reader goes into the list to click each link.
Exporting Scraped Data to a File
The code above opens the Wikipedia page and parses the HTML. We now have a generic list of links from the page. Now, we need to export the links to a CSV file. We’ll make another method named WriteToCsv() to write data from the generic list to a file. The following code is the full method that writes the extracted links to a file named “links.csv” and stores it on the local disk.
private void WriteToCsv(List<string> links) { StringBuilder sb = new StringBuilder(); foreach (var link in links) { sb.AppendLine(link); } System.IO.File.WriteAllText("links.csv", sb.ToString()); }
The above code is all it takes to write data to a file on local storage using native .NET framework libraries.
The full HomeController code for this scraping section is below:
using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using Microsoft.AspNetCore.Mvc; using Microsoft.Extensions.Logging; using HtmlAgilityPack; using System.Net.Http; using System.Net.Http.Headers; using System.Threading.Tasks; using System.Net; using System.Text; using System.IO; <code class="language-c#" data-lang="c#">namespace ScrapingBeeScraper.Controllers { public class HomeController : Controller { private readonly ILogger<HomeController> _logger; public HomeController(ILogger<HomeController> logger) { _logger = logger; } public IActionResult Index() { string url = "https://en.wikipedia.org/wiki/List_of_programmers"; var response = CallUrl(url).Result; var linkList = ParseHtml(response); WriteToCsv(linkList); return View(); } [ResponseCache(Duration = 0, Location = ResponseCacheLocation.None, NoStore = true)] public IActionResult Error() { return View(new ErrorViewModel { RequestId = Activity.Current?.Id ?? HttpContext.TraceIdentifier }); } private static async Task<string> CallUrl(string fullUrl) { HttpClient client = new HttpClient(); ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls13; client.DefaultRequestHeaders.Accept.Clear(); var response = client.GetStringAsync(fullUrl); return await response; } private List<string> ParseHtml(string html) { HtmlDocument htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(html); var programmerLinks = htmlDoc.DocumentNode.Descendants("li") .Where(node => !node.GetAttributeValue("class", "").Contains("tocsection")).ToList(); List<string> wikiLink = new List<string>(); foreach (var link in programmerLinks) { if (link.FirstChild.Attributes.Count > 0) wikiLink.Add("https://en.wikipedia.org/" + link.FirstChild.Attributes[0].Value) ; } return wikiLink; } private void WriteToCsv(List<string> links) { StringBuilder sb = new StringBuilder(); foreach (var link in links) { sb.AppendLine(link); } System.IO.File.WriteAllText("links.csv", sb.ToString()); } }
Part II: Scraping Dynamic JavaScript Pages
In the previous section, data was easily available to our scraper because the HTML was constructed and returned to the scraper the same way a browser would receive data. Newer JavaScript technologies such as Vue.js render pages using dynamic JavaScript code. When a page uses this type of technology, a basic HTTP request won’t return HTML to parse. Instead, you need to parse data from the JavaScript rendered in the browser.
Dynamic JavaScript isn’t the only issue. Some sites detect if JavaScript is enabled or evaluate the UserAgent value sent by the browser. The UserAgent header is a value that tells the web server the type of browser being used to access pages. If you use web scraper code, no UserAgent is sent and many web servers will return different content based on UserAgent values. Some web servers will use JavaScript to detect when a request is not from a human user.
Using Selenium with Headless Chrome
If you don’t want to use Puppeteer, you can use Selenium WebDriver. Selenium is a common tool used in automation testing on web applications because, in addition to rendering dynamic JavaScript code, it can also be used to emulate human actions such as clicks on a link or button. To use this solution, you need to go to NuGet and install Selenium.WebDriver and Selenium.WebDriver.ChromeDriver.
Add the following library to the using statements:
using OpenQA.Selenium; using OpenQA.Selenium.Chrome;
Now, you can add the code that will open a page and extract all links from the results. The following code demonstrates how to extract links and add them to a generic list.
public async Task<IActionResult> Index() { string fullUrl = "https://en.wikipedia.org/wiki/List_of_programmers"; List<string> programmerLinks = new List<string>(); var options = new ChromeOptions() { BinaryLocation = "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe" }; options.AddArguments(new List<string>() { "headless", "disable-gpu" }); var browser = new ChromeDriver(options); browser.Navigate().GoToUrl(fullUrl); var links = browser.FindElementsByTagName("a"); foreach (var url in links) { programmerLinks.Add(url.GetAttribute("href")); } return View(); }
Notice that the Selenium solution is not asynchronous, so if you have a large pool of links and actions to take on a page, it will freeze your program until the scraping completes.
Conclusion
Web scraping is a powerful tool for developers who need to obtain large amounts of data from a web application. With pre-packaged dependencies, you can turn a difficult process into only a few lines of code.
Abhishek Kumar
More posts by Abhishek Kumar