Using the HtmlAgilityPack to parse HTML in ASP.NET

4.64 (14 votes)

Hardly a week goes by without someone asking a question in the ASP.NET forums about parsing HTML for one purpose or another. Mostly, the questions are couched in terms of 'finding values' or similar, prompting responses from the community that recommend one regular expression pattern or another, treating HTML as a string of text with no structure or rules. In fact, HTML is a structured document format with a set of very clearly defined rules, which means that it can easily be parsed given the right tool. My favourite tool for parsing HTML is the HtmlAgilityPack.

The HtmlAgilityPack (HAP) has been around for some time now, and is available via Nuget. You can install it using the command

install-package htmlagilitypack

HAP accepts HTML as a string, file, stream or TextReader object. The HTML is loaded into an HtmlDocument object using the Load method for streams, files and the TextReader option, and the LoadHtml method for loading HTML represented as a string. The two most commonly used methods are those that load a file or string:

var html = new HtmlDocument();
html.Load(@"C:\HtmlDocs\test.html"); // load a file
html.LoadHtml(new WebClient().DownloadString("http://www.somedomain.com")); // load a string

Querying the DOM

Once you have loaded the HTML to be parsed, you can access it via the DocumentNode property of the HtmlDocument which returns the root element. From there, you can use LINQ (or XPath) to query the document, or more specifically, the collection of HtmlNode objects returned by the Descendants() method:

var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.asp.net")); 
var root = html.DocumentNode;
var nodes = root.Descendants();
var totalNodes = nodes.Count();

The code above returns the total number of HtmlNode objects (or HTML elements) found in the document. You can filter them in a number of ways. For example, you can pass a tag name to the Descendants method to filter by that tag. The following snippet queries the document for anchor tags and unordered lists:

var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.asp.net")); 
var root = html.DocumentNode;
var anchors = root.Descendants("a");
var unorderedLists = root.Descendants("ul");

You can further refine your search by specifying elements that have a particular attribute's value. This example searches for all elements with a class of "common-link":

var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://www.asp.net")); 
var root = html.DocumentNode;
var commonPosts = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("common-post"));

Locating a specific piece of content

One of the uses of the HAP is for locating specific pieces of content in an HMTL document. The following example will demonstrate how to obtain the number of points I have been awarded as displayed on my profile page at the www.asp.net site:

HtmlAgilityPack

The first step is to examine the relevant HTML. I have only included a small section containing the content I am after, and have highlighted it below:

<div class="module-common">
    <h2 class="common-header-underline transform-none">
        Community Recognition
        <span class="recognition-new-rules"><a href="/t/2024428.aspx">New Rules</a></span>
    </h2>
    <div class="module-profile-recognition">
        <h3>Mikesdotnetting</h3>
        <div class="post-rating All-Star"></div>
        <div class="clear"></div>
        <p>Has 164330 points and achieved the <strong>All-Star</strong> level</p>
        <a href="http://www.asp.net/community/recognition/hall-of-fame">Hall of Fame</a><span class="separator">&#124;</span><a href="http://www.asp.net/community/recognition">About</a><span class="separator">&#124;</span><a href="javascript:;" data-uitype="reputation-history" data-username="Mikesdotnetting">Details</a>
        <table>
            <thead>
                <tr><th>Location</th><th style="width:60%;">Activity</th><th style="width:10%;text-align:right">Points</th></tr>
            </thead>
            <tbody id="reputation-activities-container">
                <tr>
                    <td colspan="3" style="width:100%;height:65px;" class="busy"></td>
                </tr>
            </tbody>
        </table>
    </div>
</div>

The content I want to target is located in a p element with no distinguishing features, such as an id or a class attribute.There are a number of other p elements within the document, so targeting them all won't be helpful. The best strategy is to target an easily identifiable single element, and then to navigate from there. There are a couple of fairly obvious candidates: a div with a class of "post-rating" and another with a class of "module-profile-recognition". If I was creating a tool to regularly parse the same live page, I would generally avoid targeting elements by class because, even though there may only be one on the page today (as is the case for both potential targets in this instance), more could be added in future. Therfore any assumptions about the number of elements is a brittle assumption. Id attributes, on the other hand, should be unique.

Having provided that warning, here's the code that starts with the element with a class of "module-profile-recognition":

var html = new HtmlDocument();
html.LoadHtml(new WebClient().DownloadString("http://forums.asp.net/members/Mikesdotnetting.aspx")); 
var root = html.DocumentNode;
var p = root.Descendants()
    .Where(n => n.GetAttributeValue("class", "").Equals("module-profile-recognition"))
    .Single()
    .Descendants("p")
    .Single();
var content = p.InnerText;

The Descendants method returns a collection. Since there is only one div element matching the "module-profile-recognition" class selector, it is safe to use the Single method to return it. Then you can use the Descendants method to return all the child elements of the div that match the p selector. Again, there is only one, so it is safe to use the Single method to return the only paragraph. Finally, the text content is obtained via the InnerText property. An alternative property is the InnerHtml property, which returns all content, not just the text. Once you have the text content, you can perform Regex on it to extract just the numbers:

var points = Regex.Match(content, @"\d+").Value;

Summary

This is a brief introduction to the HtmlAgiltyPack which is the recommended tool for parsing HTML. It provides a familiar LINQ to Objects API which makes working with the library pretty easy. IF you need to parse or manipulate HTML, this is the only tool you need. Full documentation is available from the project's Codeplex site. Since it's a chm file, you will need to unblock it before you can use it. You do this by right-clicking on the file and going to its properties, then clicking the Unblock button.

You might also like...

Date Posted:
Last Updated:
Posted by:
Total Views to date: 66532

8 Comments

- Andi

Did you manage to use the HtmlAgility pack for an ASP.NET 5 application?

- José

Why does this article has the tag HTML5? What does that mean? that the html agility pack is able to parse correctly HTML5 documents? Does it follow a parsing specification (W3C or WHATWG)? Is standards compliant?

Note. The simple sum was to hard to solve.

- Andrea

For ajax page ?

- Mike

@Andrea,

Huh?

- Bruce

Hi Mike
Thanks for your helpful Blog. Nicely explained! But one thing i do not understand is, what if i want to get more than one paragraph? That wouldn't work with .Single ...
I would be really happy if you could help me!

- Mike

@Bruce,

If you want to get all paragraphs, use Where instead of FirstOrDefault or Single.

- John

Your lines that create a new HtmlDocument do not work in VS 2015 C#. Get better results with WebClient - and simpler to use.

- Frank

When you go to the codeplex site to view the documentation you get this response:
This project does not have documentation yet. Visit the Discussions tab to ask questions.

Recent Comments

Satyabrata Mohapatra 23/07/2017 16:43
In response to Razor Pages - The Elevator Pitch
@Dale Severin You can continue to build apps using asp.net web pages....

Satyabrata Mohapatra 23/07/2017 16:40
In response to Sending Email in Razor Pages
Thanks for sharing...learned a lot...

Gfw 22/07/2017 11:53
In response to Sending Email in Razor Pages
Question... Does System.Net.Mail support SSL?...

Dale Severin 20/07/2017 03:38
In response to Razor Pages - The Elevator Pitch
I work with razor web pages extensively. I appreciate the rapid development it permits me to I am as...

Obinna Okafor 14/07/2017 01:19
In response to Routing in Razor Pages
Thank you, Mike. Good post....

Satyabrata Mohapatra 11/07/2017 16:02
In response to Routing in Razor Pages
Very powerful routing system!!...

Cyrus 05/07/2017 03:41
In response to Razor Pages - Getting Started With The Preview
How can I trim packages and services as much as possible to use just razor pages? I don’t want to to...

Harris Boyce 04/07/2017 04:17
In response to Razor Pages - The Elevator Pitch
As a developer of a couple "trivial" web pages applications used by non-profits that wouldn't have I...

Cyrus 28/06/2017 20:25
In response to Razor Pages - Getting Started With The Preview
.net core 2.0 preview 2: <a...

ojorma 17/06/2017 09:24
In response to Razor Pages - The Elevator Pitch
Finally I can say goodbye to webforms...