The HTML Agility Pack is an amazing .NET code library HTML parser. You can find out about and download it from CodePlex at http://htmlagilitypack.codeplex.com/. If you need to parse out HTML pages this is the library that you want to add to your app. It’s biggest draw back is decidedly its giant lack of documentation. So that others don’t have bruise their foreheads on their keyboards like I did all day here are my lessons learned.

Using

Just to get this out of the way don’t forget to start at the begining add using HtmlAgilityPack; to your .cs file.

Get Page

Get the raw HTML of the page

HtmlDocument page = new HtmlWeb().Load("http://www.codefall.io");

Time to Parse!

This is where the lack of documentation starts to really become painful.

Get a Single Element

var title = page.DocumentNode.SelectSingleNode("//h1[@id='site-title']");

//'site-title' will match against <h1 id="site-title"> and <h1 id='site-title'>
//that is the single and double quotes are interchangeable.

title is now an object that has a few built in functions.

string titleText = title.InnerText;
// Returns the text only (i.e.: Title String by Some Dude)

string titleHtml = title.InnerHtml;
// Returns the text and html (i.e.: Title String by <strong>Some Dude</strong>)

string titleFullHtml = title.OuterHtml;
// Returns the text and html of the entire node
//(i.e.: <h1 id="site-title">Title String by <strong>Some Dude</strong></h1>)

You can also use these functions to replace the text/html that was captured:

title.InnerText("New Title");

Get an Array of Elements

This works nearly identically to the above except that it returns an array of matching elements.

var titles = page.DocumentNode.SelectNodes("//h2[@class='entry-title']");
// Returns all titles on the home page of this site in an array.

To process the list just iterate through it.

List<string> pageTitles = new List<string>();
foreach (var title in titles)
{
    pageTitles.Add(title.InnerText);
}

Now you have a list of strings of all of the titles on the home page.

Digging Deep

The biggest thing lacking from all of the examples that I found on this is just how to dig down through a document.

Take this example for instance.

var firstDiv = page.DocumentNode.SelectSingleNode("//div");
//Returns the first div on the page, as it should. But you cannot dig into this.

var firstDivTitle = firstDiv.SelectSingleNode("//h2");
// This returns the first <h2> on the entire page, not the one in the div.

Thankfully while the above was really confusing to me the solution is incredibly simple.

var firstDiv = page.DocumentNode.SelectSingleNode("//div");
// Returns page's first div.

var firstDivTitle = page.DocumentNode.SelectSingleNode("//div//h2");
// Returns page's first div's <h2>.

You can also use this with id's and classes.

var pageLinks = page.DocumentNode.SelectNodes("//div[@id='main']//h2[@class='entry-title']//a");

foreach ( var link in pageLinks )
{
    Console.PrintLine(link.Attributes["href"].Value);
    // This shows how to grab the value of an atribute
    // Can be any html such as alt, href, src
}

Notes:

  • The biggest thing is the drill down and knowing that you can’t drill down through an already pulled element.
  • Able to grab attributes from the Html.
  • Regex Free! By the end I didn’t need any regex for the parsing, was able to grab it all just using the built in tools.
  • Print it out! It took me far longer than it should because I couldn’t print to console, I finally just put it in a variable being passed to the view so that I could see what it was grabbing. That’s how I figured out that you can’t grab new elements from an already pulled element.
  • If nothing is matched null will be returned which means that if you are assigning to string you will get a runtime error.

About Author

Siva Katir

Siva Katir

Senior Software Engineer working at PlayFab in Seattle Washington.