Extract data from HTML page with XPath and Linq

As developer, we usually need to extract data from a html web page in our projects, but most of the time we try to find another solution due to the complexity of the task. Indeed, the classical approach would be to use regex or similar and try to find out our informations in the page structure. But there’s something really helpfull, the HTMLAgilityPack. I’ll give you 2 example, the first one for a basic overview of the HTMLAgilityPack and a second one with much more details about the process I follow for that kind of tasks.

So, HTMLAgilityPack allow us to download a webpage and then consult the content with his methods or with the provided XPath API. To install it, just download the NuGet package from VS2010 >= or grab the project from the codeplex page.

Once done, you can use it to parse and extract data from html page.

using HtmlAgilityPack;

Then you can use it to download the page :

string url = "http://www.simpleweb.org/";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);

Or you can use LoadHtml(string htmlCode) on the HtmlDocument object to work on raw html code contained in memory.

Once done, you can write XPath query to retreive data. For this example, I want to retreive the url of the wiki (second link in the left menu).
According to the google chrome debugger, the quicker and optimzed XPath to this property is : (thank you chrome :p)


For those not familiar with XPath, it’s really simple. This query just mean, from the root (//) all tags (*) with id attribute set to “top option” ([@id=”top_option”]) then select the second div (div[2]) then select span and within the span select link (/span/a). That give us the following Linq query :

var result = from link in htmlDoc.DocumentNode.SelectNodes("//*[@id='top_option']/div[2]/span/a")
                         select link.Attributes["href"].Value;

The intellisense will be once again your best friends as this assembly is really well designed and intuitive. So here I select all the nodes who match the XPath query and then select the href attribute value.

Let’s take another example, I want to get an object tree which represent the whole menu. Just before that, I would like to mention something about XPath. This is a great query language… if you know it ! I mean, you can write really powerful but complex query using it, and I always think about the next one who’ll need to understand and/or update the query. So I usually prefer if I have to share it or in a collaborative context to write multiple steps/subqueries to make it more “developer friendly”. Even some simple query can seems really complex for someone not familiar with it. That said, let’s write the query to rebuild the full menu tree of the same page. This time I’ll also share the steps I follow for that kind of task.

First, analyze the page structure. We can see that the menu is contained within a td (/html/body/table/tbody/tr[3]/td/table/tbody/tr/td[1]). This td contain 7 div, but only the 3 first one interest us. To know that, I use the Chrome debugger, when you place your move hover a tag, Chrome will highlight in the page the corresponding area.

<div style="float:left;" id="top_option" class="aTopLink">
		<span><a class="aTopLink" href="/">Home</a></span>
		<span><a class="aTopLink" href="/wiki/">Wiki</a></span>
<div style="float:left;" id="my_menu" class="sdmenu">
    <div class="collapsed">
		<a href="/ietf/mibs/">Browser</a>
                <a href="/ietf/mibs/validate/">MIB Validation</a>
                <a href="/ietf/mibs/byEncoding?encoding=txt">Plain text</a>
                <a href="/ietf/mibs/byEncoding?encoding=highlighting">Syntax Highlighting</a>
                <a href="/ietf/mibs/byEncoding?encoding=html">HTML encoding</a>
		<a href="/ietf/enterprise/">Vendor MIBs</a>
		<a href="/ietf/mibs/search/">Search</a>
	    <div class="collapsed">
		<a href="/ietf/rfcs/rfcbynumber.php" class="current">By Number</a>
		<a href="/ietf/rfcs/rfcbytopic.php">By Topic</a>
		<a href="/ietf/rfcs/rfcbystatus.php">By Status</a>
		<a href="/ietf/rfcs/rfcbymodule.php">By Module</a>
                <a href="/ietf/rfcs/complete/">Complete</a>
		<a href="/ietf/rfcs/">Search</a>
            <div class="collapsed">
                <a href="/software/select_obj.php?orderBy=package&amp;oldorderBy=package&amp;orderDir=asc&amp;free=free">Freely available</a>
                <a href="/software/select_obj.php?orderBy=package&amp;oldorderBy=package&amp;orderDir=asc&amp;com=com">Commercial</a>
                <a href="/software/">Search</a>
            <div class="collapsed">
                <a href="/wiki/Events">Conferences</a>
                <a href="/wiki/Cfp">Call for Papers</a>
            <div class="collapsed">
                <a href="/wiki/Books">Books</a>
                <a href="/wiki/Journals">Journals</a>
                <a href="/wiki/Theses">Theses</a>
            <div class="collapsed">
                <a href="/wiki/Internet_Management_Tutorials">Slides</a>
                <a href="/wiki/Video_tutorials_on_Internet_management">Podcasts</a>
                <a href="/wiki/Exercises_in_Internet_Management">Exercises</a>
                <a href="/wiki/Other_tutorials">Other tutorials</a>
		<a href="/tutorials/demo/snmp/v1/">SNMPv1 demo</a>
		<a href="/tutorials/demo/snmp/v2c/">SNMPv2c demo</a>
		<a href="/tutorials/demo/snmp/v3/">SNMPv3 demo</a>
<div style="float:left;" id="bottom_option" class="aTopLink">
        <span><a class="aTopLink" href="/wiki/Podcasts">Podcasts</a></span>
        <span><a class="aTopLink" href="http://www.simpleweb.org/wiki/Traces">Traffic Traces</a></span>
        <span><a class="aTopLink" href="/ifip/">IFIP WG6.6</a></span>
        <span><a class="aTopLink" href="mailto:simpleweb@simpleweb.org">Contact</a></span>

To represent it we will have the following object model in C# :

public class Node
    public List<Node> Childs { get; set; }
    public string Name { get; set; }
    public string Url { get; set; }

And the query will be the following :

var result = (from topNode in htmlDoc.DocumentNode.SelectNodes("//div[@id='top_option']/div/span/a")
                select new Node()
                                Name = topNode.InnerText,
                                Url = topNode.Attributes["href"].Value
                from bottomNode in htmlDoc.DocumentNode.SelectNodes("//div[@id='bottom_option']/div/span/a")
                select new Node()
                            Name = bottomNode.InnerText,
                            Url = bottomNode.Attributes["href"].Value
            from groupNode in htmlDoc.DocumentNode.SelectNodes("//*[@id='my_menu']/div[*]")
            let groupChilds = (from nodes in groupNode.SelectNodes("a")
                                select new Node()
                                                Name = nodes.InnerText,
                                                Url = nodes.Attributes["href"].Value
            select new Node()
                            Name = groupNode.SelectSingleNode("span").InnerText,
                            Childs = groupChilds.ToList()

We use 3 query with union. As said, I tried to decompose a bit the query, the two first are the same, select the links and create nodes with the innerText of the a tag and the href attribute. Then we have the centre of the menu with hierarchy. For this, we select each div representing a group, then we create a first list of node like for the two first request based on the link contained within the div, and we create the node object with the list we’ve just created as child’s object list and the span content as Name.

As you can see, we can easily write complex requests in no time (this one took me about 30minutes to write, test and include in this article). HTMLAgilityPack is a great tool that allow us to use XPath and finally address a great answer to the critical html data parsing issue !

See you next time and keep it bug free ! 🙂

One thought on “Extract data from HTML page with XPath and Linq

  1. Pingback: URL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s