Parse HTML using the HTML Agility Pack




If you haven’t yet heard or explored the HTML Agility Pack, then you must do so. I have been using this library from quite some time to extract links and tags and it works very well. As given on the site:

HTML Agility Pack(HAP) is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Sample applications of the HAP are:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.
  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.
  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.

Download it here



Will you give this article a +1 ? Thanks in advance




About The Author

Suprotim Agarwal, ASP.NET Architecture MVP works as an Architect Consultant and provides consultancy on how to design and develop Web applications.

Suprotim is also the founder and primary contributor to DevCurry, DotNetCurry and SQLServerCurry. He has also written an EBook 51 Recipes using jQuery with ASP.NET Controls.

Follow him on twitter @suprotimagarwal

comments

6 Responses to "Parse HTML using the HTML Agility Pack"
  1. anirudha gupta said...
    December 26, 2009 at 11:46 AM

    Hap is not good i always use Regex where i need Hap.

  2. Suprotim Agarwal said...
    December 26, 2009 at 6:51 PM

    Regex is powerful, no doubt about that..but Regex not a solution for parsing HTML.

  3. Alex said...
    December 27, 2009 at 1:37 AM

    You have to look on Data Extracting SDK (http://extracting.codeplex.com/)

  4. anirudha gupta said...
    December 27, 2009 at 6:55 AM
    This comment has been removed by the author.
  5. Josh said...
    February 7, 2011 at 4:46 AM

    I tried using HAP in one of my projects but the performance slowed down dramatically but with regex it was it was a piece of cake...

    for the curious one I am trying to crawl approx 1m domains to extract some selected information from links found on homepage and other pages linked from homepage

  6. Suprotim Agarwal said...
    February 7, 2011 at 7:04 PM

    Can you share the Regex that you used.

 

Copyright © 2009-2014 All Rights Reserved for DevCurry.com by Suprotim Agarwal | Terms and Conditions