Parsing HTML with Groovy and HTMLCleaner

HTML found on the web can sometimes be invalid and difficult to parse. There are several HTML cleaning utilities that convert this invalid HTML to valid XML which is easier to work with. Two of these are Tag Soup and HTMLCleaner.

Tag Soup has a much nicer syntax when used with Groovy, but I decided to try HTMLCleaner because it is reported to have better results and is also used in the open source web scraping WebHarvest project.

For this example, let’s parse the Groovy website’s home page.

import org.htmlcleaner.*
 
def address = 'http://groovy.codehaus.org/'
 
// Clean any messy HTML
def cleaner = new HtmlCleaner()
def node = cleaner.clean(address.toURL())
 
// Convert from HTML to XML
def props = cleaner.getProperties()
def serializer = new SimpleXmlSerializer(props)
def xml = serializer.getXmlAsString(node)
 
// Parse the XML into a document we can work with
def page = new XmlSlurper(false,false).parseText(xml)

When creating the XMLSlurper with XMLSlurper(false,false) this disables validation and name space awareness. This prevents the error:

The prefix "xml" cannot be bound to any namespace other than its usual namespace; neither can the namespace for "xml" be bound to any prefix other than “xml”.

Once your HTML has been parsed you can use Groovy’s nice syntax for extracting information from your HTML:

def logo = page.body.div[1].div.div.table.tbody.tr.td.p.a.span.img.@src

This navigates it’s way down to logo image on the page and extracts the logo URL.

This entry was posted in Groovy and tagged , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments

  1. Richard
    Posted May 12, 2010 at 11:11 pm | Permalink

    Try using Crouton, its simple light and efficient:

    cfset Crouton = createObject(“component”, “Crouton”).init()
    cfset strXHTML = Crouton.Parse(urlToParse)

    http://sourceforge.net/projects/cfsynergy/files/Crouton.cfc/Crouton.zip/

    or you can try jTidy

    http://jtidy.riaforge.org/

    Doesn’t need any of the extra overhead that comes with Groovy, also HTMLCleaner might need CF8+ because of the JRE 1.4+ dependencies.

  2. hyunjungsoh
    Posted June 16, 2011 at 11:03 am | Permalink

    Wow! This was helpful! :)
    thanks