HTML found on the web can sometimes be invalid and difficult to parse. There are several HTML cleaning utilities that convert this invalid HTML to valid XML which is easier to work with. Two of these are Tag Soup and HTMLCleaner.
Tag Soup has a much nicer syntax when used with Groovy, but I decided to try HTMLCleaner because it is reported to have better results and is also used in the open source web scraping WebHarvest project.
For this example, let’s parse the Groovy website’s home page.
import org.htmlcleaner.* def address = 'http://groovy.codehaus.org/' // Clean any messy HTML def cleaner = new HtmlCleaner() def node = cleaner.clean(address.toURL()) // Convert from HTML to XML def props = cleaner.getProperties() def serializer = new SimpleXmlSerializer(props) def xml = serializer.getXmlAsString(node) // Parse the XML into a document we can work with def page = new XmlSlurper(false,false).parseText(xml)
When creating the XMLSlurper with XMLSlurper(false,false) this disables validation and name space awareness. This prevents the error:
The prefix "xml" cannot be bound to any namespace other than its usual namespace; neither can the namespace for "xml" be bound to any prefix other than “xml”.
Once your HTML has been parsed you can use Groovy’s nice syntax for extracting information from your HTML:
def logo = page.body.div.div.div.table.tbody.tr.td.p.a.span.img.@src
This navigates it’s way down to logo image on the page and extracts the logo URL.