HTML Data Extraction

I’ve been looking for a decent tool for HTML Data Mining, (aka web-based data mining, aka screen scraping) with no real success.
I wanted to extract data from some 350+ HTML files and upload them into a DB.

Sounded like a thing that a sourceforge application would do… BUT, after spending a couple of days around, looks like a solution based on an article (Web Based Data Mining) by IBM folks is the closest that I can get.

The code is a bit outdated (2001) but the main theme remains the same:

  • Tidy up the HTML (JTidy).
  • Parse the HTML/XHTML Content to get a DOM.
  • Parse the XSL containing the XSL template (with XPath).
  • Apply the Tranformation (using javax.xml transformer).
  • Write the output to a file (XML?).
  • upload data from file to DB.

I’m still trying to get it to work, but, I’m having good progress.

I also tried using Butterfly XML Editor but couldn’t manage to make it apply XSL transformations.