I've been looking for a decent tool for HTML Data Mining, (aka web-based data mining, aka screen scraping) with no real success.
I wanted to extract data from some 350+ HTML files and upload them into a DB.
Sounded like a thing that a sourceforge application would do… BUT, after spending a couple of days around, looks like a solution based on an article (Web Based Data Mining) by IBM folks is the closest that I can get.
The code is a bit outdated (2001) but the main theme remains the same:
– Tidy up the HTML (JTidy).
– Parse the HTML/XHTML Content to get a DOM.
– Parse the XSL containing the XSL template (with XPath).
– Apply the Tranformation (using javax.xml transformer).
– Write the output to a file (XML?).
– upload data from file to DB.
I'm still trying to get it to work, but, I'm having good progress.
I also tried using Butterfly XML Editor but couldn't manage to make it apply XSL transformations.