I’ve been looking for a decent tool for HTML Data Mining, (aka web-based data mining, aka screen scraping) with no real success.
I wanted to extract data from some 350+ HTML files and upload them into a DB.
Sounded like a thing that a sourceforge application would do… BUT, after spending a couple of days around, looks like a solution based on an article (Web Based Data Mining) by IBM folks is the closest that I can get.
The code is a bit outdated (2001) but the main theme remains the same:
- Tidy up the HTML (JTidy).
- Parse the HTML/XHTML Content to get a DOM.
- Parse the XSL containing the XSL template (with XPath).
- Apply the Tranformation (using javax.xml transformer).
- Write the output to a file (XML?).
- upload data from file to DB.
I’m still trying to get it to work, but, I’m having good progress.
I also tried using Butterfly XML Editor but couldn’t manage to make it apply XSL transformations.