HTML Data Extraction

General May 3, 2005 Tamer Salama

I’ve been looking for a decent tool for HTML Data Mining, (aka web-based data mining, aka screen scraping) with no real success.
I wanted to extract data from some 350+ HTML files and upload them into a DB.

Sounded like a thing that a sourceforge application would do… BUT, after spending a couple of days around, looks like a solution based on an article (Web Based Data Mining) by IBM folks is the closest that I can get.

The code is a bit outdated (2001) but the main theme remains the same:

Tidy up the HTML (JTidy).
Parse the HTML/XHTML Content to get a DOM.
Parse the XSL containing the XSL template (with XPath).
Apply the Tranformation (using javax.xml transformer).
Write the output to a file (XML?).
upload data from file to DB.

I’m still trying to get it to work, but, I’m having good progress.

I also tried using Butterfly XML Editor but couldn’t manage to make it apply XSL transformations.

Tags: java

You may also like...

Leave a Reply