Skip to content

HTML Data Extraction

I've been looking for a decent tool for HTML Data Mining, (aka web-based data mining, aka screen scraping) with no real success.
I wanted to extract data from some 350+ HTML files and upload them into a DB.

Sounded like a thing that a sourceforge application would do… BUT, after spending a couple of days around, looks like a solution based on an article (Web Based Data Mining) by IBM folks is the closest that I can get.

The code is a bit outdated (2001) but the main theme remains the same:

– Tidy up the HTML (JTidy).
– Parse the HTML/XHTML Content to get a DOM.
– Parse the XSL containing the XSL template (with XPath).
– Apply the Tranformation (using javax.xml transformer).
– Write the output to a file (XML?).
– upload data from file to DB.

I'm still trying to get it to work, but, I'm having good progress.

I also tried using Butterfly XML Editor but couldn't manage to make it apply XSL transformations.

Posted in Java.

One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Mainak says

    I am interested in this and have to create a similar thing. Do you have an idea of current scenario in this matter. would very much like to know the progress u’ve made

Some HTML is OK

or, reply to this post via trackback.