Auto Tagging / Term Extraction

Remember the idea I once had? (as if I have dedicated reads 🙂 .. Well, it’s resurfaced again.

As I was reviewing itoot, I wanted to know what itoot bloggers are really talking about. So, another Mashup? pretty close. My idea was to aggregate the feeds, parse them, pass their contents to an auto-tagging service/API, then put the result in a Tagcloud-form. Ruby to the rescue. After reviewing the quickest shortcuts, here’s what I did:

  1. Getting feed urls (directly from FF Web-Developer  adorable plugin)
  2. Using Feedtools ruby script for parsing feed contents and generating a unified Atom feed.
  3. for the rest (auto-tagging the unified feed + tag-cloud).

Here’s the result. Do I like it? Absolutely not. It’s seems that the tag cloud was based only on the first few “English” feeds.

So, are there any “better” auto-tagging services/APIs? here’re my finds:

According to Ryan King, Yahoo term extraction (which is “sometimes” used by Zoomclouds as their backend) seems to use a statistical analysis to arrange the extracted terms. Yes! I agree. Tagging (as per delicious) is a human based activity, and automating it does not serve what it was really intended for. BUT, in the case at hand, Tags (aka terms, aka keywords) are an intermediate product, and in this case, service some functional need rather a presentable end product.

Now, back to my “evolved” idea. Using the technology at hand, I can use RSS to get information, come up with some BI tool that do some text-processing and correlate information about organizations, products, people, and events (nodes?), then present these information to the end user using some neat visualization. Something along the lines of a central semantic web 🙂

The technology still lacks the part that does text-processing and correlation of information. If I settled for an auto-tagging or term extraction service to help cover this part, then coming up with such service sounds like a good idea. Especially if the system covers some non-English languages (Arabic to name a few), is pluggable (REST) and has some output format options.

I feel that the idea is obvious that it must have been handled some way or the other. Can u point out any?

powered by performancing firefox

Check out AlchemyAPI for another term extraction solution.

Offers a REST api (JSON/XML/RDF output), weighted relevancy-ranked results, supports languages (8) other than english, etc.

30,000 queries a day for free, commercial use/SLA available.