Friday, February 10, 2006

Automatic Tagging of micro contents

Tagging is in.

Automatic Tagging is out.

Everybody admires Flickr and because it is fun to manually tag links or pictures; because it is a social activity of sharing; because Tagging makes it easier to find things (an alternative to filing systems); because lots of people decide the fate of the Web and not a few experts.

It seems that almost nobody cares about Automatic Tagging . Those who work on it are like generals without soldiers. But since the amount of Tagging is too small compared to the amount of new documents that are added to the Web every day - the best way to organize this chaotic Web is through Automatic Tagging .

I believe that Automatic Tagging of micro contents is possible and it will eventually lead to the making of better search engines that not only search but also find. When this happens Automatic Tagging will be in and manual Tagging will be out.

For those of you who want to know more about this issue here are some excerpts I collected lately – some of them with QTsearch.

Research has shown that web-based retrieval of documents in other domains (i.e. news, business, etc.) can be improved by Automatic Tagging of the documents using SGML or XML-based markup schemes designed to capture the semantic meaning of terms. Therefore, automatic SGML or XML-based semantic tagging of genealogical documents, combined with a search engine capable of reading the tags, should go a long way toward addressing the problems plaguing web-based genealogy research. While tagging of structured data may also be beneficial, the tagging that I propose to do will be most useful for unstructured text-based documents, especially when combined with a fielded search interface. By fielded search interface, I mean a search form that provides separate slots for surname, given name, location, etc. Many of us are accustomed to filling in such templates when we do searches on the web, but unfortunately most search engines out there cannot make full use of the disambiguation provided by such forms because the documents are not similarly tagged.

We can stop overload by eliminating useless and irrelevant information or by helping people become more efficient. Manual tagging is labor-intensive and expensive. A Forrester Research report estimates that it costs up to $50 to tag a large document. Companies that have employed Automatic Tagging include Northern Light, whose search engine (which is no longer publicly available) placed search results in "folders," and Vivísimo, which uses document clustering that lets searchers organize information dynamically without the need to construct and maintain taxonomies.


The Semantic Web is gaining popularity for its ability on information interchange and sharing between machines. Such ability is possible when the Web pages are properly annotated. For newly developed Semantic Web resources, such annotation can be done manually or by help of sophisticated authoring tools. However, it is not practical to semantically annotating existing Web pages due to the gigantic amount of them. To bridge the gap between the Semantic Web and the World Wide Web, we propose a machine learning approach to automatically generate semantic markups for traditional Web pages. The proposed method applies the self-organizing map algorithm to cluster training Web pages and conducts a text mining process to discover the words to be tagged and their semantic descriptions. Preliminary experiments show that our method may successfully generate semantical markups for the Web pages.

Adam is making an interesting experience with Automatic Tagging service Tagyu . Insert some sample text (blog post, news story, etc) and it tags it for you.

AUTASYS is a menu-driven Automatic Tagging and lemmatising system that analyses English texts at word-class level with the Lancaster-Oslo-Bergen (LOB) tagset, the International Corpus of English (ICE) tagset, and the ?skeleton? tagset (SKELETON), which is the set of base tags from ICE without features.The tagged text can be subsequently lemmatised so that each lexical item is reduced to its base form as presented in a dictionary.

AUTASYS: Automatic Tagging and Cross-Tagset Mapping.In Comparing English World Wide: The International Corpus of English, ed. by Sidney Greenbaum. Oxford: Oxford University Press. pp 110-124.

 The first major step in Automatic Tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences.These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence).

Manual tagging is not a solution for me. So here I introduce the Automatic Tagging system. Just observe around and notice that it's already implemented on my blog. Generally, the concept is almost like how Jonathon Snook adds tagging for FontSmack.

In this paper a fully Automatic Tagging system for the dialogue texts in the London-Lund corpus, LLC, will be presented. The units that receive tags are "turns"; a collection of (not necessarily connected) tone units ? the basic record in the corpus ? that one speaker produces while being either the "floor holder" or the "listener"; the quoted concepts are defined below. The tags constitute a classification of each turn according to "type of turn".

Automatic Tagging and document generation solution for XML data! 


No comments: