Wednesday, October 26, 2005

Homonyms synonyms unique words and relevance

Matthew Koll's insight about needles and haystacks is bright but not clear enough. It is a beautiful metaphor but I found it hard to interpret it to concrete and useful distinctions. So here's my attempt to simplify the problem of searching by machines.

Searching is simple. Finding is complicated.

Computers are simple. Language is complicated.

Computers look for a match between words in a query and words in a result. They don't care about the meaning of these words.

The simplest case is when there is one unique word with one unique meaning.
For example: 'tinnitus'. Most search engines will find most of these words without any problem. I tried
a list of 100 such words on QTSaver and it found 99 of them.

Then there is a case of one word that has many meanings (synonyms).
For example: 'elders', 'old', and
'senior citizens'.
The computer will find one match or many matches.
It can find one right match and then there is no problem.
It can find one wrong match and then the user feels
frustration.
It can find many matches from which one is the right one and the others are wrong and then the user feels confusion.

In case many meanings have one word (Homonyms)
For example:' apple' from the tree and ' apple' the company.

The computer can find a right match or a wrong matche or both. And again, if it found a right one there is no problem, and if it found wrong ones the user is either frustrated or confused.

Search engines try to tackle these problems by advanced searches, by suggesting words to refine the query, by clustering, by ranking etc.

The more they try the more they get sophisticated.
The more they get sophisticated the higher the expectations the users have to find what they look for.
The higher the expectations the deeper the frustration when they don't find it.

Search engines are very clever machines but the human language is very stubborn. Even if you send every query to a human search expert there will always be unanswered queries. Why? Because the complexity of language is too much for the simplicity of machines

To add a little color to this post I looked for the word homonym in QTSaver and got
Alan Coopers' wonderful insight into the nature of homonyms:

I consider homonyms to be the prime numbers of the English language. Like
primes, they cannot be predicted by any rules of grammar or diction. In the way
that you can't search the number line for primes, you cannot systematically
search the dictionary for homonyms. You just have to find them, like Easter Eggs
in the dictionary.

No comments: