Sunday, January 22, 2006

False Positives

1. Spider
Today I searched on Google for the words "spider identification" and got a site with identification of brown spiders .
I meant to search for "search engine spider identification" but I was so sure that "spider" is a computer program that I forgot the fact that this word is a homonym, a word that has multiple meanings. It quite amused me and I thought it would be nice to collect false positives on this web page and to update the page whenever I find new ones. Readers are invited to add False Positives on their comments form.

2. Home
On Wikipedia I found a False Positive right in the definition of the term "False Positive".

In computer database searching, false positives are documents that are retrieved by a search despite their irrelevance to the search question. False positives are common in full text searching, in which the search algorithm examines all of the text in all of the stored documents in an attempt to match one or more search terms supplied by the user.
Most false positives can be attributed to the deficiencies of natural language, which is often ambiguous: the term "home," for example, may mean "a person's dwelling" or "the main or top-level page in a Web site." The false positive rate can be reduced by using a controlled vocabulary, but this solution is expensive because the vocabulary must be developed by an expert and applied to documents by trained indexers.


3. Pray and 'Bare feet'
Beware of homonyms:
Two words are homonyms if they are pronounced or spelled the same way but have different meanings. A good example is 'pray' and 'prey'. If you look up information on a 'praying mantis', you'll find facts about a religious insect rather than one that seeks out and eats others. 'Bare feet' and 'bear feet' are two very different things! If you use the wrong word to describe your search you will find interesting, but wrong, results.


4. Apple
If you enter the word "apple" into Google search looking for the tree or the fruit of that tree you'll have to scan a few hundred results about the company by that name before you find what you asked for.

5. organization and color
Aside from cultural differences there are spelling differences as well. American spellings vary from English; you might be missing your answer by only searching organisation (organization) and color (colour).

6. Polish/polish
Homonyms
can affect your search: China/china or Polish/polish.


7. Police
If you type in police, you get a lot of pages about the rock group.

8. Football
Free text
searching is likely to retrieve many documents that are not relevant to the search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language; for example, in the United States, football refers to what is called American football outside the U.S.; throughout the rest of the world, football refers to what Americans call soccer. A search for football may retrieve documents that are about two completely different sports.


9. Cobra
href="http://www.sims.berkeley.edu/courses/is141/f05/lectures/jpedersen.pdf">Jan Pedersen, Chief Scientist, Yahoo! Search wrote on 19 September 2005 about The Four Dimensions of Search Engine Quality and in the chapter about Handling Ambiguity
brought nine pictures of different things called "cobra" (snake, car, helicopter etc.)


10. China
Quickly finding documents is indeed easy. Finding relevant documents, however, is a challenge that information retrieval (IR) researchers have been addressing for more than 40 years. The numerous ambiguities inherent in natural language make this search problem incredibly difficult. For example, a query about "China" can refer to either a country or dinnerware.


11. Cook
http://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&pageNumber=5&catID=2
An intelligent search program can sift through all the pages of people whose name is "Cook" (sidestepping all the pages relating to cooks, cooking, the Cook Islands and so forth),

12. cat/cats
http://en.wikipedia.org/wiki/Folksonomy
Phenomena that may cause problems include polysemy, words which have multiple related meanings (a window can be a hole or a sheet of glass); synonym, multiple words with the same or similar meanings (tv and television, or Netherlands/Holland/Dutch) and plural words (cat and cats)

1 comment:

zeevveez said...

http://web.syr.edu/~mdtaffet/GENTECH_Scholarship_Proposal.htm
People researching the SEE family tend to find documents relating to vision in their search results.
People researching GREEN county often find documents that mention the color green.
People researching the WILL family find many documents in which the only mention of “Will” is as a legal document transferring property to heirs.
In my own searching on the DRESSER family, way too often I find documents that mention hair dressers, leather dressers, bone dressers and other occupations that contain this term; I have even found documents that refer to dresser as a piece of furniture in someone’s home – it is quite rare to find the actual surname.