Eric Brill, a senior research at Microsoft, has been trying to figure out how to take search engines to the next level, by getting them to answer your question instead of providing links to pages that might have the answer.
His possible answer: "noisy channels," in which statistical analysis is used to try to transform what a user types into a search box into something an application can use to ferret out the answer to his question (sort of similar to the way word checkers now give you suggested correct spellings for whatever gibberish you've just typed). To test out the approach, Brill and Radu Soricut of USC collected zillions of Web-based FAQs - since they tend to provide short answers to specific questions. But as they write (that link brings up a Word doc), first they had to find the FAQs:
... If one poses the simple query "FAQ" to an existing search engine, one can observe that roughly 85% of the returned URL strings corresponding to genuine FAQ pages contain the substring "faq", while virtually all of the URLs that contain the substring "faq" are genuine FAQ pages. It follows that, if one has access to a large collection of the Web's existent URLs, a simple pattern-matching for "faq" on these URLs will have a recall close to 85% and precision close to 100% on returning FAQ URLs from those available in the collection. Our URL collection contains approximately 1 billion URLs, and using this technique we extracted roughly 2.7 million URLs containing the (uncased) string "faq", which amounts to roughly 2.3 million FAQ URLs to be used for collecting question/answer pairs. ...Back to Compendium
Post a comment
