You may have heard of the Web Bot Project. It was an application that crawled news articles, blogs, forums, and other forms of Internet conversations, looking for specific keywords. Its creators, Clif High and George Ure, initially did it to look for stock market trends.
After claims that Web Bot allegedly predicted the 2004 Indonesian earthquake and Hurricane Katrina months before they happened, High went all Art Bell/George Noory, creating a website where he post a whole lot of nonsensical babble pretending to be predictions.
No, I don't think much of the Web Bot project.
But the idea of looking through existing conversation for patterns and emerging trends isn't invalid. Researchers at Microsoft and an Israeli research firm have created software that attempts to predict outbreaks based on two decades of New York Times articles and other online data.
Microsoft Research has already produced a number of interesting products, and its partner in this one, the Technion-Israel Institute of Technology in Haifa, Israel, is no slouch, either. Its alumni include AI pioneer Saul Amarel, Andrei Broder, the developer of captcha technology, Andi Gutmans, the developer of PHP and co-founder of Zend Technologies, and Dadi Perlmutter, the chief product officer of Intel.
This kind of data mining has a decent track record. For example, reports of droughts in Angola in 2006 triggered a warning about possible cholera outbreaks in the country because outbreaks following a drought had happened before. A second warning was issued in early 2007 from news reports of large storms in Africa because they had happened before.
In similar tests involving forecasts of disease, violence, and a significant numbers of deaths, the system’s warnings were correct between 70 to 90 percent of the time, Kira Radinsky, a researcher at the Technion-Israel Institute, told MIT Technology Research.
In addition to using 22 years of New York Times archives – from 1986 to 2007 – the project also turns to other web sources. One source, according to Radinsky, is DBpedia, a crowd-sourced community effort to extract structured information from Wikipedia and to make this information available on the web. Also in use are WordNet, a software-based thesaurus that groups words to learn their context, and OpenCyc, a database of general knowledge. Radinsky said the predictive application will eventually use more than 90 total data sources.
Microsoft doesn’t have plans to commercialize the research as yet, but the project will continue. Personally, I hope it winds up in Bing someday. That would certainly give it a competitive edge.