You might be digitzing books on the Web without knowing it thanks to this stealthy anti-spam technology

You know those pesky but necessary CAPTCHA boxes whose squiggly letters and digits you need to retype to make use of certain parts of sites such as Yahoo, Wikipedia and PayPal?

A computer scientist from Carnegie Mellon is looking to replace many of those boxes with anti-spam boxes of his own for the purpose of helping to digitize and make searchable the text from books and other printed materials. To boot, the system could help companies better secure their Web sites.

The idea is somewhat along the lines of projects like the famous SETI@Home grid supercomputer project for detecting signs of extra terrestrial life from deep space. Organizers of SETI@Home convinced computer users all over the world to allow their computers’ CPU cycles to be used to process information for the ET hunt when the systems weren’t otherwise being used.

But in the case of Luis von Ahn’s project, he and his team are convincing organizations to replace the CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) security boxes on their Web sites with what the assistant professor of computer science calls reCAPTCHA boxes. Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project .

      “I think it’s a brilliant idea — using the Internet to correct OCR mistakes,” said Brewster Kahle, director of the Internet Archive, in a statement. “This is an example of why having open collections in the public domain is important. People are working together to build a good, open system.”

                 Von Ahn says it is estimated that people solve 60 million-plus CAPTCHAs a day, amounting to 150,000 or more man hours of work that can be put to use for the digitization effort. His team is working with Intel to offer a Web-based service enabling Webmasters to adopt reCAPTCHAs to secure their sites.

                 An audio version is in the works for transcribing radio programs and that can be used by blind Web users.

Editors' Picks
Join the discussion
Be the first to comment on this article. Our Commenting Policies