You know those pesky but necessary CAPTCHA boxes whose squiggly letters and digits you need to retype to make use of certain parts of sites such as Yahoo, Wikipedia and PayPal?
A computer scientist from Carnegie Mellon is looking to replace many of those boxes with anti-spam boxes of his own for the purpose of helping to digitize and make searchable the text from books and other printed materials. To boot, the system could help companies better secure their Web sites.
The idea is somewhat along the lines of projects like the famous SETI@Home grid supercomputer project for detecting signs of extra terrestrial life from deep space. Organizers of SETI@Home convinced computer users all over the world to allow their computers’ CPU cycles to be used to process information for the ET hunt when the systems weren’t otherwise being used.
But in the case of Luis von Ahn’s project, he and his team are convincing organizations to replace the CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) security boxes on their Web sites with what the assistant professor of computer science calls reCAPTCHA boxes. Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project .
“I think it’s a brilliant idea — using the Internet to correct OCR mistakes,” said Brewster Kahle, director of the Internet Archive, in a statement. “This is an example of why having open collections in the public domain is important. People are working together to build a good, open system.”
Von Ahn says it is estimated that people solve 60 million-plus CAPTCHAs a day, amounting to 150,000 or more man hours of work that can be put to use for the digitization effort. His team is working with Intel to offer a Web-based service enabling Webmasters to adopt reCAPTCHAs to secure their sites.
An audio version is in the works for transcribing radio programs and that can be used by blind Web users.
"they would retype text that
"they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials."
This is nonsense, from start to finish.
If the computer couldn't OCR the text to begin with, how will it know if you've completed the capta correctly.
from ->
from -> http://recaptcha.net/learnmore.html
Direct quote:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
Easy, they give you two
Easy, they give you two words, one of which they already know. You decode both.
From TFA:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
From the website: But if a
From the website:
But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct.
>This is nonsense, from
>This is nonsense, from start to finish.
>If the computer couldn't OCR the text to begin with, how will it know if you've completed the capta correctly.
Answer: It could provide the same material to multiple persons who are logging in at the same time. If your response is even close to other responses, then you're in. The accepted translation would be the one that gets answered the most.
That was going to be my
That was going to be my approach - the statistical one.
Not nonsense
Thanks for helping everybody out with your useless criticism. Go to ReCAPTCHA's home page if you care about how they're dealing with this problem.
If the computer couldn't OCR
It will probably work the same way Mechanical Turk works. It will give the captcha to several people, and if they all get the same answer, it's right. It will be obvious if one of them is a robot. For high-traffic sites, they could probably get several people to solve it within seconds. For slower sites, though, it might necessitate letting a few people in "for free" before you know the answer.
But you have a good point. It's far from foolproof.
presumably whether you agree
presumably whether you agree with the majority of other people who do the captcha.
Make the verification text
Make the verification text two words--One unknown, one known. The known word is the one that allows access, the unknown becomes verified. Once a word is verified the same way by several people, it becomes a known word.