Deciphering old texts, one woozy, curvy word at a time

Advertisement
By Guy Gugliotta, New York Times | Updated: 11 June 2012 16:40 IST
Highlights
  • In the old days, anybody interested in watching a movie would have to brave the traffic, stand in long queues, probably days in advance, and buy a ticket to the movie. Not anymore. Now, you just visit one of them many movie-ticket Web sites and book a tic
Deciphering old texts, one woozy, curvy word at a time
In the old days, anybody interested in watching a movie would have to brave the traffic, stand in long queues, probably days in advance, and buy a ticket to the movie. Not anymore. Now, you just visit one of them many movie-ticket Web sites and book a ticket up to two hours before your show starts.

But before taking the money, the Web site might first present the reader with two sets of wavy, distorted letters and ask for a transcription. These things are called Captchas, and only humans can read them. Captchas ensure that robots do not hack secure Web sites.

What Web readers do not know, however, is that they have also been enlisted in a project to transform an old book, magazine, newspaper or pamphlet into an accurate, searchable and easily sortable computer text file.

One of the wavy words quite likely came from a digitized image from an old, musty text, and while the original page has already been scanned into an online database, the scanning programs made a lot of mistakes. Mets fans and other Web site users are correcting them. Buy a ticket to the ballgame, help preserve history.

The set of software tools that accomplishes this feat is called reCaptcha and was developed by a team of researchers led by Luis von Ahn, a computer scientist at Carnegie Mellon University.

Its pilot project was to clean up the digitized archive of The New York Times. Today it has become the principal method used by Google to authenticate text in Google Books, its vast project to digitize and disseminate rare and out-of-print texts on the Internet.

Digitization is normally a three-stage process: create a photographic image of the text, also known as a bitmap; encode the text in a compact, easily handled and searchable form using optical character recognition software, commonly called O.C.R.; and, finally, correct the mistakes.

Today's technology makes the first two steps relatively straightforward. The third, however, can be extremely difficult. For vintage 19th-century texts in English, O.C.R. programs mess up or miss 10 per cent to 30 per cent of the words. Only humans can fix the errors. The standard method, called key and verify, uses two transcribers to type the text independently and compares the results. This is time- consuming and extremely expensive.

But in 2006, Dr. von Ahn's team figured out a way around this obstacle. The ubiquitous Captchas, familiar to even the most casual Web user, were the perfect tools. Captchas, short for "completely automated public Turing test to tell computers and humans apart," are impossible for machines to decipher, but easy for humans. (The test is named for the British computer pioneer Alan Turing.)

Dr. von Ahn's group estimated that humans around the world decode at least 200 million Captchas per day, at 10 seconds per Captcha. This works out to about 500,000 hours per day -- a lot of applied brainpower being spent on what Dr. von Ahn regards as a fundamentally mindless exercise.

"So we asked, 'Can we do something useful with this time?' " Dr. von Ahn recalled in a telephone interview. Instead of making Captchas out of random words printed in a woozy way, why not ask  Webusers to translate problem words from archival texts?

By Dr. von Ahn's estimate, reCaptcha is being used by 70 percent to 90 percent of Web sites that have Captchas -- including Ticketmaster, Facebook and local bank branches.

Google bought Dr. von Ahn's start-up in 2009 -- he will not say how much it paid -- and put it to work on Google Books. He says "several million" words are being translated every day.

The Times, published since 1851, had already optically transcribed its archive when it contacted Dr. von Ahn. Robert Larson, the company's vice president for search products, said the paper had "looked at various ways" to edit the text, "but Luis's method was faster and cheaper."

Page images, particularly those printed before 1900, are loaded with smudges, stains, watermarks and crooked type, all of which give O.C.R.'s the fits. To fix the errors, Dr. von Ahn uses a number of programs, which when applied in the proper sequence magically transform troubled passages into easy- to-read prose.

The first step is done in-house. Two different O.C.R. programs scan the photographic image. Both will make mistakes, but not necessarily the same mistakes.

ReCaptcha flags as "suspicious" any word that is deciphered differently by the two programs or that does not appear in an English dictionary. The dictionary catches words that are misspelled the same way by both O.C.R.'s. Other programs examine the words on either side of the suspect word and make another guess based on that analysis.

Then each suspicious word is turned into a Captcha. It is crucial to understand that the Captcha is a distorted version of the word as printed in the original photographic image. It is not made from the O.C.R.'s imagined translation, which is often unintelligible. The unknown word is then paired with a second Captcha word whose correct translation is already known. This is the "control."

Several Web users seeking entry to secure sites are then given both words and asked to decipher them separately.

A correct answer for the control word proves that the user is a human and not a machine. Answers for the unknown word are compared with the O.C.R. guesses and the context analysis. If the system is satisfied that the answer is correct, then the game is over.

Dr. von Ahn acknowledged that some words cannot be transcribed, usually because the original text is torn or damaged in some other way. If enough users fail to identify an unknown, the word is deemed to be indecipherable and is marked as such.

ReCaptcha also fails badly on cursive, Dr. von Ahn said, adding that "nobody reads handwriting anymore." And reCaptcha so far translates only English words, even though many reCaptcha Web sites have overseas clients whose users are not necessarily English speakers.

With all these constraints, reCaptcha nevertheless achieves an accuracy rate above 99 percent, which compares favorably with professional human transcribers. And Dr. von Ahn is convinced that performance will improve with experience, of which there will be no shortage.

"We'll be going for a long time," he said. "There's a lot of printed material out there."

For the latest tech news and reviews, follow Gadgets 360 on X, Facebook, WhatsApp, Threads and Google News. For the latest videos on gadgets and tech, subscribe to our YouTube channel. If you want to know everything about top influencers, follow our in-house Who'sThat360 on Instagram and YouTube.

Further reading: Captchas, distorted letters, hack
Advertisement

Related Stories

Popular Mobile Brands
  1. Realme 15 Pro 5G Launched in India Alongside Realme 15 5G: Price, Features
  2. AppleCare One Announced; Lets You Add Up to 3 Devices Under a Single Plan
  3. These Motorola Phones Will Get Android 16 as Their Final OS Update
  4. iQOO Z10R 5G With MediaTek Dimensity 7400 SoC Launched in India: See Price
  5. Infinix Smart 10 India Launch Today: All You Need to Know
  6. Xiaomi 16 Ultra Leaks Hints at Major Camera and Battery Upgrades
  7. Redmi Note 14 SE 5G to Launch in India on July 28 With These Features
  8. Google's AI Mode in Search Hits Major Milestone, New Features Coming
  1. Infinix Smart 10 Launching Today: Know Price in India, Features and Specifications
  2. NASA Engineers Rescue JunoCam with Deep-Space Heating Hack
  3. Rising Rocket Launches May Delay Ozone Layer Recovery, Study Finds
  4. New Study Reveals Mars Faced Heavy Rains: Possible Clue to Ancient Life
  5. Forza Horizon 5 Is Reportedly the Best-Selling Game on PS5 in 2025
  6. Realme Buds T200 With IP55 Rating and Up to 55 Hours Total Playback Time Launched in India
  7. Realme 15 Pro 5G Launched in India With Snapdragon 7 Gen 4 SoC; Realme 15 5G Tags Along
  8. Meta Showcases Wristband That Supports Typing, Navigation Without a Keyboard or Mouse
  9. GitHub Spark AI App Generation Tool Released, Comes With a Collaboration Mode
  10. CMF Buds 2, Buds 2 Plus Sale to Start in India on July 25: All Details
Gadgets 360 is available in
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2025. All rights reserved.