Google Uses OCR to Index Scanned PDF Files

Google started to index to full text of "scanned" PDF files using a technique called OCR (optical character recognition). "Every day, people all over the world post scanned documents online -- everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world," says Evin Levey.

The great thing about the new feature is that you won't notice it unless you look for it, but it improves the quality of Google's search results. Google doesn't mention how many of the 300 million indexed PDF files were converted into text, but you can see some examples if you search for: [repairing aluminium wiring], [Steady success in a volatile world] and click on "View as HTML".


Google sponsors an open-source OCR software called OCRopus and it's likely that Google used it for indexing PDF files from the web. "OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities. (...) It's initially intended for high-throughput, high-volume document conversion efforts. We expect that it will also be an excellent OCR system for many other applications."

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik