The Quality of Google Book Search


Paul Duguid wrote an interesting article about Google Book Search in which he analyzed the quality of the indexed editions and the search results by doing a search for Lawrence Sterne's "Tristram Shandy", a novel from the 18th century. Mr. Duguid noticed that the Harvard edition of the book had many quality problems and some text wasn't scanned properly. Google Book Search doesn't distinguish between the volumes of a book, so it's difficult to realize that the Stanford edition is actually the second volume of the book.
Google may or may not be sucking the air out of other digitization projects, but like Project Gutenberg before, it is certainly sucking better–forgotten versions of classic texts from justified oblivion and presenting them as the first choice to readers. (...) The Google Books Project is no doubt an important, in many ways invaluable, project. It is also, on the brief evidence given here, a highly problematic one. Relying on the power of its search tools, Google has ignored elemental metadata, such as volume numbers. The quality of its scanning (and so we may presume its searching) is at times completely inadequate. The editions offered (by search or by sale) are, at best, regrettable. Curiously, this suggests to me that it may be Google's technicians, and not librarians, who are the great romanticisers of the book. Google Books takes books as a storehouse of wisdom to be opened up with new tools. They fail to see what librarians know: books can be obtuse, obdurate, even obnoxious things. As a group, they don't submit equally to a standard shelf, a standard scanner, or a standard ontology.

Patrick Leary, the author of the article Googling the Victorians (PDF), has a pragmatical response, as seen on O'Reilly Radar:
Mass digitization is all about trade-offs. All mass digitizing programs compromise textual accuracy and bibliographical meta-data so that they can afford to include many more texts at a reasonable cost in money and time. All texts in mass digitization collections are corrupt to some degree. Everything else being equal, the more limited the number of texts included in a digital collection, the more care can be lavished on each text. Assessing the balance of value involved in this trade-off, I think, is one of the main places where we part company. You conclude, on the basis of your inspection of these two volumes, that the corruption of texts like Tristram Shandy makes Google Books a "highly problematic" way of getting at the meanings of the books it includes. By contrast, while acknowledging how unfortunate are some of the problems you mention, I believe that the sheer scale of the project and the power of its search function together far outweigh these "problematic" elements.

When scanning and indexing millions of books, it's difficult to assess the quality of each edition. Google Book Search's main goal is to let you discover books you can borrow or buy later on. But Google could add an option to rate the quality of each digitized book or build algorithms that detect flaws or differences between editions. So the next time you do a search for Tristram Shandy, all the editions are clustered and the best one comes up first.

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik