Detecting Near-duplicate Documents

Approximately 30% of the pages on the web are (near) duplicates. Google has a patent for some improved duplicate and near duplicate detection techniques.

"From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content."

One idea might be indexing the keywords in the documents and comparing the percentage of terms shared by the two documents, but that highly inefficient.

Or you can try to compute the edit distance (Damerau-Levenshtein distance) between the two documents. The edit distance between two input strings is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol, and transpositions ) needed to change one input string into the other string.

A much better method for detecting duplicate and near-duplicate documents involve generating "fingerprints" (hashes) for elements (paragraphs, sentences, words, shingles) of documents. Two documents would be considered to be near-duplicates if they share more than a predetermined number of fingerprints.

A k-shingle is a sequence of k consecutive words from a documents. If S(A) is the set of shingles contained by A, we can compute the resemblance of A and B like this: |S(A)VS(B)| divided by |S(A)US(B)|. The problem is that the intersection is hard to compute, so it has to be estimated.

Learn more from Andrei Broder's course at Princeton University [PDF, html version].

"Search without a box" - A chat with Andrei Broder

Labels

Web Search Gmail Google Docs Mobile YouTube Google Maps Google Chrome User interface Tips iGoogle Social Google Reader Traffic Making Devices cpp programming Ads Image Search Google Calendar tips dan trik Google Video Google Translate web programming Picasa Web Albums Blogger Google News Google Earth Yahoo Android Google Talk Google Plus Greasemonkey Security software download info Firefox extensions Google Toolbar Software OneBox Google Apps Google Suggest SEO Traffic tips Book Search API Acquisitions InOut Visualization Web Design Method for Getting Ultimate Traffic Webmasters Google Desktop How to Blogging Music Nostalgia orkut Google Chrome OS Google Contacts Google Notebook SQL programming Google Local Make Money Windows Live GDrive Google Gears April Fools Day Google Analytics Google Co-op visual basic Knowledge java programming Google Checkout Google Instant Google Bookmarks Google Phone Google Trends Web History mp3 download Easter Egg Google Profiles Blog Search Google Buzz Google Services Site Map for Ur Site game download games trick Google Pack Spam cerita hidup Picasa Product's Marketing Universal Search FeedBurner Google Groups Month in review Twitter Traffic AJAX Search Google Dictionary Google Sites Google Update Page Creator Game Google Finance Google Goggles Google Music file download Annoyances Froogle Google Base Google Latitude Google Voice Google Wave Google Health Google Scholar PlusBox SearchMash teknologi unik video download windows Facebook Traffic Social Media Marketing Yahoo Pipes Google Play Google Promos Google TV SketchUp WEB Domain WWW World Wide Service chord Improve Adsence Earning jurnalistik sistem operasi AdWords Traffic App Designing Tips and Tricks WEB Hosting linux How to Get Hosting Linux Kernel WEB Errors Writing Content award business communication ubuntu unik