Tutorial Proggraming Free: Detecting Near-duplicate Documents

Approximately 30% of the pages on the web are (near) duplicates. Google has a patent for some improved duplicate and near duplicate detection techniques.

"From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content."

One idea might be indexing the keywords in the documents and comparing the percentage of terms shared by the two documents, but that highly inefficient.

Or you can try to compute the edit distance (Damerau-Levenshtein distance) between the two documents. The edit distance between two input strings is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol, and transpositions ) needed to change one input string into the other string.

A much better method for detecting duplicate and near-duplicate documents involve generating "fingerprints" (hashes) for elements (paragraphs, sentences, words, shingles) of documents. Two documents would be considered to be near-duplicates if they share more than a predetermined number of fingerprints.

A k-shingle is a sequence of k consecutive words from a documents. If S(A) is the set of shingles contained by A, we can compute the resemblance of A and B like this: |S(A)VS(B)| divided by |S(A)US(B)|. The problem is that the intersection is hard to compute, so it has to be estimated.

Learn more from Andrei Broder's course at Princeton University [PDF, html version].

"Search without a box" - A chat with Andrei Broder

Detecting Near-duplicate Documents

Labels