Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Presenter: Tsai Tzung Ruei　Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke

SIGIR. 2008

國立雲林科技大學National Yunlin University of Science and Technology


N.Y.U.S.T.

I. M.Outline

Motivation Objective Methodology Experiments Conclusion Comments

2


N.Y.U.S.T.

I. M.Motivation

Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.

3


N.Y.U.S.T.

I. M.Objective

To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus.

4


N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION

MATCHING

5

WebDatabase document


N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION A = {aj(dj, cj)}

6

Example

a(1,2), an(1,2), the(1,2) and is(1,2)

“ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.”

ResultS = {a:rally:kick,a:weeklong:campain, the:south:carolina, the:record:straight,an:attack:circulating, the:internet:designed, is:designed:play}


N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING Jaccard Similarity for Sets

7

Generalization for Multi-Sets


N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING

8

SPOT SIGNATURE

partition

partition

partition

Inverted Index Pruning

Jaccard Similarity for Sets


N.Y.U.S.T.

I. M.Methodology

Optimal Partitioning

9


N.Y.U.S.T.

I. M.Methodology


10

Exampled1 = {s1:5, s2:4, s3:4}, with |d1| = 13d2 = {s1:8, s2:4}, |d2| = 12d3 = {s1:4, s2:5, s3:5} , |d3| = 14τ = 0.8δ1 = 0δ2 = |d1| − |d3| = −1

SPOT SIGNATURE

partition

partition

partition


Jaccard Similarity for Sets


N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles SpotSigs vs. Shingling

Choice of Spot Signatures

SpotSigs vs. Hashing

TREC WT10g SpotSigs vs. Hashing

11


N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles

12

SpotSigs vs. Shingling

Choice of Spot SignaturesSpotSigs vs. Hashing


N.Y.U.S.T.

I. M.Experiments

TREC WT10g SpotSigs vs. Hashing

13


N.Y.U.S.T.

I. M.Conclusion

MAJOR CINTRIBUTION SpotSigs proved to provide both increased robustness of signatures as

well as highly efficient deduplication compared to various state-of-the-art approaches.

FUTURE WORK Future work will focus on efficient access to disk-based index

structures, as well as generalizing the bounding approach toward other metrics such as Cosine.

14


N.Y.U.S.T.

I. M.Comments

Advantage The SpotSigs deduplication algorithm runs “right out of the box”

without the need for further tuning, while remaining exact and efficient.

Drawback …..

Application information retrieval

15

Documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung