15
Intelligent Database Systems Lab N.Y.U.S. T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke SIGIR. 2008 國國國國國國國國 National Yunlin University of Science and Technology

Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Embed Size (px)

Citation preview

Page 1: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections

Presenter: Tsai Tzung Ruei Authors: Martin Theobald, Jonathan Siddharth, and Andreas Paepcke

SIGIR. 2008

國立雲林科技大學National Yunlin University of Science and Technology

Page 2: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Outline

Motivation Objective Methodology Experiments Conclusion Comments

2

Page 3: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Motivation

Detecting near-duplicate documents and records in large data sets is a long-standing problem. Syntactically, near duplicates are pairs of items that are very similar along some dimensions, but different enough that simple byte-by-byte comparisons fail.

3

Page 4: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Objective

To avoid exact duplicates during the collection of Web archives, near duplicates frequently slip into the corpus.

4

Page 5: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION

MATCHING

5

WebDatabase document

Page 6: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE EXTRACTION A = {aj(dj, cj)}

6

Example

a(1,2), an(1,2), the(1,2) and is(1,2)

“ At a rally to kick off a weeklong campaign for the South Carolina primary, Obama tried to set the record straight from an attack circulating widely on the Internet that is designed to play into prejudices against Muslims and fears of terrorism.”

ResultS = {a:rally:kick,a:weeklong:campain, the:south:carolina, the:record:straight,an:attack:circulating, the:internet:designed, is:designed:play}

Page 7: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING Jaccard Similarity for Sets

7

Generalization for Multi-Sets

Page 8: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

SPOT SIGNATURE MATCHING

8

SPOT SIGNATURE

partition

partition

partition

Inverted Index Pruning

Jaccard Similarity for Sets

Page 9: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

Optimal Partitioning

9

Page 10: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Methodology

Inverted Index Pruning

10

Exampled1 = {s1:5, s2:4, s3:4}, with |d1| = 13d2 = {s1:8, s2:4}, |d2| = 12d3 = {s1:4, s2:5, s3:5} , |d3| = 14τ = 0.8δ1 = 0δ2 = |d1| − |d3| = −1

SPOT SIGNATURE

partition

partition

partition

Inverted Index Pruning

Jaccard Similarity for Sets

Page 11: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles SpotSigs vs. Shingling

Choice of Spot Signatures

SpotSigs vs. Hashing

TREC WT10g SpotSigs vs. Hashing

11

Page 12: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

Gold Set of Near Duplicate News Articles

12

SpotSigs vs. Shingling

Choice of Spot SignaturesSpotSigs vs. Hashing

Page 13: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiments

TREC WT10g SpotSigs vs. Hashing

13

Page 14: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusion

MAJOR CINTRIBUTION SpotSigs proved to provide both increased robustness of signatures as

well as highly efficient deduplication compared to various state-of-the-art approaches.

FUTURE WORK Future work will focus on efficient access to disk-based index

structures, as well as generalizing the bounding approach toward other metrics such as Cosine.

14

Page 15: Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Comments

Advantage The SpotSigs deduplication algorithm runs “right out of the box”

without the need for further tuning, while remaining exact and efficient.

Drawback …..

Application information retrieval

15