Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University

Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu

Presentation Outline Introduction Problem Definition System Architecture Feature-based Document Retrieval Similarity-based Clustering Evaluation and Experimental Results Related Work Conclusion and Demo

Introduction - I U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. Some popular regulations attract hundreds of thousands of comments from the general public. In late 1990s USDAs national organic standard, manually sort over 250,000 public comments. In 2004 the EPAs proposed Mercury rule (USEPA-OAR-2002-0056) attracted over 530,000 email messages. Very labor-intensive

Introduction - II Things Become Worse Now Many Online form letters available Written by special interest groups Modifying an electronic form letter is extremely easy Special Interest Groups Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action. Provide information and tools to help each individual have the greatest possible impact once a group is assembled. Moveon.org, http://www.moveon.orghttp://www.moveon.org GetActive, http://www.getactive.orghttp://www.getactive.org

Introduction - III Public comments will be near-duplicates if created from the same form letter. Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. Goal: Recognizing the Near-duplicates and organize them Finding the added information by an individual Finding the Unique comments Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

Problem Definition - I What is a near-duplicate? Pugh declared that two documents are considered near duplicates if they have more than r features in common. Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts Our definition based on the ways to create near-duplicates

Problem Definition - II Sources of Public Comments ffrom scratch (unique comments) bbased on a form letter (exact- or near duplicates) A Category-based Definition BBlock edit KKey block MMinor change MMinor change + block edit EExact

Block Edit

Key Block

Minor Change

Minor Change + Block Edit

Block Reordering

System Architecture

Feature-based Document Retrieval - I To get duplicate candidate set for each seed Avoid working on the entire dataset Steps: Each seed document is broken into chunks Select the most informative words from each chunk a text span around term t* which is the term with minimal document frequency in the chunk

Feature-based Document Retrieval - II Metadata extraction by Information Extraction email senders, receivers, signatures, docket IDs, delivered dates, email relayers

Feature-based Document Retrieval - III Query Formulation #AND ( docketoar.20020056 router.moveon #OR(standards proposed by will harm thousands unborn children for coal plants should other cleaner alternative by 90 by with national standards available pollution control) )

Similarity-based Clustering - I Document dissimlilarity based on Kullback-Leibler (KL) divergence KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another. Clustering Algorithm Soft, Non-Hierarchical Clustering (Partition) Single Pass Clustering with carefully selected seed documents each time Close to K-Means No need to define K before hand

Dup Candidates Similarity-based Clustering - II Dup Set 1 Dup Set 2

Adaptive Thresholding Cut-off threshold for different clusters should be different Documents in a cluster are sorted by their document-centroid similarity scores. Sample sorted scores at 10 document intervals If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

Does Feature-based Document Retrieval Helps? It works fairly efficient for large clusters cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. 536,975 ->10,995 documents Bad for small clusters (especially for those only containing a single unique document) Disable feature-based retrieval after most of the big clusters have been found. assume that most of the remaining unclustered documents are unique. Only similarity-based clustering is used on them

Evaluation Methodology - I Difficult to evaluate clustering performance lack of man power to produce ground truth for large dataset Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset. Assessors: two graduate research assistants Manually organized the documents into clusters of documents that they felt were near-duplicates Manually went through one of the experimental clustering results pair by pair ( compare document- centroid pair)

Evaluation Methodology - II Class j vs. Cluster i F-measure pij = nij/ ni, rij = nij/ nj F =, Fj = maxi {Fij} Purity =, i = maxj{pij}, Pairwise-measure Folkes and Mallows index Kappa = p(A) = (a+d)/m, p(E) =

Experimental Results - I

Experimental Results - II

Conclusion Large Volume Working Set Duplicate Definition and Automatic evaluation Feature-based Duplicate Candidate Retrieval Similarity-based Clustering Improved Efficiency

Related Work - I Duplicate detection in other domains: databases [Bilenko and Mooney 2003] to find records referring to the same entity but possibly in different representations electronic publishing [Brin et al. 1995] to detect plagiarism or to identify different versions of the same document. web search [Chowdhury et al. 2002] [Pugh 2004] more efficient web-crawling effective search results ranking easy web documents archiving

Related Work - II Fingerprinting a compact description of a document, and then do pair-wise comparison of document fingerprints Shingling [Broder et al.] represents a document as a series of simple numeric encodings for an n-term window retain every mth shingle to produce a document sketch super shingles Selective fingerprinting [Heintze] selected a subset of the substrings to generate fingerprints Statistical approach [Chowdhury et al.] n high idf terms Improved accuracy over shingling Efficient: one-fifth of the time over shingling Fingerprints Reliability in dynamic environment [Conrad et al.] Consider time factor on the Web

References M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398409. ACM Press, May 1995. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 97, pages 391404. Elsevier Science, April 1997. J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004http://hartford.lti.cs.cmu.edu/eRulemaking/Data/ A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960. J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM03, pages 443452. ACM Press, Nov. 2003. N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191200, Nov. 1996. W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003US Patent 6,658,423ttp://www.cs.umd.edu/~pugh/google/Duplicates.pdf

Demo http://hartford.lti.cs.cmu.edu/eRulemaking/ Data/USEPA-OAR-2002-0056/http://hartford.lti.cs.cmu.edu/eRulemaking/ Data/USEPA-OAR-2002-0056/