DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola

DETECTING NEAR-DUPLICATES FOR WEB CRAWLINGAuthors:Gurmeet Singh Manku, Arvind Jain, andAnish Das SarmaPresentation By:Fernando Arreola

*6/20/2011Detecting Near-Duplicates for Web Crawling

OutlineDe-duplicationGoal of the PaperWhy is De-duplication Important?AlgorithmExperimentRelated WorkTying it Back to Lecture Paper EvaluationQuestions


De-duplicationThe process of eliminating near-duplicate web documents in a generic crawl

Challenge of near-duplicates:Identifying exact duplicates is easyUse checksumsHow to identify near-duplicate?Near-duplicates are identical in content but have differences in small areasAds, counters, and timestamps


Goal of the PaperPresent near-duplicate detection system which improves web crawlingNear-duplicate detection system includes:Simhash techniqueTechnique used to transform a web-page to an f-bit fingerprint Solution to Hamming Distance ProblemGiven f-bit fingerprint find all fingerprints in a given collection which differ by at most k-bit positions


Why is De-duplication Important?Elimination of near duplicates:Saves network bandwidthDo not have to crawl content if similar to previously crawled contentReduces storage costDo not have to store in local repository if similar to previously crawled contentImproves quality of search indexesLocal repository used for building search indexes not polluted by near-duplicates


Algorithm: Simhash Technique Convert web-page to set of featuresUsing Information Retrieval techniquese.g. tokenization, phrase detectionGive a weight to each featureHash each feature into a f-bit valueHave a f-dimensional vectorDimension values start at 0Update f-dimensional vector with weight of featureIf i-th bit of hash value is zero -> subtract i-th vector value by weight of featureIf i-th bit of hash value is one -> add the weight of the feature to the i-th vector valueVector will have positive and negative componentsSign (+/-) of each component are bits for the fingerprint


Algorithm: Simhash Technique (cont.)Very simple exampleOne web-pageWeb-page text: Simhash TechniqueReduced to two featuresSimhash-> weight = 2Technique-> weight = 4Hash features to 4-bitsSimhash-> 1101Technique-> 0110


Algorithm: Simhash Technique (cont.)Start vector with all zeroes


Algorithm: Simhash Technique (cont.)Apply Simhash feature (weight = 2)

1101features f-bit value0 + 20 + 20 - 20 + 2calculation


Algorithm: Simhash Technique (cont.)Apply Technique feature (weight = 4)

0110features f-bit value2 - 42 + 4-2 + 42 - 4calculation


Algorithm: Simhash Technique (cont.)Final vector:

Sign of vector values is -,+,+,-Final 4-bit fingerprint = 0110


Algorithm: Solution to Hamming Distance ProblemProblem: Given f-bit fingerprint (F) find all fingerprints in a given collection which differ by at most k-bit positionsSolution:Create tables containing the fingerprintsEach table has a permutation () and a small integer (p) associated with itApply the permutation associated with the table to its fingerprintsSort the tablesStore tables in main-memory of a set of machinesIterate through tables in parallelFind all permutated fingerprints whose top pi bits match the top pi bits of i(F)For the fingerprints that matched, check if they differ from i(F) in at most k-bits


Algorithm: Solution to Hamming Distance Problem (cont.)Simple exampleF = 0100 1101K = 3Have a collection of 8 fingerprintsCreate two tables

Fingerprints1100 01011111 11110101 11000111 11101111 11100000 00011111 01011101 0010


Algorithm: Solution to Hamming Distance Problem (cont.)

Fingerprints1100 01011111 11110101 11000111 11101111 11100010 00011111 01011101 0010

p = 3; = Swap last four bits with first four bits0101 11001111 11111100 01011110 0111

p = 3; = Move last two bits to the front1011 11110100 10000111 11011011 0100









F = 0100 1101(F) = 1101 0100(F) = 0101 0011Match!




Algorithm: Solution to Hamming Distance Problem (cont.)With k =3, only fingerprint in first table is a near-duplicate of the F fingerprintF

p = 3; = Swap last four bits with first four bits1101010011000101

p = 3; = Move last two bits to the front0101001101001000


Algorithm: Compression of TablesStore first fingerprint in a block (1024 bytes)XOR the current fingerprint with the previous oneAppend to the block the Huffman code for the position of the most significant 1 bitAppend to the block the bits after the most significant 1 bitRepeat steps 2-4 until block is full

Comparing to the query fingerprintUse last fingerprint (key) in the block and perform interpolation search to decompress appropriate block


Algorithm: Extending to Batch QueriesProblem: Want to get near-duplicates for batch of query fingerprints not just oneSolution:Use Google File System (GFS) and MapReduceCreate two filesFile F has the collection of fingerprintsFile Q has the query fingerprintsStore the files in GFSGFS breaks up the files into chunksUse MapReduce to solve the Hamming Distance Problem for each chunk of F for all queries in QMapReduce allows for a task to be created per chunkIterate through chunks in parallel Each task produces output of near-duplicates foundProduce sorted file from output of each taskRemove duplicates if necessary


Experiment: Parameters8 Billion web pages usedK = 1 10Manually tagged pairs as follows:True positivesDiffer slightlyFalse positivesRadically different pairsUnknownCould not be evaluated


Experiment: ResultsAccuracyLow k value -> a lot of false negativesHigh k value -> a lot of false positivesBest value -> k = 375% of near-duplicates reported75% of reported cases are true positives

Running TimeSolution Hamming Distance: O(log(p))Batch Query + Compression:32GB File & 200 tasks -> runs under 100 seconds


Related WorkClustering related documentsDetect near-duplicates to show related pagesData extractionDetermine schema of similar pages to obtain informationPlagiarismDetect pages that have borrowed from each otherSpamDetect spam before user receives it


Tying it Back to LectureSimilaritiesIndicated importance of de-duplication to save crawler resourcesBrief summary of several uses for near-duplicate detection

DifferencesLecture focus:Breadth-first look at algorithms for near-duplicate detectionPaper focus:In-depth look of simhash and Hamming Distance algorithmIncludes how to implement and effectiveness


Paper Evaluation: ProsThorough step-by-step explanation of the algorithm implementationThorough explanation on how the conclusions were reachedIncluded brief description of how to improve simhash + Hamming Distance algorithmCategorize web-pages before running simhash, create algorithm to remove ads or timestamps, etc.


Paper Evaluation: ConsNo comparisonHow much more effective or faster is it than other algorithms?By how much did it improve the crawler?Limited batch queries to a specific technologyImplementation required use of GFSApproach not restricted to certain technology might be more applicable


Any Questions?

???

Documents

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola