Precise Detection of Content Reuse in the Web - SIGCOMM...to!ll out link farms that support search-engine “optimization”. E-book content farming republishes publicly available

Public Review for

Precise Detection of Content Reuse inthe Web

C. Ardi, J. Heidemann

This paper presents a new methodology for discovering and detecting content

reuse in the web, an open problem given the scale of the web. Via some smart

engineering based on Apache Spark, the proposed solution is able to scale

linearly with the size of the dataset. Via experiments, authors show that the

approach is robust to correctly detecting full and partially duplicated pages

as well as bad neighbourhoods, which are set of pages that share a common

URL prefix. Reviewer found the paper to be extremely well written, with a

solid and thorough experimental part. In addition, authors are sharing both

the code and the data to let other researchers replicate and extend the work.

Overall - the paper o↵ers nice insights and smart solutions to a still open

problem.

Public review written byMarco Mellia

Politechnico di Torino, Italy

ACM SIGCOMM Computer Communication Review Volume 49 Issue 2, April 2019

9

Precise Detection of Content Reuse in the Web

Calvin ArdiUSC/Information Sciences Institute

[email protected]

John HeidemannUSC/Information Sciences Institute

[email protected]

ABSTRACTWith vast amount of content online, it is not surprising that un-scrupulous entities “borrow” from the web to provide content foradvertisements, link farms, and spam. Our insight is that crypto-graphic hashing and fingerprinting can efficiently identify contentreuse for web-size corpora. We develop two related algorithms, oneto automatically discover previously unknown duplicate contentin the web, and the second to precisely detect copies of discoveredor manually identified content. We show that bad neighborhoods,clusters of pages where copied content is frequent, help identifycopying in the web. We verify our algorithm and its choices withcontrolled experiments over three web datasets: Common Crawl(2009/10), GeoCities (1990s–2000s), and a phishing corpus (2014).We show that our use of cryptographic hashing is much more pre-cise than alternatives such as locality-sensitive hashing, avoidingthe thousands of false-positives that would otherwise occur. Weapply our approach in three systems: discovering and detectingduplicated content in the web, searching explicitly for copies ofWikipedia in the web, and detecting phishing sites in a web browser.We show that general copying in the web is often benign (for exam-ple, templates), but 6–11% are commercial or possibly commercial.Most copies of Wikipedia (86%) are commercialized (link farmingor advertisements). For phishing, we focus on PayPal, detecting59% of PayPal-phish even without taking on intentional cloaking.

CCS CONCEPTS• Information systems → Spam detection; Web mining; Ex-traction, transformation and loading; Data cleaning; • Security andprivacy→ Phishing.

KEYWORDScontent reuse, content duplication, duplicate detection, phishing

1 INTRODUCTIONA vast amount of content is online, easily accessible, and widelyutilized today. User-generated content fills many sites, sometimesnon-commercial like Wikipedia, but more often commercial likeFacebook and Yelp, where it supports billions of dollars in adver-tising. However, sometimes unscrupulous entities repackage thiscontent, wrapping their commercial content around this previouslypublished information to make a quick profit.

There are several recent examples of misleading reuse of con-tent. Content farming involves reposting copies of Wikipedia ordiscussion forums to garner revenue from new advertisements, orto fill out link farms that support search-engine “optimization”.E-book content farming republishes publicly available informationas e-books to attract naive purchasers and spam the e-book mar-ket. (Tools like Autopilot Kindle Cash can mass-produce dozens of“books” in hours.) Review spamming posts paid reviews that are of-ten fake and use near-duplicate content to boost business rankings.

The common thread across these examples is that they gather andrepublish publicly available information for commercial gain.

Our goal is to develop methods that efficiently find duplicationin large corpora, and to show this approach has multiple appli-cations. We show that our method (§7) efficiently finds instancesof mass-republishing on the Internet: for example, sites that useWikipedia content for advertising (§8.1). While copying Wikipediais explicitly allowed, bulk copying of copyrighted content is not.Even when allowed, content farming does little to enhance the web,and review spamming and e-book content farming degrade theimpact of novel reviews and books, much as click fraud degradinglegitimate advertising. Our approach can also detect phishing sitesthat use duplicated content to spoof users (§8.2).

Our insight is that cryptographic hashing can provide an effectiveapproach in duplication detection, and scales well to very largedatasets. A hash function takes arbitrary content input and producesa statistically unique, simple, fixed-length bitstring. We build listsof hashes of all documents (or “chunks”, subparts of documents) inweb-size corpora, allowing very rapid detection of content reuse.Although minor changes to content result in different hashes, weshow that copying can often be identified in the web by finding thesame chunks across documents. Economically, spammers seek thegreatest amount of profit with minimal work: we see that currentspammers usually do not bother to obfuscate their copying. (If ourwork forces them to hide, we at least increase their effort.) Ourwork complements prior work in semantic fingerprints [24, 31,44] and locality-sensitive hashing [28]. Such approaches provideapproximate matching, reducing false negatives at the cost of somefalse positives. While semantic hashing is ideal for applicationsas computer forensics, where false positives are manageable, ourapproach is relevant to duplicate detection in web-size corpora,where precise matching without false positives, since even a tinyrate of false positives overwhelms true positives when applied tomillions of documents on the web (§6.5). We also explore blind(automated) discovery of duplicated content.

We evaluate our approach on several very large, real-worlddatasets. We show that blind discovery can automatically find pre-viously unknown duplicated content in general web-scale corpora(§7), evaluating Common Crawl (2.86×109 files) and GeoCities(26.7×106 files). While most general duplication is benign (suchas templates), we show that 6–11% of widespread duplication onthe web is for commercial gain. We also show that expert-labeleddatasets can be usedwith our approach to efficiently searchweb-sizecorpora or to quickly search new pages on the web.We demonstratebulk searches by looking for copies of Wikipedia on the web (§8.1),finding that most copies of Wikipedia (86%) are commercialized(link farming or advertisements). We also show that our approachcan detect phishing in web pages (§8.2), demonstrating a Chromeplugin and evaluating it with a targeted dataset that finds that 59%of PayPal phish, even without taking measures to defeat intentionalcloaking (for example, source-code obfuscation).


2610

Contributions: The contribution of this paper is to show thathash-based methods can blindly discover content duplication thendetect this duplication in web-size corpora. The novelty of thiswork is not the use of hashing (a long-existing technique), butdesign choices in adapting hashing (with chunking and cleaning,§4.4) to scale to discovery (§4.2) and detection (§4.3) across web-size datasets, and to operate robustly in the face of minor changes.In particular, our approach to discovery can be thought of as aform of semi-supervised machine learning. One important stepis our use of the hierarchical nature of the web to find clustersof copied content (“bad neighborhoods”, §7.3). We show that thisapproach applies not only to discovering general duplication (§7)and identifying bulk copying in the web (§8.1), but also to detectingphishing activity (§8.2). We support our work with validation ofthe correctness of discovery and detection in our methodology(§6) and evaluation of bad neighborhood detection’s robustness torandom document changes (§6.4). Our data and code are availablefor research reproducibility (§5.3).

2 PROBLEM STATEMENTReplicating web content is easy. Some individuals bulk copy high-quality content from Wikipedia or Facebook to overlay advertise-ments, or to back-fill for link farms. Others reproduce selectedcontent to impersonate high-value sites for phishing. We seek todevelop new approaches to address two problems. First, we want toautomatically discover content that is widely duplicated, or large-scale duplication in a few places. Second, given a list of knownduplicated content, we want to detect where such content is dupli-cated. We next define these two problems more precisely.

Consider a corpus C of files f . Interesting corpora, such as acrawl of the Internet, are typically far too large to permit manualexamination. We assume the corpus consists of semi-structuredtext; we use minimal understanding of the semantics of the textto break it into chunks cf by choosing basic, structural delimiters(without attempting to infer the meaning of the content itself). Eachfile is identified by URLs; we can exploit the hierarchical structurein the path portion of the URL, or treat them as flat space identifiedonly by the sitename portion.

Our first problem is discovery. In discovery, our goal is to discovera labeled dataset L consisting of content of interest we believe to becopied. The simplest way to determine L is for an expert to examineC and manually identify it, thus building an expert-labeled datasetL from content in C. (The corpus C used to build L can be the sameas or different than the corpus later used in detection—for now weuse the same C in discovery and detection.) Although not possiblein general, semi-automated labeling is suitable for some problems(§8) where one can build L independently from known information.

Alternatively, we show how to discover L through a blind discov-ery process, without external knowledge. We explore this approachto discover content that is widely duplicated in the web (§7).

The detection process finds targets T in the corpus C that dupli-cate portions of labeled dataset L. In addition to finding individualfiles that show high levels of copying, we also exploit the hierarchi-cal grouping of documents inC to find bad neighborhoodsN, definedas clusters of content sharing the same URL hierarchy where manyfiles appear to be duplicated.

3 RELATEDWORKThere is significant prior work in detection of duplicated contentto reduce storage or network use and to find near-duplicate con-tent for plagiarism detection or information retrieval, and in phishdetection.

Storage & Network Optimization: Content duplication detec-tion serves many purposes and several fields have revolved aroundthe idea. Data deduplication can be used to efficiently store similaror identical pieces of data once [30, 35, 47]. Our work shares someof the same fundamental ideas through the use of cryptographichashing and chunking to effectively find duplicate or similar files.They have explored, for example, chunking files into both fixed-and variable-sized blocks and hashing those chunks to find and sup-press duplicate chunks in other files. In our chunking methods, weconsider variable-sized blocks delimited by HTML tags, leveragingthe structure provided by the markup language. While they targetthe application of storage optimization of files, focusing on archivalhardware and systems, our applications target duplicate detectionat both the file- and neighborhood-level for commercial gain.

The same concept can be used in a network to reduce the amountof data transferred between two nodes [34, 41]. Network operatorscan reduce WAN bandwidth use by hashing transmitted packets(or chunks of packets) in real time and storing this data in a cache.Network users can then see improved download times when re-trieving data matched in the cache (for example, when multipleusers download the same file or visit the same webpage). Our workalso uses the idea of hashing to detect duplicates, but with differentapplications. While their work looks at suppressing redundant, du-plicate downloads from the web, our work looks at finding whereduplication exists on the web. Their application forces efficient,streaming processing and a relatively small corpus (caches are lessthan 100GB to minimize overhead), while our web analysis is suit-able for offline processing with corpora larger than 1 PB.

Plagiarism Detection is a very different class of application.Storage and network optimization requires exact reproduction oforiginal contents. Existing approaches to plagiarism detection indocuments emphasize semantic matching, as plagiarism is alsoconcerned with subtle copying concepts, in addition to exact textreuse. Plagiarism detection makes use of stylometric features [19,40], measuring writing structure and styles, in addition to textstatistics (counts of words and parts-of-speech). Our work aimsto answer the question of whether massive duplication exists ona web-scale using syntactic methods; we do not attempt to infersemantic equivalence of the content.

Because detecting plagiarism is typically done over small- tomoderate-sized corpora (comparing a essay or homework assign-ment to ∼1000 others), long runtimes (minutes to sometimes hoursper document [17]), and a relatively large rate of false positives(precision = 0.75, for example [19]) are tolerable. Manual reviewcan address false positives, and with a relatively small corpus, theabsolute number of false positives can be manageable even if therate is not small. In our applications, our parallelized processingenables us to maintain good performance, even as the corpus grows(§5.2). Additionally, we require high precision in detecting reusesince with large corpora (109 documents or more), even a smallfalse positive rate quickly makes human review impractical (§6.5).


2711

Information Retrieval: Document similarity and detection isat the heart of the field of information retrieval (IR). Approachesin IR have explored duplicate detection to improve efficiency andthe precision of answers [11, 24, 31, 42]. Our use of cryptographichashing has high precision at the cost of lower recall by missingmutated files.

Broder et al. [10] develop a technique called “shingling” (todayknown as n-grams) to generate unique contiguous subsequencesof tokens in a document and cluster documents that are “roughlythe same”. They use this technique to find, collapse, and ignorenear-duplicates when searching for information (to avoid showingusers the same content multiple times). In our applications, wespecifically look for content matches and require new approaches(cryptographic hashes) to avoid overwhelmingly numbers of falsepositives from approximate matching (§6.5).

SpotSigs [44] and Chiu et al. [12] use n-grams in different appli-cations to search for similarities and text reuse with approximatematching. SpotSigs extends the use of n-grams by creating signa-tures of only semantic (meaningful) text in documents, ignoringmarkup language details (like HTML tags). Their system for ap-proximate matching is quadratic (O(n2)) in its worst-case, but itcan trade-off runtime performance with threshold of similarity.Our work looks for precise content matches in quasilinear time(O(n logn)). In one of our applications in phish detection, we lever-age the details of HTML to enable precise detection of phish.

Chiu et al. build an interface within a web browser to interactwith their back-end, enabling users to query for sentence-levelsimilarities between their visited page and other pages in a largercorpus. We use precise matching of paragraph-level chunks, withapplications on detecting widespread duplication across the web.We distinguish between the discovery and detection sides of theproblem, allowing us to better separate the problems of finding“what content is duplicated” and finding “where the content is beingduplicated”. In Chiu et al., “discovery” is a manual search queryperformed by the user, while in our work we can perform discoveryas an automated process.

Cho et al. [13] use sentence-level hashing of text to searchfor duplicated content in order to improve Google’s web crawlerand search engine. By identifying duplicates with hashing, a webcrawler becomes more efficient as it avoids visiting clusters of simi-lar pages. Because the crawler avoids and suppresses duplicates, thequality of search results is improved with more diverse sites beingreturned in response to a user’s query. Our work complements thisprior work with different chunking strategies and different applica-tions in measurements and anti-phishing. While Cho et al. extract,chunk, and hash only textual information (no markup), we lookat paragraph-level chunking of a document’s native format, find-ing it to be effective in duplicate detection. Rather than avoid andsuppress duplicates, we focus on precisely finding and identifyingclusters of similar pages to measure content reuse on the web (forcommercial gain or otherwise) and detect phishing websites as partof an anti-phishing strategy.

Zhang et al. [48] build a near-duplicate detection technique bychunking documents at the sentence-level and matching their sig-natures across a variety of English and Chinese datasets (1.69–50.2×106 documents, 11–490 GB). We chunk documents at theparagraph-level, and compare the performance of matching at the

file- and paragraph-level. We focus on applications in detectingcommercialized duplication and web phishing, showing that ourtechniques can scale to web-size corpora (§5, 2.86×109 documents,99 TB). Their signature creation also leverages prior work, usingshingles [10], SpotSigs [44], and I-Match (SHA-1) [14], preferringI-Match and its efficiency. Our work validates SHA-1’s performance,which we leverage to achieve precise and efficient detection.

Henzinger [24] compares the performance of algorithms that useshingling and Charikar’s locality sensitive hashing (LSH). WhileLSH achieves better precision than shingling, combining the twoprovides even higher precision. Exploration of LSH is an interestingpossible complement to our use of cryptographic hashing: althoughthe objective of our paper is not a survey of algorithms, we brieflycompare LSH and cryptographic hashing in §6.5.

Yang and Callan [46] develop a system that implements a clus-tering algorithm using document metadata as constraints to groupnear-duplicates together in EPA and DOT document collections.They exploit constraints in document metadata; we instead focuson general datasets that provide no such metadata.

Kim et al. [26] develop an approximate matching algorithm foroverlaps and content reuse detection in blogs and news articles.A search query is compared to sentence-level signatures for each

document, with returned results being some Euclidean distance√d

of each other. Their system, tested on corpus sizes of 1-100×103documents, balances the trade-off between a higher true positiverate (recall) with lower precision and their algorithm’s quadraticruntime (O(n2)). They also optimize part of their processing byusing incremental updates for new content. We focus on precisematching in quasilinear time (O(n logn)) on larger chunks (at theparagraph-level) of content reuse. In our applications, we look atdetecting large-scale content reuse on web-scale corpora (≥ 109

documents), requiring high precision to avoid being overwhelmedwith false positives (requiring costly post-processing).

Phish Detection:We summarize prior work here, from our pre-vious, more detailed review [6]. Machine learning has been usedto detect phish, by converting a website’s content [23] or URL anddomain properties [29] into a set of features to train on. Otherapproaches measure the similarity of phish and original sites bylooking at their content and structure: similarities can be com-puted based on the website’s visual features like textual content,styles, and layout [27]. Many of these approaches use approximatematching, which runs the risk of producing false positive detec-tions, and machine learning techniques have high computationalrequirements (that would make them difficult to run on clients). Ouruse of precise content matching helps avoid false positives, runsefficiently in clients, and can provide a first pass that complementsheavier approaches.


2812

4 METHODOLOGYWe next describe our general approach to detecting content reuse.Although we have designed the approach for web-like corpora,it also applies to file systems or other corpora containing textualcontent like news sources.

4.1 OverviewOur general approach is to compute a hash for each data item, thenuse these hashes to find identical objects. In this section we presentour workflow and address the discovery and detection phases ofour approach.

Collecting the Data:

(0) Crawl the web, or use an existing web crawl, and correct acqui-sition errors (§4.4.1).

(1) For each file f in corpus C, compute a hash of the whole file f :H (f ) and

(2) Split f into a vector of chunks cf = {cf ,1, . . . , cf ,n } and hash

each chunk H (cf ,i ) to form a chunk hash vector1 H (cf ).

Discovery: (§4.2)

(3) Populate the labeled dataset with files Lf or chunks Lc by either:(a) informed discovery: seeding it with known content a priori(b) blind discovery: (i) identifying the most frequently occurring

files or chunks in C as suspicious, after (ii) discarding knowncommon but benign content (stop-chunk removal, §4.4.2)

Detection: (§4.3)

(4) Simple Object Matching: Given a labeled dataset of hashedchunks or files L, find all matches ∈ C where its hash is ∈ Lo .This results in target (suspicious) files and chunks: Tf and Tc .

(5) Partial Matching: To identify files containing partial matches,we use the chunk hash vectors compute the badness ratio oftarget chunks to total file content:

contains(Lc, f ) =|Lc ∩ H (cf )|

|H (cf )|

If contains(Lc, f ) is greater than a threshold, we consider f tobe a partial target in T.

(6) Bad Neighborhood Detection: Apply stop-chunk removal (§4.4.2),then for each neighborhood N = { fN ,1, . . . , fN ,n } where thefiles share a hierarchical relationship, compute the overall bad-ness ratio of labeled content matches to total content:

badness(N ) =!

!n∈N

contains(Lc ,n))|N |

If badness(N ) is greater than a threshold, we consider N as abad neighborhood in TN .

The thresholds for partial matching and bad neighborhood detec-tion are configurable; we set the default threshold to one standarddeviation over the mean. We elaborate on our choice and how toselect a threshold in §4.2.

1While the vector contains an ordering of hashed chunks, we do not currently use theorder.

4.2 DiscoveryDiscovery is the process of building a labeled dataset of items wewish to find in the corpus during detection. We can do this with aninformed or blind process.

With informed discovery (Step 3a), an expert provides labeledcontent of interest L, perhaps by exploring C manually, or usingexternal information. As one example, we know that Wikipedia iswidely copied, and so we seed L with a snapshot of Wikipedia (§8.1).One could also seed L with banking websites to identify phishingsites that reproduce this content (we seed L with PayPal in §8.2).

One can also identify L through a blind discovery process (Step 3b)that automatically finds widely duplicated content. Blind discoveryis appropriate when an expert is unavailable, or if the source ofcopying is unknown. We first populate Lf and Lc with the mostfrequently occurring files or chunks in the corpus. We set the dis-covery threshold depending on the dataset size and the type ofobject being identified. For example, one would set the thresholdto be higher when the dataset size is larger. We looked at the ROCcurves (a plot between the true positive rate and false positive rate)and found a trade-off between false positives (FP) and true posi-tives (TP). There was no strong knee in the curve, thus we pickedthresholds with a reasonable balance of FP to TP. In the CommonCrawl dataset of 40.5×109 chunks, we set the threshold to 105.

Additionally, in our discovery process we expect to find trivialcontent that is duplicated many times as part of the web publishingprocess: the empty file, or a chunk consisting of an empty paragraph,or the reject-all robots.txt file. These will inevitably show up veryoften and litter L: while common, they are not very significantor useful indicators of mass duplication. To make blind discoverymore useful, we remove this very common but benign content usingstop-chunk removal, described in §4.4.2.

Given a threshold, all chunks c in the corpus C whose numberof duplicates exceeds the threshold and are not “stop chunks” areautomatically labeled and added to the labeled dataset L:

L := !c ∈ C : duplicates(c) > threshold, c ! {stop chunks}

We next look at properties of the discovery process.An important property of discovery is that it is not distributive—

analysis must consider the entire corpus. Parts of discovery are in-herently parallelizable and allow for distributed processing by divid-ing the corpus to various workers; we use MapReduce to parallelizethe work (§5). However, to maximize detection, the final data joinand thresholding must consider the full corpus. Given an examplethreshold of 1000, consider a corpusC = C1∪C2. Consider an objectj = j1 = j2 such that duplicates(j) = duplicates(j1)+ duplicates(j2):j1 ∈ C1, duplicates(j1) = 1000 and j2 ∈ C2, duplicates(j2) = 100.Object j only exceeds the threshold in the complete corpus (withduplicates(j) = 1100), not with consideration of only j1 or j2.

Discovery runtime isO(n logn) and performance on a moderate-size Hadoop cluster is reasonable (hours). We look at the runtimeperformance to understand which part of discovery dominates thecomputation time and, if possible, identify areas for improvement.After we hash all the desired objects (O(n)), we sort and count allhashes (O(n logn)), and cull objects (O(n)) whose number of dupli-cates do not exceed the threshold. Discovery’s performance is dom-inated by sorting, leading to an overall performance of O(n logn).


2913

4.3 DetectionIn the detection phase, we find our targets T at varying levels ofgranularity in the corpus C by looking for matches with our labeleddataset L.

In simple object matching, our targets T are an exact match of achunk c or file f in L. Given L, find all chunks or files ∈ C whereits hash is ∈ Lo and add them to the set of targets T. We can thenanalyze T to understand if objects in L are being duplicated inC andhow often it is being duplicated. While those statistics are relevant,we expect that duplication happens often and would like to furtherunderstand the details of where duplication happens.

Algorithmic performance of detection is O(mn logn), wheremis the size of the labeled data, L, and n the size of the corpus C.Since |L| << |C | (the corpus is large, with millions or billions ofpages, only fraction of which are labeled as candidates of copy),performance is dominated byO(n logn) because of sorting.With op-timized sorting algorithms (such as those in Hadoop), our approachscales to handle web-sized corpora.

We also consider partial file matching. Rather than look at wholeobjects, we can detect target files that partially duplicate contentfrom elsewhere based on a number of bad chunks. Partial matchesare files that belong in Tp because they contain part of L. Contain-ment examines the chunk hash vectorH (cf ) of each file to see whatfraction of chunks are in L.

Finally, we use bad neighborhood detection to look beyond iden-tification of individual files. Examination of “related” files allowsdetection of regions where large numbers of related files each havea duplicated copy. For example, finding a copy of many Wikipediapages might lead to a link farm which utilized Wikipedia to boostits credibility or search engine ranking.

We define a neighborhood based on the hierarchical relationshipof files in the corpus. A neighborhood N is defined by the URLprefix p, it consists of all files f ∈ C where p(f ) = p(N ).

Many sites have shallow hierarchies, so in the worst case eachsite is a neighborhood. For example, while people might easilycreate domains and spread content across them, the duplicatedcontent would be detected as matches and reveal a cluster of neigh-borhoods (or sites) containing duplicated content. However, forcomplex sites with rich user content (e.g., GeoCities), individualsmay create distinct neighborhoods. Each site will have neighbor-hoods at each level of the hierarchy. For arxiv.org/archive/physics/,we would consider three neighborhoods: arxiv.org/archive/physics/,arxiv.org/archive/, and arxiv.org/.

We assess the quality of a neighborhood by applying partialmatching to all chunks in the neighborhoodN using contains(Lc,N )in Step 5 and add N to the set of targets T if the result is greaterthan a threshold. Like chunk hash vector for files, the neighborhoodchunk hash vector will have duplicated components when thereare multiple copies of the same chunk in the neighborhood. Be-cause neighborhood analysis is done over a larger sample, when wefind regions that exceed our detection threshold, it is less likely torepresent an outlier and instead show a set of files with suspiciouscontent. We next look at properties of the detection process.

Unlike discovery, the detection process is parallelizable whenprocessing distinct neighborhoods N (neighborhoods that do notshare the same URL prefix). This parallelizable property allows us

to process many neighborhoods simultaneously without affectingwhether a particular neighborhood is detected as “bad” or not.

Given C1 and C2, we assert that

detected(L,C1 ∪ C2) = detected(L,C1) ∪ detected(L,C2).

This holds true because C1 and C2 share no neighborhoods: givensome neighborhoodN ∈ C1,N ! C2. Aswe showed earlier, runtimeperformance isO(n logn) because of the sort during join. However,since neighborhoods are independent and numerous, we get “easy”parallelism. With p processors, we get runtime O(n logn)/p.

4.4 Cleaning the DataWe do two types of cleaning over the data, first we identify recur-sion errors that result in false duplication from the crawling process,and then we eliminate common, benign features with stop-chunk re-moval and whitespace normalization. We evaluate the effectivenessof these methods in §6.1.

4.4.1 Detecting and Handling Recursion Errors. Crawling the real-world web is a perilous process, with malformed HTML, crawlertraps, and other well understood problems [9, 25]. We detect andremove crawler artifacts that appear in both Common Crawl andGeoCities. Our main concern is recursion errors, where a loop inthe web graph duplicates files with multiple URLs—such resultswill skew our detection of copied data. We see this problem inboth datasets and use heuristics involving how often a URL pathcomponent is repeated and remove that URL from processing if itis determined to be a recursion error. We evaluate these heuristicsin §6.1, finding that these heuristics have a very low false positiverate in detecting crawler problems, and are sufficient to avoid falsepositives in our duplication detection. Future work may refine theseheuristics to reduce the number of false negatives in recursion-errorremoval.

4.4.2 Stop Chunk Removal. We see many common idioms in bothfiles and chunks. We call these stop chunks, analogous to stop wordsin natural language processing (such as “a”, “and”, and “the”). Forchunks, these include the empty paragraph (<p></p>), or a single-space paragraph (<p> </p>). For files, examples are the emptyfile, or a reject-all robots.txt file. These kind of common, benignidioms risk skewing our results.

We remove stop chunks before applying bad neighborhood de-tection, and use both manual and automated methods of generatinglists of stop chunks. We find that automated generation, while lessaccurate, is sufficient.

We manually generated lists of stop chunks for duplicate detec-tion in the web (§7). In Common Crawl, the list of 226 chunks isshort enough to allow manual comparison: if the list becomes toolarge, we can apply Bloom filters [8] to support efficient stop-chunkremoval.

For automated generation, we apply the heuristic of treating allshort chunks as stop chunks. In our evaluation of expert-identifiedcontent (§8), we discard chunks shorter than 100 characters. Wealso compare manual and automated generation: we previouslylabeled manually 226 (45%) of the top 500 chunks in CommonCrawl as benign. By discarding chunks shorter than 100 characters,we automatically label 316 (63%) as benign: 222 benign and 94non-benign from our manually labeled list. The non-benign that


3014

arxiv.org/archive/physics/

arxiv.org/archive/physics/

arxiv.org/archive/

arxiv.org/

are automatically labeled aren’t necessarily “bad”—typically theyare basic layout building blocks used in web templates or editors.Thus we find the trade-off acceptable: manual generation is moreaccurate, but automatic generation is sufficient for applicationsusing expert-identified content.

4.5 Chunking and HashingChunking text and data into non-overlapping segments has beendone in natural language processing (NLP) [1, 37], in disk dedupli-cation optimization [35, 47], and in information retrieval [36]. Wechunk all content in our corpora, breaking at the HTML delimiters<p> (paragraph) and <div> (generic block-level component) tags,that are used to structure documents. We could also chunk on othertags for tables, frames, and inline frames, but find that our chosendelimiters are sufficiently effective.

Hash function: Unlike prior work with hashing from NaturalLanguage Processing, we use cryptographic hashing to summarizedata. We employ the SHA-1 [18, 32] cryptographic hash functionfor its precision—identical input always produces the same output,and different input yields a different output. Cryptographic hashingis used in disk deduplication [35], but most prior work consideringduplicate detection uses locality-sensitive [15, 28] and semantic [38]hashes. We use cryptographic hashing to eliminate the small falsepositive rate seen in other schemes: we show in §6.5 that even atiny rate of false positives is magnified by large corpora.

5 DATASETS AND IMPLEMENTATION

5.1 Datasets and RelevanceThis paper uses three web datasets: Common Crawl, GeoCities,and our own phishing site corpus. We use the Common Crawlcrawl-002 dataset (Ccc ) collected in 2009/2010 and publicly pro-vided by the Common Crawl Foundation [20] to represent recentweb data. crawl-002 is freely available on Amazon S3 and includes2.86×109 items (26 TB compressed, 99 TB uncompressed). Most ofits data is HTML or plain text, with some supporting textual mate-rial (CSS, JavaScript, etc.); it omits images.

As a second dataset, we use the GeoCities archive (Cд ) crawledby the Archive Team [4] just before the GeoCities service was shut-tered by Yahoo! in October 2009. The dataset was compiled betweenApril–October 2009 and contains around 33×106 files (650GB com-pressed) including documents, images, MIDI files, etc. in variouslanguages. Although the content is quite old, having been generatedwell before its compilation in 2009, it provides a relatively completesnapshot of diverse, user-generated content.

We generate the third dataset of phish (Cp ) by extracting the top-level webpages (HTML) from a stream of 2374 URLs of suspectedphishing sites provided by PhishTank [33], a crowd-sourced anti-phishing service, over two days (2014-Sep-24 and 2014-Sep-25).(Our other datasets, Common Crawl and GeoCities, are not suitablefor phish detection since phish lifetimes are often only hours todays, leaving very little time to crawl phish.) From the collectedURL stream, we automatically crawl the suspect URLs and manuallyclassify each as phish or otherwise. We ignore phishing sites thatwere removed by the time we crawl, discarding about 20% of thestream.

Dataset Relevance: As the web evolves quickly, its content andstructure also evolves. Most GeoCities content dates from the late1990s to the early 2000s, Common Crawl is from 2009–2010, and ourphishing dataset is from 2014. Does evaluation of our techniquesover these older datasets apply to the current web? We stronglybelieve it does, for two reasons: our datasets fulfill the requirementfor content classification and we show that our approach easilyadapts to today’s and tomorrow’s web.

First, the key requirement for classification of content in a corpusis that the corpus be diverse and large enough to approach real-world size and diversity. Both GeoCities and Common Crawl satisfythe diversity requirement. While GeoCities is perhaps small relativeto the current web, we believe it is large enough provide diversity.Our phishing dataset is intentionally small because it addressesa more focused problem; it shows considerable diversity in thatdomain.

Second and more importantly, the web will always be differenttomorrow, and continue to change over the next ten years. Theincreasingly dynamic and personalized nature of the web will mod-ify the edges (like recommended links) but leave the core contentunchanged and still detectable with hashing. We show that our ap-proach works well over many years of web pages with only modestchanges (for example, adding the use of <div> in addition to <p> toidentify chunks). Our largest change was to shift from static webcontent to crawling a browser-parsed DOM in our phishing study(§8.2)—while conceptually straightforward, its implementation isquite different. This change allows us to accommodate dynamically-generated, JavaScript-only web content. We believe that this rangeof ages in our datasets strongly suggests that our approach (per-haps with similar modest changes) will generalize to future webpractices, whatever they may be.

5.2 ImplementationWe implement our methods and post-processing on cloud comput-ing services (Amazon EC2) and a local cluster of 55 commodity PCsrunning Apache Hadoop [2]. Processing was done with customMapReduce programs [16], Apache Pig [3], and GNU Parallel [43].Our current cluster can intake data at a rate around 9 TB/hour.

We initially hash files and chunks in Common Crawl (Ccc ) onEC2 in 11.5×103 compute hours (18 real hours, ∼$650), produc-ing 2.86×109 file hashes and 40.5×109 chunk hashes along withbackreferences to the original dataset (1.8 TB of metadata).

We use our local cluster to hash GeoCities (Cд ) in 413 compute

hours (1.5 real hours) producing 33×106 file hashes and 184×106chunk hashes along with backreferences (4.7 GB of metadata).

Overall performance is good as the corpus grows to the sizeof a large sample of the web. Although the theoretical bound onprocessing is the O(n logn) sort of hashes, in practice performanceis dominated by scanning the data, anO(n) process that parallelizeswell with MapReduce. We see linear runtime on uncompresseddataset size ranges from 5GB to 5 TB (we omit this graph due tospace). We expect processing 225 TB of uncompressed data (Com-mon Crawl, Feb. 2019, CC-MAIN-2019-09) on the same EC2 setupused earlier to take 41 real hours and $1300.


3115

5.3 ReproducibilityOur research is reproducible, with all code available as open source(when possible), and all data available from us or provided withpointers to other public sources (for data generated by others).

Source code and data used for validation, discovering and de-tecting duplication of web content, and detecting clones of Wiki-pedia are available at https://ant.isi.edu/mega and https://github.com/cardi/content-reuse-detection. Source code for our Auntie-Tuna anti-phishing plugin and data used in our anti-phishing ap-plication are available at https://ant.isi.edu/software/antiphish andhttps://github.com/cardi/auntietuna. Instructions for reproducibil-ity are included in their respective repositories.

6 VALIDATIONWe next validate our design choices, showing the importance ofcleaning and correctness of our methodology.

6.1 Do Our Cleaning Methods Work?Initial analysis of our raw data is skewed by crawler errors, andidentification of bad neighborhoods can be obscured by commonbenign content. We next show that our cleaning methods from §4.4are effective.

We have reviewed our data and taken steps to confirm thatrecursion errors do not skew discovery of duplicates. While only1% of all 913×106 neighborhoods in Common Crawl are the resultof recursion errors, removing the obvious errors is helpful althoughnot essential. Details of recursion error removal and its validationare omitted here due to space, but are in our technical report [5].

We next describe validation that our stop-chunk removal process(§4.4.2) is effective. To identify stop chunks, we manually examinethe 500 most frequently occurring chunks in Common Crawl andidentify 226 as benign. These chunks occur very frequently in thedataset as a whole, accounting for 35% of all chunks that occur≥ 105 times. To verify that we do not need to consider additionalfrequent words, we also examine the next 200 and identify only 43as benign, showing diminishing returns (these 43 account for only1% of all chunks that occur ≥ 105 times). We therefore stop withthe benign list of 226 chunks found in the top 500 most frequent asit is sufficient to avoid false positives due to benign data.

To demonstrate the importance of stop-chunk removal, we com-pare bad neighborhood detection with and without stop-chunkremoval. Stop chunks dilute some pages and skews the quality ofthe badness ratio; if we do not remove stop-chunks in CommonCrawl (913×106 neighborhoods), we would detect 1.88×106 (2.35%)more “bad” neighborhoods than the 79.9×106 bad neighborhoodswe find after stop-chunk removal. These additional 1.88×106 “bad”neighborhoods are false positives, mainly consisting of detectedstop chunks, which would dilute the results above the detectionthreshold and reduce detection’s precision: (true positives)/(truepositives + false positives).

6.2 Can We Discover Known Files and Chunks?We next turn to the correctness of our approach. We begin byvalidating and verifying that hashing can detect specific content inspite of the background “noise” of millions of web pages with thefollowing experiment.

Duplicated full files: We first consider a spammer that du-plicates a file many times to provide content for thousands ofparked domains. To emulate this scenario, we take a known website(blog.archive.org as of 2013-Aug-22) containing roughly 4000 pagesor files and duplicate the entire site from d = 100 to 5000 times. Foreach duplication we generate a unique, emulated website, processthat data with steps 0–2 of our methodology, merging this with ourfull processed data.

We then build our labeled dataset via blind discovery. Our blinddiscovery process populates the labeled dataset with the most fre-quently occurring content. In Common Crawl (Ccc ), our blind dis-covery threshold is 103 (threshold is set using §4.2): all files thathave more than 103 duplicates are labeled.

Figure 1 shows the results of this experiment in Ccc file fre-quency. This frequency-occurrence graph shows the number ofoccurrences (y-axis) that a file object has, given the amount oftimes it has been duplicated (x-axis). Our discovery threshold ismarked by a red dotted line at x = 103; all the content (indicated bypoints) past the threshold are added to the labeled dataset. Dupli-cating the entire blog moves it from unique, unduplicated content(a grey dot in the top left) to an outlying point with 4000 pagesoccurring 5000 times (as indicated by a labeled black triangle at(x ,y) = (5000, 4000)). We see that the point passes our thresholdand we have discovered our injected and massively duplicated blog.This change from top-left to an outlier above and further right onthe graph represents what happens when spammers duplicate partsof the web.

Spammersmay duplicate files fewer number of times. To considerthis scenario, we change the number of duplications d to valuesless than our previous example. The blue circles represents theinjected site had the entire site (4000 files) been duplicated differentamounts of times (at d = 100, 250, 1000). When the injected site hasbeen duplicated ≤ 103 times (three blue circles on and to the leftof the red threshold line), that site and corresponding files will notbe automatically discovered; all points right of the red thresholdline (the black triangle at x = 5000) will. Note that even with fewerduplications (blue circles left of the red threshold), the injectedfiles duplicated fewer times will be visibly obvious outliers on thegraph and may be detected with manual analysis or more sensitiveautomation (using additional analysis of the corpus or an iterativesearch to determine an optimal threshold as defined in §4.2).

Partially duplicated pages: The above experiment shows ourability to track duplicated files, but spammers almost always add tothe duplicated content to place their own links or advertisements.We therefore repeat our study of duplicating files by duplicatingan entire website d = 5000 times, but add a different paragraphto the beginning and end of each duplicated page to representunique advertisements attached to each page. Since each page varieshere, file-level analysis will detect nothing unusual, but chunk-levelanalysis will show outliers. The size distribution of pages is skewedand appears heavy tailed (mean: 48 chunks, median: 23, max: 428).Our discovery threshold is increased from 103 to 105, because thenumber of chunks in Ccc is much larger than the number of pages.

Figure 2 shows chunk-level evaluation of this scenario, with eachdot representing a particular chunk. The red dotted line at x = 105

marks our discovery threshold: all 6000 chunks to the right this lineare discovered, added to the labeled dataset, and further analyzed.


3216

https://ant.isi.edu/mega

https://github.com/cardi/content-reuse-detection

https://github.com/cardi/content-reuse-detection

https://ant.isi.edu/software/antiphish

https://github.com/cardi/auntietuna

blog.archive.org

100 101 102 103 104 105 106 107

1umEer of DupOiFDtes

100

101

102

103

104

105

106

107

108

109

1010

1um

Eer o

f 2FF

uren

Fes

)iOe-OeveO GrDnuODrity, Common CrDwO (2.86B fiOes)

100250

10005000

GupOiFDteG Fontent (fiOes)

(r rrrrrjjj

GisF

over

y th

resh

oOG

Figure 1: File-level discovery of injectedduplicates (!) in Ccc , compared to filefrequency (grey dots). j: JavaScript, r:robots.txt, E: empty file.

100 101 102 103 104 105 106 107 108 109

1umber of DupOicDtes

100

101

102

103

104

105

106

107

108

109

1010

1011

1um

ber o

f 2cc

uren

ces

ChunN-OeveO GrDnuODrity, Common CrDwO (40.5B chunNs)

GupOicDteG content(chunNs)

<P>

Gisc

over

y th

resh

oOG

Figure 2: Chunk-level discovery of in-jected duplicates (!) in Ccc , compared tochunk distribution (grey dots).

100 101 102 103 104 105 106

All Chunks DupliFDteG # 7imes

0.0

0.2

0.4

0.6

0.8

1.0

)rDF

tion

Chu

nks

Dis

Fove

reG

(7hr

esho

lG >105)

Chunk-level GrDnulDrity, Common CrDwl (40.5% Fhunks)

GisFovereG 78% when weGupliFDte Dll Fhunks 5000 times

Figure 3: Percentage of chunks discov-ered in blog.archive.org given the num-ber of times it is duplicated.

We now see more evidence of duplicated chunked content, whichis shown by a cluster of black triangles (as opposed to a singleoutlying point) corresponding to the 1.2×109 chunks that makeup the duplicated content of blog.archive.org (originally 242×103chunks). The light grey dots correspond to all the existing chunksin Ccc .

We see that many of the chunks that make up the pages ofblog.archive.org pass our defined threshold and we discover 78% oftotal distinct chunks. Similar to the previous experiment, we can“control” where the points are distributed by varying the number oftimes we duplicate the site. If all the chunks in the site had fallenbelow the threshold, we would not have automatically discoveredthe site via our blind discovery process.

Hashing a finer-grained object in our discovery process allowsus to discover more content that has been duplicated. File-leveldiscovery returns a binary result: either we discover the file or not.Chunk-level discovery allows us to discover varying percentages ofcontent depending on how many times it was duplicated. Figure 3shows how many chunks from blog.archive.org are discoveredgiven the number of times all chunks have been duplicated. Whenwe duplicate the website d = 5000 times (black triangles in Figure 2and the point marked by the red dotted line in Figure 3), we discover78% of the chunks. (Trivially, we discover 100% of the chunks whenwe duplicate the site ≥ 105 times.)

Our simple threshold detects some but not all duplicated chunksthat were injected. The duplicated content (black triangles) in Fig-ure 2 are clear outliers from most of the traditional content (greydots), suggesting a role for manual examination. This experimentshows that chunk-level analysis is effective even though only por-tions of pages change. We next look at the effects of content muta-tion more systematically.

6.3 Can We Detect Specific Bad Pages?Having shown that we can discover known files and chunks, wenext validate our detection mechanism by finding known targets Tand understanding the conditions in which our mechanism fails.Given our labeled dataset curated by an expert (Lexpert) and one viablind discovery (Lblind), can we detect bad pages? Furthermore, as

we increasingly mutate each page, at what point can we no longerdetect it?

To evaluate our bad page detection mechanism, we continue ourprior example where we rip and duplicate blog.archive.org; thisset of pages becomes our injected corpus Ci . We mutate Ci in aconsistent manner that can be applied to all pages in Ci to get aresulting C′

i . We can categorize each mutation into the following:

+ Add additional content, such as ads or link spam∆ Modify existing content by rewriting links− Remove content such as headers, copyright notices, footers, or

the main body of the page

We build both Lexpert and Lblind from Ci (as described in §6.2), thenrun the detection process to see if pages in C′

i are detected.

We continue mutating C′i (e.g., C′′

i , . . . ,C′(n)i ) to understand

the kinds and amount of mutations that the detection process canhandle. While we utilize a copy of blog.archive.org to build L andCi ,our results for each mutation experiment are consistent with otherL because we mutate each of the 4626 pages. For each experiment,we have the base site Ci and apply n independent mutations to

each page resulting in C′(n)i .

In our first mutation experiment, we continuously add contentto a page such that the page is diluted with non-target contentand we do not detect it (due to the badness ratio not reaching aparticular threshold). Figure 4 shows the performance with bothLexpert (green) and Lblind (blue). The bottom x-axis details the num-ber of chunks added per page relative to the average number ofchunks per page in Ci (cpp = 48). The y-axis shows the averagebadness ratio per page (averaged over all 4626 pages in Ci ). Thebadness threshold is labeled on each graph at 0.144 (we describeits computation in a later section). We perform 10 runs over Ci ateach x value and take the average. We omit error bars when thestandard error is < 0.01 for clarity (Figure 4 in particular has noerror bars).

This experiment shows that we can tolerate an additional 3.4×(using Lblind) or 4.5× (using Lexpert) the mean number of chunks perpage (cpp) in each labeled dataset and still detect duplicated content.These tolerances are visually represented in Figure 4, where theblue (Lblind) or green (Lexpert) dotted lines meet with the red dottedline (badness threshold): points to the left of the blue or green lines


3317

blog.archive.org

blog.archive.org

blog.archive.org

blog.archive.org

blog.archive.org

blog.archive.org

0 1 2 3 4 5 6 7 85elDtive 1uPber Rf Added ChunNs per 3Dge (cpp)

0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf D

3Dg

e

expert L

blind L bDdness threshRld

3.4×

, blin

d L

(blu

e)

4.5×

, exp

ert L

(gre

en)

0 50 100 150 200 250 300 3501uPber Rf Added ChunNs per 3Dge (DbsRlute)

DDtD 0utDtiRn: Ddding chunNs, blRg.Drchive.Rrg

Figure 4: Effects of continuouslyadding chunks on pages.

0 1 2 3 4 5 6 7 85elDtive 1uPber Rf Deleted ChunNs per 3Dge (cpp)

0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf D

3Dg

e

DDtD 0utDtiRn 20 rRunds: deleting chunNs, blRg.Drchive.Rrg

0 50 100 150 200 250 300 3501uPber Rf Deleted ChunNs per 3Dge (DbsRlute)

bDdness threshRld

expert L (green) blind L (blue)

25

50

75

100

125

150

175

200

225

Aver

Dge

1uP

ber R

f Chu

nNs

per 3

Dge

Figure 5: Effects of continuously deletingchunks on pages.

0 1 2 3 4 5 6 7 85elDtive 1uPber Rf 5DndRP ChDnges per 3Dge (cpp)

0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf D

3Dg

e

0 50 100 150 200 250 300 3501uPber Rf 5DndRP ChDnges per 3Dge (DbsRlute)

DDtD 0utDtiRn: PutDting chunNs, blRg.Drchive.Rrg

bDdness threshRld

1.8×

, blin

d L

(blu

e)

1.(×

, exp

ert L

(gre

en)

Figure 6: Effects of continuously chang-ing chunks on pages.

on the x-axis will have an average badness above the detectionthreshold on the y-axis. This behavior is not surprising: if we wereto dilute the content with many other unrelated chunks, the averagebadness would asymptotically approach 0.

We next continuously delete content randomly; deleting contentwill increase the badness ratio but may be overlooked because thenumber of total chunks on the page will be smaller. Users of hash-ing might require a minimum number of chunks per page beforeapplying the badness ratio. Figure 5 shows the average badness ofa page given the number of chunks we delete per page. Using anLexpert (green), we see that the ratio is always 1.0: deleting chunksdoes not affect the badness because the entire page is bad regardless.Next, we initially see an increase in average badness of a page whenusing Lblind (blue) and stabilizes until a certain point as we increasethe number of deleted chunks per page. Pages that have a smallnumber of total chunks have on average a lower badness ratio untilthe page is eventually removed from the population, which resultsin a higher average badness as pages that have a higher numberof total chunks survive deletion. In this experiment, our detectionmechanism on average handles all 400 deletions (per page).

Similarly, we see a large variance in badness at the tail of the

graph because the population of pages in C′(n)i (after mutation)

decreases. As we increase the number of deleted chunks per page,the average number of chunks per page (orange) fall. Pages alsocease to exist after all the chunks have been deleted; we see inFigure 5 that the average number of chunks per page increases asthe population of pages decreases. This behavior is expected: as atrivial example, consider a page with only two chunks only oneof which is in L: the badness of the page is 0.5. If we delete thebad chunk, the badness falls to 0, but if we delete the other, thebadness increases to 1. Thus, depending on the chunks we delete,the badness of a page will fluctuate.

In our final experiment, we continuously modify content to thepoint where we no longer can detect it (e.g., if every chunk is mod-ified at least once, our detection algorithm will fail). We considera stream of mutations: we randomly pick a chunk to modify andchange one random character in that chunk, with replacement (insuccessive mutations, the same chunk can be modified again). Fig-ure 6 shows the average badness of a page given the number ofrandom changes with replacement. We see an exponential drop in


0

1000

2000

3000

4000

1uP

ber R

f 3Dg

es D

eteF

ted

Ds B

Dd


DDtD 0utDtiRn: PutDting FhunNs, blRg.DrFhive.Rrg

0.0

0.2

0.4

0.6

0.8

1.0

)rDF

tiRn

Rf 3

Dges

Det

eFte

d Ds

BDdexpert L (green)

blind L (blue)

Figure 7: Number of pages detected as bad after continuouslychanging chunks on pages in blog.archive.org.

the average badness of the page as we linearly increase the num-ber of random changes (with replacement) per page. On average,our bad page detection mechanism handles 1.8 × cpp (Lblind) and2.0×cpp (Lexpert) changes before the page falls below the threshold.

To show that we can tolerate 3.4× cpp mutations, we look at theperformance of our bad page detection mechanism. Figure 7 showshow many pages we detect as bad given the number of randomchanges per page in Ci . In the perfect case (such as using Lexpert onan unmodified site), we detect all 4600 pages in Ci as bad. While theLexpert performs much better initially (detecting between 300-700more pages than with Lblind), we see both lines eventually converge.

We can detect known bad pages to a certain degree of muta-tion. Our validation experiments show that we can handle between1.8− 4.5× cpp mutations on Ci depending on the type of mutationand the labeled dataset we utilize. While utilizing the Lexpert slightlyincreases the number of mutations we can tolerate (compared to us-ing the Lblind), the Lexpert contains over 4.8× the number of entries

(|Lexpert | = 21×103, |Lblind | = 4.4×103). We next transition into thevalidation of detecting known bad neighborhoods.


3418

blog.archive.org

0 1 2 3 4 5 6 7 85elDtive 1uPber Rf Added ChunNs per 3Dge (cpp)

0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf th

e 1

eigh

bRrh

RRd

DDtD 0utDtiRn: Ddding chunNs, blRg.Drchive.Rrg neighbRrhRRds

0 50 100 150 200 250 300 3501uPber Rf Added ChunNs per 3Dge (DbsRlute)

bDdness threshRld

4.4×

, blin

d L

(blu

e)

5.2×

, exp

ert L

(gre

en)

Figure 8: Effects of continuouslyadding chunks in a neighborhood.

0 1 2 3 4 5 6 7 85elDtive 1uPber Rf Deleted ChunNs per 3Dge (cpp)

0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf th

e 1

eigh

bRrh

RRd

DDtD 0utDtiRn: deleting chunNs, blRg.Drchive.Rrg neighbRrhRRds

0 50 100 150 200 250 300 3501uPber Rf Deleted ChunNs per 3Dge (DbsRlute)

expert L (green) blind L (blue)

bDdness threshRld

0

30

60

90

120

150

180

1uP

ber R

f Chu

nNs

in 1

eigh

bRrh

RRd

(×1(((

)

Figure 9: Effects of continuously deletingchunks in a neighborhood.


0.0

0.2

0.4

0.6

0.8

1.0

Aver

Dge

BDdn

ess

Rf th

e 1

eigh

bRrh

RRd

DDtD 0utDtiRn: PutDting chunNs, blRg.Drchive.Rrg neighbRrhRRds


bDdness threshRld

5.3×

, blin

d L

(blu

e)5.4×

, exp

ert L

(gre

en)

Figure 10: Effects of continuouslychanging chunks in a neighborhood.

6.4 Can We Detect Known Bad Neighborhoods?Given our success finding bad pages, we next validate the robustnessof detecting known bad neighborhoods. Recall that a neighborhoodcontains a set of pages that share a common URL prefix. As withpages, we evaluate both expert and blind labeled datasets, andchange a known target to evaluate the sensitivity of our detectionmechanism.

We evaluate our detection mechanism by designing a mutationexperiment with an example neighborhood N . The goal of ourexperiment is to understand the degree of change before our detec-tion process fails. We continue to use the same neighborhood N(blog.archive.org) and the same approach as in the previous section(§6.3) with the following change: mutate all pages in N in a consis-tent manner to get a resulting N ′: n mutations results in N ′(n). Wethen run the bad neighborhood detection process to see if N ′(n) isdetected.

We see similar results in the performance of bad neighborhooddetection compared to bad page detection. Figures 8 through 10show the bad neighborhood detection performance using bothLexpert (green) and Lblind (blue) for add, delete, and modify oper-ations, respectively. We compare the relative number of mutatedchunks per page in N (cpp) against the resulting badness ratio ofthe neighborhood after mutation (N ′(n)). We use a fixed badnessthreshold as described in §7.3. We again take the average over 10runs over N at each x value and omit error bars when standarderror is < 0.01.

Our experiments show that we can tolerate between 4.4 − 5.4 ×cpp mutations, and that bad neighborhood detection is much morerobust than bad page detection—on average our process can handle2.7 – 3.0× more modifications per page than bad page detection.Analysis of the neighborhood is much more robust because weconsider the badness across a collection of pages and have a largerpopulation of content to work with; considering only a page whencalculating badness is much more susceptible to fluctuation andnot as robust to mutation because of its smaller magnitude.

We have now validated our mechanisms that we will use in twoapplications: content reuse detection over web content using theblind process and detection of expert-identified content in the web.

Table 1: Performance of expert-identified detection onphishcorpus using cryptographic and locality-sensitive hashing.

True Nature Hash Alg.of Page Classified As Crypto LSH

PayPal Phish PayPal Phish (TP) 43 43Missed PayPal (FN) 42 42

Non-PayPal Misclassfied PP Phish (FP) 0 10Non-PayPal (TN) 1803 1793

Total 1888 1888

6.5 Cryptographic vs. Locality-SensitiveHashes

Our work uses cryptographic hashing functions to minimize the im-pact of false positives that result from locality-sensitive and seman-tic hashing. To quantify this trade-off, we next compare bad pagedetection with SHA-1 (our approach) to the use of Nilsimsa [15], alocality-sensitive hashing algorithm focused on anti-spam detec-tion. We use a corpus Cp of 2374 suspected phish (as describedin §5) and build a labeled dataset L from current and recent PayPalU.S., U.K., and France home pages (Sep. 2014, plus Jan. 2012 to Aug.2013 from archive.org).

We process the datasets, chunking on <p> and <div> tags, com-puting hashes of each chunk in Cp and L with both SHA-1 andNilsimsa. We then use L to detect PayPal phish in Cp . For detectionwith Nilsimsa, we use a matching threshold of 115 (0 being thefuzziest and 128 an exact match), a relatively conservative value.

Table 1 compares the confusionmatrixwhen using SHA-1 (Crypto)and Nilsimsa (LSH) independently in detection. Both algorithms de-tect (TP = 43) and miss (FN = 42) the same number of PayPal phish.However, Nilsimsa has false positives, misclassifying 10 pages asPayPal phish, while SHA-1 misclassifies none.

Even very low, non-zero false-positive rates (0<FPR<1%) arebad when used against web-size corpora, since false positives ina large corpus will overwhelm true positives. For Common Crawlwith 2.86×109 files, Nilsimsa’s very low 0.55% FPR at threshold 115could result in 15.7×106 false positives (upper bound)!


3519

blog.archive.org

archive.org

Table 2: Categories of the top 100 distinct chunks in Ccc .

Description |c | Type

Common (benign) 68 BenignTemplates 17 Benign

e-Commerce 8 BenignOther 9 Benign

Misc. 15 Benign

Total 100

Table 3: Classification of a sample of 100 distinct chunkswith more than 105 occurrences in Ccc .

Description |c | Type

Misc. 4 Benign

JavaScript 2 Benignescaped 1 Benignother 1 Benign

Templates 83 Benignnavigation 17 Benignforms 32 Benignsocial 4 Benignother 30 Benign

Commercial 6spam 1 MaliciousJavaScript advertising 3 AmbiguousJavaScript tracking 2 Ambiguous

Possibly Commercial 5 Ambiguous

Total 100

LSH’s design for approximate matching makes some false pos-itives inevitable. A challenge with any LSH is finding the “right”threshold on each particular dataset to minimize the FPR. The num-ber of false positives can differ greatly from small variations inthreshold. We exhaustively studied the parameter space for Nil-simsa and our phishing dataset. A threshold of 128 forces exactmatching, causing no false positives, but also makes the algorithmequivalent to cryptographic hashing. Our initial parameter choicewas 115; all thresholds from 120 down to 115 give a very low butnon-zero false-positive rate from 0.33% to 0.55%. At a thresholdof 114, the false-positive rate doubles (1.11%) and as we continueto decrease the threshold, the FPR grows rapidly. Matching withthresholds from 128 to 120 is like exact matching with no falsepositives, in which case our analysis is needed to evaluate its per-formance. Althoughwe find some thresholds with no false positives,in general, exhaustive search of the parameter space is not possible,and no one value will be “correct” across varying inputs.

The problem of false positives overwhelming rare targets knownas the base rate fallacy and is a recognized barrier to the use ofimprecise detection in security problems [7, 39]. This problem mo-tivates our use of cryptographic hashing.

7 ANALYSIS OF BLIND DISCOVERY OF WEBCOPYING

We next study the application of blind discovery of duplication ofweb content, and use this application to understand our approach.

7.1 Why is File-level Discovery Inadequate?We first consider file-level discovery on both datasets. File-levelcomparisons are overly sensitive to small mutations; we use themto establish a baseline against which to evaluate chunk-level com-parisons.

Figure 1 (grey dots) shows the long-tail distribution of the fre-quency of file-level hashes in Common Crawl (2.86×109 files, Ccc ).We look at both the top 50 most occurring files and a sample of 40random files that have more than 103 occurrences and find onlybenign content (e.g., JavaScript libraries, robots.txt). We see thesame results with GeoCities (Cд ), where common files include theGeoCities logo, colored bullets, and similar benign elements.

7.2 How Does Chunking Affect Discovery?We expect the greater precision of chunk-level analysis to be moreeffective. We next consider chunking (§4.1) of textual files (HTML,plaintext, JavaScript) by paragraphs (i.e., the literal <p> tag).

Figure 2 shows frequency-occurrence distribution of the 40.5×109chunks in Common Crawl (Ccc ). Again, we see a heavy-tailed dis-tribution: 40% of chunks are unique, but ∼3.7×109 distinct chunksappear more than 105 times. The most common chunk is the emptyparagraph (<p>).

Chunking’s precision reveals several different kinds of duplica-tion: affiliate links, JavaScript ads, analytics, and scripts, and benigncontent dominating the list. Table 2 classifies the 100 most frequentchunks. After common web idioms (empty paragraph, etc.), we seetemplates from software tools or web pages begin to appear.

Again, we turn to a random sample of the tail of the graph tounderstand what makes up duplicated content. We draw a sample of100 chunks from those with more than 105 occurrences and classifythem in Table 3.

This sample begins to show common web components that sup-port monetization of websites. JavaScript occurs some (7%) andused for advertising via Google AdSense (3%), user tracking, andanalytics (2%). We sampled one instance of spam where an articlefrom The Times (London) was copied and an advertising snippetwas included in the article for travel insurance. Other snippetswere potentially spam-like or linking to a scam (5%), but ambiguousenough to qualify as a non-malicious (if not poorly designed forlegitimate monetization) site.

We also find instances of potentiallymalicious escaped JavaScript:decoding it reveals an email address (obfuscated via JavaScript tothrow off spammers). Most content we discovered are elements ofsites that make heavy use of templates (83%) such as navigationelements, headers, and footers. Given an Lo of the most frequentlyoccurring content, this is not surprising: thousands of pages con-taining such template elements would naturally show up at the tailof the distribution.


3620

100 101 102 103 104 105 106

1umber of DuSOicDtes

100

101

102

103

104

105

106

107

108

1um

ber o

f 2cc

uren

ces

ChunN-OeveO GrDnuODrity, Geocities (970 chunNs)

Many unique items

Few highly dupliFated items

eval(unescape('\%70\%61\%72\%65\%6(\%74...'))

GoogOe AG6ense

<P>(fiOtereG)

Figure 11: Chunk-level discovery on Cд

(97×106 chunks, after heuristic and stop-chunk removal).

0 20 40 60 80 100%DdnHss PHrFHntDgH (%)

0.0

0.2

0.4

0.6

0.8

1.0

)rDF

tion

(His

togr

DP),

P[x<X] (

CD

))

bDdn

Hss

thrH

shol

d

%Dd PrHfixHs (8.88%)

Figure 12: Frequency of badness ofneighborhoods in Ccc , as a histogram(bars) and CDF (lines).

Description |N | %

Ads* 2 5.0Blog* 19 47.5Empty 1 2.5Forms 1 2.5Forum 1 2.5“Suspect” 0 0.0JavaScript 2 5.0Templated/CMS 17 42.5

Total* 43 100

Table 4: Classification of a sample of 40bad neighborhoods from Ccc . (*) indi-cates overlap between categories.

We confirm our results over a second dataset with chunk-leveldiscovery on Cд (GeoCities) in Figure 11. We see a similar dis-tribution overall, and find similar templates and JavaScript as inCcc .

We discovered and examined the kinds of content duplicated inCcc . Chunking identifies frequent duplication, but not bad behavior.However, we can now use the results to build a labeled dataset ofobjects Lo .We next utilize Lo in our detectionmechanism to identifyand detect areas where copying runs rampant.

7.3 Are There Bad Neighborhoods in the RealWorld?

Chunking is successful at identifying bad chunks and pages, butduplication for profit can draw on many related pages to maximizecommercial potential. Detection at the individual page-level canresult in false positives, so we would prefer to detect groups ofrelated pages that show a significant amount of copied content. Wenow shift our focus to detecting bad neighborhoods.

In Common Crawl: To look for bad neighborhoods, we utilizethe top 2121 common distinct chunks from Ccc as our labeleddataset Lc (from §4.2), and identify bad neighborhoods in the fulldataset using the algorithm in §4, step 6. Ccc contains 900×106neighborhoods. Our detection threshold uses themean and standarddeviation across all neighborhoods.

As one would hope, most neighborhoods N ∈ Ccc are not bad(91%). Figure 12 shows a combined histogram and CDF the bad con-tent ratio of all neighborhoods. We observe that 79.8×106 prefixes(9%) out of 900×106 would be classified as a bad neighborhood:neighborhoods with badness > 0.163 (since µN ,cc = 0.04 andσN ,cc = 0.123, and the threshold is µN ,cc + σN ,cc ).

To understand the nature of the neighborhoods we identify asemploying common content, we extract a sample of 40 neighbor-hoods from the 19.6×106 that are above the threshold and classifythem in Table 4. We find 82.5% of the sampled sites to be benign:mostly blogs, forums, or newspapers that make heavy use of tem-plates. Only 13% of the content is clearly for profit: either spam, orsearch-engine optimization with ads.

Our results show that there is duplication on the web: our ap-proach discovers it through a blind process and then detects the

broad kinds of copying that exists. Our approach is best at findingcontent that uses templates or uniform, repeated structure. Mostcontent with this structure is benign, but we find a fraction of it isspam. The prevalence of templates in the sites we detect is a directresult of obtaining L via our blind process, since the definition ofL is commonly reused content. This observation suggests that alabeled dataset more focused on malicious content (not just dupli-cated content) would improve the yield, as we explore in §8 withan expert-provided L.

In GeoCities:We repeat this experiment on GeoCities and ob-tain similar results. A random sample of 100 of the top 5% of worstneighborhoods found 40% to be link farms (clusters of ad-centricpages), while the other 60% were benign (usually templates). Theseresults confirm what we find in Common Crawl. We believe thehigher rate of copying in link farms reflects the greater susceptibil-ity of search engines to duplicated content at this earlier time.

8 APPLICATIONS WITHEXPERT-IDENTIFIED CONTENT

We next look at two systems that use our approach, and use expert-identified content instead of blind discovery. Expert-identificationis useful when targets locations are known but copied locations areunknown.

8.1 Detecting Clones of Wikipedia for ProfitWe first explore finding copies of Wikipedia on the web. AlthoughWikipedia’s license allows duplication [45], and we expect shar-ing across sibling sites (Wiktionary, Wikiquote, etc.), other copiesare of little benefit for users and often serve mainly to generateadvertisement revenue, support link farms, or dilute spam.

We next consider a copy of Wikipedia (from June 2008 [21],in English) as our labeled dataset L and use it to understand ifWikipedia is copied wholesale or just in parts.

Wikipedia becomes a Lc of 75.0×106 distinct chunks of lengthmore than 100 characters (we treat shorter chunks as stop chunks,§4.4.2) and then search for this content in the Common Crawlcorpus (Ccc , Nov. 2009 to Apr. 2010). Utilizing Lc, we identify badneighborhoods in Ccc using the algorithm described in §4.


3721

Table 5: Classification of the top 40 bad neighborhoods inCcc , L =Wikipedia.

Description |N | % Type

Wikipedia Clones/Rips 31 78“Wikipedia Ring” 13 ProfitReference Sites 5 ProfitAds 10 ProfitFork 1 Ambig.Unknown 2 Ambig.

Search Engine Optim. 3 8e-Commerce 2 ProfitStock Pumping 1 Profit

Wikipedia/Wikimedia 5 13 BenignSite using MediaWiki 1 3 Benign

Total 40 100

The length of time between the crawl dates of L and Ccc maybias our detection’s true positive rate in a particular direction. Tounderstand Wikipedia’s rate of change, during the 16–22 monthsbetween L and Ccc , Wikipedia added an additional 1.45–1.86×106pages/month (an increase from 9.00×106 to 14.6×106 pages), en-compassing 2.52–3.46GB of edits/month [22]. Thus, if sites in Ccc

copy from a more recent version of Wikipedia than L, we wouldexpect that to bias our detection’s true positive rate to be lower.

Our detection mechanism finds 136×103 target neighborhoods(almost 2% of 68.9×106 neighborhoods in Ccc ) of path length 1that include content chunks of length > 100 from Wikipedia. Tounderstand how and why more than 100×103 sites copy parts ofWikipedia, we focus our analysis on neighborhoods that duplicatemore than 1000 chunks fromWikipedia. We look at the 40 neighbor-hoods with the largest number of bad chunks and classify them inTable 5. We find 5 Wikimedia affiliates, including Wikipedia, Wiki-books, and Wikisource. More interestingly, we find 34 instances ofduplicate content on third party sites: 31 sites rip Wikipedia whole-sale, and the remaining 3 utilize content from Wikipedia subtly forsearch-engine optimization (SEO).

Almost all of the 31 third-party sites significantly copying Wiki-pedia are doing so to promote commercial interests. One interestingexample was a “Wikipedia Ring”: a group of 13 site rips of Wiki-pedia, with external links to articles that leads to another site inthe ring. In addition to the intra-ring links, each site had an adver-tisement placed on each page to generate revenue. Other clones aresimilar, sometimes with the addition of other content. Finally, wealso observe Wikipedia content used to augment stock pumpingpromotions or to draw visitors to online gambling.

Our study of Wikipedia suggests that our approach is very accu-rate, at least for bulk copies. All neighborhoods in our sample ofthe tail of the distribution were copies of Wikipedia, and only onesite was a false positive (because it uses MediaWiki, an open-sourcewiki application). All others were true positives.

We have shown that from a labeled dataset (Wikipedia), ourapproach detects dozens of copies across the entire web, and thatmost of the bulk copies are for monetization. We next shift frombulk copying of Wikipedia to targeted copying in phishing sites.

Figure 13: Implementation diagram of the AuntieTuna anti-phishing plugin.

8.2 Detecting Phishing SitesPhishing websites attempt to trick users into giving up information(passwords or banking details) with replicas of legitimate sites—these sites often duplicate content, making them detectable withour methods. We adapt our system with an expert-labeled datasetbuilt from common targets (such as real banking pages). In ourprototype browser extension, AuntieTuna (implementation diagramin Figure 13, described in our prior work [6]), users first identifypages they care about (manually or automatically, via Trust on FirstUse) to build a custom labeled dataset (Discovery, §4.2). AuntieTunathen checks each page the user visits for potential phish (Detection,§4.3). As an alternative system implementation, phish could bedetected centrally by crawling URLs found in email or websitespam, then testing each as potential phish with our method.

To evaluate our approach to detect phish, we examine PayPalphishing and build a labeled dataset Lpp with PayPal home pages,chunking on both <p> and <div>, resulting in 311 distinct chunkslonger than 25 characters. Our corpus Cp is drawn from a streamof 2374 suspected phish from over 2 days of data (Oct. 2014) fromPhishTank [33], a crowd-sourced anti-phishing site. All operationsare done on the DOM content (as rendered by the browser). Wecompare each suspected phish against Lpp using our algorithm(§4), which we’ve shown to be robust to minor changes (§6) with adetection threshold of one or more chunks.

To evaluate ground truth, we manually examine Cp and identify85 phish using text content from PayPal. Our mechanism detects 50(58.8%) pages: 43 directly copy from the original (whitespace nor-malization included), and an additional 7 obfuscate their copies withJavaScript. Table 6 classifies the type of techniques each phish uses.Our targeted dataset and precise approach prevents false positives(specificity is 100%), although we see 40% of phish copy too littlecontent from the original for us to detect (in this study we ignoreimage-based phish). This evaluation shows that our hash-baseddetection can be a part of an anti-phishing scheme, complementingother techniques.


3822

Table 6: Classification of phish in Cp , Lpp = PayPal.

Description Num. Pages %

Candidates 2374Unavailable 486Ripped 1888Other 1764 TN = 1764PayPal (image-based, removed) 39PayPal 85 100.0 FP = 0

Successfully detected 50 58.8 TP = 50Direct copies 35Whitespace normalization 8JavaScript obfuscation 7

Custom-styled with 35 41.2 FN = 35minor PayPal content

9 CONCLUSIONIn this paper, we developed a method to discover previously un-known duplicated content and to precisely detect that or other con-tent in a web-size corpus. We also showed how to exploit hierarchyin the corpus to identify bad neighborhoods, improving robustnessto random document changes. We verified that our approach workswith controlled experiments, then used it to explore duplication ina recent web crawl with an informed and uninformed discoveryprocess. Although most duplicated content is benign, we show thatour approach does detect duplication as-is in link farms, webpagespam, and phishing websites.

ACKNOWLEDGMENTSThis material is based on research sponsored by Air Force ResearchLaboratory under agreement number FA8750-18-2-0280, and theDefense Advanced Research Projects Agency (DARPA) under con-tract W911NF-18-C-0020. The views and conclusions containedin this document are those of the authors and should not be in-terpreted as representing the official policies, either expressly orimplied, of DARPA or the U.S. Government. The U.S. Governmentis authorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon.

REFERENCES[1] Steven Abney. 1991. Parsing by Chunks. Principle–based Parsing.[2] Apache. [n. d.]. Hadoop. http://hadoop.apache.org.[3] Apache. [n. d.]. Pig. http://pig.apache.org.[4] ArchiveTeam. 2009. GeoCities. http://archiveteam.org/index.php/GeoCities.[5] Calvin Ardi and John Heidemann. 2014. Web-scale Content Reuse Detection

(extended). Technical Report ISI-TR-692. USC/ISI.[6] Calvin Ardi and John Heidemann. 2016. AuntieTuna: Personalized Content-based

Phishing Detection. In Proceedings of the NDSSWorkshop on Usable Security (USEC’16). https://doi.org/10.14722/usec.2016.23012

[7] Stefan Axelsson. 2000. The Base-rate Fallacy and the Difficulty of IntrusionDetection. ACM Trans. Inf. Syst. Secur. 3, 3 (Aug. 2000), 186–205. https://doi.org/10.1145/357830.357849

[8] BurtonH. Bloom. 1970. Space/time trade-offs in hash codingwith allowable errors.Commun. ACM 13, 7 (July 1970), 422–426. https://doi.org/10.1145/362686.362692

[9] Sergey Brin and Lawrence Page. 1998. The Anatomy of a Large-Scale Hypertex-tual Web Search Engine. In Proceedings of the Seventh International World WideWeb Conference. 107–117. https://doi.org/10.1016/S0169-7552(98)00110-X

[10] Andrei Z. Broder, Steven C. Glassman,Mark S. Manasse, and Geoffrey Zweig. 1997.Syntactic clustering of the Web. In Selected papers from the sixth internationalconference on World Wide Web. 1157–1166. http://dl.acm.org/citation.cfm?id=283554.283370

[11] Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding Al-gorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on The-ory of Computing (STOC ’02). ACM, New York, NY, USA, 380–388. https://doi.org/10.1145/509907.509965

[12] Stanford Chiu, Ibrahim Uysal, and W. Bruce Croft. 2010. Evaluating text reusediscovery on the web. In Proceedings of the third symposium on Informationinteraction in context (IIiX ’10). 299–304. https://doi.org/10.1145/1840784.1840829

[13] Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina. 2000. FindingReplicatedWeb Collections. In Proceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data (SIGMOD ’00). ACM, New York, NY, USA,355–366. https://doi.org/10.1145/342009.335429

[14] Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe.2002. Collection Statistics for Fast Duplicate Document Detection. ACM Trans.Inf. Syst. 20, 2 (April 2002), 171–191. https://doi.org/10.1145/506309.506311

[15] Ernesto Damiani, Sabrina De Capitani di Vimercati, Stefano Paraboschi, andPierangela Samarati. 2004. An Open Digest-based Technique for Spam Detection.ISCA PDCS 2004 (2004), 559–564. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.6185

[16] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processingon Large Clusters. In Proceedings of the 6th Conference on Symposium on OpeartingSystems Design & Implementation - Volume 6 (OSDI’04). USENIX Association,Berkeley, CA, USA, 10–10. http://dl.acm.org/citation.cfm?id=1251254.1251264

[17] Heinz Dreher. 2007. Automatic Conceptual Analysis for Plagiarism Detection.Journal of Issues in Informing Science and Information Technology 4 (2007), 601–614.https://doi.org/10.28945/3141

[18] D. Eastlake 3rd and P. Jones. 2001. US Secure Hash Algorithm 1 (SHA1). TechnicalReport 3174. http://www.ietf.org/rfc/rfc3174.txt Updated by RFCs 4634, 6234.

[19] Sven Meyer Zu Eissen and Benno Stein. 2006. Intrinsic Plagiarism Detection. InProceedings of the European Conference on Information Retrieval ECIR-06.

[20] Common Crawl Foundation. [n. d.]. Common Crawl. http://commoncrawl.org.[21] Wikimedia Foundation. 2008. Static HTML Dump of Wikipedia. http://dumps.

wikimedia.org/other/static_html_dumps/2008-06/en/[22] Wikimedia Foundation. 2019. Wikimedia Statistics. https://stats.wikimedia.org/

v2/#/en.wikipedia.org [Online; accessed 2019-Feb-27].[23] R. Gowtham and Ilango Krishnamurthi. 2014. PhishTackle–a Web Services

Architecture for Anti-phishing. Cluster Computing 17, 3 (Sept. 2014), 1051–1068.https://doi.org/10.1007/s10586-013-0320-5

[24] Monika Henzinger. 2006. Finding near-duplicate web pages: a large-scale evalu-ation of algorithms. In Proceedings of the 29th annual international ACM SIGIRconference on Research and development in information retrieval (SIGIR ’06). ACM,284–291. https://doi.org/10.1145/1148170.1148222

[25] Allan Heydon and Marc Najork. 1999. Mercator: A Scalable, Extensible WebCrawler. World-Wide Web Journal 2, 4 (Dec. 1999), 219–229. https://doi.org/10.1023/A:1019213109274

[26] Jong Wook Kim, K. Selçuk Candan, and Junichi Tatemura. 2009. Efficient overlapand content reuse detection in blogs and online news articles. In Proceedings ofthe 18th international conference on World wide web (WWW ’09). ACM, 81–90.https://doi.org/10.1145/1526709.1526721

[27] Wenyin Liu, Xiaotie Deng, Guanglin Huang, and Anthony Y Fu. 2006. Anantiphishing strategy based on visual similarity assessment. Internet Computing,IEEE 10, 2 (2006), 58–65. https://doi.org/10.1109/MIC.2006.23

[28] Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search. In Proceed-ings of the International Conference on Very Large Data Bases. VLDB Endowment,Vienna, Austria, 950–961. http://dl.acm.org/citation.cfm?id=1325958

[29] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. 2009.Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs.In Proceedings of the 15th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’09). ACM, 1245–1254. https://doi.org/10.1145/1557019.1557153

[30] J. Malhotra and J. Bakal. 2015. A survey and comparative study of data dedu-plication techniques. In 2015 International Conference on Pervasive Computing(ICPC). 1–5. https://doi.org/10.1109/PERVASIVE.2015.7087116

[31] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference onWorld Wide Web (WWW ’07). ACM, 141–150. https://doi.org/10.1145/1242572.1242592

[32] National Institute of Standards and Technology. 2008. Secure Hash Standard(SHS). Federal Information Processing Standard (FIPS) 180-3. National Instituteof Science and Technology. http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf

[33] OpenDNS. [n. d.]. PhishTank. http://www.phishtank.com[34] Himabindu Pucha, David G. Andersen, and Michael Kaminsky. 2007. Exploiting

similarity for multi-source downloads using file handprints. In Proceedings of the4th USENIX conference on Networked systems design & implementation (NSDI’07).USENIX Association, Berkeley, CA, USA, 2–2. https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/pucha/pucha.pdf


3923

http://hadoop.apache.org

http://pig.apache.org

http://archiveteam.org/index.php/GeoCities

https://doi.org/10.14722/usec.2016.23012

https://doi.org/10.1145/357830.357849

https://doi.org/10.1145/357830.357849

https://doi.org/10.1145/362686.362692

https://doi.org/10.1016/S0169-7552(98)00110-X

http://dl.acm.org/citation.cfm?id=283554.283370


https://doi.org/10.1145/509907.509965

https://doi.org/10.1145/509907.509965

https://doi.org/10.1145/1840784.1840829

https://doi.org/10.1145/342009.335429

https://doi.org/10.1145/506309.506311

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.6185

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.6185


https://doi.org/10.28945/3141

http://www.ietf.org/rfc/rfc3174.txt

http://commoncrawl.org

http://dumps.wikimedia.org/other/static_html_dumps/2008-06/en/

http://dumps.wikimedia.org/other/static_html_dumps/2008-06/en/

https://stats.wikimedia.org/v2/#/en.wikipedia.org

https://stats.wikimedia.org/v2/#/en.wikipedia.org

https://doi.org/10.1007/s10586-013-0320-5

https://doi.org/10.1145/1148170.1148222

https://doi.org/10.1023/A:1019213109274

https://doi.org/10.1023/A:1019213109274

https://doi.org/10.1145/1526709.1526721

https://doi.org/10.1109/MIC.2006.23

http://dl.acm.org/citation.cfm?id=1325958

https://doi.org/10.1145/1557019.1557153

https://doi.org/10.1145/1557019.1557153

https://doi.org/10.1109/PERVASIVE.2015.7087116

https://doi.org/10.1145/1242572.1242592

https://doi.org/10.1145/1242572.1242592

http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf

http://csrc.nist.gov/publications/fips/fips180-3/fips180-3_final.pdf

http://www.phishtank.com

https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/pucha/pucha.pdf

https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/pucha/pucha.pdf

[35] Sean Quinlan and Sean Dorward. 2002. Venti: A New Approach to ArchivalData Storage. In Proceedings of the 1st USENIX Conference on File and StorageTechnologies (FAST ’02). USENIX Association, Berkeley, CA, USA, Article 7. http://dl.acm.org/citation.cfm?id=1083323.1083333

[36] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu, and Fred Douglis. 2004. Auto-matic Detection of Fragments in Dynamically Generated Web Pages. In Proceed-ings of the 13th International Conference on World Wide Web (WWW ’04). ACM,New York, NY, USA, 443–454. https://doi.org/10.1145/988672.988732

[37] Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking usingTransformation-Based Learning. ACL Third Workshop on Very Large Corporacmp-lg/9505040 (1995), 82–94. http://arxiv.org/abs/cmp-lg/9505040

[38] Ruslan Salakhutdinov and GeoffreyHinton. 2009. Semantic hashing. Int. J. Approx.Reasoning 50, 7 (July 2009), 969–978. https://doi.org/10.1016/j.ijar.2008.11.006

[39] Bruce Schneier. 2005. Why Data Mining Won’t Stop Terror. Wired Magazine (9March 2005). https://schneier.com/essays/archives/2005/03/why_data_mining_wont.html

[40] Antonio Si, Hong Va Leong, and Rynson W. H. Lau. 1997. CHECK: a documentplagiarism detection system. In Proceedings of the 1997 ACM symposium onAppliedcomputing (SAC ’97). ACM, 70–77. https://doi.org/10.1145/331697.335176

[41] Neil T. Spring and David Wetherall. 2000. A Protocol-independent Tech-nique for Eliminating Redundant Network Traffic. In Proceedings of the Con-ference on Applications, Technologies, Architectures, and Protocols for ComputerCommunication (SIGCOMM ’00). ACM, New York, NY, USA, 87–95. https://doi.org/10.1145/347059.347408

[42] Benno Stein, Moshe Koppel, and Efstathios Stamatatos. 2007. Plagiarism Analysis,Authorship Identification, and Near-duplicate Detection PAN’07. , 4 pages. https://doi.org/10.1145/1328964.1328976

[43] O. Tange. 2011. GNU Parallel - The Command-Line Power Tool. ;login: TheUSENIX Magazine 36, 1 (Feb 2011), 42–47. http://www.gnu.org/s/parallel

[44] Martin Theobald, Jonathan Siddharth, and Andreas Paepcke. 2008. SpotSigs:robust and efficient near duplicate detection in large web collections. In Pro-ceedings of the 31st annual international ACM SIGIR conference on Researchand development in information retrieval (SIGIR ’08). ACM, 563–570. https://doi.org/10.1145/1390334.1390431

[45] Wikipedia. 2017. Reusing Wikipedia Content. https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content [Online; accessed 23-Feb-2019].

[46] Hui Yang and Jamie Callan. 2006. Near-duplicate detection by instance-levelconstrained clustering. In Proceedings of the 29th annual international ACM SIGIRconference on Research and development in information retrieval (SIGIR ’06). ACM,421–428. https://doi.org/10.1145/1148170.1148243

[47] Lawrence You, Kristal Pollack, and Darrell D. E. Long. 2005. Deep Store: AnArchival Storage System Architecture. In Proceedings of the 21st InternationalConference on Data Engineering (ICDE ’05). https://doi.org/10.1109/ICDE.2005.47

[48] Qi Zhang, Yue Zhang, Haomin Yu, and Xuanjing Huang. 2010. Efficient Partial-duplicate Detection Based on Sequence Matching. In Proceedings of the 33rdInternational ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR ’10). ACM, New York, NY, USA, 675–682. https://doi.org/10.1145/1835449.1835562


4024



https://doi.org/10.1145/988672.988732

http://arxiv.org/abs/cmp-lg/9505040

https://doi.org/10.1016/j.ijar.2008.11.006

https://schneier.com/essays/archives/2005/03/why_data_mining_wont.html

https://schneier.com/essays/archives/2005/03/why_data_mining_wont.html

https://doi.org/10.1145/331697.335176

https://doi.org/10.1145/347059.347408

https://doi.org/10.1145/347059.347408

https://doi.org/10.1145/1328964.1328976

https://doi.org/10.1145/1328964.1328976

http://www.gnu.org/s/parallel

https://doi.org/10.1145/1390334.1390431

https://doi.org/10.1145/1390334.1390431

https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

https://doi.org/10.1145/1148170.1148243

https://doi.org/10.1109/ICDE.2005.47

https://doi.org/10.1145/1835449.1835562

https://doi.org/10.1145/1835449.1835562

Documents

Precise Detection of Content Reuse in the Web - SIGCOMM...to!ll out link farms that support search-engine “optimization”. E-book content farming republishes publicly available