View
219
Download
0
Embed Size (px)
Citation preview
3
Statistics (Preview)
Collection # of copies
# of pages
TUCOWS 360 1052
LDP (Linux Documentation Project) 143 1359
Apache manual 125 115
Java manual 59 552
Mars Pathfinder 49 498
More than 48% of pages have copies!
4
Reasons for replication
Actual replicationSimple copying or Mirroring
Apparent replicationAliases (multiple site names)Symbolic linksMultiple mount points
6
Outline
Definitions Web graph, collection Identical collection
Similar collectionAlgorithmApplicationsResults
8
Identical web collection
Collection: induced subgraphIdentical collection: one-to-one (equi-
size)
9
Collection similarity
Coincides with intuitively similar collections
Computable similarity measure
11
Page content similarity
Fingerprint-based approach (chunking) Shingles [Broders et al., 1997] Sentence [Brin et al., 1995] Word [Shivakumar et al., 1995]
Many interesting issues Threshold value Iceberg query
16
Essential property
Rb
a a
bbb
aRa
|Ra| = Ls = Ld = |Rb|
Ls: # of pages linked from
Ld: # of pages linked to
17
Essential property
a a
bbb
a
Rb
Ra
|Ra| Ls = Ld |Rb|
Ls: # of pages linked from
Ld: # of pages linked to
18
Algorithm
Based on the property we identifiedInput: set of pages collected from
webOutput: set of similar collectionsComplexity: O(n log n)
19
Algorithm
Step 1: Similar page identification(iceberg query)
25 million pagesFingerprint computation: 44 hoursReplicated page computation: 10 hours
Step 1web pages
Rid Pid
11
122
103753895014545
102618633
20
Algorithm
Step 2: link structure check
Rid Pid
11
12
103753895014545
1026
Rid Pid
11
12
103753895014545
1026
Pid Pid
11
22
23610
Group by (R1.Rid, R2.Rid)
Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|
LinkR1 R2 (Copy of R1)
21
Algorithm
Step 3:S = {}
For every (|Ra|, Ls, Ld, |Rb|) in step 2
If (|Ra| = Ls = Ld = |Rb|)
S = S U {<Ra, Rb>}
Union-Find(S)
Step 2-3: 10 hours
22
Experiment
25 widely replicated collections(cardinality: 5-10 copies, size: 50-1000
pages)
=> Total number of pages : 35,000 + 15,000 random pagesResult: 180 collections
149 “good” collections 31 “problem” collections
25
Application (web crawling)
Before experiment: 48%With our technique: 13%
initialcrawl
offline copydetection
secondcrawl
replicationinfo
crawledpages
27
Related work
Collection similarity Altavista [Bharat et al., 1999]
Page similarity COPS [Brin et al., 1995]: sentence SCAM [Shivakumar et al., 1995]: word Altavista [Broder et al., 1997]: shingle