18
Advanced Algorithms for Massive Datasets The power of “failing

Advanced Algorithms for Massive Datasets

  • Upload
    truman

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Algorithms for Massive Datasets. The power of “ failing ”. 2. TTT. Not perfectly true but. Opt k = 5.45. m / n = 8. We do have an explicit formula for the optimal k. Other advantage: no key storage. Crawling. - PowerPoint PPT Presentation

Citation preview

Page 1: Advanced Algorithms for Massive Datasets

Advanced Algorithmsfor Massive Datasets

The power of “failing”

Page 2: Advanced Algorithms for Massive Datasets
Page 3: Advanced Algorithms for Massive Datasets
Page 4: Advanced Algorithms for Massive Datasets
Page 5: Advanced Algorithms for Massive Datasets
Page 6: Advanced Algorithms for Massive Datasets
Page 7: Advanced Algorithms for Massive Datasets

TTT 2

Page 8: Advanced Algorithms for Massive Datasets

Not perfectly true but...

Page 9: Advanced Algorithms for Massive Datasets

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0,08

0,09

0,1

0 1 2 3 4 5 6 7 8 9 10

Fa

lse

po

siti

ve

rate

Hash functions

m/n = 8Opt k = 5.45...

We do have an

explicit formula

for the optimal k

Page 10: Advanced Algorithms for Massive Datasets
Page 11: Advanced Algorithms for Massive Datasets
Page 12: Advanced Algorithms for Massive Datasets

Other advantage: no key storage

Page 13: Advanced Algorithms for Massive Datasets

Crawling

What data structures should we use to keep

track of the visited URLs of a crawler?

URLs are long

Check should be very fast

No care about small errors (≈ page not crawled)

Bloom Filter

over crawled URLs

Page 14: Advanced Algorithms for Massive Datasets

Anti-virus detection

D is a dictionary of virus-checksum of some given length z. For each position i, check…

Brute-force check: O( |D| * |F| ) time Trie check: O( z * |F| ) time Better Solution ?

Build a BF on D.

Check T[i,i+z-1] є D, if BF answers YES

then “warn the user” or explicitly scan D

FVji i+z

O(k*|F|)

or even better...

Page 15: Advanced Algorithms for Massive Datasets
Page 16: Advanced Algorithms for Massive Datasets

Upper bounds

Page 17: Advanced Algorithms for Massive Datasets

Upper bounds

Page 18: Advanced Algorithms for Massive Datasets

Recurring minimum forimproving the estimate

+ 2 SBF