# Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

• View
215

0

Tags:

Embed Size (px)

### Text of Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

• Slide 1
• Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
• Slide 2
• A scenario IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 131.107.65.14 18.9.22.69 80.97.56.20 IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 128.112.128.819 127.0.0.18 257.2.5.70 7.8.20.131 Challenge: compute something on the table, using small space. Challenge: compute something on the table, using small space. 131.107.65.14 Example of something: # distinct IPs max frequency other statistics
• Slide 3
• Sublinear: a panacea? Sub-linear space algorithm for solving Travelling Salesperson Problem? Sorry, perhaps a different lecture Hard to solve sublinearly even very simple problems: Ex: what is the count of distinct IPs seen Will settle for: Approximate algorithms: 1+ approximation true answer output (1+ ) * (true answer) Randomized: above holds with probability 95% Quick and dirty way to get a sense of the data IPFrequency 131.107.65.143 18.9.22.692 80.97.56.202 128.112.128.819 127.0.0.18 257.2.5.70 8.3.20.121
• Slide 4
• Streaming data Data through a router Data stored on a hard drive, or streamed remotely More efficient to do a linear scan on a hard drive Working memory is the (smaller) main memory 2 2 2 2
• Slide 5
• Application areas Data can come from: Network logs, sensor data Real time data Search queries, served ads Databases (query planning)
• Slide 6
• Problem 1: # distinct elements 2 2 5 5 7 7 5 5 5 5 i Frequency 21 53 71
• Slide 7
• Distinct Elements: idea 1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 2 2 7 7 5 5 [Flajolet-Martin85, Alon-Matias-Szegedy96]
• Slide 8
• Distinct Elements: idea 2 ZEROS(x) x=0.0000001100101 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2
• Slide 9
• Problem 2: max count Problem: compute the maximum frequency of an element in the stream Bad news: Hard to distinguish whether an element repeated (max = 1 vs 2) Good news: Can find heavy hitters elements with frequency > total frequency / s using space proportional to s IPFrequency 21 53 71 2 2 5 5 7 7 5 5 5 5 heavy hitters
• Slide 10
• Heavy Hitters: CountMin 1 1 1 2 2 5 5 7 7 5 5 5 5 11 2 11 12 12 111 22 13 121 32 14 131 Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,w-1} Process (int i): for(j=0; j

Documents
Documents
Documents
Documents
Documents
Documents
Education
Documents
Education
Documents
Education
Education