55
Data Mining Distributed Streams Edo Liberty Principal Scientist Amazon Web Services

Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

DataMiningDistributedStreams

EdoLibertyPrincipalScientistAmazonWebServices

Page 2: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Data

Computation Result

TheWorld

Singlemachinedataprocessing

Page 3: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Data Data Data Data

Computation Result

TheWorld

Distributedstorage

Page 4: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Computation Result

TheWorld

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Distributedcompute(map/reduce,MPI,…)

Page 5: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Computation Result

TheWorld

Data+Compute

Data+Compute

Data+Compute

Data+Compute

ComputationQuery

Distributedmodel(indexes,databases,Spark…)

Page 6: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

207big-datainfographics(ametainfographic)

Page 7: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet
Page 8: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute

Thestreamingmodel

Page 9: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Merge+Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute+Sketch

Compute+Sketch

Compute+Sketch

Compute+Sketch

Thedistributedstreamingmodel

Page 10: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Sketch

Result

Iterator

Computation

Thestreamingmodel(moreaccurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computationperitem

1 7 8 1 0 1 7 7

Page 11: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Sketch Result

Iterator Iterator

Communicationcomplexity

1 7 8 1 0 1 7 7

Page 12: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

Page 13: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

Page 14: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

FrequencyCounting

Misra,Gries.Findingrepeatedelements,1982.

Demaine,Lopez-Ortiz,Munro.Frequencyestimationofinternetpacketstreamswithlimitedspace,2002

Karp,Shenker,Papadimitriou.Asimplealgorithmforfindingfrequentelementsinstreamsandbags,2003

Thename``Lossy Counting"wasusedforadifferentalgorithmbyManku andMotwani,2002

Metwally,Agrawal,Abbadi,EfficientComputationofFrequentandTop-kElementsinDataStreams,2006

Charikar,Chen,Farach-Colton,Findingfrequentitemsindatastreams,2002

Cormode,Muthukrishnan,AnImprovedDataStreamSummary:TheCount-MinSketchanditsApplications.

Page 15: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

n

f( ) = 5

ProblemDefinition

|f 0 � f | < "n

Page 16: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Canwedobetterthansampling?

f 0( ) = 3 · n/`

` = O(1/"2)

Page 17: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 18: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 19: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 20: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 21: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 22: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 23: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

`

Page 24: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

f 0( ) = 0

`

f 0( ) = 2

Page 25: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Assumewedeletetimest

Secondfact: f

0(x) � f(x)� t

f

0(x) f(x)Firstfact:

Analysis

Therefore: |f 0(x)� f(x)| t

Page 26: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Wedeletedifferentitemseverytime!

Thirdfact: t n/`

`

Analysis

Wegetthat:

⌅When:(muchbetterthansampling!)` = 1/"

|f 0(x)� f(x)| < "n

Page 27: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Items’exactprobability p(x) = f(x)/n

p

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Analysis

Approximateprobability

Weget:

Ifwegetonlyaerrorinourestimations.

Wewouldneed10billion samplestogetthesameaccuracy!

` = 10, 000 0.01%

Page 28: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Emailthreads

Asimpleemailthread(that’snotveryhardtodo…)

Page 29: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

ThreadingMachineGeneratedEmail

Ailon,Karnin,Maarek,Liberty,ThreadingMachineGeneratedEmail,WSDM2013

Page 30: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

ThreadingMachineGeneratedEmail

Page 31: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

ThreadingMachineGeneratedEmail

Page 32: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Streamingquantiles

Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.

Page 33: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

ProblemDefinition

n

0 nn/2

R( ) = 0.6 · n

|R0 �R| < "nSamplingvaluesgives canwedobetter?O(1/"2)

Page 34: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

1 0 35 4 7

Bufferofsizek

Page 35: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

Storeskstreamentries

1

03

5

47

Page 36: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

Thebuffersortskstreamentries

10

3

54

7

Page 37: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

Deleteseveryotheritem

10

3

54

7

Page 38: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

Andoutputstherestwithdoubletheweight

035

Page 39: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

0

0

x x

1 54 7

1

3

3

4

5

7

R(x) = 2

R

0(x) = 2

R

0(x) = 2

R(x) = 5

R

0(x) = 4

R

0(x) = 6

Page 40: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thebasicbufferidea

Repeattimeuntiltheendofthestream

0

|R0(x)�R(x)| < n/k

nn/2

n/k

1 0 355

Page 41: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

n

Buffersofsize k

|R0(x)�R(x)| n log2(n)/k

log2(n)

1 0 35

Manku-Rajagopalan-Lindsay(MRL)sketch

Page 42: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

k = log2(n)/"Ifweset

|R0(x)�R(x)| "nWeget

Andwemaintainonlyitemsfromthestream!log

22(n)/"

Manku-Rajagopalan-Lindsay(MRL)sketch

Page 43: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Greenwald-Khanna(GK)sketch

|R0(x)�R(x)| "nItgets

Andmaintainsonlyitemsfromthestream!

Usesacompletelydifferentconstruction

O(log(n)/")

Page 44: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)

Buffersofsize klog(1/")

startsamplingafteritemsO(1/"2)

log

2(1/")/"Reducesspaceusagetoitemsfromthestream.

1 0 35

Page 45: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)

E[R0(x)] = R(x)

R

0(x) isarandomvariablenowand

R(x) = 1

R

0(x) = 2

R

0(x) = 0

x

Reducesspaceusagetoitemsfromthestream.log

3/2(1/")/"

5 7

5

7

Page 46: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(1)

Exponentiallyshrinkingbuffers

plog(1/")/"

1 0 35

Page 47: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(2)

Exponentiallydecreasingbuffersizes

GKSketch

log log(1/")/"

1 0 35

Page 48: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

100 1000 10000 100000 1e+06

Err

or

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

0

500

1000

1500

2000

2500

3000

3500

4000

100 1000 10000 100000 1e+06

Space

Use

d F

or

Sto

ring S

am

ple

s

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

Someexperimentalresults

Page 49: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

CountDistinct(DemoOnly)

Page 50: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

>>headdata.csv0103023732

Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.

Assumeyouneedtoestimatethenumberofunique numbersinafile

Page 51: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

>>time wc -lc data.csv1000000076046666data.csv

real0m0.101suser 0m0.072ssys 0m0.021s

Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.

Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.

Page 52: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

>>timesortdata.csv -u|wc -l5001233

real2m37.071suser2m36.587ssys0m0.376s

Tocountthenumberofdistinctitemsyoumighttrythis:

>>sortdata.csv |uniq |wc-l

>>sortdata.csv -u|wc-l

However,itisfastertohave“uniqify”whilesorting.

Page 53: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

>>timesort data.csv -u-n|wc -l5001233

real 0m11.809suser 0m11.587ssys 0m0.228s

Still,mostofthetimeisspentoncomparingstrings....

>>sort data.csv -u-n-S100%|wc -l

Thisismuchbetter!

Page 54: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

>>timesketchuniqdata.csvEstimate :4974249UpperBound:5116569LowerBound:4835874

real0m1.527suser0m1.506ssys0m0.152s

Thisisthewaytodothiswiththesketchinglibrary

>>sketchuniq data.csv

Toofasttousethesystemmonitor UI...

Ituses~32kofmemory!

Page 55: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet

Thankyou!