Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated...

Preview:

Citation preview

DataMiningDistributedStreams

EdoLibertyPrincipalScientistAmazonWebServices

Data

Computation Result

TheWorld

Singlemachinedataprocessing

Data Data Data Data

Computation Result

TheWorld

Distributedstorage

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Computation Result

TheWorld

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Distributedcompute(map/reduce,MPI,…)

Data+Compute

Data+Compute

Data+Compute

Data+Compute

Computation Result

TheWorld

Data+Compute

Data+Compute

Data+Compute

Data+Compute

ComputationQuery

Distributedmodel(indexes,databases,Spark…)

207big-datainfographics(ametainfographic)

Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute

Thestreamingmodel

Merge+Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute+Sketch

Compute+Sketch

Compute+Sketch

Compute+Sketch

Thedistributedstreamingmodel

Sketch

Result

Iterator

Computation

Thestreamingmodel(moreaccurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computationperitem

1 7 8 1 0 1 7 7

Sketch Result

Iterator Iterator

Communicationcomplexity

1 7 8 1 0 1 7 7

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

FrequencyCounting

Misra,Gries.Findingrepeatedelements,1982.

Demaine,Lopez-Ortiz,Munro.Frequencyestimationofinternetpacketstreamswithlimitedspace,2002

Karp,Shenker,Papadimitriou.Asimplealgorithmforfindingfrequentelementsinstreamsandbags,2003

Thename``Lossy Counting"wasusedforadifferentalgorithmbyManku andMotwani,2002

Metwally,Agrawal,Abbadi,EfficientComputationofFrequentandTop-kElementsinDataStreams,2006

Charikar,Chen,Farach-Colton,Findingfrequentitemsindatastreams,2002

Cormode,Muthukrishnan,AnImprovedDataStreamSummary:TheCount-MinSketchanditsApplications.

n

f( ) = 5

ProblemDefinition

|f 0 � f | < "n

Canwedobetterthansampling?

f 0( ) = 3 · n/`

` = O(1/"2)

`

`

`

`

`

`

`

f 0( ) = 0

`

f 0( ) = 2

Assumewedeletetimest

Secondfact: f

0(x) � f(x)� t

f

0(x) f(x)Firstfact:

Analysis

Therefore: |f 0(x)� f(x)| t

Wedeletedifferentitemseverytime!

Thirdfact: t n/`

`

Analysis

Wegetthat:

⌅When:(muchbetterthansampling!)` = 1/"

|f 0(x)� f(x)| < "n

Items’exactprobability p(x) = f(x)/n

p

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Analysis

Approximateprobability

Weget:

Ifwegetonlyaerrorinourestimations.

Wewouldneed10billion samplestogetthesameaccuracy!

` = 10, 000 0.01%

Emailthreads

Asimpleemailthread(that’snotveryhardtodo…)

ThreadingMachineGeneratedEmail

Ailon,Karnin,Maarek,Liberty,ThreadingMachineGeneratedEmail,WSDM2013

ThreadingMachineGeneratedEmail

ThreadingMachineGeneratedEmail

Streamingquantiles

Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.

ProblemDefinition

n

0 nn/2

R( ) = 0.6 · n

|R0 �R| < "nSamplingvaluesgives canwedobetter?O(1/"2)

Thebasicbufferidea

1 0 35 4 7

Bufferofsizek

Thebasicbufferidea

Storeskstreamentries

1

03

5

47

Thebasicbufferidea

Thebuffersortskstreamentries

10

3

54

7

Thebasicbufferidea

Deleteseveryotheritem

10

3

54

7

Thebasicbufferidea

Andoutputstherestwithdoubletheweight

035

Thebasicbufferidea

0

0

x x

1 54 7

1

3

3

4

5

7

R(x) = 2

R

0(x) = 2

R

0(x) = 2

R(x) = 5

R

0(x) = 4

R

0(x) = 6

Thebasicbufferidea

Repeattimeuntiltheendofthestream

0

|R0(x)�R(x)| < n/k

nn/2

n/k

1 0 355

n

Buffersofsize k

|R0(x)�R(x)| n log2(n)/k

log2(n)

1 0 35

Manku-Rajagopalan-Lindsay(MRL)sketch

k = log2(n)/"Ifweset

|R0(x)�R(x)| "nWeget

Andwemaintainonlyitemsfromthestream!log

22(n)/"

Manku-Rajagopalan-Lindsay(MRL)sketch

Greenwald-Khanna(GK)sketch

|R0(x)�R(x)| "nItgets

Andmaintainsonlyitemsfromthestream!

Usesacompletelydifferentconstruction

O(log(n)/")

Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)

Buffersofsize klog(1/")

startsamplingafteritemsO(1/"2)

log

2(1/")/"Reducesspaceusagetoitemsfromthestream.

1 0 35

Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)

E[R0(x)] = R(x)

R

0(x) isarandomvariablenowand

R(x) = 1

R

0(x) = 2

R

0(x) = 0

x

Reducesspaceusagetoitemsfromthestream.log

3/2(1/")/"

5 7

5

7

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(1)

Exponentiallyshrinkingbuffers

plog(1/")/"

1 0 35

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(2)

Exponentiallydecreasingbuffersizes

GKSketch

log log(1/")/"

1 0 35

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

100 1000 10000 100000 1e+06

Err

or

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

0

500

1000

1500

2000

2500

3000

3500

4000

100 1000 10000 100000 1e+06

Space

Use

d F

or

Sto

ring S

am

ple

s

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

Someexperimentalresults

CountDistinct(DemoOnly)

>>headdata.csv0103023732

Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.

Assumeyouneedtoestimatethenumberofunique numbersinafile

>>time wc -lc data.csv1000000076046666data.csv

real0m0.101suser 0m0.072ssys 0m0.021s

Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.

Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.

>>timesortdata.csv -u|wc -l5001233

real2m37.071suser2m36.587ssys0m0.376s

Tocountthenumberofdistinctitemsyoumighttrythis:

>>sortdata.csv |uniq |wc-l

>>sortdata.csv -u|wc-l

However,itisfastertohave“uniqify”whilesorting.

>>timesort data.csv -u-n|wc -l5001233

real 0m11.809suser 0m11.587ssys 0m0.228s

Still,mostofthetimeisspentoncomparingstrings....

>>sort data.csv -u-n-S100%|wc -l

Thisismuchbetter!

>>timesketchuniqdata.csvEstimate :4974249UpperBound:5116569LowerBound:4835874

real0m1.527suser0m1.506ssys0m0.152s

Thisisthewaytodothiswiththesketchinglibrary

>>sketchuniq data.csv

Toofasttousethesystemmonitor UI...

Ituses~32kofmemory!

Thankyou!

Recommended