DataMiningDistributedStreams
EdoLibertyPrincipalScientistAmazonWebServices
Data
Computation Result
TheWorld
Singlemachinedataprocessing
Data Data Data Data
Computation Result
TheWorld
Distributedstorage
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Computation Result
TheWorld
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Distributedcompute(map/reduce,MPI,…)
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Computation Result
TheWorld
Data+Compute
Data+Compute
Data+Compute
Data+Compute
ComputationQuery
Distributedmodel(indexes,databases,Spark…)
207big-datainfographics(ametainfographic)
Sketch
TheWorld
QueryAlgorithm ResultQuery
Result
Compute
Thestreamingmodel
Merge+Sketch
TheWorld
QueryAlgorithm ResultQuery
Result
Compute+Sketch
Compute+Sketch
Compute+Sketch
Compute+Sketch
Thedistributedstreamingmodel
Sketch
Result
Iterator
Computation
Thestreamingmodel(moreaccurately)
O(n) Items
O(polylog(n)) Space
O(polylog(n)) Computationperitem
1 7 8 1 0 1 7 7
Sketch Result
Iterator Iterator
Communicationcomplexity
1 7 8 1 0 1 7 7
Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)
WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification
Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching
Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)
WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification
Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching
FrequencyCounting
Misra,Gries.Findingrepeatedelements,1982.
Demaine,Lopez-Ortiz,Munro.Frequencyestimationofinternetpacketstreamswithlimitedspace,2002
Karp,Shenker,Papadimitriou.Asimplealgorithmforfindingfrequentelementsinstreamsandbags,2003
Thename``Lossy Counting"wasusedforadifferentalgorithmbyManku andMotwani,2002
Metwally,Agrawal,Abbadi,EfficientComputationofFrequentandTop-kElementsinDataStreams,2006
Charikar,Chen,Farach-Colton,Findingfrequentitemsindatastreams,2002
Cormode,Muthukrishnan,AnImprovedDataStreamSummary:TheCount-MinSketchanditsApplications.
n
f( ) = 5
ProblemDefinition
|f 0 � f | < "n
Canwedobetterthansampling?
f 0( ) = 3 · n/`
` = O(1/"2)
`
`
`
`
`
`
`
f 0( ) = 0
`
f 0( ) = 2
Assumewedeletetimest
Secondfact: f
0(x) � f(x)� t
f
0(x) f(x)Firstfact:
Analysis
Therefore: |f 0(x)� f(x)| t
Wedeletedifferentitemseverytime!
Thirdfact: t n/`
`
Analysis
Wegetthat:
⌅When:(muchbetterthansampling!)` = 1/"
|f 0(x)� f(x)| < "n
Items’exactprobability p(x) = f(x)/n
p
0(x) = f
0(x)/n
|p0(x)� p(x)| 1/`
Analysis
Approximateprobability
Weget:
Ifwegetonlyaerrorinourestimations.
Wewouldneed10billion samplestogetthesameaccuracy!
` = 10, 000 0.01%
Emailthreads
Asimpleemailthread(that’snotveryhardtodo…)
ThreadingMachineGeneratedEmail
Ailon,Karnin,Maarek,Liberty,ThreadingMachineGeneratedEmail,WSDM2013
ThreadingMachineGeneratedEmail
ThreadingMachineGeneratedEmail
Streamingquantiles
Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.
ProblemDefinition
n
0 nn/2
R( ) = 0.6 · n
|R0 �R| < "nSamplingvaluesgives canwedobetter?O(1/"2)
Thebasicbufferidea
1 0 35 4 7
Bufferofsizek
Thebasicbufferidea
Storeskstreamentries
1
03
5
47
Thebasicbufferidea
Thebuffersortskstreamentries
10
3
54
7
Thebasicbufferidea
Deleteseveryotheritem
10
3
54
7
Thebasicbufferidea
Andoutputstherestwithdoubletheweight
035
Thebasicbufferidea
0
0
x x
1 54 7
1
3
3
4
5
7
R(x) = 2
R
0(x) = 2
R
0(x) = 2
R(x) = 5
R
0(x) = 4
R
0(x) = 6
Thebasicbufferidea
Repeattimeuntiltheendofthestream
0
|R0(x)�R(x)| < n/k
nn/2
n/k
1 0 355
n
Buffersofsize k
|R0(x)�R(x)| n log2(n)/k
log2(n)
1 0 35
Manku-Rajagopalan-Lindsay(MRL)sketch
k = log2(n)/"Ifweset
|R0(x)�R(x)| "nWeget
Andwemaintainonlyitemsfromthestream!log
22(n)/"
Manku-Rajagopalan-Lindsay(MRL)sketch
Greenwald-Khanna(GK)sketch
|R0(x)�R(x)| "nItgets
Andmaintainsonlyitemsfromthestream!
Usesacompletelydifferentconstruction
O(log(n)/")
Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)
Buffersofsize klog(1/")
startsamplingafteritemsO(1/"2)
log
2(1/")/"Reducesspaceusagetoitemsfromthestream.
1 0 35
Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)
E[R0(x)] = R(x)
R
0(x) isarandomvariablenowand
R(x) = 1
R
0(x) = 2
R
0(x) = 0
x
Reducesspaceusagetoitemsfromthestream.log
3/2(1/")/"
5 7
5
7
Reducesspaceusagetoitemsfromthestream.
Lang,Karnin,Liberty(1)
Exponentiallyshrinkingbuffers
plog(1/")/"
1 0 35
Reducesspaceusagetoitemsfromthestream.
Lang,Karnin,Liberty(2)
Exponentiallydecreasingbuffersizes
GKSketch
log log(1/")/"
1 0 35
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
100 1000 10000 100000 1e+06
Err
or
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
0
500
1000
1500
2000
2500
3000
3500
4000
100 1000 10000 100000 1e+06
Space
Use
d F
or
Sto
ring S
am
ple
s
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
Someexperimentalresults
CountDistinct(DemoOnly)
>>headdata.csv0103023732
Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.
Assumeyouneedtoestimatethenumberofunique numbersinafile
>>time wc -lc data.csv1000000076046666data.csv
real0m0.101suser 0m0.072ssys 0m0.021s
Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.
Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.
>>timesortdata.csv -u|wc -l5001233
real2m37.071suser2m36.587ssys0m0.376s
Tocountthenumberofdistinctitemsyoumighttrythis:
>>sortdata.csv |uniq |wc-l
>>sortdata.csv -u|wc-l
However,itisfastertohave“uniqify”whilesorting.
>>timesort data.csv -u-n|wc -l5001233
real 0m11.809suser 0m11.587ssys 0m0.228s
Still,mostofthetimeisspentoncomparingstrings....
>>sort data.csv -u-n-S100%|wc -l
Thisismuchbetter!
>>timesketchuniqdata.csvEstimate :4974249UpperBound:5116569LowerBound:4835874
real0m1.527suser0m1.506ssys0m0.152s
Thisisthewaytodothiswiththesketchinglibrary
>>sketchuniq data.csv
Toofasttousethesystemmonitor UI...
Ituses~32kofmemory!
Thankyou!