Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated...

DataMiningDistributedStreams

EdoLibertyPrincipalScientistAmazonWebServices

Computation Result

TheWorld

Singlemachinedataprocessing

Data Data Data Data

Computation Result

TheWorld

Distributedstorage

Data+Compute

Computation Result

TheWorld

Data+Compute

Distributedcompute(map/reduce,MPI,…)

Data+Compute

Computation Result

TheWorld

Data+Compute

ComputationQuery

Distributedmodel(indexes,databases,Spark…)

207big-datainfographics(ametainfographic)

Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute

Thestreamingmodel

Merge+Sketch

TheWorld

QueryAlgorithm ResultQuery

Result

Compute+Sketch

Thedistributedstreamingmodel

Sketch

Result

Iterator

Computation

Thestreamingmodel(moreaccurately)

O(n) Items

O(polylog(n)) Space

O(polylog(n)) Computationperitem

1 7 8 1 0 1 7 7

Sketch Result

Iterator Iterator

Communicationcomplexity

1 7 8 1 0 1 7 7

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)

WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification

Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching

FrequencyCounting

Misra,Gries.Findingrepeatedelements,1982.

Demaine,Lopez-Ortiz,Munro.Frequencyestimationofinternetpacketstreamswithlimitedspace,2002

Karp,Shenker,Papadimitriou.Asimplealgorithmforfindingfrequentelementsinstreamsandbags,2003

Thename``Lossy Counting"wasusedforadifferentalgorithmbyManku andMotwani,2002

Metwally,Agrawal,Abbadi,EfficientComputationofFrequentandTop-kElementsinDataStreams,2006

Charikar,Chen,Farach-Colton,Findingfrequentitemsindatastreams,2002

Cormode,Muthukrishnan,AnImprovedDataStreamSummary:TheCount-MinSketchanditsApplications.

f( ) = 5

ProblemDefinition

|f 0 � f | < "n

Canwedobetterthansampling?

f 0( ) = 3 · n/`

` = O(1/"2)

f 0( ) = 0

f 0( ) = 2

Assumewedeletetimest

Secondfact: f

0(x) � f(x)� t

0(x) f(x)Firstfact:

Analysis

Therefore: |f 0(x)� f(x)| t

Wedeletedifferentitemseverytime!

Thirdfact: t n/`

Analysis

Wegetthat:

⌅When:(muchbetterthansampling!)` = 1/"

|f 0(x)� f(x)| < "n

Items’exactprobability p(x) = f(x)/n

0(x) = f

0(x)/n

|p0(x)� p(x)| 1/`

Analysis

Approximateprobability

Weget:

Ifwegetonlyaerrorinourestimations.

Wewouldneed10billion samplestogetthesameaccuracy!

` = 10, 000 0.01%

Emailthreads

Asimpleemailthread(that’snotveryhardtodo…)

ThreadingMachineGeneratedEmail

Ailon,Karnin,Maarek,Liberty,ThreadingMachineGeneratedEmail,WSDM2013

ThreadingMachineGeneratedEmail

Streamingquantiles

Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.

ProblemDefinition

0 nn/2

R( ) = 0.6 · n

|R0 �R| < "nSamplingvaluesgives canwedobetter?O(1/"2)

Thebasicbufferidea

1 0 35 4 7

Bufferofsizek

Thebasicbufferidea

Storeskstreamentries

Thebasicbufferidea

Thebuffersortskstreamentries

Thebasicbufferidea

Deleteseveryotheritem

Thebasicbufferidea

Andoutputstherestwithdoubletheweight

Thebasicbufferidea

1 54 7

R(x) = 2

0(x) = 2

R(x) = 5

0(x) = 4

0(x) = 6

Thebasicbufferidea

Repeattimeuntiltheendofthestream

|R0(x)�R(x)| < n/k

1 0 355

Buffersofsize k

|R0(x)�R(x)| n log2(n)/k

log2(n)

1 0 35

Manku-Rajagopalan-Lindsay(MRL)sketch

k = log2(n)/"Ifweset

|R0(x)�R(x)| "nWeget

Andwemaintainonlyitemsfromthestream!log

22(n)/"

Manku-Rajagopalan-Lindsay(MRL)sketch

Greenwald-Khanna(GK)sketch

|R0(x)�R(x)| "nItgets

Andmaintainsonlyitemsfromthestream!

Usesacompletelydifferentconstruction

O(log(n)/")

Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)

Buffersofsize klog(1/")

startsamplingafteritemsO(1/"2)

2(1/")/"Reducesspaceusagetoitemsfromthestream.

1 0 35

Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)

E[R0(x)] = R(x)

0(x) isarandomvariablenowand

R(x) = 1

0(x) = 2

0(x) = 0

Reducesspaceusagetoitemsfromthestream.log

3/2(1/")/"

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(1)

Exponentiallyshrinkingbuffers

plog(1/")/"

1 0 35

Reducesspaceusagetoitemsfromthestream.

Lang,Karnin,Liberty(2)

Exponentiallydecreasingbuffersizes

GKSketch

log log(1/")/"

1 0 35

100 1000 10000 100000 1e+06

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

100 1000 10000 100000 1e+06

ring S

Number of Items in Randomly Permuted Stream

Lazy KLL versus (Sketch Library and Two Variants)

Sketch LibraryVariant 1Variant 2Lazy KLL

Someexperimentalresults

CountDistinct(DemoOnly)

>>headdata.csv0103023732

Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.

Assumeyouneedtoestimatethenumberofunique numbersinafile

>>time wc -lc data.csv1000000076046666data.csv

real0m0.101suser 0m0.072ssys 0m0.021s

Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.

Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.

>>timesortdata.csv -u|wc -l5001233

real2m37.071suser2m36.587ssys0m0.376s

Tocountthenumberofdistinctitemsyoumighttrythis:

>>sortdata.csv |uniq |wc-l

>>sortdata.csv -u|wc-l

However,itisfastertohave“uniqify”whilesorting.

>>timesort data.csv -u-n|wc -l5001233

real 0m11.809suser 0m11.587ssys 0m0.228s

Still,mostofthetimeisspentoncomparingstrings....

>>sort data.csv -u-n-S100%|wc -l

Thisismuchbetter!

>>timesketchuniqdata.csvEstimate :4974249UpperBound:5116569LowerBound:4835874

real0m1.527suser0m1.506ssys0m0.152s

Thisisthewaytodothiswiththesketchinglibrary

>>sketchuniq data.csv

Toofasttousethesystemmonitor UI...

Ituses~32kofmemory!

Thankyou!

Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated...

Documents

Christian Sauer*, Matthias Gries, Hans-Peter Löb

Efectividad de la técnica Peter Gries von-illoswa

Erik D. Demaine

Ernst August Gries: Progreßburschenschafter in Halle 1844–1852 · 2019. 5. 27. · Ernst August Gries: Progreßburschenschafter in Halle 1844–1852 bearbeitet und herausgegeben

Particleplacement in EFL learnerspeech: core ... · • Higher phrasal verb frequency is correlated with (higher percentages of) V‐DO‐Prt (Gries, 2011) • Association between

Welcome - By Eric Gries

Tangled Tangles - Erik Demaineerikdemaine.org/papers/Tangle_MOVES2015/paper.pdf · Tangled Tangles Erik D. Demaine 1, Martin L. Demaine , Adam Hesterberg2, Quanquan Liu1, Ron Taylor3,

Introduction to Algorithms Massachusetts Institute of Technology Professors Erik Demaine

Vermieterverzeichnis, Accommodation - Längenfeld Huben Gries 2010/2011/2012

Complexity of Games & Puzzles [Demaine, Hearn & many others]

Kultur- und Sportzentrum Gries: Hier kommen Ihre Anlässe ......3 — Das multifunktionale Kultur- und Sportzentrum Gries bietet Vereinen, Schulen, Bewohnern und Aus-wärtigen viele

CS2110–2111 Spring 2013. David Gries

Analisis NO2 (Gries-saltman)

Courtesy of Robert Connelly, Erik D. Demaine, Martin L ......Courtesy of Robert Connelly, Erik D. Demaine, Martin L. Demaine, Sándor P. Fekete, Stefan Langerman, Joseph S. B. Mitchell,

Continuously Flattening Polyhedra Using Straight Skeletons - Erik … · 2014. 7. 8. · zabel@mit.edu Erik D. Demaine Martin L. Demaine MIT CSAIL 32 Vassar St., Cambridge MA 02139,

September 26, 2005Copyright © 2001-5 Erik D. Demaine and Charles E. Leiserson L5.1 Introduction to Algorithms 6.046J/18.401J Prof. Erik Demaine LECTURE5

Lecture 9 Slides: Pleat Folding, 6.849 Fall 2010 - ocw.mit.edu · Circular Variation from Bauhaus [Albers at Bauhaus, 1927–1928] Virtual Origami. Demaine, Demaine, Fizel, Ochsendorf

CS2110– 2111 Fall 2013 . David Gries

Erik D. Demaine Martin L. Demaine David Eppsteinz …Erik D. Demainey Martin L. Demaine David Eppsteinz Joseph O’Rourkex July 24, 2020 Abstract It is unknown whether every polycube

Fun with Fonts: Algorithmic Typography1 - Erik …erikdemaine.org/papers/Fonts_TCS/paper.pdfFun with Fonts: Algorithmic TypographyI Erik D. Demaine a, Martin L. Demaine aMIT CSAIL,