21
Streaming Data Susan B. Davidson CIS 700: Advanced Topics in Databases MW 1:30-3 Towne 309 http://www.cis.upenn.edu/~susan/cis700/homepage.html

Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

Streaming Data

SusanB.DavidsonCIS700:AdvancedTopicsinDatabases

MW1:30-3

Towne309

http://www.cis.upenn.edu/~susan/cis700/homepage.html

Page 2: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Manymodernapplicationsrequirelong-running,continuousqueriesoverunboundedstreamsofdata• Networkmonitoring

• Financialanalysis• Manufacturing

• Sensornetworks

• Contrastthiswiththetraditionaldatabasesettingofone-timequeriesoverfinitestoreddatasets.

Streaming Data

2

Page 3: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Supposewearecollectingoceansurfacetemperature/surfaceheightdatausingfloatingsensors• Millionsofsensorseachsendingbackastreamattherateof10readingspersecond

• Thiscouldeasilybecomeseveralterabtyesofdataperday,cannotbekeptinworkingstorage.

• Samplecontinuous(“standing”)queries• Outputanalertwheneverthetemperatureexceeds25degreescentigrade

• Producetheaverageofthe24mostrecentreadings• Producethehighesttemperaturerecorded,oraveragetemperatureoverallrecordings

Example: oceanography

3

Page 4: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Websitesreceivesstreamsofvarioustypes• Googlereceivesseveralhundredmillionsearchqueriesperday

• Yahoo!acceptsbillionsof“clicks”perday• Manyinterestingthingscanbelearned

• Anincreaseinquerieslike“sorethroat”signalsthespreadofviruses.

• Asuddenincreaseintheclickrateforalinkcouldindicatesomenewsconnectedtothatpage,oritcouldmeanthatthelinkisbrokenandneedstoberepaired.

• Oneapproachtohandlead-hocqueriesistostoreaslidingwindowofeachstream• Allinputsinlastttimeunits

• Lastkinputs

Example: Web sites

4

Page 5: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

Data-stream management system architecture

5

From “Mining of Massive Datasets” By Leskovec, Rajaraman and Ullman.

Page 6: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Streamsdeliverelementsrapidly,andelementsmustbeprocessedinrealtime

• Algorithmsshouldbeexecutedinmainmemory

• Theremaybemanystreams

• Evenifeachstreamcaneasilybeexecutedinmainmemory,thecombinationmayexceedavailablememory

• Commontechniques:approximation,hashing

• Forcomplex,ad-hocqueries:storeaslidingwindowofeachstreamintheworkingstore.

• Mostrecentnelementsofastream,forsomen

• Alltheelementsthatarrivedwithinthelastttimeunits

Issues in stream processing

6

Page 7: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Databasemanagementsystemsperspective• StanfordSTREAMproject

• NiagaraCQ(WisconsinandOGI)

• Algorithmicperspective• Approximationandhashingtechniquesarecommonlyexplored.

• Datastreamprocessingengines(e.g.Discretizedstream,Drizzle,Flink)

Several directions have been taken…

7

Page 8: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Twodatatypes:• Stream:unboundedbagofpairs(s,t)wheresisatupleandtisatimestamp

• Relation:time-varyingbagoftuples,R(t)denotesaninstantaneousrelation

STREAM: the Stanford project

8

Page 9: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Stream-to-relationoperatorsarebasedonslidingwindows:• Tuple-based:R(t)containstheNtuplesofstreamSwiththelargesttimestamps<=t

• Time-based:R(t)containsalltuplesofSwithtimestampsbetweent-wandt.

• Partitionedslidingwindow:partitionsSintodifferentsubstreamsbasedonequalityofattributesA1,…,Akandcomputesatuple-basedslidingwindowofsizeNindependentlyoneachsubstream,thentaketheunionofthewindowstoproducetheoutputrelation

CQL: Stream-to-relation operators

9

Page 10: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Istream:contains(s,t)whenevertuplesisinsertedintoRattimet

• Dstream:contains(s,t)whenevertuplesisdeletedfromRattimet

• Rstream:contains(s,t)whenevertuplesisinR(t),i.e.everycurrenttupleofRisstreamedateverytimeinstant

CQL: Relation-to-stream operators

10

SelectIstream(*)FromS[Rowsunbounded]WhereS.A>10ormoreintuitively:Select*FromSWhereS.A>10

Page 11: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

CQL, more examples

11

Select * From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select Istream(S1.A) From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select Rstream(S.A, R.B) From S [Now], R Where S.A=R.A

Page 12: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Queryplansconsistof• Operators,whichperformtheactualprocessing• Queues,whichbuffertuplesastheymovebetweenoperators• Synopses,whichstoreoperatorstate

• Eachoperatorreadsfromoneormoreinputqueues,processestheinput,andwritesoutputtoanoutputqueue.

• Allqueuesenforcenondecreasingtimestamps.• Synopsesstorestatethatmayberequiredforfutureevaluationofanoperator,andaresharedbetweenoperatorswheneverpossible.• E.g.materializethecontentsofaslidingrelationortheresultofasubquery

Query plans for continuous queries

12

Page 13: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

Operators in CQL query plans

13

Page 14: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

Example

14

Select * from S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10

Tuple-based windows do not commute with filter conditions.

Page 15: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Noveloptimizations• Synopsissharing• Constraintsonstreams• Operatorscheduling

• Monitoringandadaptivequeryprocessing• Profilercollectsandmaintainsstatisticsaboutstreamandplancharacteristics,e.g.constraints

• Reoptimizerensuresthatplansandmemorystructuresareefficientforcurrentcharacteristics,e.g.joinorders,adding/deletingsubresultcaches

• Approximation

Performance in a time-varying landscape

15

Page 16: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Use“stubs”toindexintosharedsynopses• Withinaquery,e.q.Synopsis1andSynopsis3

• Acrossqueries,e.g.

Optimization: synopsis sharing

16

Select * from S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select A, Max(B) From S1 [Rows 200] Group By A

Page 17: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Referentialintegritykonamany-onejoin:boundkonthedelaybetweenthearrivalofatupleonthe“many”streamandthearrivalofitsjoining“one”tupleontheotherstream.

• Ordered-arrivalk-constraintonastreamattributeA:boundkontheamountofreorderinginvaluesofA.• ForanytuplesinstreamS,foralltupless’thatarriveatleastk+1elementsafters,itmustbetruethats’:A>=s:A.

• Clustered-arrivalk-constraintonastreamattributeA:boundkonthedistancebetweenanytwoelementsthathavethesamevalueofA.

Optimizations: constraints

17

Page 18: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Datastreamsmaybeburstywithpeaksduringwhichsystemresourcesareexhausted.• CPU-limited:dataarrivalrateissohighthatthereisinsufficientCPUtimetoprocesseachstreamelement

• Memory-limited:totalstaterequiredforallqueriesmayexceedavailablememory

• Solution:degradeaccuracybyprovidingapproximateanswersduringloadspikes• CPU-limited:dropelementsbeforetheyareprocessed

• Memory-limited:selectivelyretainsomestateanddiscardtherest

Approximation

18

Page 19: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Introducesamplingoperatorsthatprobabilisticallydropstreamelementsastheyareinputtothequeryplan.

• Forexample,supposethereissetofslidingwindowaggregationqueries• Goal:sampletheinputstominimizethemaximumrelativeerroracrossallqueries,i.e.keeptherelativeerrorthesameforallqueries

• Assume:foreachQi,knowthemeanandstandarddeviationofinputstreamvaluesaswellasthewindowsize(canbecollectedbytheprofiler)

• ThencanusetheHoeffdinginequalitytoderiveaboundontheprobabilitythattherelativeerrorexceedsagiventhresholdforagivensamplingrate.

CPU-limited computation

19

Selectavg(temp)FromSensorReadings[Range5minutes]

Page 20: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Severaloptimizationsareaimedatminimizingmemorydevotedtoqueuesandsynopsissizes(e.g.synopsissharing,operatorscheduling,usingconstraints),butmemorymaystillbealimitation• Spillingtodiskmaynotbefeasibleasitistooslow

• Focusonreducingsynopsis• Introducinganewwindoworshrinkinganexistingwindow

• Maintainingasampleoftheintendedsynopsiscontent

• Usinghistogramsorwaveletswhenthesynopsisisusedforaggregation

• UsingBloomfiltersforduplicateelimination,setdifferenceorsetintersection

• Loweringk-valuesforknownk-constraints

Memory-limited computation

20

Page 21: Streaming Data · 2018. 3. 10. · over streaming data • Cannot directly apply relational semantics, need to introduce streamà relation and relationà stream operators • Optimizing

• Manymodernapplicationsrequirecontinuousqueriesoverstreamingdata

• Cannotdirectlyapplyrelationalsemantics,needtointroducestreamàrelationandrelationàstreamoperators

• Optimizingcontinuousqueriesrequiresanewsetoftricks• Sharingstateandcomputationwithinandacrossqueryplans• Usinginferredconstraintsondatastreams• Adaptivequeryprocessing• Load-sheddingandapproximations

Conclusions

21