Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Streaming Data
SusanB.DavidsonCIS700:AdvancedTopicsinDatabases
MW1:30-3
Towne309
http://www.cis.upenn.edu/~susan/cis700/homepage.html
• Manymodernapplicationsrequirelong-running,continuousqueriesoverunboundedstreamsofdata• Networkmonitoring
• Financialanalysis• Manufacturing
• Sensornetworks
• Contrastthiswiththetraditionaldatabasesettingofone-timequeriesoverfinitestoreddatasets.
Streaming Data
2
• Supposewearecollectingoceansurfacetemperature/surfaceheightdatausingfloatingsensors• Millionsofsensorseachsendingbackastreamattherateof10readingspersecond
• Thiscouldeasilybecomeseveralterabtyesofdataperday,cannotbekeptinworkingstorage.
• Samplecontinuous(“standing”)queries• Outputanalertwheneverthetemperatureexceeds25degreescentigrade
• Producetheaverageofthe24mostrecentreadings• Producethehighesttemperaturerecorded,oraveragetemperatureoverallrecordings
Example: oceanography
3
• Websitesreceivesstreamsofvarioustypes• Googlereceivesseveralhundredmillionsearchqueriesperday
• Yahoo!acceptsbillionsof“clicks”perday• Manyinterestingthingscanbelearned
• Anincreaseinquerieslike“sorethroat”signalsthespreadofviruses.
• Asuddenincreaseintheclickrateforalinkcouldindicatesomenewsconnectedtothatpage,oritcouldmeanthatthelinkisbrokenandneedstoberepaired.
• Oneapproachtohandlead-hocqueriesistostoreaslidingwindowofeachstream• Allinputsinlastttimeunits
• Lastkinputs
Example: Web sites
4
Data-stream management system architecture
5
From “Mining of Massive Datasets” By Leskovec, Rajaraman and Ullman.
• Streamsdeliverelementsrapidly,andelementsmustbeprocessedinrealtime
• Algorithmsshouldbeexecutedinmainmemory
• Theremaybemanystreams
• Evenifeachstreamcaneasilybeexecutedinmainmemory,thecombinationmayexceedavailablememory
• Commontechniques:approximation,hashing
• Forcomplex,ad-hocqueries:storeaslidingwindowofeachstreamintheworkingstore.
• Mostrecentnelementsofastream,forsomen
• Alltheelementsthatarrivedwithinthelastttimeunits
Issues in stream processing
6
• Databasemanagementsystemsperspective• StanfordSTREAMproject
• NiagaraCQ(WisconsinandOGI)
• Algorithmicperspective• Approximationandhashingtechniquesarecommonlyexplored.
• Datastreamprocessingengines(e.g.Discretizedstream,Drizzle,Flink)
Several directions have been taken…
7
• Twodatatypes:• Stream:unboundedbagofpairs(s,t)wheresisatupleandtisatimestamp
• Relation:time-varyingbagoftuples,R(t)denotesaninstantaneousrelation
STREAM: the Stanford project
8
• Stream-to-relationoperatorsarebasedonslidingwindows:• Tuple-based:R(t)containstheNtuplesofstreamSwiththelargesttimestamps<=t
• Time-based:R(t)containsalltuplesofSwithtimestampsbetweent-wandt.
• Partitionedslidingwindow:partitionsSintodifferentsubstreamsbasedonequalityofattributesA1,…,Akandcomputesatuple-basedslidingwindowofsizeNindependentlyoneachsubstream,thentaketheunionofthewindowstoproducetheoutputrelation
CQL: Stream-to-relation operators
9
• Istream:contains(s,t)whenevertuplesisinsertedintoRattimet
• Dstream:contains(s,t)whenevertuplesisdeletedfromRattimet
• Rstream:contains(s,t)whenevertuplesisinR(t),i.e.everycurrenttupleofRisstreamedateverytimeinstant
CQL: Relation-to-stream operators
10
SelectIstream(*)FromS[Rowsunbounded]WhereS.A>10ormoreintuitively:Select*FromSWhereS.A>10
CQL, more examples
11
Select * From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select Istream(S1.A) From S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select Rstream(S.A, R.B) From S [Now], R Where S.A=R.A
• Queryplansconsistof• Operators,whichperformtheactualprocessing• Queues,whichbuffertuplesastheymovebetweenoperators• Synopses,whichstoreoperatorstate
• Eachoperatorreadsfromoneormoreinputqueues,processestheinput,andwritesoutputtoanoutputqueue.
• Allqueuesenforcenondecreasingtimestamps.• Synopsesstorestatethatmayberequiredforfutureevaluationofanoperator,andaresharedbetweenoperatorswheneverpossible.• E.g.materializethecontentsofaslidingrelationortheresultofasubquery
Query plans for continuous queries
12
Operators in CQL query plans
13
Example
14
Select * from S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10
Tuple-based windows do not commute with filter conditions.
• Noveloptimizations• Synopsissharing• Constraintsonstreams• Operatorscheduling
• Monitoringandadaptivequeryprocessing• Profilercollectsandmaintainsstatisticsaboutstreamandplancharacteristics,e.g.constraints
• Reoptimizerensuresthatplansandmemorystructuresareefficientforcurrentcharacteristics,e.g.joinorders,adding/deletingsubresultcaches
• Approximation
Performance in a time-varying landscape
15
• Use“stubs”toindexintosharedsynopses• Withinaquery,e.q.Synopsis1andSynopsis3
• Acrossqueries,e.g.
Optimization: synopsis sharing
16
Select * from S1 [Rows 1000], S2 [Range 2 minutes] Where S1.A=S2.A and S1.A>10 Select A, Max(B) From S1 [Rows 200] Group By A
• Referentialintegritykonamany-onejoin:boundkonthedelaybetweenthearrivalofatupleonthe“many”streamandthearrivalofitsjoining“one”tupleontheotherstream.
• Ordered-arrivalk-constraintonastreamattributeA:boundkontheamountofreorderinginvaluesofA.• ForanytuplesinstreamS,foralltupless’thatarriveatleastk+1elementsafters,itmustbetruethats’:A>=s:A.
• Clustered-arrivalk-constraintonastreamattributeA:boundkonthedistancebetweenanytwoelementsthathavethesamevalueofA.
Optimizations: constraints
17
• Datastreamsmaybeburstywithpeaksduringwhichsystemresourcesareexhausted.• CPU-limited:dataarrivalrateissohighthatthereisinsufficientCPUtimetoprocesseachstreamelement
• Memory-limited:totalstaterequiredforallqueriesmayexceedavailablememory
• Solution:degradeaccuracybyprovidingapproximateanswersduringloadspikes• CPU-limited:dropelementsbeforetheyareprocessed
• Memory-limited:selectivelyretainsomestateanddiscardtherest
Approximation
18
• Introducesamplingoperatorsthatprobabilisticallydropstreamelementsastheyareinputtothequeryplan.
• Forexample,supposethereissetofslidingwindowaggregationqueries• Goal:sampletheinputstominimizethemaximumrelativeerroracrossallqueries,i.e.keeptherelativeerrorthesameforallqueries
• Assume:foreachQi,knowthemeanandstandarddeviationofinputstreamvaluesaswellasthewindowsize(canbecollectedbytheprofiler)
• ThencanusetheHoeffdinginequalitytoderiveaboundontheprobabilitythattherelativeerrorexceedsagiventhresholdforagivensamplingrate.
CPU-limited computation
19
Selectavg(temp)FromSensorReadings[Range5minutes]
• Severaloptimizationsareaimedatminimizingmemorydevotedtoqueuesandsynopsissizes(e.g.synopsissharing,operatorscheduling,usingconstraints),butmemorymaystillbealimitation• Spillingtodiskmaynotbefeasibleasitistooslow
• Focusonreducingsynopsis• Introducinganewwindoworshrinkinganexistingwindow
• Maintainingasampleoftheintendedsynopsiscontent
• Usinghistogramsorwaveletswhenthesynopsisisusedforaggregation
• UsingBloomfiltersforduplicateelimination,setdifferenceorsetintersection
• Loweringk-valuesforknownk-constraints
Memory-limited computation
20
• Manymodernapplicationsrequirecontinuousqueriesoverstreamingdata
• Cannotdirectlyapplyrelationalsemantics,needtointroducestreamàrelationandrelationàstreamoperators
• Optimizingcontinuousqueriesrequiresanewsetoftricks• Sharingstateandcomputationwithinandacrossqueryplans• Usinginferredconstraintsondatastreams• Adaptivequeryprocessing• Load-sheddingandapproximations
Conclusions
21