Upload
nibal
View
45
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Making Every Bit Count in Wide Area Analytics. Ariel Rabkin Joint work with: Matvey Arye , Siddhartha Sen , Michael J. Freedman, and Vivek Pai. Global Systems Have Global Data. The Rise of Big Distributed Data. CDNs: Akamai has ~20 m illion requests per second - PowerPoint PPT Presentation
Citation preview
1
Making Every Bit Count in Wide Area Analytics
Ariel Rabkin
Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai
2
Global Systems Have Global Data
3
The Rise of Big Distributed Data
• CDNs:– Akamai has ~20 million requests per
second– CloudFlare has about 300 MB/s of logs,
volume doubles every 4 months• Sensor data (e.g., power grid,
highways)• Smart camera networks
4
Trends
Time
Amou
nt p
er
dolla
r Data
Volum
esWide-area Bandwidth
5
Analyzing Low-rate Events is Easy
Server Crashed!
Alert me when server crashes!
6
High-rate Events can be Costly
Every minute, compute request counts by URL
RequestsRequestsRequestsRequests
RequestsRequestsRequestsRequests
7
Backhaul has Bad DynamicsExample: backhaul count of events every 5 minutesChoice of summaries is made upfront statically
• Buyer’s remorse: Chose to collect unnecessary and expensive data
• Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data
8
Local Storage!
Every minute, compute request counts by URL
RequestsRequestsRequestsRequests
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
9
Challenge: Bandwidth ScarcityI want the request count for every URL every
secondI can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value
cutoff, or change frequency?
I can do that for 900 KB/sec.
Can I get the top 1000 URLs every second?
Great, do it!
10
? ? ? ? ? ? ?
Challenge: Varying Scarcity
Time
Band
wid
thNeeded
Available
Can do
First aggregate over longer time periods, up to 30 seconds. Then
only keep the top URLs.
12
Data Processing Requirements• Aggregatable
• Merge-able
Data DataMerged
Representation
+ =• Reducible
Data Data
StoredData +
=Updat
e
13
Raw byte stringse.g. MapReduce
Database tables
High-level API
Merge + Aggregate
Predictable performance
ArbitraryJoins
X X √ X√ X X √
14
The Data Cube Model
Counts by URL 12:00
12:01
12:02
www.mysite.com
3 5 …
www.yoursite.com
5 4 …
www.hersite.com
8 12 …Roll-up of mysite.com by time from 12:00 to 12:01:
8Roll-up of sites at time
12:00: 16
Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions
Aggregation function used for:• Updates• Roll-ups• Merging cubes• Degrading
cubes
15
Data Cube
Raw byte stringse.g. MapReduce
Database tables
High-level API
Merge + Aggregate
Predictable performance
ArbitraryJoins
X X √ X√ X X √√ √ √ X
16
DataflowOperator
sLocalCube
DataflowOperator
s
Net
wor
k bo
ttle
neck
DataflowOperator
sLocal Cube
DataflowOperator
s
DataflowOperator
sMerged Cube
Dataflow
Operators
A Vision for Wide-Area Analytics
Dataflow adapted to bandwidth
17
Adaptivity
DataflowOperator
s
Local CubeDataflowOperator
s
Net
wor
kbo
ttle
neck
18
Feedback control
Net
wor
kbo
ttle
neck
Adaptivity
DataflowOperator
s
Local CubeDataflowOperator
sSummariz
edCube
• Key ingredients:– Cube summarization as
mechanism– User-defined policies– Feedback control
19
Backup Slides
20
Conclusions• The hard problems in wide-area analysis:– Reasoning about bandwidth/data quality
tradeoffs– Optimizing data quality under changing
conditions.– Jointly optimizing bandwidth and other
resources• We are building a system. –We call it JetStream. Stay tuned….
23
Bandwidth Costs do not Decline Smoothly
[TeleGeography's Global Bandwidth Research Service]
24 [TeleGeography's Global Bandwidth Research Service]
20% 20%
Frankfurt-
London
2012 Bandwidth Price Shifts
25
Diurnal Load Makes Overprovisioning Expensive
• Leased lines waste capacity during off-peak
• Public internet gets congested during peak
29
Can iteratively pose different queries
RequestsRequestsRequestsRequests
Benefit: Iteration
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
A revised query
30
Can adapt data volume collected to available bw
RequestsRequestsRequestsRequests
Benefit: adaptation
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
Limited Bandwidth
31
Can adapt data volume collected to available bw
RequestsRequestsRequestsRequests
Benefit: adaptation
RequestsRequestsRequestsRequests
LocalAggregatio
n and Storage
LocalAggregatio
n and Storage
Ample Bandwidth
32
A dataflow model for wide-area analytics
Operator
Cube
Defines data transformation on tuples. Can do input or output.
Structured storage of data
33
Processing SourceCube
Net
wor
k bo
ttle
neck
Processed Data
Processing SourceCube
Generated data Ingested Into Local cubes
34
Processed Data
Processing