Upload
jonathon-hanley
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Composite Subset Measures
Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran
University of Wisconsin - Madison
Raghu RamakrishnanYahoo! Research and University of Wisconsin – Madison
09.12.2006
2
Motivation
Consider this query: “For each year and each country, compute the ratio of the
average personal incomes between richest city and poorest city . Then find the number of countries where such ratio continuously decrease between 1990-2000“
It is Hard to write in SQL Hard to optimize/understand the SQL query
This kind of queries is increasingly common: Multi-step aggregation Must scale to very large datasets, often distributed
3
Contributions
A new framework for expressing such compositional aggregate queries Key contribution is how we look at the
computation, in terms of aggregating over related regions in “cube space”
An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps Experimental results
4
Background Computing “measures”
Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio)
Approaches: Group by, data cubes, Hancock, Sawzall
Cube space Partition feature space using attribute values; domain
hierarchies organize this space into nested collections of regions
Regions: (2006, Korea), (2006/09, Seoul) Region sets: (Year, Country), (Month, City)
5
Composite Subset Measures
The measure of a cube region is computed by: Aggregating data in a region directly (e.g.,
sales volumes for each day), or Summarizing the measures for related
regions, e.g.: The maximum of daily volumes within a year The ratio of average personal incomes
between the richest and poorest cities in a country
6
What is “Related” in Cube Space
Focus on relationships which are commonly used can be efficiently evaluated
Self Parent/Child
E.g., Year/Day
Child/Parent E.g., Day/Year
Sibling E.g., Today/Tomorrow
7
Examples (Network Analysis)
Data involved: Stream of data records for IP packet information Time (t), Source (U), Destination (D) , Size (s)
Queries: For every minute, the number of outgoing packets
from each given source IP For every hour, the maximum number of minutely
outgoing packets from a given source IP
8
Expression Algebra
Each measure entity is defined as a collection of region/value pairs Regions should belong to same region set
Fact Table Aggregation Selection Match join Combine join
( )cond T
, ( )G aggg T
,| cond aggS T
1 2( , ,..., )fc nS T T T
D
9
Example: Aggregation
For every hour and every unique IP, compute the number of outgoing packets
( : , : ), (*)C t hour U IP countS g D
10
Example: Selection
For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count)
( : ), (*) 5( )S t hour count M CS g S
time
Source
11
Example: Match
For each six hour time window, compute the average of the high traffic count
1 2 2 21 2
( . [ . , . 6]), ( . )|
S S S Savg S SS t S t S t avg S M
S S S
1SS
2SS
12
Example: Combine
For each hour, compute the ratio between the six hour average and the high traffic count
. / .avg Sratio avg S M S M SS S S
SS
avgS
13
Aggregation Workflows
A diagrammatic way to express multiple composite subset measure expressions Semantically equivalent to the algebra
Rectangles: Region sets Ellipses: Measures associated with the
Region sets Arcs: Computational dependencies
among measures
14
Example
U:IPt:hour
Countcount(*)
Savg
Avg(Count)
Count.t=Sbase.t
t:hour
Region set
Measure name
Aggregation formulaSelection condition
Match condition
15
Example (cont.)
U:IPt:hour
MAXSmax(s)
MINSmin(s)
RatioMAXS/MINS
16
Multi-step Execution Plan Evaluation based on the topology order
of the aggregation workflow Materialize non-dependent measures Then evaluate dependent measures
following the arcs of the aggregation workflow
May need to perform join Problem
Intermediate measures: extra I/O
17
Simple Scan Execution[*]
Build one hash table for each measure “Insert” data into hash tables of low-level
measures Propagate the measures upwards after the
scan is over Distributive or algebraic aggregation function Problem
Each hash table keeps all the entries Bottleneck: Memory capacity
[*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179.
18
Sort/Scan Execution Simple scan requires large memory
For each hash table, we need to keep all the entries during the scan
When the data is ordered Some hash entries can be flushed out before
the scan is finished The memory footprint can be reduced One pass scan becomes feasible CPU cost is reduced
19
Evaluation
t:Day t:DayU:IP
t:MonthU:IP
Sort by day
month 1 month 2
Output stream for each hash table is still ordered!
COUNT0count(*)
COUNT2count(*)
COUNT3count(*)
20
Evaluation
t:Day t:DayU:IP
t:MonthU:IP
Sort by month
month 1 month 2
COUNT0count(*)
COUNT2count(*)
COUNT3count(*)
All the output stream is ordered by month!
21
Evaluation
t:MonthU:IP
Data are sortedby (t:month, U:IP)
month 1 month 2
COUNT3count(*)
1 1 1 2
By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint
22
Order and Slack Order
How the records are ordered in the stream E.g., <t:day, U:IP>
Slack The gap between the output stream of the measure
and the scan progress of raw data E.g., <t:day:[-3,+3]>
We have developed a mechanism to Calculate the order/slack Take advantage of the order/slack information during
evaluation
23
Evaluation Network
M2
M3 M4
M1
order key:<t:Day, U:IP>slack: t:[-1,+1]
order key:<t:Hour, U:IP>slack: t:[-1,+1]order key:<t:Hour, U:IP>
slack: <>
hashtables
Scan sorted data
24
Optimization How to find a good sort order?
Enumerate all possible orders For each order estimate the memory usage Use sort orders with minimal usage
Evaluation with multiple passes What measure to compute during each
pass? What order to use in each pass?
25
Experiments 64 million records Synthetic data set Scenario 1
The measures of a region are computed by combining the aggregated measures for different kinds of child region sets
Scenario 2 The measures of a region are computed by
aggregating the measures of multiple chained siblings
26
Experimental Results (cont.)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
2 3 4 5 6
#dependent child measures
Ex
ec
uti
on
Tim
e (
se
co
nd
s)
DB
SortScan
27
Experimental Results
0
500
1000
1500
2000
2500
3000
2 3 4 5 6 7
Size of the Sibling Chain
Ex
ec
uti
on
Tim
e (
se
co
nd
s)
DB
SortScan
28
Conclusions Composite measures as building blocks for
complicated analysis process Algebra provides the semantic foundation Aggregation workflow offers intuitive interface Sort/Scan execution plan evaluates multiple
dependent measures in the same run and hence improve the evaluation performance