28
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research and University of Wisconsin – Madison 09.12.2006

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

Embed Size (px)

Citation preview

Page 1: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

Composite Subset Measures

Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran

University of Wisconsin - Madison

Raghu RamakrishnanYahoo! Research and University of Wisconsin – Madison

09.12.2006

Page 2: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

2

Motivation

Consider this query: “For each year and each country, compute the ratio of the

average personal incomes between richest city and poorest city . Then find the number of countries where such ratio continuously decrease between 1990-2000“

It is Hard to write in SQL Hard to optimize/understand the SQL query

This kind of queries is increasingly common: Multi-step aggregation Must scale to very large datasets, often distributed

Page 3: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

3

Contributions

A new framework for expressing such compositional aggregate queries Key contribution is how we look at the

computation, in terms of aggregating over related regions in “cube space”

An efficient evaluation framework based on sorted scans that take into account of multiple aggregation steps Experimental results

Page 4: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

4

Background Computing “measures”

Measures summarize some characteristic of data subsets (e.g., SUM, std dev, beta-value of a portfolio)

Approaches: Group by, data cubes, Hancock, Sawzall

Cube space Partition feature space using attribute values; domain

hierarchies organize this space into nested collections of regions

Regions: (2006, Korea), (2006/09, Seoul) Region sets: (Year, Country), (Month, City)

Page 5: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

5

Composite Subset Measures

The measure of a cube region is computed by: Aggregating data in a region directly (e.g.,

sales volumes for each day), or Summarizing the measures for related

regions, e.g.: The maximum of daily volumes within a year The ratio of average personal incomes

between the richest and poorest cities in a country

Page 6: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

6

What is “Related” in Cube Space

Focus on relationships which are commonly used can be efficiently evaluated

Self Parent/Child

E.g., Year/Day

Child/Parent E.g., Day/Year

Sibling E.g., Today/Tomorrow

Page 7: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

7

Examples (Network Analysis)

Data involved: Stream of data records for IP packet information Time (t), Source (U), Destination (D) , Size (s)

Queries: For every minute, the number of outgoing packets

from each given source IP For every hour, the maximum number of minutely

outgoing packets from a given source IP

Page 8: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

8

Expression Algebra

Each measure entity is defined as a collection of region/value pairs Regions should belong to same region set

Fact Table Aggregation Selection Match join Combine join

( )cond T

, ( )G aggg T

,| cond aggS T

1 2( , ,..., )fc nS T T T

D

Page 9: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

9

Example: Aggregation

For every hour and every unique IP, compute the number of outgoing packets

( : , : ), (*)C t hour U IP countS g D

Page 10: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

10

Example: Selection

For every hour, compute the sum of outgoing packets from those source IP with at least five packets in that hour (High traffic count)

( : ), (*) 5( )S t hour count M CS g S

time

Source

Page 11: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

11

Example: Match

For each six hour time window, compute the average of the high traffic count

1 2 2 21 2

( . [ . , . 6]), ( . )|

S S S Savg S SS t S t S t avg S M

S S S

1SS

2SS

Page 12: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

12

Example: Combine

For each hour, compute the ratio between the six hour average and the high traffic count

. / .avg Sratio avg S M S M SS S S

SS

avgS

Page 13: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

13

Aggregation Workflows

A diagrammatic way to express multiple composite subset measure expressions Semantically equivalent to the algebra

Rectangles: Region sets Ellipses: Measures associated with the

Region sets Arcs: Computational dependencies

among measures

Page 14: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

14

Example

U:IPt:hour

Countcount(*)

Savg

Avg(Count)

Count.t=Sbase.t

t:hour

Region set

Measure name

Aggregation formulaSelection condition

Match condition

Page 15: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

15

Example (cont.)

U:IPt:hour

MAXSmax(s)

MINSmin(s)

RatioMAXS/MINS

Page 16: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

16

Multi-step Execution Plan Evaluation based on the topology order

of the aggregation workflow Materialize non-dependent measures Then evaluate dependent measures

following the arcs of the aggregation workflow

May need to perform join Problem

Intermediate measures: extra I/O

Page 17: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

17

Simple Scan Execution[*]

Build one hash table for each measure “Insert” data into hash tables of low-level

measures Propagate the measures upwards after the

scan is over Distributive or algebraic aggregation function Problem

Each hash table keeps all the entries Bottleneck: Memory capacity

[*] T. Johnson and D. Chatziantoniou, Extending complex ad-hoc OLAP, in CIKM, 1999, 170-179.

Page 18: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

18

Sort/Scan Execution Simple scan requires large memory

For each hash table, we need to keep all the entries during the scan

When the data is ordered Some hash entries can be flushed out before

the scan is finished The memory footprint can be reduced One pass scan becomes feasible CPU cost is reduced

Page 19: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

19

Evaluation

t:Day t:DayU:IP

t:MonthU:IP

Sort by day

month 1 month 2

Output stream for each hash table is still ordered!

COUNT0count(*)

COUNT2count(*)

COUNT3count(*)

Page 20: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

20

Evaluation

t:Day t:DayU:IP

t:MonthU:IP

Sort by month

month 1 month 2

COUNT0count(*)

COUNT2count(*)

COUNT3count(*)

All the output stream is ordered by month!

Page 21: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

21

Evaluation

t:MonthU:IP

Data are sortedby (t:month, U:IP)

month 1 month 2

COUNT3count(*)

1 1 1 2

By carefully choosing the sort order of the raw data, we can greatly reduce the memory footprint

Page 22: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

22

Order and Slack Order

How the records are ordered in the stream E.g., <t:day, U:IP>

Slack The gap between the output stream of the measure

and the scan progress of raw data E.g., <t:day:[-3,+3]>

We have developed a mechanism to Calculate the order/slack Take advantage of the order/slack information during

evaluation

Page 23: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

23

Evaluation Network

M2

M3 M4

M1

order key:<t:Day, U:IP>slack: t:[-1,+1]

order key:<t:Hour, U:IP>slack: t:[-1,+1]order key:<t:Hour, U:IP>

slack: <>

hashtables

Scan sorted data

Page 24: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

24

Optimization How to find a good sort order?

Enumerate all possible orders For each order estimate the memory usage Use sort orders with minimal usage

Evaluation with multiple passes What measure to compute during each

pass? What order to use in each pass?

Page 25: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

25

Experiments 64 million records Synthetic data set Scenario 1

The measures of a region are computed by combining the aggregated measures for different kinds of child region sets

Scenario 2 The measures of a region are computed by

aggregating the measures of multiple chained siblings

Page 26: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

26

Experimental Results (cont.)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2 3 4 5 6

#dependent child measures

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

DB

SortScan

Page 27: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

27

Experimental Results

0

500

1000

1500

2000

2500

3000

2 3 4 5 6 7

Size of the Sibling Chain

Ex

ec

uti

on

Tim

e (

se

co

nd

s)

DB

SortScan

Page 28: Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research

28

Conclusions Composite measures as building blocks for

complicated analysis process Algebra provides the semantic foundation Aggregation workflow offers intuitive interface Sort/Scan execution plan evaluates multiple

dependent measures in the same run and hence improve the evaluation performance