34
WILD PROJECT REVIEW WILD PROJECT REVIEW Efficient Allocation Algorithms For OLAP Over Imprecise Data Doug Burdick University of Wisconsin – Madison Raghu Ramakrishnan Yahoo! Research Prasad Deshpande IBM India Research Lab, SIRC Shivakumar Vaithyanathan IBM Almaden Research T.S. Jayram IBM Almaden Research Center

Efficient Allocation Algorithms For OLAP Over Imprecise Data

  • Upload
    kirkan

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Efficient Allocation Algorithms For OLAP Over Imprecise Data. Doug Burdick University of Wisconsin – Madison. Prasad Deshpande IBM India Research Lab, SIRC. T.S. Jayram IBM Almaden Research Center. Raghu Ramakrishnan Yahoo! Research. Shivakumar Vaithyanathan IBM Almaden Research Center. - PowerPoint PPT Presentation

Citation preview

Page 1: Efficient Allocation Algorithms For OLAP Over Imprecise Data

WILD PROJECT REVIEWWILD PROJECT REVIEW

Efficient Allocation Algorithms For OLAP Over Imprecise Data

Doug Burdick

University of Wisconsin – Madison

Raghu Ramakrishnan

Yahoo! Research

Prasad Deshpande

IBM India Research Lab, SIRC

Shivakumar Vaithyanathan

IBM Almaden Research Center

T.S. Jayram

IBM Almaden Research Center

Page 2: Efficient Allocation Algorithms For OLAP Over Imprecise Data

2

MA

NY

TX

CAW

est

Eas

t

ALL

LOC

AT

ION

Civic SierraF150Camry

TruckSedan

ALL

AUTOMOBILE

Model

Category

Re

gio

n

Sta

te

ALL

AL

L

1

3

2

2 1 3

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

Multidimensional Data

p3

p1

p4

p2

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

p5

[BDJ+05] Burdick et al. OLAP Over Uncertain and Imprecise Data In VLDB 2005

Imprecise Data

Page 3: Efficient Allocation Algorithms For OLAP Over Imprecise Data

3

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Sierra is … but their F150 has …

Page 4: Efficient Allocation Algorithms For OLAP Over Imprecise Data

4

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Sierra is … but their F150 has …

Page 5: Efficient Allocation Algorithms For OLAP Over Imprecise Data

5

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

Auto

F150

Sierra

F150

Sierra

{Sierra,F150}

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Sierra is … but their F150 has …

More details for dimensions extracted from text in [BDJ+06] Burdick et al. OLAP Over Uncertain and Imprecise Data. To appear in VLDB Journal

Page 6: Efficient Allocation Algorithms For OLAP Over Imprecise Data

6

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is

p5 cust’s Truck has…

Page 7: Efficient Allocation Algorithms For OLAP Over Imprecise Data

7

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Truck has…

Page 8: Efficient Allocation Algorithms For OLAP Over Imprecise Data

8

Sources of Imprecision Dimensions extracted from free text

Assume given extractor for Auto dimension values

Auto

F150

Sierra

F150

Sierra

Truck

FactID Text

p1 Brakes on F150…

p2 Rotors on the Sierra are…

p3 The F150 has…

p4 The Sierra is…

p5 cust’s Truck has…

Page 9: Efficient Allocation Algorithms For OLAP Over Imprecise Data

9

Sources of Imprecision Data Integration

Fact table constructed by integrating multiple data sources Different sources record same dimension attribute at

different granularities

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

Call Center Mailing List

FactID Auto Loc Repair

p5 Truck NY 100

FactID Auto Loc Repair

p5 Truck NY 100

Civic SierraF150Camry

TruckSedan

ALL

AUTOMOBILE

Model

Category

ALL

1

2

3

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

Page 10: Efficient Allocation Algorithms For OLAP Over Imprecise Data

10

Imprecision In Real Data

Obtained real-world dataset from auto manufacturer Fact table entries from several source relations Integrated fact table contained 798,570 facts

Real data has many imprecise facts

Page 11: Efficient Allocation Algorithms For OLAP Over Imprecise Data

11

Querying Imprecise Facts

p3

p1

p4

p2

p5

MA

NY

SierraF150

Truck

East

FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

Auto = F150Loc = MASUM(Repair) = ???

Page 12: Efficient Allocation Algorithms For OLAP Over Imprecise Data

12

Solution: Allocation Intuitively: Replace each imprecise fact

r with set of precise facts, one for each possible completion of r Each completion is assigned an allocation

weight Refer to the resulting fact table as the

Extended Database (EDB)

Queries operate over this Extended Database

Page 13: Efficient Allocation Algorithms For OLAP Over Imprecise Data

13

p3

p1

p4

p2

p5

MA

NY

SierraF150FactID Auto Loc Repair

p1 F150 NY 100

p2 Sierra NY 500

p3 F150 MA 100

p4 Sierra MA 200

p5 Truck MA 100

Truck

East

Handle Imprecision With Allocation

ID FactID Auto Loc Repair Weight

1 p1 F150 NY 100 1.0

2 p2 Sierra NY 500 1.0

3 p3 F150 MA 100 1.0

4 p4 Sierra MA 200 1.0

5 p5 F150 MA 100 0.5

6 p5 Sierra MA 100 0.5

p5

Page 14: Efficient Allocation Algorithms For OLAP Over Imprecise Data

14

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

Querying The Extended Database

p5 p5

Auto = F150Loc = MASUM(Repair) = ???

ID FactID Auto Loc Repair Weight

1 p1 F150 NY 100 1.0

2 p2 Sierra NY 500 1.0

3 p3 F150 MA 100 1.0

4 p4 Sierra MA 200 1.0

5 p5 F150 MA 100 0.5

6 p5 Sierra MA 100 0.5

Page 15: Efficient Allocation Algorithms For OLAP Over Imprecise Data

15

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

Querying The Extended Database

p5 p5

Auto = F150Loc = MASUM(Repair) = 150

ID FactID Auto Loc Repair Weight

1 p1 F150 NY 100 1.0

2 p2 Sierra NY 500 1.0

3 p3 F150 MA 100 1.0

4 p4 Sierra MA 200 1.0

5 p5 F150 MA 100 0.5

6 p5 Sierra MA 100 0.5

Procedure for assigning allocation weights is referred to as an

allocation policy

Page 16: Efficient Allocation Algorithms For OLAP Over Imprecise Data

16

Contributions Propose generalized template for allocation

policies presented in [BDJ+05] Present operational framework for allocation

Allocation graph formalism Used to derive Independent, Block, Transitive Algorithms

Propose Extended Database Maintenance Algorithm Update EDB to reflect changes to given fact table

Experimental Evaluation

Page 17: Efficient Allocation Algorithms For OLAP Over Imprecise Data

17

Allocation Policy Template

r

MA

NY

SierraF150

Truck

East

c1 c2

)(

)(

)'(

)(

)('

, rQsum

cQ

cQ

cQp

rregionc

rc

)2()1(

)2(

)2()1(

)1(

,2

,1

cQcQ

cQp

cQcQ

cQp

rc

rc

Page 18: Efficient Allocation Algorithms For OLAP Over Imprecise Data

18

p4

p1

p5

p2

p6

MA

NY

SierraF150

Truck

East

p7

Interactions between overlapping facts Allocation weights for

imprecise fact p6 depend on allocation weights for fact p7 (and vice-versa)

Would like assigned weights to capture these interactions

Idea: Repeatedly allocate p6 and p7 until allocation weights converge

Page 19: Efficient Allocation Algorithms For OLAP Over Imprecise Data

19

Iterative Allocation Policies

' ( )

( ) ( ')t t

c region r

Qsum r Q c

1) Initialize each Q each Q00(c) in cell c (using precise facts) (c) in cell c (using precise facts)

2) For each iteration t until all Qt(c) converged

For each cell c

For each imprecise fact r overlapping c

)(

)()()(

1

rQsum

rQcQcQ

t

ttt

)(

)(, rQsum

cQp

t

t

rc

3) For each imprecise fact r

For each imprecise fact r

For each cell c in region(r)

Page 20: Efficient Allocation Algorithms For OLAP Over Imprecise Data

20

Benefits of Iterative Allocation Imprecise facts can be allocated in any order

and same allocation weights are obtained Leverage this idea to obtain scalable allocation

algorithms

Leads to Expectation Maximization (EM) framework for allocation Final allocation weights have pleasing

mathematical properties See [BDJ+05] for details

Page 21: Efficient Allocation Algorithms For OLAP Over Imprecise Data

21

Allocation Graph

<MA,Truck>

Imprecise Facts

Precise Cells

Cell(NY,F150)

Cell(NY,Sierra)

Cell(MA,F150)

Cell(MA,Sierra)

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p5

p6c1 c2

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p5

p6c1 c2

Page 22: Efficient Allocation Algorithms For OLAP Over Imprecise Data

22

Processing WithAllocation Graph

<MA,Truck>

Imprecise Facts

Precise Cells

Cell(NY,F150)

Cell(NY,Sierra)

Cell(MA,F150)

Cell(MA,Sierra)

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East c1 c2

p3

p1

p4

p2

MA

NY

SierraF150

Truck

East

p6c1 c2

12 3

2 / 3

1 / 3

' ( )

( ) ( ')t t

c region r

Qsum r Q c

)(

)(, rQsum

cQp

t

t

rc

p5 p5p5

Initialize each Q each Q00(c) in cell c(c) in cell c

Page 23: Efficient Allocation Algorithms For OLAP Over Imprecise Data

23

Efficient Allocation Algorithms Independent Algorithm

Requires multiple sorts of precise cells for each iteration

Optimizations based on re-using each sort as much as possible

Block Algorithm Reduces the number of required sorts for precise

cells to 1 Optimizations based on increasing buffer

utilization

Page 24: Efficient Allocation Algorithms For OLAP Over Imprecise Data

24

<MA,Sedan>p6

<MA,Truck>p7

<CA,ALL>p8

<East,Truck>p9

<West,Sedan>p10

<ALL,Civic>p11

<ALL,Sierra>p12

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<MA,Sierra>

<NY,F150>

<CA,Civic>

<CA,Sierra>

p1

p2

p3

p4

p5

S1:<State,Category>

S2 :<State, ALL>

S3 :<Region,Category>

S4 :<ALL,Model>

S5 :<Region,Model>

Page 25: Efficient Allocation Algorithms For OLAP Over Imprecise Data

25

Iteration aware allocation

Optimizations for Independent and Block reduce work for single iteration

Problem: Each iteration of allocation is still expensive Involves multiple scans of entire fact table Not feasible for real data warehouses!

Can we do better?

Page 26: Efficient Allocation Algorithms For OLAP Over Imprecise Data

26

Required Data For Allocating A Fact <MA,Sedan>p6

<MA,Truck>p7

<CA,ALL>p8

<East,Truck>p9

<West,Sedan>p10

<ALL,Civic>p11

<ALL,Sierra>p12

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<MA,Sierra>

<NY,F150>

<CA,Civic>

<CA,Sierra>

c1

c2

c3

c4

c5

`

Page 27: Efficient Allocation Algorithms For OLAP Over Imprecise Data

27

<MA,Sedan>p6

<CA,ALL>p8

<West,Sedan>p10

<ALL,Civic>p11

<West,Civic>p13

<West,Sierra>p14

<MA,Civic>

<CA,Civic>

<CA,Sierra>

c1

c4

c5

<MA,Truck>p7

<East,Truck>p9

<ALL,Sierra>p12

<MA,Sierra>

<NY,F150>

c2

c3Connected components in allocation graph can be

processed independently

Required Data For Allocating A Fact

Page 28: Efficient Allocation Algorithms For OLAP Over Imprecise Data

28

Transitive Algorithm Transitive Algorithm has two steps:

1) Connected component identification step 2) Process each connected component

Read component into memory Perform all iterations of allocation for facts in component

If each component fits into memory then required I/O operations for Transitive is independent of number of iterations! Components larger than buffer processed using Block

algorithm In real datasets, all components were memory resident

Use concepts from Transitive Algorithm to develop EDB Maintenance Algorithm

Page 29: Efficient Allocation Algorithms For OLAP Over Imprecise Data

29

Experimental Setup

Algorithms evaluated on several datasets Real-world dataset: 798K facts , 4 dimensions Used several synthetic datasets

Vary level of imprecision in the data Percentage of imprecise facts Severity of imprecision

Scalability (up to 5 million tuples)

Important parameter: Ratio of input table size to available memory Memory limited to restricted buffer pool

Page 30: Efficient Allocation Algorithms For OLAP Over Imprecise Data

30

Experiment 1a: Memory Resident

0

50

100

150

200

250

300

1 3 5 7

Iterations (until converged)

Tim

e (

se

c)

IndependentBlockTransitive

Real Dataset

Page 31: Efficient Allocation Algorithms For OLAP Over Imprecise Data

31

Experiment: Memory Resident (2)

0

100

200

300

400

500

0 5 10Iterations (until converged)

Tim

e (

se

c)

IndependentBlockTransitive

Synthetic Dataset (more imprecision)

Page 32: Efficient Allocation Algorithms For OLAP Over Imprecise Data

32

Experiment: Algorithm Scalability

ε = 0.1 (3 iterations)

0

200

400

600800

1000

1200

1400

600KB 1MB 6MB 12MB

Buffer Size

Tim

e (s

ec)

IndependentBlockTransitive

Page 33: Efficient Allocation Algorithms For OLAP Over Imprecise Data

33

Experiment 1b: Algorithm Scalability

ε = 0.005 (10 iterations)

01000200030004000500060007000

600 KB 1MB 6MB 12MB

Buffer Size

Tim

e (

se

c)

IndependentBlockTransitive

Page 34: Efficient Allocation Algorithms For OLAP Over Imprecise Data

34

Conclusions Imprecision is a compelling real-world

problem Propose allocation as a solution

Allocation graph formalism Basis for 3 scalable allocation algorithms Independent, Block, Transitive

Transitive algorithm is quite intriguing Performance is stable as number of iterations

increase Connected components algorithm identifies can

be used in proposed EDB maintenance algorithm