53
ETL Workflows: From Formal Specification to Optimization Timos Sellis National Technical University of Athens (joint work with Alkis Simitsis, IBM Almaden Research Center, Panos Vassiliadis, Univ. of Ioannina and Dimitris Skoutas, NTUA)

Formal Specification and Optimization of ETL Workflows

  • Upload
    doanh

  • View
    231

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Formal Specification and Optimization of ETL Workflows

ETL Workflows: From Formal Specification to Optimization

Timos SellisNational Technical University of Athens

(joint work with Alkis Simitsis, IBM Almaden Research Center, Panos Vassiliadis, Univ. of Ioannina and Dimitris Skoutas, NTUA)

Page 2: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 2

Data Warehouse Environment

Page 3: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 3

Extract-Transform-Load (ETL)

Sources

Extract Transform & Clean

DW

Load

DSA

Page 4: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 4

Motivation ETL and Data Cleaning tools cost

30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project

ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential

ETL tools in the market software packages in-house development

No standard, no common model most vendors implement a core set of operators and provide GUI to

create a data flow

Page 5: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 5

Problems The key factors underlying the main problems of

ETL processes are: vastness of the data volumes quality problems, since data is not always clean and has

to be cleansed performance, since the whole process has to take place

within a specific time window evolution of the sources and the data warehouse can

eventually lead, even to daily maintenance operations

Page 6: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 7

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Quality Metrics (will not touch this)

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 7: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 8

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Quality Metrics (will not touch this)

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 8: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 9

Conceptual Model Design goals

we need a simple model, sufficient for the early stages of the data warehouse design

we need convenient means of communication among different groups of people involved in the DW project (e.g., dba’s and business managers)

we need to be able to model what our sources “talk” about

Semantic goals we need richer semantics to

describe sources reason about them

Page 9: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 10

Conceptual Model

S1.PARTSUPPS2.PARTSUPP DW.PARTSUPP

American toEuropean Date$ € Date = SysDate()

SK

f

SUM(S2.Cost)

SUM(S2.Qty)

S2.Date

S2.PKey

S2.SuppKey

f NN

f

SK

PK

Cost

Qty

Date

PKey

SuppKey

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

Annual PartSupp’s

RecentPartSupp’s

{XOR}

Due to acccuracy and small size

(< update window)

Necessary providers:S1 and S2

{Duration<4h}

U

γ

PS1

Qty

PKey

SuppKey

PS2

Cost

PKey

SuppKey

+

PS1.Pkey+=PS2.PKeyPS1.SuppKey+=PS2.SuppKey

PS1.Dept+=PS2.Dept

Dept

Dept

Dept

Page 10: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 11

Conceptual Model Key idea

an ontology-based approach to facilitate the conceptual design An ontology

is a “formal, explicit specification of a shared conceptualization” describes the knowledge in a domain in terms of classes, properties,

and relationships between them machine processable formal semantics reasoning mechanisms

Method construct an appropriate application vocabulary annotate the data sources generate the application ontology apply reasoning techniques to select relevant sources and identify

required transformations

Page 11: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 12

Paris/Rome/Athens

Conceptual Model

Above 200 Euros

prices in Dollars

From 500 to 1500 Euros

software/hardware

only software products

Paris/Rome/Athens

Rome/Athens

software

hardware

Page 12: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 13

Conceptual Model

Application vocabularyVC = {product, store}

VPproduct = {pid, pName, quantity, price, type, storage}

VPstore = {sid, sName, city, street}

VFpid = {source_pid, dw_pid}

VFsid = {source_sid, dw_sid}

VFprice = {dollars, euros}

VTtype = {software, hardware}

VTcity = {paris, rome, athens}

Datastore mappings

Datastore annotation

Page 13: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 14

Conceptual Model The class hierarchy Definition for class

DS1_Products

Page 14: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 15

Conceptual Model Reasoning on the mappings s(location, city, street)c(street, number, street)

Page 15: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 16

Conceptual Model Reasoning on the definitions

f(pid, Source_Pid, DW_Pid)σ(price, From_500_To_1500)nn(quantity)σ(type, {software})

Page 16: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 17

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Quality Metrics (will not touch this)

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 17: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 18

Logical Model

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF2

DS.PSNEW2.PKEY,DS.PSOLD2.PKEY

DS.PSNEW2

DS.PSOLD2

DW.PARTS Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP2S2.PARTS

S1.PARTS FTP1

DS.PSNEW1

DIFF1

DS.PSOLD1

DS.PSNEW1.PKEY,DS.PSOLD1.PKEY

SourcesDW

DSA

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

Page 18: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 19

Logical Model Main question:

What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of

an attribute? what part of the scenario is affected if we delete an

attribute?

Page 19: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 20

Architecture Graph

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate U

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF2

DS.PSNEW2.PKEY,DS.PSOLD2.PKEY

DS.PSNEW2

DS.PSOLD2

DW.PARTS Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP2S2.PARTS

S1.PARTS FTP1

DS.PSNEW1

DIFF1

DS.PSOLD1

DS.PSNEW1.PKEY,DS.PSOLD1.PKEY

SourcesDW

DSA

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

DS.PS2

Log

rejected

Log

rejected

Page 20: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 21

Architecture GraphExample

Add_Attr2 SK2DS.PS2 TMP_STOR.PARTSUPP

PKEY

QTY

COST

PKEY

QTY

COST

PKEY

QTY

COST

DATE

PKEY

QTY

COST

DATE

PKEY

QTY

COST

DATE

PKEY

QTY

COST

DATEDATE DATE

SOURCE SOURCE SOURCE SOURCE

SKEYAddConst2

in out

1

LOOKUP2

PKEY

SOURCE

SKEY

PKEY

SOURCE

LPKEY

LSOURCE

LSKEY

IN IN INOUT OUT OUT

OUT

PAR PAR

SUPPKEY SUPPKEY SUPPKEY SUPPKEY SUPPKEYSUPPKEY

2

Page 21: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 22

Architecture Graph

SK2

PKEY

QTY

COST

DATE

PKEY

QTY

COST

DATE

SOURCE SOURCE

SKEY

PKEY

SOURCE

LPKEY

LSOURCE

LSKEY

IN OUT

PAR

SUPPKEY SUPPKEY

input schema

output schema

functionality schema

projected-out schema

generated schema

Add_Attr2DS.PS2 TMP_STOR.PARTSUPP

PKEY

QTY

COST

PKEY

QTY

COST

PKEY

QTY

COST

DATE

PKEY

QTY

COST

DATEDATE DATE

SOURCE SOURCE

AddConst2

in out

1

LOOKUP2

PKEY

SOURCE

SKEY

IN INOUT OUT

OUT

PAR

SUPPKEY SUPPKEY SUPPKEY SUPPKEY

Example

2

Page 22: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 23

Semantics We provide graph modeling techniques for several

kinds of activities: update (INS, UPD, DEL) activities aggregates rules employing negation and aliases functions

Page 23: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 24

Logical Model Question revisited

What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage?

it is described as the Architecture Graph which attributes/tables are involved in the population of

an attribute? what part of the scenario is affected if we delete an

attribute? follow the appropriate path in the Architecture Graph

Page 24: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 25

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Metrics

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 25: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 26

Conceptual to Logical Similarities

concepts and attributes recordsets and attributes part-of and provider relationships

DW.PARTS

PKey

Cost

Qty

Date

DW.PARTS

Qty

Cost

PKey

Date

... ...

Page 26: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 27

Conceptual to Logical Discrepancies

PKey PKeySK

S1 DW

R01: sk.a_in1(pkey <-S1(pkey).

R02: sk.a_in2(pkey,source,skey)<-lookUp(l_pkey,source,l_skey),pkey=l_pkey, skey=l_skey, source=1.

R03: sk.a_out(pkey, skey)<-add_sk1.a_in1(pkey),add_sk1.a_in2(pkey,source,skey).

R04: S2(skey)<-sk.a_out(pkey,skey).

R01: sk.a_in1(pkey <-S1(pkey).

R02: sk.a_in2(pkey,source,skey)<-lookUp(l_pkey,source,l_skey),pkey=l_pkey, skey=l_skey, source=1.

R03: sk.a_out(pkey, skey)<-add_sk1.a_in1(pkey),add_sk1.a_in2(pkey,source,skey).

R04: S2(skey)<-sk.a_out(pkey,skey).

Operational Semantics of SK :

SK

PKeyPKey

S1

PKey

S2

SKey

LookUp

l_PKey

l_SKey

Source

SKey

PKey

SKey

Source

1

=

=

IN2

Program

SK_in1

R03

SK_in2

SK_out

R01 R04

headhead

head

R02

head

Transformationsvs.

Activities

Page 27: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 28

Conceptual to Logical Correctness

our methodology is a semi-automatic procedure the designer should examine, complement or change the outcome of

this methodology Optimization

the constraints in the determination of the execution order require only that the placement of activities should be semantically correct

the logical workflow produced by this methodology is not the only possible logical workflow

the final choice of an ETL workflow depends on other parameters, e.g., the quality of the designquality of the design of an ETL workflow the execution costexecution cost of an ETL workflow

Page 28: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 29

SK

S2.PARTSUPP DW.PARTSUPP

γ

f1

S2.Date

f2Cost

Qty

Date

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

PK

Conceptual to Logical Execution order…

which is the proper execution order?

Page 29: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 30

Conceptual to Logical

γ DW.PARTSUPPf2SKS2.PART

SUPP f1 PK

Execution order…

order equivalence?

SK,f1,f2 or SK,f2,f1 or ... ?

Page 30: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 31

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Metrics

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 31: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 32

Optimization of ETL Workflows Common settlement

ad-hoc optimization based on the experience of the designer execute ETL workflow as it is; hopefully, the optimizer of the DBMS

would improve the performance An ETL workflow is NOT a big query

ETL is very procedural in nature, each ETL operator is a low-level operator in procedural languages like PL/SQL

Traditional query optimization techniques are not enough existence of functions

where it is allowed to push an activity before/after a function? existence of black-box activities

unknown semantics can not interfere in their interior

naming conflicts

Page 32: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 33

Optimization of ETL Workflows How can we improve an ETL workflow in terms of execution

time? We model the ETL processes optimization problem as a state

search problem we consider each ETL workflow as a state we construct the search space the optimal state is chosen according to our cost model’s criteria, in

order to minimize the execution time of an ETL workflow

Page 33: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 34

Optimization of ETL Workflows Transition from one state to the other

SWA: interchange two activities of the workflow FAC: replace homologous tasks in parallel flows with an equivalent

task over a flow to which these parallel flows converge DIS: divide tasks of a joint flow to clones applied to parallel flows

that converge towards the joint flow MER / SPL: merge / split group of activities

a2

SWA(a1,a2) SWA(a2,a1)

a2 a1

a1

FAC(ab,a1,a2) DIS(ab,a)

aab

ab

a1

a2

a,a1,a2: homologous activities

a2a1

SPL(a1+2,a1,a2)

a1+2

MER(a1+2,a1,a2)

Page 34: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 35

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF2

DS.PSNEW2.PKEY,DS.PSOLD2.PKEY

DS.PSNEW2

DS.PSOLD2

DW.PARTS Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP2S2.PARTS

S1.PARTS FTP1

DS.PSNEW1

DIFF1

DS.PSOLD1

DS.PSNEW1.PKEY,DS.PSOLD1.PKEY

SourcesDW

DSA

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

Page 35: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 36

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

Page 36: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 37

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

Page 37: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 38

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

Page 38: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 39

Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s

in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

DS.PS1

DS.PS2 AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

ULog

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

$2€ A2EDate

Adate

Edate

$

Adate

Adate

in out in out

AddAttr SK2

pkey

source

skey

source

pkey pkey

source

in out in out

skey

Page 39: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 40

Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s

in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

DS.PS1

DS.PS2 AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

ULog

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

A2EDate $2€

Edate

Adate

Adate

$

Adate

in out in out

AddAttr SK2

pkey

source

skey

source

pkey pkey

source

in out in out

skey

??

Page 40: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 41

Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s

in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

DS.PS1

DS.PS2 AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

ULog

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

AddAttr SK2

pkey

source

skey

source

pkey pkey

source

in out in out

skey

x

A2EDate $2€

$

Edate

Εdate

$

Adate

$

Εdate

in out in out

Page 41: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 42

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

SK2

U

SK1

Page 42: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 43

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

SK1,2U

Page 43: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 44

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

UPKU

Page 44: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 45

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

PK2

U

PK1

Page 45: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 46

Optimization of ETL Workflows

AddAttr2

SOURCE

SK2

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SOURCE

$2€

COST DATE

DS.PS1 SK1

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SOURCECOST DATE=SYSDATE

AddDate

U

DS.PS2

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

Log

rejected

γ

QTY,COST

PK1

PKEY,DATE

Log

rejected

DW.PARTS

Pre-Processing Phase I: SWA’s in local groups

Phase II: FAC’s

homologousPhase III: DIS’s

Phase IV: SWA’s in local

groups

Post-Processing

PK2

PKEY,DATE

Log

rejected

Page 46: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 47

Optimization of ETL Workflows Algorithms

exhaustive (1) heuristic (2) greedy (3)

in phases I & IV perform only those transitions that drive to a better state

Results algorithms (2) and (3) improve the performance of ETL workflows

over 70% (avg) during a satisfactory for DW’s period of time (in a time range of sec..10min)

Page 47: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 48

Optimization of ETL Workflows Physical optimization

several physical operators implement the same semantic operation e.g., a logical join can be performed in more than one way:

nested loops, sort-merge join, hash-join Solution

employ different alternatives for the physical execution of each logical-level activity

take into consideration the effects of possible system failures to the workflow operation the introduction of sorter activities in the physical design

Algorithms exhaustive search of the search space state-space pruning based on experimental observations and the benefits of

sorter introduction given by the formula:

Page 48: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 49

Outline Conceptual Model Logical Model

Architecture Graph Operational Semantics Metrics

Conceptual to Logical Optimization of ETL Workflows Research Challenges

Page 49: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 50

Research Challenges A unified way to represent ETL processes

an algebra or a declarative language a benchmark for ETL processes

useful for evaluating optimization techniques, implementation algorithms for specific ETL activities, and so on(a first attempt was presented in QDB’07, in conj. w/ VLDB’07)

Extension of the ETL mechanisms for non-traditional data XML/HTML, spatial, biomedical data, …

Apply the techniques described to similar environments e.g., Active DWs

meet the high demand of applications for up-to-date information are refreshed on-line and achieve a higher consistency between the

stored information and the latest data updates

Page 50: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 51

Real time ETL

Real-time Stream of S1 updates DS

Relation R

Source Relation

S

Join module

Load shedder

(Active) ETL activities for regular DW load

...

...

DSA

DW

Active ETL workflow for approximate, real-time reporting

Off-line synchronization

IntroductoryStage

GrowthStage

MaturityStage Decline Stage

TotalMarketSales

Time

10090

8070

6050

4030

4050

Real-time DW refreshment

Offline update of reports

DW refreshmentData to

be loaded

Page 51: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 52

Research Challenges Take into consideration

privacy issues, physical constraints Evolution of ETL processes

changes may occur in the sources, DW, business requirements, … (a first attempt was presented in DaWaK’07)

Can it scale? think of new models for the case of large distributed environments

with many sources, e.g. P2P can the techniques scale? can they adapt to the different semantics, like approximate and

incomplete answers? can we make the techniques “goal”-driven rather than strict: e.g. I want

to have 100% over this week’s data, 80% over last week’s, etc? how to integrate static and dynamic cases (peers come and leave, others

stay there for a long period)?

Page 52: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 53

Conclusions Conceptual Model Logical Model

Architecture Graph Operational Semantics Metrics

Conceptual to Logical Optimization of ETL Workflows Research Challenges

DOLAP ’02,’06 – NLDB ’07 – J. IJSWIS ’07

CAiSE ’02 – DMDW ’02CAiSE ’03 - J. Inf.Sys. ’05ER ’05 – DaWaK ’05

DOLAP ’05 – J. DSS ’05ICDE ’05 – J. TKDE ’05 – DOLAP

’07

Real ETL – MeshJoin, ICDE ’07

Page 53: Formal Specification and Optimization of ETL Workflows

Timos Sellis - ADBIS 2007 54

Thank you!