15
IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Advances and Challenges for Scalable Provenance in Scalable Provenance in Stream Processing Systems Stream Processing Systems Archan Misra, Marion Archan Misra, Marion Blount, Blount, Anastasios Anastasios Kementsietsidis Kementsietsidis , , Daby Sow, Min Wang Daby Sow, Min Wang

IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

Embed Size (px)

Citation preview

Page 1: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM Corporation

Advances and Challenges for Advances and Challenges for Scalable Provenance in Stream Scalable Provenance in Stream Processing SystemsProcessing Systems

Archan Misra, Marion Blount,Archan Misra, Marion Blount,Anastasios KementsietsidisAnastasios Kementsietsidis,,Daby Sow, Min WangDaby Sow, Min Wang

Page 2: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

The Healthcare Crisis

Increased Demand

75 million Aging Baby

Boomers

Rising consumption of

healthcare services

Strains existing

healthcare resources

Limited Resources

133 million Americans

have a chronic disease

(e.g., diabetes)

$1.6 Trillion cost to treat

chronic disease (2005)

“Frequent Flyers” are

40% of the cost

1979 1991

Rising Costs

Existing nurse shortage

Existing bed shortage

Doctor shortage

predicted by 2015

Today

Page 3: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

The Century Project:Moving From Reactive to Proactive Healthcare…

Ubiquitous SensorNetwork

Stream-based Distributed Interoperable Health care Infrastructure (CENTURY)

Solutions (Applications)

Inte

rne

t

Inte

rne

t

Subscribe

Notify

SubscriptionService

Event StoreQuery Service

ProvenanceQuery Service

GroupManagement

Service

Solution DeliveryServices

ApplicationServer

GUI

GUI

GUI

Event Management Service

Base Solutions (System S, DB2, WAS)

US

N G

ateway

CENTURY

Analysis Framework

QRS

BP

RR PT

FA

WT

GLSPE

AR

CHF

GLA

BPA

EP

WTA

EventPreprocessor

Provenance Service

Subscribe

Notify

ApplicationServer

Subscribe

Notify

ApplicationServer

...

GUI

GUI

ECG

WeightBP

Glucose

USN Hub

ECG

WeightBP

Glucose

USN Hub

ECG

WeightBP

Glucose

USN Hub

Service Data Management

Platform Service IHE Adapter

InteroperabilityContainer

EHR SystemPatientMedicalRecord

1. We are not just addressing a social problem…There are some interesting research challenges here!

2. We are not just proposing a set of open problems… The first version of Century has already been deployed!

1. We are not just addressing a social problem…There are some interesting research challenges here!

2. We are not just proposing a set of open problems… The first version of Century has already been deployed!

Page 4: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008 4

An Example of Data Provenance Use

Urgent AlertPatient: Doe, John

Condition: Abnormal ReactionRecommendation:

Century has detected an issue in your patient’s medical condition which is deteriorating. A known

side effect has emerged. Century recommends that you decrease the patient’s dosage of the prescribed medication to 10 mg twice a day.

Ultimately, the physician is responsible for medical

decisions/actions. In order for medical professionals to accept the upcoming technology, we must provide them with the

information they need to make these decisions responsibly.

The Setup:-- Dr. Lee prescribes medication to patient John Doe for his heart condition. -- Dr. Lee also prescribes a program to monitor the effect of the drug on Mr. Doe. -- A few days later, Dr. Lee receives an alert from the Century prescription program.

Data Provenance provides the foundation for this understandingData Provenance provides the foundation for this understanding

That’s unusual!!! Before agreeing to this change, I

need to understand on what basis the system has made

this recommendation

Page 5: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

O23(t) I8(i, i-256)

Sequence dependency

O45(t) I96(t){(spo2<89)}, I97(t,t-1){(systolic>130)} I67(t,t-1){(weightDelta>5%)}

Hybrid Time Value dependencyO33(t) I15(t, t-1)

Time dependency

The Underlying Technology

Angina Pectoris

alert

alert

SPE

QRS

FA

RR

SP

BP

AR

AP

PT

SPA

BPA

EP

alertWB

WTWTA

Arrhythmia

Well-Being

O33

O45

O51

O11

O23

O13

O25O24

O30

O40

O42O1

O70O71

O95

I10

I8I9

I21

I41I45I2

I50I52

I56

I49

I79I89

I80I83

I96I97I67

I15

Stream Persistence Challenge

Provenance (and regulatory requirements) require that data streams are persisted. Can we sustain such a high insertion throughput in today’s DBs?

Stream Persistence Challenge

Provenance (and regulatory requirements) require that data streams are persisted. Can we sustain such a high insertion throughput in today’s DBs?

Data Provenance Granularity Challenge

Unlike traditional data provenance settings, data provenance is no longer limited to a tuple-based granularity. Indeed, granularityis very much dynamic and depends on the particular analytics.

How can we handle varying provenance granularities?

Data Provenance Granularity Challenge

Unlike traditional data provenance settings, data provenance is no longer limited to a tuple-based granularity. Indeed, granularityis very much dynamic and depends on the particular analytics.

How can we handle varying provenance granularities?

Page 6: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Persisting ECG data streams [DEBS07]

Data SizeData Size

Tim

eT

ime

This is ageneral trend!

This is a general problem:-- Not specific to provenance!-- Not specific to an app domain!

This is a general problem:-- Not specific to provenance!-- Not specific to an app domain!

Page 7: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

The CMIR Framework

PEV2

PEV1

PE1 PE2

PE3 PE4

PE5

PE6

O21O11

O31

O61

O41

O51

I11

I31

I61

I41

I21

I51

O21I11 (t-15,t)

O41I31{(i,i-17,order=1), (systolic>130,order=2)}

The main idea of the framework is the introduction of virtual PEs

to reduce the storage load imposed by provenance support.

The main idea of the framework is the introduction of virtual PEs

to reduce the storage load imposed by provenance support.

Implementing such a framework has its own challenges!

Implementing such a framework has its own challenges!

Page 8: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Composing Provenance Rules

Since a provenance rule is associated with each individual PE, the provenance rule for a virtual PE must be automatically generated. This assumes that provenance rules are composable.

R2: O25(t) I10(t, t-32)

QRSEPO13 O25

I10

I2

R1: O13(t) I2(t, t-10)

= RC: O25(t) I2(t, t-42)

R1: O13(t) I2(t, t-10)

QRSEPO13 O25

I10

I2

R1: O25I10{(systolic>130,order=1), (i,i-5,order=2)}

= ????

Composing the two formulas requires that we can determine statically how far back in time, at runtime, we have to go to retrieve five values with a systolic pressure above 130.

Composing the two formulas requires that we can determine statically how far back in time, at runtime, we have to go to retrieve five values with a systolic pressure above 130.

We need to develop a rule language whose primitives are composableWe need to develop a rule language whose primitives are composable

Page 9: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Inverting Provenance Rules

While backward provenance is the most common, in a number of settings, forward provenance turns out to be equally useful.

R1: O25(t) I10(t, t-32)

QRSO25

I10

R1C: I10(t) O25(t, t+32)

R2: O25(t) I10(systolic>130)

QRSO25

I10

R2C: I10(t) ????

R2C: I10(t)

Ø, if Ø, if systolic<130 at t

O25(t, -), otherwise

Not supported by our current modelR3: O25I10{(t, t-10,order=1), (systolic>130,order=2)}

QRSO25

I10

R3C: I10(t) O25(t, t+10)

This is approximate inversion but it might suffice in practice…

Page 10: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Persisting Processing Element (PE) State

The processing perfomed by a PE, and hence its output, does not depend

only on its input(s)!! PEs can be, and often are, statefull.

Strategies for persisting state:

As a custom byte stream

± Each PE developer decides what/how to persist.

– No sharing of code/design across PE developers.

As a predefined data structure

+ Code/structure sharing across PE developers

– It must be generic enough to accommodate diverse needs across PEs.

As a (full-fledged) database

+ Well understood technology

+ Each PE developer decides what to persist

+ Generic enough to accommodate diverse needs across PEs.

– Might be an overkill when the state information is simple.

Page 11: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Facing Varying Granularities

The granularity at which a consumer PE ingests streaming data might differfrom the granularity at which a producer PE generates them.

I2

I3

QuickTime™ and a decompressor

are needed to see this picture.

R1: O25(t) I10(i, i-10)

QRS

SPEO13 O25I10

Raw Signal

Producer 1

Producer 2

Consumer 1

Consumer 2

SPEO14 QRS

O26I11

R2: O26(t) I11(i, i-7)

This problem is NOT Century-specific!See the Xstream system [ICDE08]

This problem is NOT Century-specific!See the Xstream system [ICDE08]

The QRS analytics produces an alertbased on the last 10 seconds of ECG signal

The QRS analytics produces an alertbased on the last 10 seconds of ECG signal

Page 12: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Dealing With Varying Granularities

I2

I3

QuickTime™ and a decompressor

are needed to see this picture.

R1: O25(t) I10(i, i-10)

QRS

SPEO13 O25I10

SPEO14 QRS

O26I11

R2: O26(t) I11(i, i-3)

The same rule language can be used to smooth out the differences in granularities between consumer and producer PEs.

Raw Signal

Producer 1

Producer 2

Consumer 1

Consumer 2

R4: I11(t) O14(t, t-4)

R3: I10(t) O13(t)

Page 13: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Related Work

EU Provenance (PASOA, PreServ))

Captures Workflow bindings

Medical Stream Provenance(Century) Arbitrary time-varyinglineageDynamic stream bindings

Scientific SOA Workflows (KARMA) Workflow and application binding notificationsStateless application-defined data provenance

Database View Inversion (Trio, Widom) Specific Transforms and extensions to SQL Data dependencies specified statically for relations

Data and Processing Rates

Stream Provenance (Simhan)

Store temporal history of stream connectors as a per-stream stack

System-level automatic metadata collection at stream-level

Typ

e p

f P

rov

ena

nce

R

eco

nst

ruc

tio

n

Pro

ces

sD

ata

File Systems: PASS, LFS Captures system calls and modifications to file records

Annotation per file

Page 14: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008

Conclusions Data provenance is a key component of the Century project

The first version of Century has already been deployed and fully supports data provenance queries in an environment with high processing and data rates.

As part of our experience with this first version, we have identified a set of research challenges we must address to move Century forward:

– Stream Persistence Challenge: Not all provenance data can be persisted. Viable solutions must provide us with a way to replay some of the computation and effectively re-compute, at query time, the necessary data. To achieve this goal:

• The state of processing elements must also be persisted.• The provenance model must support composition of provenance rules• The provenance model must support inversion of provenance rules

– Data Provenance Granularity Challenge: The provenance model must account for the mismatch between the consumers and producers of data.

We are currently working towards addressing these (and other) issues.

Page 15: IBM T.J. Watson Research Center © 2008 IBM Corporation Advances and Challenges for Scalable Provenance in Stream Processing Systems Archan Misra, Marion

IBM T.J. Watson Research Center

© 2008 IBM CorporationIPAW 2008