Upload
hillary-elliott
View
215
Download
0
Embed Size (px)
Citation preview
IBM T.J. Watson Research Center
© 2008 IBM Corporation
Advances and Challenges for Advances and Challenges for Scalable Provenance in Stream Scalable Provenance in Stream Processing SystemsProcessing Systems
Archan Misra, Marion Blount,Archan Misra, Marion Blount,Anastasios KementsietsidisAnastasios Kementsietsidis,,Daby Sow, Min WangDaby Sow, Min Wang
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
The Healthcare Crisis
Increased Demand
75 million Aging Baby
Boomers
Rising consumption of
healthcare services
Strains existing
healthcare resources
Limited Resources
133 million Americans
have a chronic disease
(e.g., diabetes)
$1.6 Trillion cost to treat
chronic disease (2005)
“Frequent Flyers” are
40% of the cost
1979 1991
Rising Costs
Existing nurse shortage
Existing bed shortage
Doctor shortage
predicted by 2015
Today
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
The Century Project:Moving From Reactive to Proactive Healthcare…
Ubiquitous SensorNetwork
Stream-based Distributed Interoperable Health care Infrastructure (CENTURY)
Solutions (Applications)
Inte
rne
t
Inte
rne
t
Subscribe
Notify
SubscriptionService
Event StoreQuery Service
ProvenanceQuery Service
GroupManagement
Service
Solution DeliveryServices
ApplicationServer
GUI
GUI
GUI
Event Management Service
Base Solutions (System S, DB2, WAS)
US
N G
ateway
CENTURY
Analysis Framework
QRS
BP
RR PT
FA
WT
GLSPE
AR
CHF
GLA
BPA
EP
WTA
EventPreprocessor
Provenance Service
Subscribe
Notify
ApplicationServer
Subscribe
Notify
ApplicationServer
...
GUI
GUI
ECG
WeightBP
Glucose
USN Hub
ECG
WeightBP
Glucose
USN Hub
ECG
WeightBP
Glucose
USN Hub
Service Data Management
Platform Service IHE Adapter
InteroperabilityContainer
EHR SystemPatientMedicalRecord
1. We are not just addressing a social problem…There are some interesting research challenges here!
2. We are not just proposing a set of open problems… The first version of Century has already been deployed!
1. We are not just addressing a social problem…There are some interesting research challenges here!
2. We are not just proposing a set of open problems… The first version of Century has already been deployed!
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008 4
An Example of Data Provenance Use
Urgent AlertPatient: Doe, John
Condition: Abnormal ReactionRecommendation:
Century has detected an issue in your patient’s medical condition which is deteriorating. A known
side effect has emerged. Century recommends that you decrease the patient’s dosage of the prescribed medication to 10 mg twice a day.
Ultimately, the physician is responsible for medical
decisions/actions. In order for medical professionals to accept the upcoming technology, we must provide them with the
information they need to make these decisions responsibly.
The Setup:-- Dr. Lee prescribes medication to patient John Doe for his heart condition. -- Dr. Lee also prescribes a program to monitor the effect of the drug on Mr. Doe. -- A few days later, Dr. Lee receives an alert from the Century prescription program.
Data Provenance provides the foundation for this understandingData Provenance provides the foundation for this understanding
That’s unusual!!! Before agreeing to this change, I
need to understand on what basis the system has made
this recommendation
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
O23(t) I8(i, i-256)
Sequence dependency
O45(t) I96(t){(spo2<89)}, I97(t,t-1){(systolic>130)} I67(t,t-1){(weightDelta>5%)}
Hybrid Time Value dependencyO33(t) I15(t, t-1)
Time dependency
The Underlying Technology
Angina Pectoris
alert
alert
SPE
QRS
FA
RR
SP
BP
AR
AP
PT
SPA
BPA
EP
alertWB
WTWTA
Arrhythmia
Well-Being
O33
O45
O51
O11
O23
O13
O25O24
O30
O40
O42O1
O70O71
O95
I10
I8I9
I21
I41I45I2
I50I52
I56
I49
I79I89
I80I83
I96I97I67
I15
Stream Persistence Challenge
Provenance (and regulatory requirements) require that data streams are persisted. Can we sustain such a high insertion throughput in today’s DBs?
Stream Persistence Challenge
Provenance (and regulatory requirements) require that data streams are persisted. Can we sustain such a high insertion throughput in today’s DBs?
Data Provenance Granularity Challenge
Unlike traditional data provenance settings, data provenance is no longer limited to a tuple-based granularity. Indeed, granularityis very much dynamic and depends on the particular analytics.
How can we handle varying provenance granularities?
Data Provenance Granularity Challenge
Unlike traditional data provenance settings, data provenance is no longer limited to a tuple-based granularity. Indeed, granularityis very much dynamic and depends on the particular analytics.
How can we handle varying provenance granularities?
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Persisting ECG data streams [DEBS07]
Data SizeData Size
Tim
eT
ime
This is ageneral trend!
This is a general problem:-- Not specific to provenance!-- Not specific to an app domain!
This is a general problem:-- Not specific to provenance!-- Not specific to an app domain!
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
The CMIR Framework
PEV2
PEV1
PE1 PE2
PE3 PE4
PE5
PE6
O21O11
O31
O61
O41
O51
I11
I31
I61
I41
I21
I51
O21I11 (t-15,t)
O41I31{(i,i-17,order=1), (systolic>130,order=2)}
The main idea of the framework is the introduction of virtual PEs
to reduce the storage load imposed by provenance support.
The main idea of the framework is the introduction of virtual PEs
to reduce the storage load imposed by provenance support.
Implementing such a framework has its own challenges!
Implementing such a framework has its own challenges!
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Composing Provenance Rules
Since a provenance rule is associated with each individual PE, the provenance rule for a virtual PE must be automatically generated. This assumes that provenance rules are composable.
R2: O25(t) I10(t, t-32)
QRSEPO13 O25
I10
I2
R1: O13(t) I2(t, t-10)
= RC: O25(t) I2(t, t-42)
R1: O13(t) I2(t, t-10)
QRSEPO13 O25
I10
I2
R1: O25I10{(systolic>130,order=1), (i,i-5,order=2)}
= ????
Composing the two formulas requires that we can determine statically how far back in time, at runtime, we have to go to retrieve five values with a systolic pressure above 130.
Composing the two formulas requires that we can determine statically how far back in time, at runtime, we have to go to retrieve five values with a systolic pressure above 130.
We need to develop a rule language whose primitives are composableWe need to develop a rule language whose primitives are composable
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Inverting Provenance Rules
While backward provenance is the most common, in a number of settings, forward provenance turns out to be equally useful.
R1: O25(t) I10(t, t-32)
QRSO25
I10
R1C: I10(t) O25(t, t+32)
R2: O25(t) I10(systolic>130)
QRSO25
I10
R2C: I10(t) ????
R2C: I10(t)
Ø, if Ø, if systolic<130 at t
O25(t, -), otherwise
Not supported by our current modelR3: O25I10{(t, t-10,order=1), (systolic>130,order=2)}
QRSO25
I10
R3C: I10(t) O25(t, t+10)
This is approximate inversion but it might suffice in practice…
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Persisting Processing Element (PE) State
The processing perfomed by a PE, and hence its output, does not depend
only on its input(s)!! PEs can be, and often are, statefull.
Strategies for persisting state:
As a custom byte stream
± Each PE developer decides what/how to persist.
– No sharing of code/design across PE developers.
As a predefined data structure
+ Code/structure sharing across PE developers
– It must be generic enough to accommodate diverse needs across PEs.
As a (full-fledged) database
+ Well understood technology
+ Each PE developer decides what to persist
+ Generic enough to accommodate diverse needs across PEs.
– Might be an overkill when the state information is simple.
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Facing Varying Granularities
The granularity at which a consumer PE ingests streaming data might differfrom the granularity at which a producer PE generates them.
I2
I3
QuickTime™ and a decompressor
are needed to see this picture.
R1: O25(t) I10(i, i-10)
QRS
SPEO13 O25I10
Raw Signal
Producer 1
Producer 2
Consumer 1
Consumer 2
SPEO14 QRS
O26I11
R2: O26(t) I11(i, i-7)
This problem is NOT Century-specific!See the Xstream system [ICDE08]
This problem is NOT Century-specific!See the Xstream system [ICDE08]
The QRS analytics produces an alertbased on the last 10 seconds of ECG signal
The QRS analytics produces an alertbased on the last 10 seconds of ECG signal
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Dealing With Varying Granularities
I2
I3
QuickTime™ and a decompressor
are needed to see this picture.
R1: O25(t) I10(i, i-10)
QRS
SPEO13 O25I10
SPEO14 QRS
O26I11
R2: O26(t) I11(i, i-3)
The same rule language can be used to smooth out the differences in granularities between consumer and producer PEs.
Raw Signal
Producer 1
Producer 2
Consumer 1
Consumer 2
R4: I11(t) O14(t, t-4)
R3: I10(t) O13(t)
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Related Work
EU Provenance (PASOA, PreServ))
Captures Workflow bindings
Medical Stream Provenance(Century) Arbitrary time-varyinglineageDynamic stream bindings
Scientific SOA Workflows (KARMA) Workflow and application binding notificationsStateless application-defined data provenance
Database View Inversion (Trio, Widom) Specific Transforms and extensions to SQL Data dependencies specified statically for relations
Data and Processing Rates
Stream Provenance (Simhan)
Store temporal history of stream connectors as a per-stream stack
System-level automatic metadata collection at stream-level
Typ
e p
f P
rov
ena
nce
R
eco
nst
ruc
tio
n
Pro
ces
sD
ata
File Systems: PASS, LFS Captures system calls and modifications to file records
Annotation per file
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008
Conclusions Data provenance is a key component of the Century project
The first version of Century has already been deployed and fully supports data provenance queries in an environment with high processing and data rates.
As part of our experience with this first version, we have identified a set of research challenges we must address to move Century forward:
– Stream Persistence Challenge: Not all provenance data can be persisted. Viable solutions must provide us with a way to replay some of the computation and effectively re-compute, at query time, the necessary data. To achieve this goal:
• The state of processing elements must also be persisted.• The provenance model must support composition of provenance rules• The provenance model must support inversion of provenance rules
– Data Provenance Granularity Challenge: The provenance model must account for the mismatch between the consumers and producers of data.
We are currently working towards addressing these (and other) issues.
IBM T.J. Watson Research Center
© 2008 IBM CorporationIPAW 2008