View
213
Download
0
Embed Size (px)
Citation preview
1© Gerhard Weikum
Outline
Part A: WF Specification and Verification
Part B: WF System Architectureand Configuration
What Is It All About?
WF Specification Techniques
Statecharts
CTL and Model Checking
Summary and Open Research Issues
WF Execution Infrastructure
• Failure Handling
• Stochastic Modeling
• WF System Configuration
• Summary and Open Research Issues
2© Gerhard Weikum
WFMS Architecture for E-Services
Ms3.lnk
...
WF server type 1
App server type 1
Clients
Comm server
WF server type 2
App server type n
...
3© Gerhard Weikum
Interoperability between WF Systems
WFMS 1 WFMS 2
Ms3.lnk
WF Mediator
<?xml version="1.0"?><Activity WFtype="5” Name=„RiskAssessmentAct"> <BusinessData> <CreditRequest> <CustId>101</CustId> </CreditRequest> </BusinessData> <WorkflowCtrl> <Variable Name="Currency" Value="USD"/> </WorkflowCtrl></Activity>
• wrap WFMS using XML-based interface (e.g., WSDL/WSFL or ???)• route activity and sub-WF invocations through WF mediator• same protocol for activities and sub-WFs• add‘l functions for sub-WF monitoring
4© Gerhard Weikum
Some “AI-complete” Problems
Grand challenge:service discovery and matchmaking
State of the art: standardized syntax & protocols (à la UDDI) with queries on yellow pages (“business registry“)
Needed: semantics & interoperability with automatic reasoning about process/activity interface, behavior, and outcome
standardized ontologies are a step forward,but still far from final goal;we need: Semantic Web + Intelligent Search
5© Gerhard Weikum
Outline
Part A: WF Specification and Verification
Part B: WF System Architectureand Configuration
What Is It All About?
WF Specification Techniques
Statecharts
CTL and Model Checking
Summary and Open Research Issues
WF Execution Infrastructure
Failure Handling
• Stochastic Modeling
• WF System Configuration
• Summary and Open Research Issues
6© Gerhard Weikum
Important System Issues
see, e.g., Mentor-lite,http://www-dbs.cs.uni-sb.de/~mlite/
• Differentiated quality of service and performance guarantees (e.g., class-specific response time and workflow turnaround time)
• World-wide failure masking for exactly-once behavior with easy app development
see, e.g., Phoenix projecthttp://research.microsoft.com/db/phoenix/
• Scalability, Reliability, Availability, Manageability,...
7© Gerhard Weikum
Your server command (process id #20) has been terminated.Re-run your command (severity 13) in/export/home/WWW/your-reliable-eshop.biz/mb_1300_db.mb1
Place your order
The Need for Failure Masking
Please review and place your order
8© Gerhard Weikum
Long-Lived & Distributed Execution
SelectConf
CheckCost
Go
No
/ Budget:=1000; Trials:=1
[Fok & Eok]/ Cost := ConfFee + TrExpenses
[!Found]
[Found]/ Cost:=0
CheckConfFee
CheckTrExpenses
Atomic (transactional) write ofpersistent state & contextguarantees forward recovery
9© Gerhard Weikum
Digression: Two-Phase Commit Protocol (2PC)for Distributed Atomic Transactions
Coordinator Agent 1 Agent 2
write „begin“
force log entries& write „prepared“
force log entries& write „prepared“
send „prepare“send „prepare“
send „yes“send „yes“
write „commit“send „commit“
send „commit“write „commit“ write „commit“
send „ack“send „ack“
write „end“
10© Gerhard Weikum
committed aborted
committed1 aborted1
Statechart for 2PC Protocol
initial
collecting
forgotten
initial1
prepared1
prepare1 / yes1
prepare1 / sorry1
commit1 / ack1
abort1 / ack1
commit1 / ack1 abort1 / ack1
committed2 aborted2
initial2
prepared2
prepare2 / yes2
prepare2 / sorry2
T1|F1
T1|F1
T2|F2
T2|F2
commit2 / ack2
abort2 / ack2
commit2 / ack2 abort2 / ack2
/ prepare1; prepare2
yes1 & yes2/ commit1; commit2
sorry1 | sorry2 / abort1; abort2
ack1 & ack2 ack1
& ack2 C-pending A-pending
T|F
T|F T|F
/ commit1;commit2
/ abort1;abort2
11© Gerhard Weikum
Long-lived & Distributed Execution
SelectConf
CheckCost
Go
No
/ Budget:=1000; Trials:=1
[Fok & Eok]/ Cost := ConfFee + TrExpenses
[Found]/ Cost:=0
CheckConfFee
CheckTrExpenses
[!Found]
Queued transactions & 2PCguarantee consistencyof distributed WFMS& exactly-once execution
12© Gerhard Weikum
From ACID To Recovery Guarantees
Problem:Client that does not receive a returncode from transactional servercannot easily find out the transaction outcome andmay be tempted to re-initiate the (non-idempotent) transaction,thus producing unacceptable effects.
Approach:In addition to atomicity, the transactional server needs toguarantee the exactly-once execution of the transaction,where execution includes the server‘s reply message.
(almost) perfect failure masking
13© Gerhard Weikum
Stateless Applications Based on Queues
stateless application (running on client, or app server or data server):• user sends input message• app program sends request message to data server• data server executes transaction and sends reply message to app• app program sends output message to userthere are no conversations with the user within a transaction,and subsequent transactions are independent
Solution Queued Transactions:• message recovery by queue manager with persistent, recoverable message queues• exactly-once execution by enclosing message dequeue and enqueue into transaction
14© Gerhard Weikum
Illustration of 2-Tier Queued Transaction
User
ApplicationProcess(Client)
DatabaseServer ...
input output
enqueuerequest
dequeuereply
...
dequeuerequest
enqueuereply
server transaction
15© Gerhard Weikum
Illustration of 3-Tier Queued Transaction
User
Client
DatabaseServer ...
input output
enqueuerequest
dequeuereply
...
distributed server transaction
ApplicationServer ...
dequeuerequest
enqueuereply
16© Gerhard Weikum
Correctness of Queued Transaction Protocol
Theorem:With the queued transaction protocol for stateless applications,the following guarantees hold:1. Once the user-input transaction is committed, a request is executed by the server exactly once.2. Once the user-input transaction is committed, the user output
is delivered at least once.3. If user output is testable, the user output is delivered exactly once,
provided the user-input transaction has been committed.
Inherent (small window of) uncertainty:• (last) user input may get lost• (last) user output may be sent more than once can be eliminated with testable output (using special hardware)
17© Gerhard Weikum
Client During Normal Operation
user-input processing by client: begin transaction; enqueue (request); commit transaction;
user-output processing by client: wait until reply queue is not empty; begin transaction; dequeue (reply); while user has not acknowledged the reply or sent the next request do present reply to user; end /*while*/; commit transaction;
18© Gerhard Weikum
Server During Normal Operation
request-reply processing by data server: begin transaction; dequeue (request); perform data operations and generate reply; enqueue (reply); commit transaction;
19© Gerhard Weikum
Client and Server Restart
Client restart: check reply queue; if not empty then process reply like during normal operation; end /*if*/;
Server restart: check request queue; if not empty then initiate processing of requests like during normal operation end /*if*/;
20© Gerhard Weikum
Pseudo-Conversational Transactionsfor Stateful Applications
• Queue-based message recovery for entire conversations• Conversational “logical unit of work” broken down into chain of stateless transactions with (small) application state maintained in the queue (analogously to Cookies, but more general and much more reliable)• Dequeue of reply and enqueue of next request combined into one transaction for exactly-once execution guarantee• good for apps such as travel reservation, electronic shopping, etc.
21© Gerhard Weikum
Illustration ofPseudo-Conversational Transactions
User
ApplicationProcess(Client)
DatabaseServer ...
...
...
...
22© Gerhard Weikum
Correctness ofPseudo-Conversational Transaction Protocol
Theorem:With the queue-based message recovery for conversationalmulti-step transaction chains, the following guarantees hold:1. Once the initial user-input transaction that starts the entire
conversation is committed, the entire transaction chain isexecuted by the server exactly once.
2. Once the initial user-input transaction is committed, eachuser-output message throughout the conversation isdelivered at least once.
3. If user output is testable, each user-output message is delivered exactly once, provided the initial user-input transaction has been committed.
23© Gerhard Weikum
Queue-based Message Recovery forExactly-Once Workflow Execution
At end of activity execute transaction that combines:• writing the activity‘s modifications of workflow state and context to persistent store• writing the state modifications that result from the firing of outgoing transitions to persistent store• writing the context modifications that result from the actions of firing transitions to persistent store• notifying the follow-up activities by enqueueing messages
Newly invoked activity executes transaction that combines:• dequeueing of notification message• writing the workflow state and context to persistent store
24© Gerhard Weikum
Use of Queued Transactions inTravel Planning Workflow
SelectConference
CheckFlight
CheckHotel
CheckCost
Go
No
/ Budget:=1000; Trials:=1;
/ Cost = ConfFee + TravelCost
[Cost Budget]
[Cost > Budget & Trials < 3] / Trials++
[Cost > Budget& Trials 3]
[!ConfFound]
[ConfFound]/ Cost:=0
SelectTutorials
ComputeFee
CheckÁirfare
CheckHotel
CheckTravelCost
CheckConfFee
Queued transactions & 2PCguarantee consistencyof distributed WFMS& exactly-once execution
25© Gerhard Weikum
Compensation of Invoked Applications
SelectConf
CheckCost
Go
No
/ Budget:=1000; Trials:=1
[Fok & Eok/ Cost := ConfFee + TrExpenses
[!Found]
[Found]/ Cost:=0
CheckConfFee
CheckTrExpenses
Providecompensating steps
CancelTravel
CancelConf
& invokesteps (mostly)automatically
26© Gerhard Weikum
Meaningful Compensation Spheres (1)
SelectConf
Go
No
/ Budget:=1000; Trials:=1
[Fok & Eok/ Cost := ConfFee + TrExpenses
[!Found]
[Found]/ Cost:=0
CheckConfFee
CheckTrExpenses
Arbitrarycompensation spheres may leave workflowin non-resumableconfiguration !
CheckCost
?
27© Gerhard Weikum
Meaningful Compensation Spheres (2)
SelectConf
Go
No
/ Budget:=1000; Trials:=1
[Fok & Eok/ Cost := ConfFee + TrExpenses
[!Found]
[Found]/ Cost:=0
CheckConfFee
CheckTrExpenses
Restrictatomicity spheresto a single stateand its enclosedactivities & apps
CheckCost
!
28© Gerhard Weikum
The Need for Multi-Tier Application Recovery
Data ServerData Server Data ServerData Server Data ServerData Server Data ServerData Server
Expedia AppWeb Server
Expedia AppWeb Server
ExpediaApp ServerExpedia
App Server
SabreApp Server
SabreApp Server
AmadeusApp ServerAmadeus
App Server
ClientClientRealistic example:Expedia or Travelocity stylemulti-tier service
29© Gerhard Weikum
Need for Integrated & Application-transparent Data, Message, and Process Recovery
?Data server
Users
Web appserver
Businessportal server
Other clients
for largely autonomous components
30© Gerhard Weikum
Efficient Solution: Recovery Contracts
For each process:• log all non-deterministic events (non-forced)
Upon interaction between sender and receiver:• sender promises recoverable state and message (e.g., via replay) and resends message if necessary• receiver promises duplicate elimination and recoverable state when releasing sender promise
+ low run-time overhead: one forced log write per multi-tier request/reply
+ fast restart ( high availability) rebuild process state & message table and replay
prototype implementation for IE6 / Apache / PHP / MySQLplus COM+-based implementation work in Phoenix project at MSR
+ independent recovery of autonomous components
31© Gerhard Weikum
Committed Interaction Contract (CIC)• Sender Obligation S1: persistent state as of message time or later• Sender Obligation S2: persistent message
• S2a: resend message periodically until released by receiver• S2b: resend message upon explicit request until released
• Sender Obligation S3: unique messages
• Receiver Obligation R1: duplicate message elimination• Receiver Obligation R2: persistent state
• R2a: persistent state as of message time or later before releasing sender from S2a (stable interaction)• R2b: persistent state & message before releasing sender from S2b (installed interaction)
Immediately Committed Interaction (ICIC):Receiver releases sender from S2a, S2b immediately(similar to optimized 2PC) – crucial for autonomous recovery
32© Gerhard Weikum
Statechart for CIC
messagesent
runningmessagereceived
interaction stable:
R2a promised
interactioninstalled:
R2b promised
interaction(known to be)
installed:(S2b released)
running
interaction recoverable:
S1, S2 promised
interaction (known to be) stable:
(S2a released)
sender
receiver
/ make state and message persistent
/ messagetransfer
stability notification
commitnotification[true]
[true]
[true]
messagetransfer
[true]
/ log message arrival
/ stabilitynotification
[interaction stable] / make state persistent
/ install notification
33© Gerhard Weikum
Statechart for ICIC
messagesent
runningmessagereceived
interaction installed:
R2 promised
interaction(known to be)
installed:S2 released
running
interaction recoverable:
S1, S2 promised
sender
receiver
/ make state and message persistent
/ messagetransfer
stability and installnotification
[true]
messagetransfer
/ make state persistent
/ stability and install notification
34© Gerhard Weikum
External Interaction Contract (XIC)and Transactional Interaction Contract (TIC)
• input from user: receiver promises ICIC, sender doesn’t• output to user: sender promises ICIC, receiver doesn’t
XIC:
consequence: crash may lead to lost input or duplicated output(for small but inherently unavoidable window of vulnerability)
receiver of transactional request promises:• atomic state transition• faithful reply message• persistent reply messagesender of transactional request promises:• persistent state and commit request message• unique messages
TIC:
35© Gerhard Weikum
Special Case: Client-Server Application Recovery
UserApplicationProcess(Client)DatabaseServer ...
input output
request reply...
2nd AppProcess
...
UserApplicationProcess(Client)DatabaseServer ?...
input
request reply...
2nd AppProcess ...
replay
crash
during normal operation
during client restart
36© Gerhard Weikum
General Considerations forClient-Server Stateful Application Recovery
• Message logging for message recovery and deterministic program replay (of piecewise deterministic program)• Installation points for process recovery and reduced program replay• Server processes concurrent threads on behalf of many clients• Server “commits its state” upon sending a reply to a client• Forced logging should be minimized• Server should be able to perform independent recovery
37© Gerhard Weikum
Server Reply Logging Method• Client and server each
• maintain a message lookup table (MLT) and• write message log entries to a stable log
• Client performs lazy, non-forced, logging, and periodically creates intallation point, and force-logs user-input messages• Server forces its log buffer before sending a reply message• Server recovery rebuilds message lookup table and replays incomplete requests to produce reply
may need logging of read/write interleaving among threads• Client recovery rebuilds MLT, reloads app from last installation point and replays application, intercepting message events and obtaining the contents of messages from local MLT or the server• Client sends stability notifications to facilitate server log truncation
38© Gerhard Weikum
Data Structures for Server Reply Logging
client
server
stable log filemessage lookup table
MSN Type ...10 request20 request30 reply40 reply
70 request80 reply
...
... ...
MSN Type15 input20 request40 reply45 output ...installation
point
lazy logging
force log upon reply
65 inputrequest70
39© Gerhard Weikum
Replaying Incomplete Requests with Server Reply Logging
client
server
MSN Type ...10 request20 request30 reply
...MSN Type15 input20 request
... 15 20
10 20 30
... R(x)W(x)R(x)R(y)W(y)R(y) ...
40© Gerhard Weikum
Log Truncation with Server Reply Logging
client c
server
... ...MSN Type15 input20 request40 reply45 output...
15 20
20 40 70
request 70 +stability notification
RedoMSN for client c
70 request
80
40
20 4020 4020 4020 40
server log
other clients
15
20 40
45
70 80
client log
client message lookup table
41© Gerhard Weikum
Efficient Multi-tier Application Recoveryand Failure Masking
Data ServerData Server Data ServerData Server Data ServerData Server Data ServerData Server
Expedia AppWeb Server
Expedia AppWeb Server
ExpediaApp ServerExpedia
App Server
SabreApp Server
SabreApp Server
AmadeusApp ServerAmadeus
App Server
ClientClient
10 forced log writes:• 1 user request at client• 4 replies at data servers (transactional ICs)• 3 replies at external app servers (ICICs)• 2 app server replies at Web server (ICICs)• no forced logging between Web server and app server in same „recovery ensemble“ (CIC)
as opposed to 32forced log writeswith 2PC for everysender-receiver pair
altogether 16 messages (8 requests + 8 replies) per user request
42© Gerhard Weikum
Additional System Guarantees for Workflows
Exactly-once execution guaranteesto preserve the guaranteed semantic propertiesin a failure-prone, distributed system environment
High availability through server and data replication
Scalable performance
Guaranteed performancee.g.: response time < 5 seconds with probability 0.95 for 1000 concurrently active workflows auto-tuning and zero-admin
43© Gerhard Weikum
Outline
Part A: WF Specification and Verification
Part B: WF System Architectureand Configuration
What Is It All About?
WF Specification Techniques
Statecharts
CTL and Model Checking
Summary and Open Research Issues
WF Execution Infrastructure
Failure Handling
Stochastic Modeling
• WF System Configuration
• Summary and Open Research Issues
44© Gerhard Weikum
Internal Server Error.Our system administrator has been notified. Please try later again.
Check Availability(Look-Up Will Take 8-25 Seconds)
The Need for Performance and QoS Guarantees
45© Gerhard Weikum
From Best Effort To Performance & QoS Guarantees
”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate”
(Report of the US President’s Technology Advisory Committee)
• Very slow servers are like unavailable servers• Tuning for peak load requires predictability of workload config performance function• Self-tuning requires mathematical models• Stochastic guarantees for huge #clients P [response time 5 s] > 0.95
46© Gerhard Weikum
WFMS Architecture for E-Services
Ms3.lnk
...
WF server type 1
App server type 1
Clients
Comm server
WF server type 2
App server type n
...
47© Gerhard Weikum
Digression: Markov ChainsA discrete-time finite-state Markov chain is a pair (, p)
with a state set ={s1, ..., sn} and a
transition probability function p: [0,1] with
the property for all i where pij := p(si, sj).
A Markov chain is called ergodic (stationary),
if for each state sj the limit
exists and is independent of si,
with for t>1 and pij(t) := pij for t=1.
j
ijp 1
)(lim: tijtj pp
kkj
tik
tij ppp )1()( :
For an ergodic finite-state Markov chain, the stationary state probabilities pj can be computed by solving the linear equation system: jallforppp
iijij
j
jp 1
48© Gerhard Weikum
Markov Chain Example
0: sunny 1: cloudy 2: rainy0.8
0.2
0.3
0.30.4
0.5
0.5
p0 = 0.8 p0 + 0.5 p1 + 0.4 p2p1 = 0.2 p0 + 0.3 p2p2 = 0.5 p1 + 0.3 p2p0 + p1 + p2 = 1
p0 0.657, p1 = 0.2, p2 0.143
49© Gerhard Weikum
Digression: Continuous Time Markov ChainsA finite-state continuous-time Markov chain (CTMC) is a pair (, q)
with a state set ={s1, ..., sn} and transition rates q:
with
For an ergodic CTMC the stationary state probabilities pj can be computed by solving the system of linear flow balance equations:
jallforqpqpjk
jkjji
iji
j
jp 1
0R
]i)t(X|j)t(X[Plim:qij
0
and
A CTMC can be „factorized“ into a discrete-time Markov chain
with transition probabilities and exponentially
distributed state residence times
kik
ijij q
qp
iH
t
e]tistateintime[P
1 ]istateintime[Eq/Hik
iki
1with
50© Gerhard Weikum
CTMC Example 1: Stationary Availability
Single server: Mirrored server pair:
0: down1: up
1 / MTTF
1 / MTTR
p0 / MTTR = p1 / MTTFp1 /MTTF = p0 / MTTRp0 + p1 = 1
MTTRMTTF
MTTFp
1
availability of server
0:1:
1 / MTTF
1 / MTTR
2:bothup
1 up1 down
bothdown
2 / MTTF
1 / MTTR
p1 / MTTR = 2 p2 / MTTF2 p2 / MTTF + p0 / MTTR = p1 / MTTR + p1 / MTTFp1 / MTTF = p0 / MTTRp0 + p1 + p2 = 1
2
22MTTF
p
availability of server pair
only transient, repairable failuresavailability = P[system is operational at random time point]
51© Gerhard Weikum
CTMC Example 2: Reliability
E20 = E[time until absorbing state is reached from initial state]
some repairable, some non-repairable failures
reliability = P[lifetime of system t] or E[lifetime]
0:1:
1 / MTTF
2:bothup
1 up1 down
bothdown
2 / MTTF
1 / MTTR
Mirrored disk pair:
ik
kjikiij EpHE
:Eij E[time betweenentering i and entering j]
E21 = H2E20 = H2 + E10E10 = H1 +MTTF / (MTTF+MTTR) E20E12 = H1
H2 = MTTF / 2H1 = MTTF MTTR / (MTTF + MTTR)
MTTR
MTTRMTTFMTTF
2
32
MTTR
MTTF
2
2
52© Gerhard Weikum
Digression: Basics of Queuing Systems (1)
...
time
servicestation
prob. distr. ofinterarrival time(e.g.: M = exp. distr.)
prob. distr. ofservice time(e.g.: M = exp. distr.)
queue
schedulingpolicy(e.g.: FCFS)
customers(requests)
arrival
departure
waiting time service time
response time(sojourn time)
e.g., of typeM/M/1//FCFS
53© Gerhard Weikum
Digression: Basics of Queuing Systems (2)Classification of queueing systems: A/B/m/K/Zwith A: distribution of interarrival times (type of arrival process) B: distribution of service times m: number of service stations with shared queue K: capacity of the queue (often assumed to be ) Z: service scheduling policy (e.g., FCFS, priority-based, etc.)Measures of interest: – arrival rate (1/mean of interarrival time distr.)X – throughput (departure rate): served requests per time unitW – (mean) waiting time in queueR – (mean) response timeS – (mean) service time (with higher moments S2, S3, ...) – utilization (probability of server being busy)N – (mean) queue length, including request in service
54© Gerhard Weikum
Digression: Basics of Queuing Systems (3)
Operational Laws (queuing theory theorems):
1. Utilization law: = X * S
2. Forced flow law: X = for <1
3. Little‘s law: N = X * R
N - = X * W
55© Gerhard Weikum
Digression: M/M/1 Queuing Systems
N(t): number of requests in queue (or in service)
0 1 2 ...
: arrival rate
: service rate
flow balance equations:
01 pp )(ppp nnn 11and for n 1
for :: 1 )(p n
n 1 for n 0
10nnpn]N[E
1
]S[E]N[E]R[E
]R[E/tR e]tR[P)t(F 1response time distribution:
56© Gerhard Weikum
Digression: M/G/1 Queuing Systems
N(t) at request departure times forms embedded Markov chain
2
1
1
2SC]S[E
]W[E
with
2
22
22
]S[E
]S[E]S[E
]S[E
]S[VarCS
]S[E]W[E]R[E
)(
]S[E]W[E]W[E
132
322
1
222 ]S[E
]W[E]R[E
)(*S
)(][*W
1
]e[E)(*X X
with Laplace-Stieltjes transform of random variable X:
0dx)x(fe X
x
57© Gerhard Weikum
Outline
Part A: WF Specification and Verification
Part B: WF System Architectureand Configuration
What Is It All About?
WF Specification Techniques
Statecharts
CTL and Model Checking
Summary and Open Research Issues
WF Execution Infrastructure
Failure Handling
Stochastic Modeling
WF System Configuration
• Summary and Open Research Issues
58© Gerhard Weikum
Stochastic Model of Workflow Systemclients
ORB WFserver
WFserver
Appserver
Appserver
System config:Server types(replicated)Workload:
workflow types,activity types
0.50.4
0.1
service requests
Performance model:
M/G/1 queues
Markov model load per server
max. throughput,E[waiting time]
Availabilitymodel
# replicas,failure ratesrestart rates
E[downtime],P[degradation]
Performabilitymodel
59© Gerhard Weikum
Stochastic Modelling of Control Flowworkflow spec. as statechart resulting CTMC
S1
S3
S2
S4
S5
2
5
4
3
1
1
1
[C1]
/st!(Act2)
P[C1=true]
[C4]/st!(Act5)
[Act3_DONE]/st!(Act4)
[not(C4)]/st!
(Act3)
[Act2_DONE]/st!(Act3)
[Not(C1)]/st!
(Act3)
/st!(Act1)
A1
P[C4=true]
P[C4=false]
P[C1=fals
e]
60© Gerhard Weikum
Modelling of Loop Iterations• Assumption: # iterations uniformly distributed over {m..n}
4
3
1
1
P[C4=true]
P[C1=fals
e]
P[C4=false]
…
…
…
……
…
• Expansion of loop states
4,1
3,1
1
1
P[C1=fals
e]
3,2
3,m 3,n
4,2 4,m
4,n
1
• Modified transition probabilities
1/(n-m+1) =: p
11
1 11
1–p 1
1-p
61© Gerhard Weikum
Stochastic Load Model: Some Detail
si
sj
s0 sApij
sk
Hi
transition probabilities pij
mean state residence times Hi
state departure rates vi = 1/Hi
transition rates qij = vi pij
pik
Mean turnaround time f0A (expected first-passage time for sA)
derived by solving:
Ajij
jAijiAi fqfv 1
Continuous-time Markov chain (CTMC)
Expected generated load L derived from Markov reward model:
Aj kjjkjkj Lq)t(E)]t(L[E 0
and probabilities
Ak
kj)z(
ik)z(
ij ppp 1for uniformized CTMC
1
00
1 z
n
)n(ij
z
zvt
ij p!z)vt(
ev
)t(Ewith
62© Gerhard Weikum
(Stationary) Availability Model
2,2,0
2,2,1
2,2,2
2,1,0
2,1,1
2,1,2
2,0,0
2,0,1
2,0,2
0,0,0
0,0,1
0,0,2
0,1,0
0,1,1
0,1,2
0,2,0
0,2,1
0,2,2
1,0,0
1,0,1
1,0,2
1,1,0
1,1,1
1,1,2
1,2,0
1,2,1
1,2,2
,, iX
,1, iX
ii Xi(Yi – (Xi-1))
System states:
kXXXX ,,, 21
ii YXwith
(strong)availability =
Xgood
X
)(
stationary state prob. )(X
63© Gerhard Weikum
Availability Example
System configuration Expected downtimeper year
(1,1,1) 71 hours(2,1,1) 65 hours(1,2,1) 62 hours(2,2,1) 60 hours(1,1,2) 11 hours(2,1,2) 4.8 hours
System configuration Expected downtimeper year
(1,2,2) 86 minutes(2,2,2) 26 minutes(2,3,2) 25 minutes(2,2,3) 21 seconds(2,3,3) 12 seconds(3,3,3) 10 seconds
11 min
432001
12 min
100801
13 min
14401
1321 min
101
•3 server types:–communication server: one failure per month
–workflow engine: one failure per week
–application server: one failure per day
• repair rate for each server type:
• expected unavailability depending on configuration (Y1, Y2, Y3):
64© Gerhard Weikum
Non-exponentially Distributed Time-to-Failure and Downtime
Approximate more general distributions of state-residence time
by E1,n distribution:state sj
qij qjk
sj0
sj1 sj2 sjn...
q
1-q
H
H H H
Important to capture realistic behavior (e.g., planned maintenace):
Special case q=0: n exponential stages with mean H behave like Erlang-n distributed state with mean nH
65© Gerhard Weikum
Workflow System Configuration ToolWorkflowRepository Operational Workflow System Config.
Admin
Modeling Calibration
Evaluation
Recommendation
MonitoringMapping
Hypotheticalconfig
Max. ThroughputAvg. waiting timeExpected downtime
66© Gerhard Weikum
Workflow System Configuration ToolWorkflowRepository Operational Workflow System Config.
Admin
Modeling Calibration
Evaluation
Recommendation
MonitoringMapping
Min-cost config.
Goals:min(throughput)max(waiting time)max(downtime)+ constraints
67© Gerhard Weikum
Workflow System Configuration ToolWorkflowRepository Operational Workflow System Config.
Modeling Calibration
Evaluation
Recommendation
MonitoringMapping
Automaticreconfiguration
Goals:min(throughput)max(waiting time)max(downtime)+ constraints
69© Gerhard Weikum
Prediction Accuracy of Goliat
Shipment_S
CreditCardCheck_S NewOrder_S [PayByCreditCard and NewOrder_DONE] /st!(CreditCardCheck) [PayByBill and
NewOrder_DONE] [CreditCardOK and CreditCardCheck_DONE]
[CreditCardNotOK and CreditCardCheck_DONE]
[in(Notify_EXIT_S) and in(Delivery_EXIT_S) and
PayByCreditCard] /st!(CreditCardCharge)
CreditCardCharge_S
EC_EXIT_S
[CreditCardCharge_DONE] Payment_S
[Payment_DONE]
[in(Notify_EXIT_S) and in(Delivery_EXIT_S) and PayByBill] /st!(Payment)
/st!(NewOrder)
EC_SC
EC_INIT_S
Notify_S Notify_EXIT_S
[Notify_DONE] /st!(Notify)
Notify_INIT_S
FindStore_S CheckStore_S [ItemsLeft and
FindStore_DONE] /fs!(ItemAvailable) st!(CheckStore)
[ItemAvailable and CheckStore_DONE]
[AllItemsProcessed]
/st!(FindStore)
Delivery_EXIT_S
Delivery_INIT_S
Benchmark:E-CommerceOrder ProcessingWorkflow
Results:
05101520253035
0 1 2 3 4 5
arrival rate [min-1]w
ait
tim
e [
ms
ec
]
ExperimentGoliat
on Mentor-lite configuration:
70© Gerhard Weikum
Multi-class Workloads with Diff QoS
Onlinebrokerage
e-ServiceCustomer Type
Premiumcustomer
Guest (potential
future customer)
...
“Channel”
What-ifportfolio analyses
Stock priceinfo service
... Bac
kend
ser
vers
Mid
dle
war
e
Class-specificrequest queues
...
class priorities ???
71© Gerhard Weikum
HEART: Help for Ensuring Acceptable Response Time
Input:• class-specific arrival rates, service time moments• class-specific goals, e.g.: E[RT(class 1)] 5 s E[RT(class 2)] 2 s Var[RT(class 2)] 4 s2
P[RT(class 3) 5 s] 0.95 ...
Output:class-specific prioritiesfor messaging middleware (MQ Series)for satisfying all goals
72© Gerhard Weikum
Autonomic Computing
My interpretation: need component design for predictability: self-inspection, self-analysis, self-tuning
Vision: all computer systems must be self-managed, self-organizing, and self-healing
Motivation:• ambient intelligence (sensors in every room, your body etc.)• reducing complexity and improving manageability of very large systems
Eight laws:• know thy self• configure thy self• optimize thy self• heal thy self• protect thy self• grow thy self• know thy neighbor• help thy users
Role model:biological systems (really ???)
73© Gerhard Weikum
Outline
Part A: WF Specification and Verification
Part B: WF System Architectureand Configuration
What Is It All About?
WF Specification Techniques
Statecharts
CTL and Model Checking
Summary and Open Research Issues
WF Execution Infrastructure
Failure Handling
Stochastic Modeling
WF System Configuration
Summary and Open Research Issues
74© Gerhard Weikum
Workflow EngineWorkflow Engine
Workflow Engine
Statechart Interpreter
CommMgr LogMgr
WorkflowRepository
WorklistDB
WorkflowLog
AppWrapper
AppWrapper
Worklist Mgt
History Mgt
Specification, verification,configurationworkbench
Mentor-lite Prototype Event-Process Chains etc.
State-charts
Other WFEngines(SAP etc.)
XML
75© Gerhard Weikum
Summary and Open Research IssuesDependable, self-organizing („autonomic“) systems require• comprehensive data/message/process recovery with failure masking• and (dynamic) configuration and tuning procedures based on tractable mathematical models
Interesting research topics for graduate students:
rigorous verification of efficient data/message/process recovery algorithmscomprehensive & efficient implementation ofrecovery contracts in WFMS / Web service environmentguarantees about response time percentiles for multi-class workloadsdynamic reconfiguration based on transientperformability predictions for given time horizon
•
•
•
•comprehensive configuration toolfor commercial WFMS / Web service suite
•
76© Gerhard Weikum
The FutureWorkflow technology is successful
Need further courageous steps towards
"Success is a lousy teacher" (Bill Gates)
Provably correct behavior
Predictably good performanceHigh reliability and availability
Guaranteed quality of results
”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate"
(Report of the US President’s Technology Advisory Committee)
Commercial world is driven by time to market
77© Gerhard Weikum
Recommended Literature• F. Leymann, D. Roller: Production Workflow – Concepts and Techniques, Prentice Hall, 2000• W. van der Aalst, K. van Hee: Workflow Management – Models, Methods, and Systems, MIT Press, 2002• A. Dogac, L. Kalinichenko, T. Özsu, A. Sheth (Eds.): Workflow Management Systems and Interoperatibility, Springer, 1998• D. Harel, M. Politi: Modeling Reactive Systems with Statecharts - The Statemate Approach, McGraw Hill, 1998• E.M. Clarke, O. Grumberg, D. Peled: Model Checking, MIT Press, 2000• G. Weikum: Towards Guaranteed Quality and Dependability of Information Services, German Database Conf. (BTW), 1999• G Weikum, G. Vossen: Transactional Information Systems - Theory, Algorithns, and the Practice of Concurrency Control and Recovery, Morgan Kaufmann, 2001• R. Barga, D. Lomet, G. Weikum: Recovery Guarantees for General Multi-Tier Applications, IEEE CS Data Engineering Conf., 2002• H.C. Tijms, Stochastic Models – An Algorithmic Approach, Wiley & Sons, 1994• G. Haring, C. Lindemann, M. Reiser (Eds.): Performance Evaluation – Origins and Directions, Springer, 2000 • M. Gillmann, G. Weikum, W. Wonner: Workflow Management with Service Quality Guarantees, ACM SIGMOD Conf., 2002• G. Weikum (Editor): Special Issue on Infrastructure for Advanced E-Services, IEEE CS Data Engineering Bulletin, March 2001