© Gerhard Weikum1 Outline Part A: WF Specification and Verification Part B: WF System Architecture and Configuration What Is It All About? WF Specification

1© Gerhard Weikum

Outline

Part A: WF Specification and Verification

Part B: WF System Architectureand Configuration

What Is It All About?

WF Specification Techniques

Statecharts

CTL and Model Checking

Summary and Open Research Issues

WF Execution Infrastructure

• Failure Handling

• Stochastic Modeling

• WF System Configuration

• Summary and Open Research Issues

2© Gerhard Weikum

WFMS Architecture for E-Services

Ms3.lnk

...

WF server type 1

App server type 1

Clients

Comm server

WF server type 2

App server type n

...

3© Gerhard Weikum

Interoperability between WF Systems

WFMS 1 WFMS 2

Ms3.lnk

WF Mediator

<?xml version="1.0"?><Activity WFtype="5” Name=„RiskAssessmentAct"> <BusinessData> <CreditRequest> <CustId>101</CustId> </CreditRequest> </BusinessData> <WorkflowCtrl> <Variable Name="Currency" Value="USD"/> </WorkflowCtrl></Activity>

• wrap WFMS using XML-based interface (e.g., WSDL/WSFL or ???)• route activity and sub-WF invocations through WF mediator• same protocol for activities and sub-WFs• add‘l functions for sub-WF monitoring

4© Gerhard Weikum

Some “AI-complete” Problems

Grand challenge:service discovery and matchmaking

State of the art: standardized syntax & protocols (à la UDDI) with queries on yellow pages (“business registry“)

Needed: semantics & interoperability with automatic reasoning about process/activity interface, behavior, and outcome

standardized ontologies are a step forward,but still far from final goal;we need: Semantic Web + Intelligent Search

5© Gerhard Weikum

Outline





Statecharts




Failure Handling

• Stochastic Modeling



6© Gerhard Weikum

Important System Issues

see, e.g., Mentor-lite,http://www-dbs.cs.uni-sb.de/~mlite/

• Differentiated quality of service and performance guarantees (e.g., class-specific response time and workflow turnaround time)

• World-wide failure masking for exactly-once behavior with easy app development

see, e.g., Phoenix projecthttp://research.microsoft.com/db/phoenix/

• Scalability, Reliability, Availability, Manageability,...

7© Gerhard Weikum

Your server command (process id #20) has been terminated.Re-run your command (severity 13) in/export/home/WWW/your-reliable-eshop.biz/mb_1300_db.mb1

Place your order

The Need for Failure Masking

Please review and place your order

8© Gerhard Weikum

Long-Lived & Distributed Execution

SelectConf

CheckCost

Go

No

/ Budget:=1000; Trials:=1

[Fok & Eok]/ Cost := ConfFee + TrExpenses

[!Found]

[Found]/ Cost:=0

CheckConfFee

CheckTrExpenses

Atomic (transactional) write ofpersistent state & contextguarantees forward recovery

9© Gerhard Weikum

Digression: Two-Phase Commit Protocol (2PC)for Distributed Atomic Transactions

Coordinator Agent 1 Agent 2

write „begin“

force log entries& write „prepared“

force log entries& write „prepared“

send „prepare“send „prepare“

send „yes“send „yes“

write „commit“send „commit“

send „commit“write „commit“ write „commit“

send „ack“send „ack“

write „end“

10© Gerhard Weikum

committed aborted

committed1 aborted1

Statechart for 2PC Protocol

initial

collecting

forgotten

initial1

prepared1

prepare1 / yes1

prepare1 / sorry1

commit1 / ack1

abort1 / ack1

commit1 / ack1 abort1 / ack1

committed2 aborted2

initial2

prepared2

prepare2 / yes2

prepare2 / sorry2

T1|F1

T1|F1

T2|F2

T2|F2

commit2 / ack2

abort2 / ack2

commit2 / ack2 abort2 / ack2

/ prepare1; prepare2

yes1 & yes2/ commit1; commit2

sorry1 | sorry2 / abort1; abort2

ack1 & ack2 ack1

& ack2 C-pending A-pending

T|F

T|F T|F

/ commit1;commit2

/ abort1;abort2

11© Gerhard Weikum

Long-lived & Distributed Execution

SelectConf

CheckCost

Go

No


[Fok & Eok]/ Cost := ConfFee + TrExpenses

[Found]/ Cost:=0

CheckConfFee

CheckTrExpenses

[!Found]

Queued transactions & 2PCguarantee consistencyof distributed WFMS& exactly-once execution

12© Gerhard Weikum

From ACID To Recovery Guarantees

Problem:Client that does not receive a returncode from transactional servercannot easily find out the transaction outcome andmay be tempted to re-initiate the (non-idempotent) transaction,thus producing unacceptable effects.

Approach:In addition to atomicity, the transactional server needs toguarantee the exactly-once execution of the transaction,where execution includes the server‘s reply message.

(almost) perfect failure masking

13© Gerhard Weikum

Stateless Applications Based on Queues

stateless application (running on client, or app server or data server):• user sends input message• app program sends request message to data server• data server executes transaction and sends reply message to app• app program sends output message to userthere are no conversations with the user within a transaction,and subsequent transactions are independent

Solution Queued Transactions:• message recovery by queue manager with persistent, recoverable message queues• exactly-once execution by enclosing message dequeue and enqueue into transaction

14© Gerhard Weikum

Illustration of 2-Tier Queued Transaction

User

ApplicationProcess(Client)

DatabaseServer ...

input output

enqueuerequest

dequeuereply

...

dequeuerequest

enqueuereply

server transaction

15© Gerhard Weikum

Illustration of 3-Tier Queued Transaction

User

Client

DatabaseServer ...

input output

enqueuerequest

dequeuereply

...

distributed server transaction

ApplicationServer ...

dequeuerequest

enqueuereply

16© Gerhard Weikum

Correctness of Queued Transaction Protocol

Theorem:With the queued transaction protocol for stateless applications,the following guarantees hold:1. Once the user-input transaction is committed, a request is executed by the server exactly once.2. Once the user-input transaction is committed, the user output

is delivered at least once.3. If user output is testable, the user output is delivered exactly once,

provided the user-input transaction has been committed.

Inherent (small window of) uncertainty:• (last) user input may get lost• (last) user output may be sent more than once can be eliminated with testable output (using special hardware)

17© Gerhard Weikum

Client During Normal Operation

user-input processing by client: begin transaction; enqueue (request); commit transaction;

user-output processing by client: wait until reply queue is not empty; begin transaction; dequeue (reply); while user has not acknowledged the reply or sent the next request do present reply to user; end /*while*/; commit transaction;

18© Gerhard Weikum

Server During Normal Operation

request-reply processing by data server: begin transaction; dequeue (request); perform data operations and generate reply; enqueue (reply); commit transaction;

19© Gerhard Weikum

Client and Server Restart

Client restart: check reply queue; if not empty then process reply like during normal operation; end /*if*/;

Server restart: check request queue; if not empty then initiate processing of requests like during normal operation end /*if*/;

20© Gerhard Weikum

Pseudo-Conversational Transactionsfor Stateful Applications

• Queue-based message recovery for entire conversations• Conversational “logical unit of work” broken down into chain of stateless transactions with (small) application state maintained in the queue (analogously to Cookies, but more general and much more reliable)• Dequeue of reply and enqueue of next request combined into one transaction for exactly-once execution guarantee• good for apps such as travel reservation, electronic shopping, etc.

21© Gerhard Weikum

Illustration ofPseudo-Conversational Transactions

User

ApplicationProcess(Client)

DatabaseServer ...

...

...

...

22© Gerhard Weikum

Correctness ofPseudo-Conversational Transaction Protocol

Theorem:With the queue-based message recovery for conversationalmulti-step transaction chains, the following guarantees hold:1. Once the initial user-input transaction that starts the entire

conversation is committed, the entire transaction chain isexecuted by the server exactly once.

2. Once the initial user-input transaction is committed, eachuser-output message throughout the conversation isdelivered at least once.

3. If user output is testable, each user-output message is delivered exactly once, provided the initial user-input transaction has been committed.

23© Gerhard Weikum

Queue-based Message Recovery forExactly-Once Workflow Execution

At end of activity execute transaction that combines:• writing the activity‘s modifications of workflow state and context to persistent store• writing the state modifications that result from the firing of outgoing transitions to persistent store• writing the context modifications that result from the actions of firing transitions to persistent store• notifying the follow-up activities by enqueueing messages

Newly invoked activity executes transaction that combines:• dequeueing of notification message• writing the workflow state and context to persistent store

24© Gerhard Weikum

Use of Queued Transactions inTravel Planning Workflow

SelectConference

CheckFlight

CheckHotel

CheckCost

Go

No

/ Budget:=1000; Trials:=1;

/ Cost = ConfFee + TravelCost

[Cost Budget]

[Cost > Budget & Trials < 3] / Trials++

[Cost > Budget& Trials 3]

[!ConfFound]

[ConfFound]/ Cost:=0

SelectTutorials

ComputeFee

CheckÁirfare

CheckHotel

CheckTravelCost

CheckConfFee

Queued transactions & 2PCguarantee consistencyof distributed WFMS& exactly-once execution

25© Gerhard Weikum

Compensation of Invoked Applications

SelectConf

CheckCost

Go

No


[Fok & Eok/ Cost := ConfFee + TrExpenses

[!Found]

[Found]/ Cost:=0

CheckConfFee

CheckTrExpenses

Providecompensating steps

CancelTravel

CancelConf

& invokesteps (mostly)automatically

26© Gerhard Weikum

Meaningful Compensation Spheres (1)

SelectConf

Go

No



[!Found]

[Found]/ Cost:=0

CheckConfFee

CheckTrExpenses

Arbitrarycompensation spheres may leave workflowin non-resumableconfiguration !

CheckCost

?

27© Gerhard Weikum

Meaningful Compensation Spheres (2)

SelectConf

Go

No



[!Found]

[Found]/ Cost:=0

CheckConfFee

CheckTrExpenses

Restrictatomicity spheresto a single stateand its enclosedactivities & apps

CheckCost

!

28© Gerhard Weikum

The Need for Multi-Tier Application Recovery

Data ServerData Server Data ServerData Server Data ServerData Server Data ServerData Server

Expedia AppWeb Server


ExpediaApp ServerExpedia

App Server

SabreApp Server

SabreApp Server

AmadeusApp ServerAmadeus

App Server

ClientClientRealistic example:Expedia or Travelocity stylemulti-tier service

29© Gerhard Weikum

Need for Integrated & Application-transparent Data, Message, and Process Recovery

?Data server

Users

Web appserver

Businessportal server

Other clients

for largely autonomous components

30© Gerhard Weikum

Efficient Solution: Recovery Contracts

For each process:• log all non-deterministic events (non-forced)

Upon interaction between sender and receiver:• sender promises recoverable state and message (e.g., via replay) and resends message if necessary• receiver promises duplicate elimination and recoverable state when releasing sender promise

+ low run-time overhead: one forced log write per multi-tier request/reply

+ fast restart ( high availability) rebuild process state & message table and replay

prototype implementation for IE6 / Apache / PHP / MySQLplus COM+-based implementation work in Phoenix project at MSR

+ independent recovery of autonomous components

31© Gerhard Weikum

Committed Interaction Contract (CIC)• Sender Obligation S1: persistent state as of message time or later• Sender Obligation S2: persistent message

• S2a: resend message periodically until released by receiver• S2b: resend message upon explicit request until released

• Sender Obligation S3: unique messages

• Receiver Obligation R1: duplicate message elimination• Receiver Obligation R2: persistent state

• R2a: persistent state as of message time or later before releasing sender from S2a (stable interaction)• R2b: persistent state & message before releasing sender from S2b (installed interaction)

Immediately Committed Interaction (ICIC):Receiver releases sender from S2a, S2b immediately(similar to optimized 2PC) – crucial for autonomous recovery

32© Gerhard Weikum

Statechart for CIC

messagesent

runningmessagereceived

interaction stable:

R2a promised

interactioninstalled:

R2b promised

interaction(known to be)

installed:(S2b released)

running

interaction recoverable:

S1, S2 promised

interaction (known to be) stable:

(S2a released)

sender

receiver

/ make state and message persistent

/ messagetransfer

stability notification

commitnotification[true]

[true]

[true]

messagetransfer

[true]

/ log message arrival

/ stabilitynotification

[interaction stable] / make state persistent

/ install notification

33© Gerhard Weikum

Statechart for ICIC

messagesent

runningmessagereceived

interaction installed:

R2 promised

interaction(known to be)

installed:S2 released

running

interaction recoverable:

S1, S2 promised

sender

receiver

/ make state and message persistent

/ messagetransfer

stability and installnotification

[true]

messagetransfer

/ make state persistent

/ stability and install notification

34© Gerhard Weikum

External Interaction Contract (XIC)and Transactional Interaction Contract (TIC)

• input from user: receiver promises ICIC, sender doesn’t• output to user: sender promises ICIC, receiver doesn’t

XIC:

consequence: crash may lead to lost input or duplicated output(for small but inherently unavoidable window of vulnerability)

receiver of transactional request promises:• atomic state transition• faithful reply message• persistent reply messagesender of transactional request promises:• persistent state and commit request message• unique messages

TIC:

35© Gerhard Weikum

Special Case: Client-Server Application Recovery

UserApplicationProcess(Client)DatabaseServer ...

input output

request reply...

2nd AppProcess

...

UserApplicationProcess(Client)DatabaseServer ?...

input

request reply...

2nd AppProcess ...

replay

crash

during normal operation

during client restart

36© Gerhard Weikum

General Considerations forClient-Server Stateful Application Recovery

• Message logging for message recovery and deterministic program replay (of piecewise deterministic program)• Installation points for process recovery and reduced program replay• Server processes concurrent threads on behalf of many clients• Server “commits its state” upon sending a reply to a client• Forced logging should be minimized• Server should be able to perform independent recovery

37© Gerhard Weikum

Server Reply Logging Method• Client and server each

• maintain a message lookup table (MLT) and• write message log entries to a stable log

• Client performs lazy, non-forced, logging, and periodically creates intallation point, and force-logs user-input messages• Server forces its log buffer before sending a reply message• Server recovery rebuilds message lookup table and replays incomplete requests to produce reply

may need logging of read/write interleaving among threads• Client recovery rebuilds MLT, reloads app from last installation point and replays application, intercepting message events and obtaining the contents of messages from local MLT or the server• Client sends stability notifications to facilitate server log truncation

38© Gerhard Weikum

Data Structures for Server Reply Logging

client

server

stable log filemessage lookup table

MSN Type ...10 request20 request30 reply40 reply

70 request80 reply

...

... ...

MSN Type15 input20 request40 reply45 output ...installation

point

lazy logging

force log upon reply

65 inputrequest70

39© Gerhard Weikum

Replaying Incomplete Requests with Server Reply Logging

client

server

MSN Type ...10 request20 request30 reply

...MSN Type15 input20 request

... 15 20

10 20 30

... R(x)W(x)R(x)R(y)W(y)R(y) ...

40© Gerhard Weikum

Log Truncation with Server Reply Logging

client c

server

... ...MSN Type15 input20 request40 reply45 output...

15 20

20 40 70

request 70 +stability notification

RedoMSN for client c

70 request

80

40

20 4020 4020 4020 40

server log

other clients

15

20 40

45

70 80

client log

client message lookup table

41© Gerhard Weikum

Efficient Multi-tier Application Recoveryand Failure Masking

Data ServerData Server Data ServerData Server Data ServerData Server Data ServerData Server



ExpediaApp ServerExpedia

App Server

SabreApp Server

SabreApp Server

AmadeusApp ServerAmadeus

App Server

ClientClient

10 forced log writes:• 1 user request at client• 4 replies at data servers (transactional ICs)• 3 replies at external app servers (ICICs)• 2 app server replies at Web server (ICICs)• no forced logging between Web server and app server in same „recovery ensemble“ (CIC)

as opposed to 32forced log writeswith 2PC for everysender-receiver pair

altogether 16 messages (8 requests + 8 replies) per user request

42© Gerhard Weikum

Additional System Guarantees for Workflows

Exactly-once execution guaranteesto preserve the guaranteed semantic propertiesin a failure-prone, distributed system environment

High availability through server and data replication

Scalable performance

Guaranteed performancee.g.: response time < 5 seconds with probability 0.95 for 1000 concurrently active workflows auto-tuning and zero-admin

43© Gerhard Weikum

Outline





Statecharts




Failure Handling

Stochastic Modeling



44© Gerhard Weikum

Internal Server Error.Our system administrator has been notified. Please try later again.

Check Availability(Look-Up Will Take 8-25 Seconds)

The Need for Performance and QoS Guarantees

45© Gerhard Weikum

From Best Effort To Performance & QoS Guarantees

”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate”

(Report of the US President’s Technology Advisory Committee)

• Very slow servers are like unavailable servers• Tuning for peak load requires predictability of workload config performance function• Self-tuning requires mathematical models• Stochastic guarantees for huge #clients P [response time 5 s] > 0.95

46© Gerhard Weikum

WFMS Architecture for E-Services

Ms3.lnk

...

WF server type 1

App server type 1

Clients

Comm server

WF server type 2

App server type n

...

47© Gerhard Weikum

Digression: Markov ChainsA discrete-time finite-state Markov chain is a pair (, p)

with a state set ={s1, ..., sn} and a

transition probability function p: [0,1] with

the property for all i where pij := p(si, sj).

A Markov chain is called ergodic (stationary),

if for each state sj the limit

exists and is independent of si,

with for t>1 and pij(t) := pij for t=1.

j

ijp 1

)(lim: tijtj pp

kkj

tik

tij ppp )1()( :

For an ergodic finite-state Markov chain, the stationary state probabilities pj can be computed by solving the linear equation system: jallforppp

iijij

j

jp 1

48© Gerhard Weikum

Markov Chain Example

0: sunny 1: cloudy 2: rainy0.8

0.2

0.3

0.30.4

0.5

0.5

p0 = 0.8 p0 + 0.5 p1 + 0.4 p2p1 = 0.2 p0 + 0.3 p2p2 = 0.5 p1 + 0.3 p2p0 + p1 + p2 = 1

p0 0.657, p1 = 0.2, p2 0.143

49© Gerhard Weikum

Digression: Continuous Time Markov ChainsA finite-state continuous-time Markov chain (CTMC) is a pair (, q)

with a state set ={s1, ..., sn} and transition rates q:

with

For an ergodic CTMC the stationary state probabilities pj can be computed by solving the system of linear flow balance equations:

jallforqpqpjk

jkjji

iji

j

jp 1

0R

]i)t(X|j)t(X[Plim:qij

0

and

A CTMC can be „factorized“ into a discrete-time Markov chain

with transition probabilities and exponentially

distributed state residence times

kik

ijij q

qp

iH

t

e]tistateintime[P

1 ]istateintime[Eq/Hik

iki

1with

50© Gerhard Weikum

CTMC Example 1: Stationary Availability

Single server: Mirrored server pair:

0: down1: up

1 / MTTF

1 / MTTR

p0 / MTTR = p1 / MTTFp1 /MTTF = p0 / MTTRp0 + p1 = 1

MTTRMTTF

MTTFp

1

availability of server

0:1:

1 / MTTF

1 / MTTR

2:bothup

1 up1 down

bothdown

2 / MTTF

1 / MTTR

p1 / MTTR = 2 p2 / MTTF2 p2 / MTTF + p0 / MTTR = p1 / MTTR + p1 / MTTFp1 / MTTF = p0 / MTTRp0 + p1 + p2 = 1

2

22MTTF

p

availability of server pair

only transient, repairable failuresavailability = P[system is operational at random time point]

51© Gerhard Weikum

CTMC Example 2: Reliability

E20 = E[time until absorbing state is reached from initial state]

some repairable, some non-repairable failures

reliability = P[lifetime of system t] or E[lifetime]

0:1:

1 / MTTF

2:bothup

1 up1 down

bothdown

2 / MTTF

1 / MTTR

Mirrored disk pair:

ik

kjikiij EpHE

:Eij E[time betweenentering i and entering j]

E21 = H2E20 = H2 + E10E10 = H1 +MTTF / (MTTF+MTTR) E20E12 = H1

H2 = MTTF / 2H1 = MTTF MTTR / (MTTF + MTTR)

MTTR

MTTRMTTFMTTF

2

32

MTTR

MTTF

2

2

52© Gerhard Weikum

Digression: Basics of Queuing Systems (1)

...

time

servicestation

prob. distr. ofinterarrival time(e.g.: M = exp. distr.)

prob. distr. ofservice time(e.g.: M = exp. distr.)

queue

schedulingpolicy(e.g.: FCFS)

customers(requests)

arrival

departure

waiting time service time

response time(sojourn time)

e.g., of typeM/M/1//FCFS

53© Gerhard Weikum

Digression: Basics of Queuing Systems (2)Classification of queueing systems: A/B/m/K/Zwith A: distribution of interarrival times (type of arrival process) B: distribution of service times m: number of service stations with shared queue K: capacity of the queue (often assumed to be ) Z: service scheduling policy (e.g., FCFS, priority-based, etc.)Measures of interest: – arrival rate (1/mean of interarrival time distr.)X – throughput (departure rate): served requests per time unitW – (mean) waiting time in queueR – (mean) response timeS – (mean) service time (with higher moments S2, S3, ...) – utilization (probability of server being busy)N – (mean) queue length, including request in service

54© Gerhard Weikum

Digression: Basics of Queuing Systems (3)

Operational Laws (queuing theory theorems):

1. Utilization law: = X * S

2. Forced flow law: X = for <1

3. Little‘s law: N = X * R

N - = X * W

55© Gerhard Weikum

Digression: M/M/1 Queuing Systems

N(t): number of requests in queue (or in service)

0 1 2 ...

: arrival rate

: service rate

flow balance equations:

01 pp )(ppp nnn 11and for n 1

for :: 1 )(p n

n 1 for n 0

10nnpn]N[E

1

]S[E]N[E]R[E

]R[E/tR e]tR[P)t(F 1response time distribution:

56© Gerhard Weikum

Digression: M/G/1 Queuing Systems

N(t) at request departure times forms embedded Markov chain

2

1

1

2SC]S[E

]W[E

with

2

22

22

]S[E

]S[E]S[E

]S[E

]S[VarCS

]S[E]W[E]R[E

)(

]S[E]W[E]W[E

132

322

1

222 ]S[E

]W[E]R[E

)(*S

)(][*W

1

]e[E)(*X X

with Laplace-Stieltjes transform of random variable X:

0dx)x(fe X

x

57© Gerhard Weikum

Outline





Statecharts




Failure Handling

Stochastic Modeling

WF System Configuration


58© Gerhard Weikum

Stochastic Model of Workflow Systemclients

ORB WFserver

WFserver

Appserver

Appserver

System config:Server types(replicated)Workload:

workflow types,activity types

0.50.4

0.1

service requests

Performance model:

M/G/1 queues

Markov model load per server

max. throughput,E[waiting time]

Availabilitymodel

# replicas,failure ratesrestart rates

E[downtime],P[degradation]

Performabilitymodel

59© Gerhard Weikum

Stochastic Modelling of Control Flowworkflow spec. as statechart resulting CTMC

S1

S3

S2

S4

S5

2

5

4

3

1

1

1

[C1]

/st!(Act2)

P[C1=true]

[C4]/st!(Act5)

[Act3_DONE]/st!(Act4)

[not(C4)]/st!

(Act3)

[Act2_DONE]/st!(Act3)

[Not(C1)]/st!

(Act3)

/st!(Act1)

A1

P[C4=true]

P[C4=false]

P[C1=fals

e]

60© Gerhard Weikum

Modelling of Loop Iterations• Assumption: # iterations uniformly distributed over {m..n}

4

3

1

1

P[C4=true]

P[C1=fals

e]

P[C4=false]

…

…

…

……

…

• Expansion of loop states

4,1

3,1

1

1

P[C1=fals

e]

3,2

3,m 3,n

4,2 4,m

4,n

1

• Modified transition probabilities

1/(n-m+1) =: p

11

1 11

1–p 1

1-p

61© Gerhard Weikum

Stochastic Load Model: Some Detail

si

sj

s0 sApij

sk

Hi

transition probabilities pij

mean state residence times Hi

state departure rates vi = 1/Hi

transition rates qij = vi pij

pik

Mean turnaround time f0A (expected first-passage time for sA)

derived by solving:

Ajij

jAijiAi fqfv 1

Continuous-time Markov chain (CTMC)

Expected generated load L derived from Markov reward model:

Aj kjjkjkj Lq)t(E)]t(L[E 0

and probabilities

Ak

kj)z(

ik)z(

ij ppp 1for uniformized CTMC

1

00

1 z

n

)n(ij

z

zvt

ij p!z)vt(

ev

)t(Ewith

62© Gerhard Weikum

(Stationary) Availability Model

2,2,0

2,2,1

2,2,2

2,1,0

2,1,1

2,1,2

2,0,0

2,0,1

2,0,2

0,0,0

0,0,1

0,0,2

0,1,0

0,1,1

0,1,2

0,2,0

0,2,1

0,2,2

1,0,0

1,0,1

1,0,2

1,1,0

1,1,1

1,1,2

1,2,0

1,2,1

1,2,2

,, iX

,1, iX

ii Xi(Yi – (Xi-1))

System states:

kXXXX ,,, 21

ii YXwith

(strong)availability =

Xgood

X

)(

stationary state prob. )(X

63© Gerhard Weikum

Availability Example

System configuration Expected downtimeper year

(1,1,1) 71 hours(2,1,1) 65 hours(1,2,1) 62 hours(2,2,1) 60 hours(1,1,2) 11 hours(2,1,2) 4.8 hours

System configuration Expected downtimeper year

(1,2,2) 86 minutes(2,2,2) 26 minutes(2,3,2) 25 minutes(2,2,3) 21 seconds(2,3,3) 12 seconds(3,3,3) 10 seconds

11 min

432001

12 min

100801

13 min

14401

1321 min

101

•3 server types:–communication server: one failure per month

–workflow engine: one failure per week

–application server: one failure per day

• repair rate for each server type:

• expected unavailability depending on configuration (Y1, Y2, Y3):

64© Gerhard Weikum

Non-exponentially Distributed Time-to-Failure and Downtime

Approximate more general distributions of state-residence time

by E1,n distribution:state sj

qij qjk

sj0

sj1 sj2 sjn...

q

1-q

H

H H H

Important to capture realistic behavior (e.g., planned maintenace):

Special case q=0: n exponential stages with mean H behave like Erlang-n distributed state with mean nH

65© Gerhard Weikum

Workflow System Configuration ToolWorkflowRepository Operational Workflow System Config.

Admin

Modeling Calibration

Evaluation

Recommendation

MonitoringMapping

Hypotheticalconfig

Max. ThroughputAvg. waiting timeExpected downtime

66© Gerhard Weikum


Admin


Evaluation

Recommendation

MonitoringMapping

Min-cost config.

Goals:min(throughput)max(waiting time)max(downtime)+ constraints

67© Gerhard Weikum



Evaluation

Recommendation

MonitoringMapping

Automaticreconfiguration

Goals:min(throughput)max(waiting time)max(downtime)+ constraints

68© Gerhard Weikum

Goliat: Goal-driven Auto-configuration Tool(for Mentor-lite)

69© Gerhard Weikum

Prediction Accuracy of Goliat

Shipment_S

CreditCardCheck_S NewOrder_S [PayByCreditCard and NewOrder_DONE] /st!(CreditCardCheck) [PayByBill and

NewOrder_DONE] [CreditCardOK and CreditCardCheck_DONE]

[CreditCardNotOK and CreditCardCheck_DONE]

[in(Notify_EXIT_S) and in(Delivery_EXIT_S) and

PayByCreditCard] /st!(CreditCardCharge)

CreditCardCharge_S

EC_EXIT_S

[CreditCardCharge_DONE] Payment_S

[Payment_DONE]

[in(Notify_EXIT_S) and in(Delivery_EXIT_S) and PayByBill] /st!(Payment)

/st!(NewOrder)

EC_SC

EC_INIT_S

Notify_S Notify_EXIT_S

[Notify_DONE] /st!(Notify)

Notify_INIT_S

FindStore_S CheckStore_S [ItemsLeft and

FindStore_DONE] /fs!(ItemAvailable) st!(CheckStore)

[ItemAvailable and CheckStore_DONE]

[AllItemsProcessed]

/st!(FindStore)

Delivery_EXIT_S

Delivery_INIT_S

Benchmark:E-CommerceOrder ProcessingWorkflow

Results:

05101520253035

0 1 2 3 4 5

arrival rate [min-1]w

ait

tim

e [

ms

ec

]

ExperimentGoliat

on Mentor-lite configuration:

70© Gerhard Weikum

Multi-class Workloads with Diff QoS

Onlinebrokerage

e-ServiceCustomer Type

Premiumcustomer

Guest (potential

future customer)

...

“Channel”

What-ifportfolio analyses

Stock priceinfo service

... Bac

kend

ser

vers

Mid

dle

war

e

Class-specificrequest queues

...

class priorities ???

71© Gerhard Weikum

HEART: Help for Ensuring Acceptable Response Time

Input:• class-specific arrival rates, service time moments• class-specific goals, e.g.: E[RT(class 1)] 5 s E[RT(class 2)] 2 s Var[RT(class 2)] 4 s2

P[RT(class 3) 5 s] 0.95 ...

Output:class-specific prioritiesfor messaging middleware (MQ Series)for satisfying all goals

72© Gerhard Weikum

Autonomic Computing

My interpretation: need component design for predictability: self-inspection, self-analysis, self-tuning

Vision: all computer systems must be self-managed, self-organizing, and self-healing

Motivation:• ambient intelligence (sensors in every room, your body etc.)• reducing complexity and improving manageability of very large systems

Eight laws:• know thy self• configure thy self• optimize thy self• heal thy self• protect thy self• grow thy self• know thy neighbor• help thy users

Role model:biological systems (really ???)

73© Gerhard Weikum

Outline





Statecharts




Failure Handling

Stochastic Modeling

WF System Configuration


74© Gerhard Weikum

Workflow EngineWorkflow Engine

Workflow Engine

Statechart Interpreter

CommMgr LogMgr

WorkflowRepository

WorklistDB

WorkflowLog

AppWrapper

AppWrapper

Worklist Mgt

History Mgt

Specification, verification,configurationworkbench

Mentor-lite Prototype Event-Process Chains etc.

State-charts

Other WFEngines(SAP etc.)

XML

75© Gerhard Weikum

Summary and Open Research IssuesDependable, self-organizing („autonomic“) systems require• comprehensive data/message/process recovery with failure masking• and (dynamic) configuration and tuning procedures based on tractable mathematical models

Interesting research topics for graduate students:

rigorous verification of efficient data/message/process recovery algorithmscomprehensive & efficient implementation ofrecovery contracts in WFMS / Web service environmentguarantees about response time percentiles for multi-class workloadsdynamic reconfiguration based on transientperformability predictions for given time horizon

•

•

•

•comprehensive configuration toolfor commercial WFMS / Web service suite

•

76© Gerhard Weikum

The FutureWorkflow technology is successful

Need further courageous steps towards

"Success is a lousy teacher" (Bill Gates)

Provably correct behavior

Predictably good performanceHigh reliability and availability

Guaranteed quality of results

”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate"

(Report of the US President’s Technology Advisory Committee)

Commercial world is driven by time to market

77© Gerhard Weikum

Recommended Literature• F. Leymann, D. Roller: Production Workflow – Concepts and Techniques, Prentice Hall, 2000• W. van der Aalst, K. van Hee: Workflow Management – Models, Methods, and Systems, MIT Press, 2002• A. Dogac, L. Kalinichenko, T. Özsu, A. Sheth (Eds.): Workflow Management Systems and Interoperatibility, Springer, 1998• D. Harel, M. Politi: Modeling Reactive Systems with Statecharts - The Statemate Approach, McGraw Hill, 1998• E.M. Clarke, O. Grumberg, D. Peled: Model Checking, MIT Press, 2000• G. Weikum: Towards Guaranteed Quality and Dependability of Information Services, German Database Conf. (BTW), 1999• G Weikum, G. Vossen: Transactional Information Systems - Theory, Algorithns, and the Practice of Concurrency Control and Recovery, Morgan Kaufmann, 2001• R. Barga, D. Lomet, G. Weikum: Recovery Guarantees for General Multi-Tier Applications, IEEE CS Data Engineering Conf., 2002• H.C. Tijms, Stochastic Models – An Algorithmic Approach, Wiley & Sons, 1994• G. Haring, C. Lindemann, M. Reiser (Eds.): Performance Evaluation – Origins and Directions, Springer, 2000 • M. Gillmann, G. Weikum, W. Wonner: Workflow Management with Service Quality Guarantees, ACM SIGMOD Conf., 2002• G. Weikum (Editor): Special Issue on Infrastructure for Advanced E-Services, IEEE CS Data Engineering Bulletin, March 2001

Documents

© Gerhard Weikum1 Outline Part A: WF Specification and Verification Part B: WF System Architecture and Configuration What Is It All About? WF Specification