View
521
Download
6
Category
Preview:
DESCRIPTION
Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.
Citation preview
Chapter 5Process Discovery: An Introduction
prof.dr.ir. Wil van der Aalstwww.processmining.org
Overview
PAGE 1
Part I: Preliminaries
Chapter 2 Process Modeling and Analysis
Chapter 3Data Mining
Part II: From Event Logs to Process Models
Chapter 4 Getting the Data
Chapter 5 Process Discovery: An Introduction
Chapter 6 Advanced Process Discovery Techniques
Part III: Beyond Process Discovery
Chapter 7 Conformance Checking
Chapter 8 Mining Additional Perspectives
Chapter 9 Operational Support
Part IV: Putting Process Mining to Work
Chapter 10 Tool Support
Chapter 11 Analyzing “Lasagna Processes”
Chapter 12 Analyzing “Spaghetti Processes”
Part V: Reflection
Chapter 13Cartography and Navigation
Chapter 14Epilogue
Chapter 1 Introduction
Process discovery
PAGE 2
software system
(process)model
eventlogs
modelsanalyzes
discovery
records events, e.g., messages,
transactions, etc.
specifies configures implements
analyzes
supports/controls
enhancement
conformance
“world”
people machines
organizationscomponents
businessprocesses
Process discovery = Play-In
PAGE 3
event log process model
Play-In
event logprocess model
Play-Out
event log process model
Replay
• extended model showing times, frequencies, etc.
• diagnostics• predictions• recommendations
Example
PAGE 4
a
b
c
de
p2
end
p4
p3p1
start
Event log contains all possible traces of model and vice versa.
Another example
PAGE 5
a
b
c
df
p2
end
p4
p3p1
start
e
p5
Generalization: event log contains only subset of all possible traces of model.
Notation is less relevant (e.g. BPMN)
PAGE 6
a
b
c
de
p2
end
p4
p3p1
start
a
start end
b
c
e
d
Another BPMN example
PAGE 7
a
b
c
df
p2
end
p4
p3p1
start
e
p5
a
start end
b
c
e
d
f
Challenge
• In general, there is a trade-off between the following four quality criteria:
1.Fitness: the discovered model should allow for the behavior seen in the event log.
2.Precision (avoid underfitting): the discovered model should not allow for behavior completely unrelated to what was seen in the event log.
3.Generalization (avoid overfitting): the discovered model should generalize the example behavior seen in the event log.
4.Simplicity: the discovered model should be as simple as possible.
PAGE 8
Process Discovery: example of algorithm
PAGE 9
α
PAGE 10
>,→,||,# relations
• Direct succession: x>y iff for some case x is directly followed by y.
• Causality: x→y iff x>y and not y>x.
• Parallel: x||y iff x>y and y>x
• Choice: x#y iff not x>y and not y>x.
a>ba>ca>eb>cb>dc>bc>de>d
a→ba→ca→eb→dc→de→d
b||cc||b
abcdacbdaed
b#ee#bc#ea#d…
PAGE 11
Basic Idea Used by α Algorithm (1)
a b
(a) sequence pattern: a→b
PAGE 12
Basic Idea Used by α Algorithm (2)
a
b
c
(b) XOR-split pattern:a→b, a→c, and b#c
a
b
c
(c) XOR-join pattern:a→c, b→c, and a#b
a
b
c
(b) XOR-split pattern:a→b, a→c, and b#c
PAGE 13
Basic Idea Used by α Algorithm (3)
a
b
c
(d) AND-split pattern:a→b, a→c, and b||c
a
b
c
(e) AND-join pattern:a→c, b→c, and a||b
a
b
c
(d) AND-split pattern:a→b, a→c, and b||c
Example Revisited
PAGE 14
a
b
c
de
p2
end
p4
p3p1
start
Result produced by α algorithm
a>ba>ca>eb>cb>dc>bc>de>d
a→ba→ca→eb→dc→de→d
b||cc||b
b#ee#bc#ea#d…
Footprint of L1
PAGE 15
a
b
c
de
p2
end
p4
p3p1
start
Footprint of L2
PAGE 16
a
b
c
df
p2
end
p4
p3p1
start
e
p5
Simple patterns
PAGE 17
a b
(a) sequence pattern: a→b
a
b
c
(b) XOR-split pattern:a→b, a→c, and b#c
a
b
c
(c) XOR-join pattern:a→c, b→c, and a#b
a
b
c
(d) AND-split pattern:a→b, a→c, and b||c
a
b
c
(e) AND-join pattern:a→c, b→c, and a||b
Algorithm
PAGE 18
Let L be an event log over T. α(L) is defined as follows. 1. TL = { t ∈ T | ∃σ ∈ L t ∈ σ}, 2. TI = { t ∈ T | ∃σ ∈ L t = first(σ) }, 3. TO = { t ∈ T | ∃σ ∈ L t = last(σ) }, 4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧
∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 }, 5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XL
A ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) }, 6. PL = { p(A,B) | (A,B) ∈ YL } ∪{iL,oL}, 7. FL = { (a,p(A,B)) | (A,B) ∈ YL ∧ a ∈ A } ∪ { (p(A,B),b) | (A,B) ∈
YL ∧ b ∈ B } ∪{ (iL,t) | t ∈ TI} ∪{ (t,oL) | t ∈ TO}, and 8. α(L) = (PL,TL,FL).
Key idea: find places
PAGE 19
a1
...
a2
am
b1
b2
bn
p(A,B) ...
A={a1,a2, … am} B={b1,b2, … bn}
4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 },
5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XLA ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) },
Places as footprints
PAGE 20
a1
...
a2
am
b1
b2
bn
p(A,B) ...
A={a1,a2, … am} B={b1,b2, … bn}
PAGE 21
a
b
c
de
p2
end
p4
p3p1
start
Another event log L3
PAGE 22
Model for L3
PAGE 23
a b
c
d
e
p({a,f},{b})iL
p({b},{c})
p({b},{d})
p({c},{e})
p({d},{e})
g
oLp({e},{f,g})
f
Another event log L4
PAGE 24
b
c
p({a,b},{c}) oL
a
iL e
d
p({c},{d,e})
Event log L5
PAGE 25
PAGE 26
Discovered model
PAGE 27
b
p({a},{e})
oLaiL
c
e
f
d
p({e},{f})
p({b},{c,f})p({a,d},{b})
p({c},{d})
Limitation of α algorithm(implicit places)
PAGE 28
g
a
c
d
e
fb
p1
p2
Green places are implicit!
Limitation of α algorithm(loops of length 1)
PAGE 29
a c
b
a c
b
Limitation of α algorithm(loops of length 2)
PAGE 30
a b d
c
a
b
d
c
Limitation of α algorithm(non-local dependencies)
PAGE 31
b
c
a
e
dp1
p2
Green places are not discovered!
Difficult constructs for α algorithm
PAGE 32
a
b
c
Taking the transactional life-cycle into account
PAGE 33
start
assigned
assign
running
complete
a
start
running
complete
b
suspended
suspend
resume
start
assigned
assign
running
complete
c
Rediscovering process models
PAGE 34
The rediscovery problem: Is the discovered model N’ equivalent to the original model N?
discoveredprocessmodel
original process model event
log
simulate discover
N=N’ ?N N’
Equivalence: trace equivalence, bisimilarity, and branching bisimilarity
PAGE 35
s1
birth
s2
s3 s4
hell
curse
curse
curse
birth
curse
hell
s5
s6
s7
birth
curse
s8
s9
s11
curse
curses10
TS1 TS2 TS3
heaven
heaven heaven hell
heaven
hell
?
heaven
Three trace equivalent transition systems: TS1 and TS2are not bisimilar, but TS2 and TS3 are bisimilar
Branching bisimilarity defined for YAWL
PAGE 36
s1
TS1 TS2
start
check
end
reject accept
c1 c2
s2
acceptreject
s3 s4
s5
s6
check
s7
acceptreject
s8
τ τ
start
end
reject accept
c3
check check
?
TS1 and TS2 are not branching bisimilar (although trace equivalent).
Challenge: finding the right representational bias
PAGE 37
a a
start endp
There is no WF-net with unique visible labels that exhibits this behavior.
Another example
PAGE 38
a b
start endp1
c
τ
p1
(a)
a b
start endp1
c
p1
(b)
a
a b
start endp1
c
p1
(c)
There is no WF-net with unique visible labels that exhibits this behavior.
Challenge: noise and incompleteness
• To discover a suitable process model it is assumed that the event log contains a representative sample of behavior.
• Two related phenomena:−Noise: the event log contains rare and
infrequent behavior not representative for the typical behavior of the process.
− Incompleteness: the event log contains too few events to be able to discover some of the underlying control-flow structures.
PAGE 39
More on incompleteness
PAGE 40
See also chapter 3 (cross-validation, precision, recall, etc.)
PAGE 41
Challenge: Balancing Between Underfitting and Overfitting
Challenge: four competing quality criteria
PAGE 42
process discovery
fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Flower model
PAGE 43
g
ac
d
ef
b
start end
h
PAGE 44
What is the best model?
A D
C
EB
A D
C
EB
ACDACEBCEBCD
990850
PAGE 45
What is the best model?
A D
C
EB
A D
C
EB
ACDACEBCEBCD
99888578
PAGE 46
What is the best model?
A D
C
EB
A D
C
EB
ACDACEBCEBCD
992853
Example: one log four models
PAGE 47
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hfend
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
N2 : fitness = -, precision = +, generalization = -, simplicity = +
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
process discovery
fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Model N1
PAGE 48
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
Model N2
PAGE 49
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N2 : fitness = -, precision = +, generalization = -, simplicity = +
Model N3
PAGE 50
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hfend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
Model N4
PAGE 51
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
Why is process mining such a difficult problem?
• There are no negative examples (i.e., a log shows what has happened but does not show what could not happen).
• Due to concurrency, loops, and choices the search space has a complex structure and the log typically contains only a fraction of all possible behaviors.
• There is no clear relation between the size of a model and its behavior (i.e., a smaller model may generate more or less behavior although classical analysis and evaluation methods typically assume some monotonicity property).
PAGE 52
Creating a 2-D slice of a 3-D reality
PAGE 53
Creating a 2-D slice of a 3-D reality: the process is viewed from a specific angle, the process is scoped using a frame, and the resolution determines the granularity of the resulting model
Recommended