Realizing a Process Cube Allowing for the Comparison of Event Data

Eindhoven University of Technology

MASTER

Realizing a process cube allowing for the comparison of event data

Mamaliga, T.

Award date:2013

DisclaimerThis document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Studenttheses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the documentas presented in the repository. The required complexity or quality of research of student theses may vary by program, and the requiredminimum study period may vary in duration.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Download date: 09. Apr. 2018

Department of Mathematics and Computer ScienceArchitecture of Information Systems Research Group

Realizing a Process Cube Allowingfor the Comparison of Event Data

Master Thesis

Tatiana Mamaliga

Supervisors:prof. dr. ir. W.M.P. van der Aalst

MSc J.C.A.M. Buijsdr. G.H.L. Fletcher

Final version

Eindhoven, August 2013

Contents

1 Introduction 5

1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Challenges - Then & Now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Assignment Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Preliminaries 11

2.1 Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Process Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 ProM Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Concepts and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.2 The Many Flavors of OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Process Cube 21

3.1 Process Cube Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Process Cube by Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 From XES Data to Process Cube Structure . . . . . . . . . . . . . . . . . . 24

3.2.2 Applying OLAP Operations to the Process Cube . . . . . . . . . . . . . . . 26

3.2.3 Materialization of Process Cells . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Comparison to Other Hypercube Structures . . . . . . . . . . . . . . . . . . . . . . 30

4 OLAP Open Source Choice 32

4.1 Existing OLAP Open Source Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Palo - Motivation of Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Implementation 36

5.1 Architectural Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Event Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Load/Unload of the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Basic Operations on the Database Subsets . . . . . . . . . . . . . . . . . . . . . . . 41

5.4.1 Dice & Slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.4.3 Drill-down & Roll-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.5 Integration with ProM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.6 Result Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2

6 Case Study and Benchmarking 496.1 Evaluation of Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Synthetic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.2 Real-life Log Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7 Conclusions & Future Work 597.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7.2.1 Conceptual Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607.2.2 Implementation Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3

Abstract

Continuous efforts to improve processes, require a deep understanding of process inner working.In this context, the process mining discipline aims at discovering process behavior from historicalrecords, i.e., event logs. Process mining results can be used for analysis of process dynamics.However, mining on realistic event logs is difficult due to complex interdependencies within aprocess. Therefore, to gain more in-depth knowledge about a certain process, it can be splitinto subprocesses, which can then be separately analysed and compared. Typical tools for processmining, e.g., ProM, are designed to handle a single event log at a time, which does not particularlyfacilitate the comparison of multiple processes. To tackle this issue, Van der Aalst proposed in[4] to organize the event log in a cubic data structure, called process cube, with a selection of theevent attributes forming the dimensions of the cube.

Although, multidimensional data structures are already employed in various business intelli-gence tools, the data used has a static character. This is in stark contrast to process mining,since event data characterizes a dynamic process that evolves in time. The aim of this thesis isto develop a framework that supports the construction of the process cube and permits multi-dimensional filtering on it, in order to separate subcubes for further processing. We start withthe OLAP foundation and reformulate its corresponding operations for event logs. Moreover, thesemantics of a traditional OLAP aggregate are changed. Numerical aggregates are substituted bysublog data. With these adjustments, a tool is developed and integrated as a plugin in ProM tosupport the aforementioned operations on the event logs. The user can unload sublogs from theprocess cube, give them as parameters to other plug-ins in ProM and visualize different resultssimultaneously.

During the development of the tool, we had to deal with a shortcoming of the multidimen-sional database technologies when storing event logs, i.e., the sparsity of the resulted process cube.Sparsity in multidimensional data structures occurs when a large number of cells in a cube areempty, i.e., there are missing data values at the intersection of dimensions. Taking a single at-tribute of an event log as a dimension in the process cube results in a very sparse multidimensionaldata structure. As a result, the computational time required to unload a sublog for processingincreases dramatically. This shortcoming was addressed by designing a hybrid database structurethat combines a high-speed in-memory multidimensional database with a sparsity-immune rela-tional database. Within this solution, only a subset of event attributes actually contribute to theconstruction of the process, whereas the rest are stored in the relational database and used furtheronly for event log reconstruction. The hybrid database solution proved to provide the flexibilityneeded for real-life logs, while keeping response times acceptable for efficient user interaction. Theapplicability of the tool was demonstrated using two event log examples, a synthetic event logand a real-life event log from the CoSeLog project. The thesis concludes with a detailed load-ing and unloading performance analysis of the developed hybrid structure, for different databaseconfigurations.

Keywords: event log, relational database, in-memory database, OLAP, process mining, visu-alization, performance analysis

4

Chapter 1

Introduction

The greatest challenge to any thinker is stating the problem in a way

that will allow a solution.Bertrand Russell, British author, mathematician, & philosopher (1872 - 1970)

This thesis completes my graduation project for the Computer Science and Engineering masterat Eindhoven University of Technology (TU/e). The project was conducted in the Architectureof Information Systems (AIS) group. The AIS group has a distinct research reputation andis specialized in process modeling and analysis, process mining and Process-Aware InformationSystems (PAIS).

The process mining field, detailed further in this chapter, provides valuable analysis techniquesand tools, but also faces a series of challenges. Main issues are large data streams and rapid changesover time. This project creates a proof-of-concept prototype, which considers the so-called processcube concept as a starting point for possible solutions to the above-mentioned challenges. Theoutcome is further used for visual comparison of event data.

This chapter describes the assignment within its scientific context. Section 1.1 provides theresearch background. Section 1.2 enumerates the most important advances in process miningand identifies the current issues in the field. Section 1.3 specifies the problem and the projectobjectives. Section 1.4 continues with a short summary on the problem solution. Finally, Section1.5 provides an overview on the remaining chapters of the thesis.

1.1 Context

Technology has become an integral part of any organization. For example, current systems andinstallations are heavily controlled and monitored remotely by integrated internet technologies[23]. Moreover, employing automated solutions in any line-of-business has become a trend. As aresult, Enterprise Systems software, offering a seamless integration of all the information flowingthrough a company [22], is used in any modern organization.

Enterprise Information Systems (EIS) keep businesses running, improve service times and thus,attract more clients. Still, like in every complex system, there are multiple points where things cango wrong. System errors, fraud, security issues, inefficient distribution of tasks are just a few tomention. To cope with these issues, EIS had to extend its function-oriented enterprise applicationswith Business Intelligence (BI) techniques. That is, BI applications have been installed to supportmanagement in measuring company’s performance and deriving appropriate decisions [39]. Amongmost important functions of BI are online analytical processing (OLAP), data mining, businessperformance management and predictive analytics.

Being aware of the existing problems in an organization and applying standardized solutions tosolve them, is usually not enough. Consider a doctor that always prescribes pain killers indepen-

5

dent of the patient complaints. Of course, these kind of pills will temporarily release the pain, butthey will not treat the real disease. A good doctor should run tests, identify the root causes of thehealth problem and only then, give an adequate treatment. This is what the process mining fieldtries to accomplish. It goes beyond analyzing merely individual data records, but rather focuseson the underlying process which glues event data together. The deep understanding of the insideof a process can point to notorious deviations, persistent bottlenecks and unnecessary rework.

All in all, technology has a major impact on organizations and it proved to be an enabler forbusiness process improvement. Therefore, by means of business intelligence, and process mining,in particular, new opportunities are constantly exploited to keep pace with challenges such aschange.

1.2 Challenges - Then & Now

In the context of today’s rapidly changing environment, organizations are looking for new solu-tions to keep their businesses running efficiently. Slogans such as “Driving the Change” (Renault),“Changes for the Better” (Mitsubishi Semiconductor), “Empowering Change” (Credit Suisse FirstBoston), “New Thinking. New Possibilities” (Hyundai) are used more and more often. Further-more, different areas of business research are trying to keep up with the change and process miningis not an exception.

In 2011, the Process Mining Manifesto [7] was released to describe the state-of-the-art inprocess mining on one hand, and its current challenges, on the other hand. A year later, theproject proposal “Mining Process Cubes from Event Data (PROCUBE)” in [4] suggested the so-called process cube as a solution direction for some of these challenges. In the context of currentlyemployed process mining solutions and using the Process Mining Manifesto as a reference, thePROCUBE project proposal presents several challenges that process mining is currently facing:

From “small” event data to “big” event data.Due to increased storage capacity and advanced technologies, the vast amount of availableevent data have become difficult to control and analyse. Most of the traditional processmining techniques operate with event logs whose size does not exceed several thousands casesand a couple hundred thousands events (for example, in BPI Challenge [2] files). However,nowadays corporations work on a different scale of event logs. Giants like Royal Dutch Shell,Walmart, IBM, would rather consider millions of events (a day or even a second) and thisnumber will continue to grow. Ways to ensure that event data growth will not affect theimportance of process mining techniques are constantly sought.

From homogeneous to heterogeneous processes.With the increasing complexity of an event log, chances are that the variability in its corre-sponding process increases as well. For example, events in an event log can present differentlevels of abstraction. However many mining techniques assume that all events in an event logare logged at the same level of abstraction. In that sense, the diverse event log characteristicshave to be properly considered.

From one to many processes.Many companies have their agencies spread across the globe. Let’s take SAP AG as anexample. Only its research and development units are located on four continents, but ithas regional offices all around the world. That is, SAP units are executing basically thesame set of processes. Still, this does not exclude possible variations. For instance, theremight be various influences due to the characteristics of a certain SAP distribution region(Germany, India, Brazil, Israel, Canada, China, and others). Traditional process mining isoriented on stand-alone business processes. However, it is of great importance to be ableto compare business processes of different organizations (units of an organization). Forexample, efficient and less efficient paths in different processes can be identified. Inefficientpaths can be substituted and efficient paths can be applied to the rest of the processes toimprove performance.

6

From steady-state to transient behavior.The change has a major impact not only on the size of event logs and on the necessityof dealing with many processes together, but also on the state of a business process. Forexample, companies should be able to quickly adjust to different business requirements. As aresult, their corresponding processes undergo different modifications. Current process miningtechniques assume business processes to be in a steady-state [5]. However, it is importantto understand the changing nature of a process and to react appropriately. The notion ofconcept drift was introduced in process mining [33] to capture this second-order dynamics.Its target is to discover and analyze the dynamics of a process by detecting and adapting tochange patterns in the ongoing work.

From offline to online.As previously mentioned, systems produce an overwhelming amount of information. Theidea of storing it as historical event data for later analysis, as it is currently done, may notseem as appealing any more. Instead, the emphasis should be more on the present and thefuture of an event. That is, an event should be analysed on-the-fly and predictions on thecontingency of its occurrence should be made based on existing historical data. As such,online analysis of event data is yet another process mining challenge.

Each of the issues discussed above, are extremely challenging. Analysing large scale eventlogs is difficult with the current process mining techniques. Solutions to mitigate some of theissues that appear when dealing with large scale event logs are proposed in [14], i.e., by event logsimplification, by dealing with less-structured processes and others. A framework for time-basedoperational support is described in [8]. In [16], an approach is offered to compare collections ofprocess models corresponding to different Dutch municipalities. Nevertheless, there is still theneed for more elaborated solutions and a unified way of approaching them.

1.3 Assignment Description

Stand-alone process analysis is the common way of analysing processes in today’s process miningapproaches. However, inspecting a process as a single entity, impedes observing differences andsimilarities with other processes. Let’s take a simple example from the airline industry. There is aconstant discussion about which of the low-cost airlines, Ryanair or Wizzair, offers better services.There are both advantages and disadvantages of traveling with either of these two. Generally,Ryanair is considered more punctual than Wizzair 1. To determine why Ryanair is more on-timewith flights than Wizzair, we compare their processes. We noticed that while at Wizzair theluggage is checked only once, Ryanair is very strict with the luggage procedure and checks it twicebefore embarking. As a result, passengers and crew are not busy with “fitting” luggage that doesnot fit and the hallway of the aircraft is kept free for new passengers that arrive at board. Withminimizing the turnaround time, the airline punctuality improves. The procedure of checking theluggage may not be the only factor that improves the punctuality of Ryanair airline, but it is clearfrom the comparison of the two airline processes that it contributes to reducing the flight delays.In conclusion, the comparison of the two processes helped in answering a specific question andidentifying parts of these processes that can be further improved.

When it comes to comparison of large processes, it is difficult to inspect processes entirelyat a glance. Splitting and merging different parts of a process can offer more insightful details.Let’s consider the following scenario. In the car manufacturing process, there is a final polishinginspection step. Several resources check whether there is a scratch on a car that needs to bepolished. During the last two weeks, it was noticed that one polishing crew worked slower thanthe others. To identify the cause of this issue, the car manufacturing process is analysed. First,the process is split by department type and the polishing department is selected. Then, only theprocess corresponding to the resources of this specific crew is isolated. The following aspects are

1http://www.flightontime.info/scheduled/scheduled.html

7

inspected: the car type, the engine type, the color type. When filtering by car type and enginetype, it seems that there are no patterns indicating a potential delay. However, when inspectingthe subprocesses corresponding to different car colors, a pattern emerges. The average workingtime of polishing a red car is much higher compared to the one of polishing cars of a differentcolor. Since red cars take, in general, more time to be polished than other cars, this indicates thatthere is a problem in the painting department. The red-colored cars are not painted properly andtherefore need constant polishing. While at the beginning, it seemed like the crew is responsiblefor the delays, in fact, the crew members were just polishing more red-colored cars. Since red-colored cars required more polishing due to a painting issue, the crew worked slower compared tothe other crews. Without filtering the initial process, it would have been difficult to identify suchdetailed problems.

Taking into consideration the discussion above, the goal of this master project can be definedas follows:

GOAL: Create a proof-of-concept tool to allow comparison of multiple processes.

In other words, the aim is to support integrated analysis on multiple processes, while examiningdifferent views of a process. Together with the main goal, there are some other targets: filteringprocesses by preserving the initial dataset, merging different parts of a process, visualizing processmining results simultaneously and placing them next to each other to facilitate comparison. Inthe following, we present the approach we propose to reach the enumerated objectives.

1.4 Approach

Figure 1.1: The process cube. Concept proposed in the PROCUBE project.

To accomplish the goal, we base our approach on the process cube concept, introduced in [4]and shown in Figure 1.1. A process cube is a structure composed of process cells. Each process cell(or collection of cells) can be used to generate an event log and derive process mining results [4].Note that traditional process mining algorithms are always applied to a specific event log withoutsystematically considering the multidimensional nature of event data.

In this project, the process cube is materialized as an online analytical processing (OLAP)hypercube structure. Except for the built-in multidimensional structure, one can benefit fromthe functionality of the OLAP operations and hopefully from the good performance of OLAPimplementations. Transactional databases are designed to store and clean data, but are nottailored towards analysis. OLAP, on the other hand, is herein chosen to harbor complex eventdata for further process analysis, in the view of its analysis-optimized databases and its specialized“drilling” operations. Organizing event data in OLAP multidimensional structures, makes it easy

8

to get event data and to pick a side to look at it. There are also many ways to divide event data,e.g., one can always drill down and up in the multidimensional structure and inspect event dataat different granularity levels. Finally, the retrieved event data can be used to obtain differentprocess-related characteristics, e.g., process models, that can be further analysed and compared.

There are however, some challenges with respect to this approach, mainly due to the fact thatOLAP does not handle event data, but enterprise data:

• Only the aggregation of large collections of numerical data is supported by the OLAP tools.

• Process-related aspects are entirely missing in the OLAP framework.

• Overlapping of cells (event) classes is not possible in OLAP cubes.

Figure 1.2: Master Project Scope.

Nevertheless, adjustments can be made to OLAP tools to accommodate process cube require-ments. The approach considers several steps shown also in Figure 1.2. First, event logs areintroduced among OLAP data sources. Hence, it becomes possible to load XES event logs in theOLAP database. Second, the process cube is created to support the materialization of an eventlog. Moreover, the process cube is designed to allow the visualization of cells with overlappingevent data. Finally, different process mining results can be produced for any section of the cubeand further exported as images.The materialization of the process cube as an OLAP cube allows to define our objective even moreprecise: the goal is to create a proof-of-concept tool that exploits OLAP features to accommodateprocess mining solutions such that the comparison of multiple processes is possible.

1.5 Thesis Structure

To describe the approach, the master thesis is structured as follows:

Present a literature study on employed concepts and technologies (Chapter 2)Concepts from process mining and business intelligence fields will be introduced. Then, adiscussion on the implemented OLAP and database technologies will follow.

Elaborate on process cube functionality (Chapter 3)The process cube notion will be clearly defined together with its structure. The requirementsneeded to attire the envisioned process cube functionality will be listed.

Explain Palo software choice (Chapter 4)Based on the requirements from Chapter 3, a collection of technological solutions that couldsupport the process cube structure is generated. After analyzing the pros and the cons ofeach solution, the choice to use Palo OLAP server is described and motivated.

9

Recall the most relevant implementation steps (Chapter 5)After presenting the architecture of the project, the implementation steps are described.The main functionality consists of: loading/unloading a XES file in/from the in-memorydatabase, enabling the adjusted OLAP operations on event logs and visualizing processmining results.

Report on the testing process and on the system test results (Chapter 6)The functionality of the software is tested and its performance is evaluated for different eventlogs and process cubes.

Conclude with general remarks on the project (Chapter 7)The thesis concludes with a series of comments and observations on both the implementedsolution and further research possibilities.

10

Chapter 2

Preliminaries

2.1 Business Intelligence

Business Intelligence (BI) incorporates all technologies and methods that aim at providing action-able information that can be used to support decision making. An alternative definition states thatBI systems combine data gathering, data storage, and knowledge management with analytical toolsto present complex internal and competitive information to planners and decision makers [41].All in all, BI represents a mixture of multiple disciplines (e.g., data warehousing, data mining,OLAP, process mining, etc.), as shown in Figure 2.1, all with the same main goal of turningraw data into useful and reliable information for further business improvements. Even though

Figure 2.1: BI - a confluence of multiple disciplines.

herein presented as totally separate disciplines, there are various attempts to interconnect someof them for obtaining more powerful analysis results. For example, data mining is integrated withOLAP techniques [31, 45]. Data warehousing and OLAP technologies are more and more usedin conjunction [13, 18]. From the above-mentioned BI disciplines, process mining and OLAP aredetailed in Section 2.2 and in Section 2.3, as being particularly relevant for this project.

11

2.2 Process Mining

2.2.1 Concepts and Definitions

The idea of process mining is to discover, monitor and improve real processes (i.e., not assumedprocesses) by extracting knowledge from event logs readily available in todays systems [3]. Thecontent and the level of detail of a process description depends on the goal of the conductedprocess mining project and the employed process mining techniques. The set of real executions isfixed and is given by the event data from an existing event log.

There are basically three types of process mining projects [3]. The goal of the first, data-drivenprocess mining project, is to conclude with a process description, which should be as detailed aspossible, without necessarily having a specific question in mind. This can be accomplished in twoways: by a superficial analysis, covering multiple process perspectives or by an in-depth analysis,on a limited number of aspects. The second, the question-driven process mining project, aims atobtaining a process description from which an answer to a concrete question can be derived. Apossible question can be: “How does the decision to increase the duration of handling an invoiceinfluences the process?” The third type, the goal-driven process mining project, consists of lookingfor weaker parts in the resulted process description that can be considered for improving a specificaspect, e.g., better response times.

Figure 2.2: Process mining: discovery, conformance, enhancement.

Establishing the type of the process mining project to conduct is followed by choosing therelevant process mining techniques to apply on the event log. Process mining comes in threeflavors: discovery, conformance and enhancement. Figure 2.2 1 shows these three main processmining categories. Discovery techniques take the event log as input and return the real processas output. Conformance checking techniques checks if reality, as recorded in the log, conforms tothe model and vice versa [7]. Enhancement techniques produce an extended process model whichgives additional insights in the process, i.e., existing bottlenecks.

Regardless of the process mining technique, an event log is always given as input, shown alsoin Figure 2.2. The content of an event log can vary greatly from process to process. Nevertheless,

1http://www.processmining.org/research/start

12

Figure 2.3: Structure of event logs.

there is a fixed skeleton, expected to be found in any event log. Figure 2.3, from [3], presents thestructure of an event log. Generally, event data from an event log correspond to a process. Aprocess is composed of cases or completed process instances. In turn, a case consists of events.Events should be ordered within a case. Preserving the order is important as it influences thecontrol flow of the process. An event corresponds to an activity, e.g., register request, pay com-pensation. A trace represents a sequence of activities. Both events and cases are characterized byattributes, e.g., activity, time, resource, costs.

The data source used for process mining is an event log. Event data of different informationsystems are stored in event logs. Since event logs can be recorded not only for process miningpurposes (e.g., for debugging errors), there is no unique format used at creation. Handling variousevent log formats for process analysis is time consuming. Therefore, event logs need to be stan-dardized by converting raw event data to a single event log format. One such format is MXML,which emerged in 2003. Recently, the popularity of XES event log standardization has grown.Further, we present an overview on XES event log structure, with relevant details for this masterthesis. A more in depth discussion on the XES format can be found in [15] and more up to dateinformation on XES can be found on http://www.xes-standard.org/.

Figure 2.4, taken from [29], shows the XES meta model. Except for traces and events, withtheir corresponding attributes, the log object contains a series of other elements. The global

13

Figure 2.4: The XES Meta-model.

attributes for traces and events are usually used to quickly find the existing attributes in the XESlog. The purpose of event classifiers is to assign each event to a pre-defined category. Eventswithin the same category can be compared with the ones from another category. XES logs arealso characterized by extensions. Extensions are used to resolve the ambiguity in the log byintroducing a set of commonly understood attributes and attaching semantics to them. Attributeshave assigned values which corresponds to a specific type of data. Based on the type of data,attributes can be classified in five categories: String attributes, Date attributes, Int attributes,Float attributes, and Boolean attributes. These attribute types correspond to the standard XMLtypes: xs:string, xs:dateTime, xs:long, xs:double and xs:boolean.

To understand the separation between required and flexible event log aspects, a formalizationof the above-highlighted concepts is given. The process mining book [3] is used as reference.

Definition 1 (Event, attribute [3]). Let E be the event universe, i.e., the set of all possibleevent identifiers. Events may be characterized by various attributes, e.g., an event may have atimestamp, correspond to an activity, is executed by a particular person, has associated costs, etc.Let AN be a set of attribute names. For any event e ∈ E and name n ∈ AN : #n(e) is the valueof attribute n for event e. If event e does not have an attribute named n, then #n(e) =⊥(nullvalue).

Notation 1. For a given set A, A∗ is the set of all finite sequences over A.

14

Definition 2 (Case, trace, event log [3]). Let C be the case universe, i.e., the set of all possiblecase identifiers. Cases, like events, have attributes. For any case c ∈ C and name n ∈ AN : #n(c)is the value of attribute n for case c (#n(c) =⊥ if case c has no attribute named n). Each case hasa special mandatory attribute trace : #trace(c) ∈ E ∗.2 c = #trace(c) is a shorthand for referringto the trace of a case.

A trace is a finite sequence of events σ ∈ E ∗ such that each event appears only once, i.e., for1 ≤ i < j ≤ |σ| : σ(i) 6= σ(j).

For any sequence δ = 〈a1, a2, · · · , an〉 over A, δset = {a1, a2, · · · , an}. δset converts a sequenceinto a set, e.g., δset(〈d, a, a, a, a, a, a, d〉) = {a, d}. a is an element of δ, denoted as a ∈ δ, if andonly if a ∈ δset(δ).

An event log is a set of cases L ⊆ C such that each event appears at most once in the entirelog, i.e., for any c1, c2 ∈ L such that c1 6= c2 : δset(c1) ∩ δset(c2) = ∅.

2.2.2 ProM Framework

A large number of algorithms are produced as a result of process mining research. Ranging fromalgorithms that provide just a helicopter view on the process (Dotted Chart) to ones that give anin-depth analysis (LTL Checker), many of them are implemented in the ProM Framework in theform of plugins.

Figure 2.5: ProM Framework Overview.

Figure 2.5, based on [24], shows an overview of the ProM Framework. It includes the maintypes of ProM plugins and the relations between them. Before applying any mining technique, anevent log can be filtered using a Log filter. Further, the filtered event log can be mined using theMining plugin and then stored as a Frame result. The Visualization engine ensures that frameresults can be visualized. An (filtered) event log, but also different models, e.g., Petri nets, LTLformulas, can be loaded into ProM using an Import plugin. Both the Conversion plugin and the

2In the remainder, we assume #trace(c) 6= 〈〉, i.e., traces in a log contain at east one event

15

Figure 2.6: Examples of process mining plugins: Log Dialog and Dotted Chart (helicopter view),Fuzzy Miner (discovery), Social Networks based on Working Together (organizational perspective).

Analysis plugin use mining results as input. While the first plugin is specialized in converting theresult to a different format, the second plugin is focused on the analysis of the result.

The ProM framework includes five types of process mining plugins, as shown in Figure 2.5:

• Mining plugins - mine models from event logs.

• Analysis plugins - implement property analysis on a mining result.

• Import plugins - allow import of objects from Petri net, LTL formula, etc.

• Export plugins - allow export of objects to various formats, e.g., EPC, Petri net, DOT, etc.

• Conversion plugins - make conversions between different data formats, e.g., from EPC toPetri net.

Figure 2.6 presents some examples of plugins in ProM: the Log Dialog, the Dotted Chart, theFuzzy Miner [30] and the Working Together Social Network [9]. There are, however, more than400 plug-ins available in Prom 6.2, covering a wide spectrum. Plugins objectives can vary fromproviding process information at a glance, e.g., Log Data, Dotted Chart, to providing automatedprocess discovery, e.g., Heuristics Miner [53] and Fuzzy Miner and offering detailed analysis forverification of process models, e.g., Woflan analysis, for performance aspects, e.g., PerformanceAnalysis with Petri net, and for the organizational perspective, e.g., Social Network miner.

2.3 OLAP

2.3.1 Concepts and Definitions

On-Line Analytical Processing (OLAP) is a method to support decision making in situations whereraw data on measures such as sales or profit needs to be analysed at different levels of statisticalaggregation [42]. Introduced in 1993 by Codd [20] as a more generic name for “multidimensional

16

data analysis”, OLAP embraces the multidimensionality paradigm as a means to provide fastaccess to data when analysing it from different views.

Figure 2.7: Traditional OLAP cube. At the intersection of the three dimensions: regions, timeand sales information, an aggregate (e.g., profit margin %) can be derived. Both time and regionsdimensions contain a hierarchy (e.g., 2012Jan, 2012Feb, 2012Mar are months of 2012).

In comparison with its On-Line Transactional Processing (OLTP) counterpart, OLAP is op-timized for analysing data, rather than storing data originating from multiple sources to avoidredundancy. Therefore, OLAP is mostly based on historical data, e.g., data that can be aggre-gated, and not on instantaneous data which is quite challenging to analyse, sort, group or compare“on-the-fly”.

Multidimensional data analysis is possible due to a multidimensional fact-based structure,called an OLAP cube. An OLAP cube is a specialized data structure to store data in an optimizedway for analysis.

Figure 2.7 presents the traditional OLAP cube structure. Designed to support enterprise dataanalysis, an OLAP cube is usually built around a business fact. A fact describes an occurrenceof a business operation (e.g., sale), which can be quantified by one or more measures of inter-est (e.g., the total amount of the sale, sales cost, profit margin %). Generally, the measure ofinterest is a real number. A business operation can be characterized by multiple dimensions ofanalysis (e.g., time, region, etc). Let DAi, 1 ≤ i ≤ n be the set of elements of the dimensions ofanalysis. Then, the measure of interest MI can be defined as a function MI :

∏ni=1DAi → R.

For example, if region, time and sales are the dimensions of analysis, as in Figure 2.7, thenMI(Germany, 2012Mar, ProfitMargin) = 11.

Moreover, elements of a dimension of analysis can be organized in a hierarchy, e.g., theEurope region is herein represented by countries like Netherlands, Germany and Belgium.A natural hierarchical organization can be observed among time elements. Consider the treestructure in Figure 2.8. The root of the tree is the 2012 year. This element has three chil-dren: 2012Jan, 2012Feb and 2012Mar, corresponding to months. Finally, each month ele-ment has days of week as children elements. Let Hi be the set of hierarchy elements, i.e.,Hi = {2012, 2012Jan, 2012Feb, 2012Mar, 2012JanMon, 2012JanThu, . . .}. The childrenfunction, children : Hi → P(Hi) returns the children elements of the argument. For example,children(2012) = {2012Jan, 2012Feb, 2012Mar}. The allLeaves function, allLeaves : Hi →

17

Figure 2.8: Example of hierarchy tree structure on time dimension.

P(Hi) returns all leaf elements corresponding to the subtree with the function argument as a rootnode. For example, allLeaves(2012) = {2012JanMon, 2012JanThu, 2012FebWed, 2012MarTue,2012MarFri}. Note that a hierarchy is a undirected graph, in which any two nodes are connectedby a simple path, with the following property: for any node h ∈ Hi, any two children h1, h2∈ children(h), allLeaves(h1) ∩ allLeaves(h2) = ∅.

Dimensions of analysis, hierarchies and measures of interest can be used to construct an OLAPcube, like the one in Figure 2.7. Dimensions of an OLAP cube are defined by CD = D1 ×D2 ×. . . ×Dn. For any 1 ≤ i ≤ n, Di ⊆ Hi is the set of dimension elements. Hierarchies are definedby CH = H1 × H2 × . . . × Hn. For example, the time dimension contains elements from thehierarchy shown in Figure 2.8. Let D1 be the cube dimension corresponding to time, then apossible content of D1 is {2012Jan, 2012Feb, 2012Mar}. It is not necessary for a dimension tocontain all the hierarchy elements. Together with dimensions, hierarchies are elements of an OLAPcube structure CS = {CD,CH}. Measures of interests are functions specific for the dimensions ofanalysis. For the dimensions of the cube, the aggregate function CA, CA :

∏ni=1Hi → R, is used

as an equivalent of a measure of interest. The only difference is that aggregates can be computedfrom multiple measure of interest results or from other aggregates. For example, the aggregatesales cost for the entire month 2012Jan is a sum of the measure of interest results correspondingto 2012JanMon and 2012JanThu.

To make the reasoning in terms of OLAP more precise and to strengthen the understandingof various cube-related concepts, we provide a formalization of the core OLAP notions.

An OLAP cube presents a multidimensional view on data from different sides (dimensions).Each dimension consists of a number of dimension attributes or values, which can be also calleddimension elements or members. Members in a dimension can be organized into a hierarchy andcorrespond, as such, to a hierarchical level. These concepts are further formalized in Definition 3.

Definition 3. (OLAP cube)Let Di, 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions,

Hi, 1 ≤ i ≤ n be a set of hierarchy elements,CD = D1 ×D2 . . .×Dn be the cube dimensions,CH = H1 ×H2 . . .×Hn be the cube hierarchies,children : Hi → P(Hi), where children(h) is the function returning the children of h ∈ Hi,allLeaves : Hi → P(Hi), where allLeaves(h) is the function returning all leaves of h ∈ Hi,h ∈ Hi, h1, h2 ∈ children(h), allLeaves(h1) ∩ allLeaves(h2) = ∅,CS = (CD,CH) be the cube structure,CA : CH → R be the cube aggregate function,

An OLAP cube is defined as OC = (CS,CA).

Given the multidimensional structure of an OLAP cube, the risk exists of having it populatedwith sparse data. Sparsity appears when often, at the intersection of dimensions, there is nocorresponding measure of interest, thus, there is an empty cell. Such behavior occurs in multidi-mensional cubes with a large number of sparse dimensions. A dimension is considered a sparsedimension when it has a large number of members, that in most of the cases appear only oncein the original data source and data values are missing for the majority of member combinations.On the contrary, in a dense dimension, a data value exists for almost every dimension member.

18

So far, we focused on the OLAP cube multidimensional structure. However, learning how toemploy it, is particularly interesting, as it gives a feeling of OLAP’s usefulness and applicability.Therefore, we further discuss about one of the main features of OLAP, the OLAP operations.In [18], Chandhuri and Dayal enumerate among the typical OLAP operations: slice and dice forselection and projection, drill-up (or roll-up) and drill-down, for data grouping and ungrouping,and pivoting (or rotation) for re-orienting the multidimensional view of data. There are also otherOLAP operations, e.g., ranking, drill-across [44]. However, the operations mentioned in [18] areconsidered sufficient for a meaningful exploration of the data.

The dice operation returns a subcube by selecting a subset of members on certain dimensions.

Definition 4 (Dice operation). Let OC, OC = (CS,CA) and D′i ⊆ Di for all 1 ≤ i ≤ n. Thedice operation is diceCD′(OC) = OC ′, where

OC ′ = (CS′, CA′),CS′ = (CD′, CH ′),CH ′ = H ′1 ×H ′2 × . . .×H ′n,H ′i = {h ∈ Hi|∃v ∈ D′i, allLeaves(v) ∩ allLeaves(h) 6= ∅},children′ : H ′i → P(H ′i), children

′(h) = children(h) ∩H ′i,allLeaves′ : H ′i → P(H ′i), allLeaves

′(h) = allLeaves(h) ∩H ′i,h ∈ H ′i, h1, h2 ∈ children′(h), allLeaves′(h1) ∩ allLeaves′(h2) = ∅,CA′ : CH ′ → R, CA′(h1, . . . , hn) = CA(h1, . . . , hn), for (h1, . . . , hn) ∈ CH ′.

The slice operation is a special case of dice operation. It produces a subcube by selecting asingle member for one of its dimensions.

Definition 5 (Slice operation). Let OC, OC = (CS,CA). The slice operation is slicek,v(OC) =OC ′, where 1 ≤ k ≤ n, v ∈ Dk, and OC ′ = diceCD′(OC) with CD′ = D1 × . . . ×Dk−1 × {v} ×Dk+1 × . . .×Dn.

Note that an OLAP cell can be defined as an OLAP subcube obtained by slicing each ofthe OLAP cube dimensions. Let OC, OC = (CS,CA). The OLAP cell is slice1,v1 (slice2,v2 . . .(slicen−1,vn−1

(slicen,vn(OC))) . . .)) = OC ′.

By slice and dice operations, various OLAP subcubes are isolated. To make them usefulfor analysis purposes, the data from the cube should be visualized. Although the cube is amultidimensional structure, only two dimensions can be visualized at a time.

Pivoting (or rotation) operation changes the visualization perspective of the OLAP cube, byswapping two dimensions D∗i and D∗j .

Definition 6 (Pivoting operation). Let OC, OC = (CS,CA) with CD = D1 ×D2 × . . .×Di ×. . .×Dj × . . .×Dn and CH = H1 ×H2 × . . .×Hi × . . .×Hj × . . .×Hn. The pivoting operationis pivoti,j(OC) = OC ′, where 1 ≤ i, j ≤ n,

OC ′ = (CS′, CA′),CS′ = (CD′, CH ′),CD′ = D1 ×D2 × . . .×Dj × . . .×Di × . . .×Dn,CH ′ = H1 ×H2 × . . .×Hj × . . .×Hi × . . .×Hn,children′ : H ′i → P(H ′i), children

′(h) = children(h),allLeaves′ : H ′i → P(H ′i), allLeaves

′(h) = allLeaves(h),h ∈ H ′i, h1, h2 ∈ children′(h), allLeaves′(h1) ∩ allLeaves′(h2) = ∅,CA′ : CH ′ → R, CA′(h1, . . . , hj , . . . , hi, . . . , hn) = CA(h1, . . . , hi, . . . , hj , . . . , hn), for (h1,. . . , hj , . . . , hi, . . . , hn) ∈ CH ′ .

The roll-up operation consolidates some of the elements of a dimension into one element, whichcorresponds to a hierarchically superior level.

Definition 7 (Roll-up operation). Let OC, OC = (CS,CA) and v ∈ Hk, where 1 ≤ k ≤ n. Theroll-up operation is rollupk,v(OC) = OC ′, where OC ′ = (CS′, CA) with CS′ = (CD′, CH), andCD′ = D1 × . . .×Dk−1 × (Dk \ children(v)) ∪ {v} × . . .×Dn.

19

The drill-down operation refines a member of a dimension into a set of members, correspondingto a hierarchically inferior level.

Definition 8 (Drill-down operation). Let OC, OC = (CS,CA) and v ∈ Dk, where 1 ≤ k ≤ n.The drill-down operation is drilldownk,v(OC) = OC ′, where OC ′ = (CS′, CA) with CS′ =(CD′, CH), and CD′ = D1 × . . .×Dk−1 × (Dk \ {v}) ∪ children(v)× . . .×Dn.

2.3.2 The Many Flavors of OLAP

Before introducing the OLAP principle, relational databases were the most widely used as tech-nology for enterprise databases. Relational databases are stable and trustworthy and can be usedfor storing, updating and retrieving data. However, they provide limited functionality to supportuser views of data. Most notably lacking was the ability to consolidate, view, and analyze dataaccording to multiple dimensions, in ways that make sense to one or more specific enterprise ana-lysts at any given point in time [20]. Consequently, OLAP facilities were designed to compensatefor the limitations of the conventional relational databases.

The OLAP Server functionality had to be implemented on top of an existing database technol-ogy. Relational databases were considered to be amongst the most reliable and popular types ofdatabases [21]. Naturally, one of the proposed solutions was to add OLAP characteristics on topof a relational model. This is how the ROLAP (Relational OLAP) category came into existence.The OLAP layer provides a multidimensional view, calculation of derived data, slice, dice anddrill-down intelligence and the relational database gives an acceptable performance by employinga Star-schema or Snowflake data model [21, 43].

Being the most appropriate database type for OLTP, due to its design, the relational databaseis not as good an option for OLAP [20, 25]. Even though presenting close to real-time data loadingand having advantages in terms of capacity, ROLAP presents slow query performance and is notalways efficient when aggregating large amounts of data.

Instead, a multidimensional database approach deemed to be more suited [11, 54]. Knownunder the name of MOLAP (Multi-dimensional OLAP), this type of OLAP is created to achievethe highest possible query performance. Still, MOLAP has its own deficiencies. MOLAP worksthe best for cubes with a limited number of sparse dimensions. Sparse data within large cubesoften causes performance problems.

Hence, the advantages of ROLAP are the disadvantages of MOLAP and vice versa. Therefore,the HOLAP (Hybrid OLAP) version was introduced as the combination of the two, to compensatefor the deficiencies of each technology [46]. HOLAP is one of the OLAP types that goes mainstreamamong the next-generation OLAP. Additional technologies, such as in-memory OLAP, are consid-ered for speed-oriented systems. Nonetheless, depending on data characteristics (e.g., summarized,detailed), one or a combination of these technologies can be considered. Even though multi-hybridmodels (e.g., MOLAP and real-time in-memory for analysis and HOLAP for drill through) aredesigned to incorporate the most of OLAP benefits, there is still no generic OLAP architecture orstandard procedure to guarantee optimal performance independent of the requirements.

With the growth of available memory capacity and because memory prices are decreasing withtime, the feasibility of storing large databases in memory increases. As a consequence, the disk-based databases are replaced more and more often with in-memory database technology. Whileconventional disk-based database systems (DRDB) store data on disk, main memory databasesystems (MMDB) [26] store and access data directly from the main physical memory. Therefore,the response times and transaction throughputs of a MMDB are considerably better than for adisk-based database system. Obviously, a DRDB still has advantages in terms of capacity. Thereare very large databases that simply cannot fit in memory, e.g., database containing NASA spacedata (with images). However, it is difficult for DRDB to compete with the speed of MMDB. Thatis, a database of a reasonable size stored in-memory outperforms a database stored on disk.

20

Chapter 3

Process Cube

In Section 1.3, the goal of this master project was described as to create a proof-of-concept toolto allow comparison of multiple processes. In Section 1.4, the process cube was introduced as ameans to satisfy the goal. Both process mining and OLAP aspects were described in Chapter 2.Being the central component of the system, the process cube links the process mining frameworkto the existing OLAP technology. By storing event logs in OLAP multidimensional structures,event data can be used to obtain and compare process mining results. In this chapter, the conceptof the process cube is explained in detail, together with an example that shows its functionalityand a comparison with other hypercube structures. Before proceeding with the process cubematerialization in Chapter 4, a set of requirements are established and enumerated at the end ofthe chapter.

3.1 Process Cube Concept

In Section 2.2.1, the definition of an event with attributes (Definition 1) and of a case withattributes (Definition 2) were given. Section 2.3.1 includes the definition of an OLAP cube (Defi-nition 3) with its corresponding operations (Definitions 4, 5, 6, 7, 8). In this section, the processcube and process cell notions are introduced by adding event log aspects into the OLAP cubedefinition. For a further elaboration and formalization of the process cube concept see the paper[6], which was published towards the end of this project.

Figure 3.1: Process Cube Concept.

Figure 3.1, taken from [4], shows relevant process cube characteristics and is therefore, rep-resentative for the definitions of different process cube concepts given below (e.g., process cube,process cell). A detailed discussion on the elements of the Figure 3.1 is presented in [6].

21

A process cube is a multidimensional structure built from event log data in a way that facilitatesfurther meaningful process mining analysis. A process cube is composed of a set of process cells [4]and the main difference between a process cube and an OLAP cube lies in its cell characteristics.In contrast to the OLAP cube, there is no real measure of interest quantifying a business operation.While OLAP structures are designed for business operations analysis, the process cube aimsat analyzing processes. Therefore, each dimension of analysis is composed of event attributes.Consequently, the content of a cell in the process cube changes from real numbers to events.While in OLAP, dimensions of analysis are used to populate the cube, in case of process cubesthe events of an event log are used to create the dimensions of analysis. Hence, instead of the MIfunction, the event members function is defined as EM : E → DA1 × . . . × DAn. Note that todifferentiate between two events with the same attributes, the event id is added as a dimension ofanalysis. Consequently, for each event there will be a unique combination of dimension of analysismembers.

Definition 9. (Process cube)Let Di, 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions,

Hi, 1 ≤ i ≤ n be a set of hierarchy elements,CD = D1 ×D2 × . . .×Dn be the cube dimensions,CH = H1 ×H2 × . . .×Hn be the cube hierarchies,children : Hi → P(Hi), where children(h) is the function returning the children of h ∈ Hi,allLeaves : Hi → P(Hi), where allLeaves(h) is the function returning all leaves of h ∈ Hi,h ∈ Hi, h1, h2 ∈ children(h), allLeaves(h1) ∩ allLeaves(h2) = ∅,CS = (CD,CH) be the process cube structure,CE : CH → P(E ) be the cell event function, CE(h1, h2, . . . , hn) = {e ∈ E |(d1, d2, . . . dn) =CC(e), di ∈ allLeaves(hi), 1 ≤ i ≤ n}, for (h1, h2, . . . , hn) ∈ CH.

A process cube is defined as PC = (CS,CE).

Note that a process cell can be defined as a subcube obtained by slicing each of the process cubedimensions. Let PC, PC = (CS,CA). The process cell is slice1,v1

(slice2,v2. . . (slicen−1,vn−1

(slicen,vn(PC))) . . .)) = PC ′. Each cell in the process cube corresponds to a set of events [4],returned by the cell event function CE.

The process cube, as defined above, is a structure that does not allow overlapping of eventsin its cells. To allow the comparison of different processes using the process cube, a table ofvisualization is created. The table of visualization is used to visualize only two dimensions at atime. Multiple slice and dice operations can be performed by selecting different elements of thetwo dimensions. Each slice, dice, roll-up or drill-down is considered to be a filtering operation.Hence, a new filter is created with each OLAP operation. Filters are added as rows/columns inthe table of visualization. Note that unlike the cells of the process cube, the cells of the table ofvisualization may contain overlapping events. That is because there is no restriction in selectingthe same dimension members for two filtering operations.

Given a process cube PC, a process model, MPC is the result of a process discovery algorithm,such as Alpha Miner, Heuristic Miner or other related algorithms, used on PC. However, thereare various process mining algorithms whose results are not necessarily process models. Instead,they can offer some insightful process-related information. For example, Dotted Chart Analysisprovides metrics (e.g., average interval between events) related to events and their distributionover time. Process cubes are not limited to process models as well. Therefore, we refer to processmining results just as models.

So far, we described the process cube as being a hypercube structure, with a finite numberof dimensions. In [4], a special process cube is presented, with three dimensions: case type (ct),event class (ec) and time window (tw).

Figure 3.2, taken from [4], contains a table corresponding to a fragment of an event log. Letthe event data from the event log be used to construct a process cube PC. Then, the ct, ec andtw dimensions are established as follows. The case type dimension is based on the properties ofa case. For example, the case type dimension can be represented by the type of the customer,in which case, the members of ct are gold and silver, i.e., D1 = {gold, silver}, H1 = D1. The

22

Figure 3.2: Event log excerpt.

event class dimension is based on the properties of an event. For example, ec can be representedby the resource and include, as such, the following members: D2 = {John}, H2 = D2. The timewindow dimension is based on timestamps. A time window can refer to years, months, days ofweek, quarters or any other relevant period of time. Due to its natural hierarchical structure, twdimension can be organized as a hierarchy, e.g., 2012 → 2012Dec → 2012DecSun. We considerD3 = {2012DecSun} and H3 = {2012, 2012Dec, 2012DecSun}.

Let D1 = {gold, silver}, D2 = {John} and D3 = {2012DecSun}H1 = {gold, silver}, H2 = {John} and H3 = {2012, 2012Dec, 2012DecSun}CD = D1 ×D2 ×D3 be the cube dimensions,CH = H1 ×H2 ×H3 be the cube hierarchies,h1, h2 ∈ H3, h1 = 2012, children(h1) = {2012Dec}, h2 = 2012Dec, children(h2) =

2012DecSun,h1, h2 ∈ H3, h1 = 2012, allLeaves(h1) = {2012DecSun}, h2 = 2012Dec, allLeaves(h2)

= 2012DecSun,CS = (CD,CH) be the process cube structure,h1 ∈ H1, h1 = gold, allLeaves(h1) = {gold}, h2 ∈ H2, h2 = John, allLeaves(h2) ={John}, h3 ∈ H3, h3 = 2012, allLeaves(h3) = {2012DecSun}.CE(h1, h2, h3) = {35654423}, CC(35654423) = (gold, John, 2012DecSun).

For the rest of the elements of CH, CE is defined in the same way.The process cube is defined as PC = (CS,CE).

Figure 3.3: A process model discovered from anextended version of the event log in Figure 3.2using the Alpha Mining algorithm.

Each process cell l can be used to dis-cover a process model, Ml. However,a process model can be also discoveredfrom a group of cells Q, MQ, or fromthe entire process cube PC, MPC . Fig-ure 3.3 shows a process model discoveredfrom all the event data from the pro-cess cube PC. MPC is the discoveredprocess model using the Alpha Miner al-gorithm, from the set of events returnedby CE. This is possible if consid-ering the process cube corresponding toa single cell in the table of visualiza-tion.

23

3.2 Process Cube by Example

In the previous section, the process cube was introduced together with a formalization of itsrelevant concepts. In this section, we continue with describing its functionality by means of anexample.

Figure 3.4: Functionality in three steps: 1. From XES data to process cube structure. 2. ApplyingOLAP operations to the process cube. 3. Materialization of process cells.

We propose a functionality in three steps approach, as depicted in Figure 3.4. In the first step,the event data for this example is presented in a XES-like format. The event data is then used toconstruct a process cube prototype. While building the process cube, its various characteristicsare clearly specified by referring to definitions from Section 3.1. The aim of the second step is toshow ways of exploring the process cube. In that sense, a range of OLAP operations (e.g., slice,dice, roll-up, drill-down, pivoting) are applied to it. As such, the process cube is prepared for thelast step - the process cube analysis. More precisely, in the third step, it is described how partsof the process cube are materialized in event logs and then used to obtain process models. Thesemodels can then be compared to discover similarities and dissimilarities between their underlyingprocesses.

3.2.1 From XES Data to Process Cube Structure

Table 3.1 contains the event data used in this example to illustrate the process cube functionality.This data is needed to build the process cube structure. In practice, explicit case ids and/orthe event ids may be missing. From Definition 1 and Definition 2, both events and cases arerepresented by unique identifiers. Therefore, when these identifiers do not exist in the originaldata source, they can be automatically generated when extracting the data.

The definition of the process cube (Definition 9) describes the process cube as a n−dimensionalstructure. Thus, establishing the dimensions is an important step in the creation of a process cube.There is no unique way of deciding on a process cube dimensions. One possibility is to select eachcase attribute and event attribute as a dimension. When applied to our example, this choice leadsto a process cube with 5 dimensions. Should the case id and the event id be also considered, thefinal structure is a 7-dimensional process cube structure. By considering each different attributevalue as a dimension member, the resulting process cube has 4×2×2×43×43×14×2 = 828, 352process cells. It is easy to notice that the case id, event id and timestamp are sparse dimensions,causing the entire process cube to be sparse. Sparsity was discussed in Section 2.3.1.

Another possibility is to limit the number of dimensions to three, as suggested in [4]. Basedon the case properties, the case type dimension can contain members created from both partsand sum leges attributes. The parts attribute, specifies for what building parts can a buildingpermit be requested, e.g., Bouw, Milieu. The sum leges attribute, gives the total cost of abuilding permit application, e.g., 138.55, 179.8. At this point, it is important to establish arepresentative dimension member, as it can influence further analysis. This can be achieved, for

24

case id properties event id propertiesparts sum leges timestamp activity resource

1 Bouw 138.55

1 2012-02-21T11:52:13 01 HOOFD 010 5604642 2012-02-21T11:56:31 01 HOOFD 020 5604643 2012-02-21T12:15:07 01 HOOFD 040 5609254 2012-02-21T12:19:22 01 HOOFD 050 5604645 2012-02-21T12:50:18 01 HOOFD 055 5604646 2012-02-21T14:09:49 01 HOOFD 060 560925

2 Bouw 138.55

7 2012-03-08T12:03:11 01 HOOFD 010 5604648 2012-03-08T12:07:53 01 HOOFD 020 5604649 2012-03-08T12:31:15 01 HOOFD 040 56092510 2012-03-08T13:22:08 01 HOOFD 060 56092511 2012-03-08T13:35:47 01 HOOFD 065 56092512 2012-03-08T14:53:34 01 HOOFD 120 56092513 2012-03-08T15:20:55 01 HOOFD 260 56046414 2012-03-08T15:36:19 09 AH I 010 56092515 2012-03-08T15:56:41 01 HOOFD 430 560925

3 Milieu 179.8

16 2012-03-12T09:03:52 01 HOOFD 010 56046417 2012-03-12T09:08:21 01 HOOFD 020 56046418 2012-03-12T09:17:39 01 HOOFD 040 56092519 2012-03-12T09:42:48 01 HOOFD 050 56092520 2012-03-12T10:15:07 06 VD 010 56092521 2012-03-12T10:24:56 01 HOOFD 120 56092522 2012-03-12T10:49:01 01 HOOFD 180 56092523 2012-03-12T11:18:19 01 HOOFD 260 560925

4 Bouw 138.55

24 2012-03-15T13:11:06 01 HOOFD 010 56046425 2012-03-15T13:15:27 01 HOOFD 020 56046426 2012-03-15T13:37:42 01 HOOFD 040 56092527 2012-03-15T14:02:18 01 HOOFD 050 56092528 2012-03-15T14:19:32 01 HOOFD 065 56092529 2012-03-15T15:06:11 01 HOOFD 120 56046430 2012-03-15T15:46:37 01 HOOFD 180 56046431 2012-03-15T16:10:44 01 HOOFD 260 56046432 2012-03-15T16:42:01 01 HOOFD 380 56046433 2012-03-15T16:53:26 01 HOOFD 430 560925

Table 3.1: Event Log Example

instance, by employing data mining techniques. For this example, we describe a simple two-stepapproach. First, cases are grouped in clusters, based on their properties. It is obvious that cases1, 3 and 4 belong to one cluster, as they all have the same case properties, and case 2 belongsto another cluster. Secondly, a classification (decision tree learning algorithm) is used on theclustering results. In this example, we expect to identify, after classification, a representativenumber, e.g., 150, for the sum leges attribute that would differentiate between the two clusters.Consequently, the following two case type dimension members can be considered representativeparts = Bouw, sum leges < 150 and parts = Milieu, sum leges >= 150. The difficulty of thisapproach is that is requires data mining knowledge to store the event data in the process cube.

There is also a middle-ground approach. For instance, the number of dimensions can still bekept small, but not necessarily limited to three. Moreover, one dimension can contain a singleproperty instead of a combination of properties. In this case, the attributes that do not end up asdimensions can be still stored in a cell. For this example, we consider 4 dimensions: parts, activity,resource and timestamp. The parts dimension has two elements, D1 = {Bouw, Milieu}. Theresource dimension has also two elements, D2 = {560464, 560925}. The activity dimension consists

25

of 15 elements, e.g. 01 HOOFD 010, 09 AH I 010 and others. While the first three dimensionshave a relatively small number of members, the last dimension consists of 43 different members.To reduce this number, only the year, the month and the day of the week is considered for thetimestamp dimension and the rest is stored in the cell. Consequently, the size of the timestampdimension is reduced to three: 2012FebTue, 2012MarMon and 2012MarThu. As a result, theprocess cube PC consists of 2× 14× 3× 2 = 168 process cells.

To show what is the content of a process cell for the process cube PC, we use the CE function ona set of selected hierarchy elements. For h1 ∈ H1, h1 = Bouw, allLeaves(Bouw) = {Bouw}, h2 ∈H2, h2 = 560925, allLeaves(h2) = 560925, h3 ∈ H3, h3 = 01 HOOFD 040, allLeaves(h3) ={01 HOOFD 040}, h4 ∈ H4, h4 = 2012MarThu, allLeaves(h4) = {2012MarThu}, the CEfunction returns CE(h1, h2, h3, h4) = {9, 26}. Both

CC(9) = (Bouw, 2012MarThu, 01 HOOFD 040, 560925) and

CC(26) = (Bouw, 2012MarThu, 01 HOOFD 040, 560925)

return the same tuple of hierarchy elements. Event data that is not yet stored as dimension values,can still be stored in the process cell containing events 9 and 26, as shown in the Table 3.2.

case id properties event id propertiessum leges timestamp

2 138.55 9 2012-03-08T12:31:154 138.55 26 2012-03-15T13:37:42

Table 3.2: Event data corresponding to the process cell defined by CE(h1, h2, h3, h4) = {9, 26}.

3.2.2 Applying OLAP Operations to the Process Cube

In Section 2.3.1, the following OLAP operations were described: slice, dice, pivoting, roll-up anddrill-down. In this section, we show, by means of an example, how these operations can be appliedon a process cube.

2012MarMon

2012Feb

2012Mar2012

2012FebTue

H4

D4 (timestamp)Bouw Milieu

01 HOOFD 01001 HOOFD 02001 HOOFD 04001 HOOFD 05001 HOOFD 05501 HOOFD 06001 HOOFD 65

560464

560925

D2

D1 (parts)

D3

2012MarThu (resource)

(activity)

Figure 3.5: Process cube by example. With orange, 2012FebTue and 2012MarThu are selectedfor the timestamp dimension and are used for dicing the process cube. With green, a subcube isillustrated, which is the result of slicing the previous subcube on 560464 member of the resourcedimension. With red, a subcube is illustrated, which is the result of slicing the previous subcubeon 560925 member of the resource dimension.

Figure 3.5 illustrates the 4-dimensional process cube PC, constructed in the previous step. Torepresent the 4D structure in a 2D plan, first the members of the timestamp hierarchy are displayedon the left. The root element of the hierarchy is the 2012 year, followed by the month elements,

26

2012Feb and 2012Mar and having the days of week as the leaf nodes, 2012FebTue, 2012MarMonand 2012MarThu. To each leaf member of the timestamp dimension, corresponds a 3D subcubeas the one on the right.

For the process cube PC, we choose to do first a dice, by selecting the 2012FebTue and the2012MarThu members on the timestamp dimension. Let PC, PC = (CS,CA) and D′i = Di forall 1 ≤ i ≤ 3, D′4 = {2012FebTue, 2012MarThu}. The dice operation is diceCD′(PC) = PC ′,where

PC ′ = (CS′, CE′),CS′ = (CD′, CH ′),CH ′ = H1 ×H2 ×H3 ×H ′4,allLeaves(2012) = {2012FebTue, 2012MarMon, 2012MarThu},allLeaves(2012FebTue) = {2012FebTue}.Then, allLeaves(2012) ∩ allLeaves(2012FebTue) = {2012FebTue}, . . .H ′4 = {2012, 2012Feb, 2012Mar, 2012FebTue, 2012MarThu},h ∈ H4, h = 2012Mar, children(h) = {2012MarMon, 2012MarThu},children′(h) = children(h) ∩H ′4, children′(h) = {2012MarThu}, . . .h ∈ H4, h = 2012Mar, allLeaves(h) = {2012MarMon, 2012MarThu},allLeaves′(h) = allLeaves(h) ∩H ′4, allLeaves′(h) = {2012MarThu}, . . .CE′(h1, . . . , h4) = CE(h1, . . . , h4), for (h1, . . . , h4) ∈ CH ′.

Further, two slice operations are performed on the diced subcube PC ′, by selecting first the560464 and then 560925 member of the resource dimension. The resulted subcubes PC ′1 and PC ′2are still 4D structures, although they have only one member on the resource dimension. Thecorresponding 3D subcubes, with dimension timestamp left aside due to representation issues, aredepicted in Figure 3.5. The PC560464 subcube is represented with green and the PC560925 subcubeis represented with red.

The slice operation where the 560464 resource is selected is slice2,560464(PC ′) = PC560464,PC560464 = diceCD′(PC

′) with CD560464 = D′1 ×{560464}×D′3 ×D′4. The slice operation wherethe 560925 resource is selected is slice2,560925(PC ′) = PC560925, PC560925 = diceCD′(PC

′) withCD560925 = D′1 × {560925} ×D′3 ×D′4.

While slice and dice operations are used to select parts of a process cube, pivoting, roll-up anddrill-down operations help in visualizing the selections. As mentioned in Section 2.3.1, only twodimensions out of all the process cube dimensions, can be visualized at a time. For example, inFigure 3.5, dimensions parts and resource can be easily visualized. This part of the cube indicateswhich resources are responsible for handling cases for Bouw and which for Milieu. It is possibleto visualize also the activity dimension, but not all its elements can be clearly distinguished.

By pivoting (or rotation) operation, the visualization perspective of the process cube can bechanged. For example, by selecting the dimension activity on x axis instead of dimension partsand dimension parts on y axis instead of dimension activity, the cube is rotated and a new sideof it can be visualized. Such a change makes it easy to distinguish the activities corresponding toBouw and Milieu parts, together with their corresponding cells.

The pivoting operation is pivot1,3(PC ′) = PC ′p.

PC ′p = (CS′p, CE′p),

CS′p = (CD′p, CH′p),

CD′p = D′3 ×D′2 ×D′1 ×D′4,CH ′p = H ′3 ×H ′2 ×H ′1 ×H ′4,children′(h) = children(h),allLeaves′(h) = allLeaves(h),CE′p(h3, h2, h1, h4) = CE(h1, h2, h3, h4).

The roll-up and drill-down operations have an impact when applied on a dimension with ahierarchical structure. Through a roll-up operation, members of a hierarchically inferior levelare replaced with a member of a hierarchically superior level. For this example, we consider thetimestamp dimension with its elements 2012FebTue, 2012MarMon and 2012MarThu. A roll-upoperation on the children of 2012Mar replaces the current timestamp elements with 2012FebTue

27

and 2012Mar.The roll-up operation is then rollup4,2012Mar(PC ′) = PC ′r, where PC ′r = (CS′r, CE) with

CS′r = (CD′r, CH), and CD′r = D′1 × D′2 ×D′3 × (D′4 \ children(2012Mar)) ∪ {2012Mar}.While the roll-up operation folds elements from an inferior hierarchical level into elements of

a superior one, the drill-down operation expands members from hierarchically superior levels. Weconsider again the timestamp dimension. For the previous PC ′r subcube, a drill-down operation onthe 2012Mar element replaces the current dimension elements with 2012FebTue, 2012MarMonand 2012MarThu.

The drill-down operation is then drilldown4,2012Mar(PC ′r) = PC ′d, where PC ′d = (CS′d, CE)with CS′d = (CD′d, CH), and CD′d = D′1 × D′2 ×D′3 × (D′4 \ {2012Mar}) ∪ children(2012Mar).

3.2.3 Materialization of Process Cells

In the previous step, the applicability of the OLAP operations was shown by means of an example.The main emphasis was on the changes that occurred at the dimension level. Naturally, thequestion arises as what happens at the cell level. The last step of our approach gives an answerto this question. We rely in our explanation on Figure 3.6, presented in more detail in [6].

Figure 3.6: Partitioning of the process cube. The split operation is realized by drill-down. Thefunctionality of the merge operation is given by roll-up.

The left part of Figure 3.6 shows the process cube created from an extended version of the eventlog in Figure 3.2. In the process cube, the top part depicts a simplified event log correspondingto the process cube. The step of extracting an event log based on the event data from the processcube or from parts of it (process cells or groups of cells) is known as the materialization step. Theresulted event logs are then given as input to different process mining algorithms. The outcomeis a set of process models which can be visualized. Back to our example, the event log shown atthe top of the process cube is used to obtain the process model shown at the bottom, by applyingthe Alpha Miner algorithm on it.

The right part of Figure 3.6 shows the result of splitting the process cube from the left onits case type and event class dimensions. In the figure, two types of splitting can be identified.Vertical splits consider for separation an entire case. For example, by splitting on the case typedimension, cases 1, 4, 5, 6 are separated from cases 2, 3, 7, 8. The results of a horizontal split are nolonger whole cases, but rather parts of cases corresponding to subsets of activities. For example,by splitting on the event class dimension, activities A,C are representative for the cell given byCE(silver customer, sales, 2012) and activities C,D,E, F,G are representative for the cell given by

28

(a) The resulted process model after slicing on560464 resource.

(b) The resulted process model after slicing on560925 resource.

Figure 3.7: Process mining results for PC560464 and for PC560925.

CE(silver customer, delivery, 2012). Note that activity C is present in both cells, i.e., activity Ccan be executed in both sales and delivery departments. This is possible as the activity attributeis not a dimension in the process cube and therefore, the same activity can be present in multiplecell.

When related to the OLAP operations, the split operation is realized by the drill-down oper-ation and the merge operation is realized by the roll-up operation.

In the second step, based on a process cube example, several OLAP operations were presented.After “playing” with the process cube, one is interested in materializing the selected parts ofthe process cube and obtaining meaningful process mining results. The PC560464 and PC560529

subcubes are among the subcubes obtained in the second step. Figure 3.7a presents the resultedprocess model MPC560464

for the process cube PC560464. Similarly, Figure 3.7b presents the re-sulted process model MPC560529 for the process cube PC560529. Now the two process models canbe compared to find differences and similarities. An immediate similarity is that both processescontain the same activities 01 HOOFD 050 and 01 HOOFD 120. There are a large number ofdifferences, related both to the activities and also to the control flow. One could start by notic-ing that one process starts with activity 01 HOOFD 010, while the other starts with activity01 HOOFD 040.

3.3 Requirements

Now that we have established the desired functionality of a process cube, the next step is tofind technologies and methods to turn the process cube concept into a real application. Thereis no fixed recipe that guarantees the achievement of this goal. Multiple tools are available thatcan accommodate the desired process cube functionality and there is certainly more than onesolution to approach the problem. Nevertheless, there is a list of requirements that should be met,independent of the chosen technology and the solution for implementation.

29

As our goal is to create a proof-of-concept tool that exploits OLAP features to accommodateprocess mining solutions such that the comparison of multiple processes is possible, and based onthe process cube functionality presented in this chapter, the following requirements are derived:

1. The system shall include an OLAP Server with support for traditional OLAP operations.

2. External tools shall be open to adjustments. They shall offer the possibility to add newfunctionality and change the existing one.

3. The application shall be programmed in Java to enable integration with ProM.

4. External tools shall provide means to enable their employment in a Java-written system.

The first requirement is quite straightforward, considering the goal of this project. The OLAPServer organizes data in multidimensional structures, which facilitates the inspection of the storeddata from different perspectives. In that sense, the OLAP Server can be also used to examine thedifferent views of a process. Employing traditional OLAP operations on the OLAP multidimen-sional structures, provides quick and facile filtering. By means of this functionality, the integratedanalysis on multiple processes can be supported.

Since the OLAP Server is an indispensable component of the system, it has to be eithercreated from scratch or employed from an external tool. Creating an OLAP Server from scratch,undoubtedly implies a vast amount of work. Under the circumstances, employing an alreadyexisting OLAP Server, to save time, seems to be a plausible idea. Moreover, parts of an OLAPClient application can be also reused to save time. However, in this case, the second requirementhas to be considered. The existing OLAP tools cannot handle event logs and do not supportprocess-mining analysis. Therefore, an external OLAP tool shall allow adding this functionalityand changing the existing one, should this be the case. This is possible only if the external tool isopen-source.

ProM Framework was introduced in Section 2.5 as a platform hosting multiple plugins thatrepresent the result of implementation of different process mining algorithms. Clearly, it is wiseto use the already existing process mining techniques as they provide sufficient methodology toperform process analysis. However, to facilitate the easy integration with ProM, Java is thepreferred programming language.

The fourth requirement comes as a consequence of the third requirement. External partiesmust possess interfacing capabilities with the system. Since the main application has to be writtenin Java, external tools should be either Java-based or provide a Java Application ProgrammingInterface (API) to allow their employment in the system.

3.4 Comparison to Other Hypercube Structures

Before starting with the process cube implementation, a literature study is performed to identifythe cubes with the closest functionality and requirements to the process cube. The reason fordoing this is threefold. First, one can find similarities with other hypercube structures, in whichcase, some of its functionality can be reused. Secondly, identifying limitations of the currentmultidimensional structures, helps in clarifying what is still to be done. Finally, previous work onsimilar OLAP cubes can suggest where one could expect difficulties.

Data loaded in traditional OLAP cubes come from different sources, e.g., multiple data ware-houses. Due to the considerable growth of stored data, simple ways of data representation aresought to conveniently keep data outside local databases. OLAP cubes are also adjusted to handledata in different formats. For example, OLAP cubes can be specified on XML data [34]. Still,OLAP cubes cannot support data in XES format, typical for event logs, because of the specificcharacteristics of event data.

OLAP cubes are designed to work with numerical measures, and various ways of computingnumerical aggregates are explored, from traditional sum, count and average to sorting-based algo-rithms [10] and agglomerative hierarchical clustering [40]. In [45], several measures are proposed

30

to summarize process behavior in a multidimensional process model. Among those, instance-basedmeasures (e.g., average throughput time of all process instances), event-based measures (e.g., av-erage execution time of process events), flow-based measures (e.g., average waiting time betweenprocess events), are the most relevant.

In the last years, also non-numerical data have been considered in an OLAP setting. OLAPcubes have been extended to graphs [52], sequences [37, 38] and also to text data [36]. Creatinga Text Cube became possible by employing information retrieval techniques and selecting termfrequency and inverted index measures.

In [45], the Event Cube is presented. Unlike other OLAP cubes, this multidimensional struc-ture is constructed for the inspection of different perspectives of a business process, which in fact,coincides with the purpose of the process cube. To accomplish this, event information is summa-rized by means of different measures of interest. For instance, the control-flow measure is used todirectly apply the Multidimensional Heuristics Miner process discovery algorithm. The difficultywith respect to this approach, is that traditional process mining techniques have to be extendedwith multidimensional capacity, in the same way as it was done for the Flexible Heuristics Miner:the Multidimensional Heuristics Miner was introduced as a generalization of the Flexible Heuris-tics Miner, to handle multidimensional process information. Of course, extending existing processmining techniques requires a lot of effort. Therefore, we propose a more conceptually clear andmore generic approach. That is, instead of adjusting all process mining techniques to multidi-mensionality, the OLAP multidimensional structure can be adjusted to allow employing existingprocess mining techniques, without the need of changing them.

All in all, the process cube is unique as it allows the storage of event data in its multidimensionalstructure, which is further used for process analysis purposes by employing existing process miningtechniques. This approach creates a bridge between process mining and OLAP, as methods fromboth fields are interchangeably applied. The advantage is that quick discovery and analysis ofbusiness processes and of their corresponding sub-processes is facilitated in an integrated way.Moreover, no changes to the applied traditional process mining techniques are needed.

31

Chapter 4

OLAP Open Source Choice

Based on the conceptual aspects previously introduced, in the following chapters we continue withdescribing the prototype solution. Before going into detail with respect to the implementation, inthis chapter we give the motivation for our technology choice.

The process cube formalization from Chapter 3, indicates the need for process mining andOLAP support. For process mining, the selected framework is ProM, introduced in Section 2.2.1,as it is the leading open source tool for process mining. Other commercial process mining systemsexist, e.g., Futura Reflect, Fluxicon, Comprehend, ARIS Process Performance Manager [12], butProM contains many plugins that allow effective process mining discovery and analysis. A part ofthese plugins are chosen for this project. Except for the OLAP database, we also use a classicalrelational database to store event data which is only used for event log reconstruction. There isa vast array of possibilities when it comes to available relational database systems, e.g., OracleDatabase, Microsoft SQL Server, MySQL, IBM DB2, SAP Sybase, just to name a few. As thereare no special benefits of using one relational database over another, in our project we chooseMySQL, as it is one of the most widely used database systems in the world.

For OLAP, on the other hand, it is difficult to make an immediate decision with respect tothe tool selection. There are multiple technologies available, which vary in terms of the useddatabase type e.g., classical relational, multidimensional, hybrid; the storage location, e.g., in-memory or on-disk; the storage method, e.g., column-based or row-based databases; the way datarelationships are kept, e.g., matrix or non-matrix (polynomial) databases and so on. Therefore,in this chapter, the different OLAP tools and their characteristics are further detailed, togetherwith the corresponding advantages and disadvantages. Finally, a single OLAP system is selectedfor our application.

4.1 Existing OLAP Open Source Tools

For a potential OLAP tool to be used in this project, supporting conventional OLAP functionalityis not sufficient. Several requirements were listed in Section 3.3. From those, two are particularlyimportant to consider when choosing an OLAP external tool. The tool has to be open source,to allow changes in its functionality, and should provide support for further Java development,to enable the integration of ProM (which is written in Java) and OLAP capabilities on a singleplatform. OLAP tools can be split in OLAP servers and OLAP clients. OLAP clients are the userinterfaces to the OLAP servers.

Even though the open source OLAP servers and clients are not as powerful as commercialsolutions [49], they encourage the community-based development by being free to use and modify.In our case, when integrating process mining solutions in OLAP technology, we expect to encounterdifferences with existing functionality. Therefore, in this project, an open source tool which allowsto add new solutions is preferred over a more “powerful”, but non-extensible commercial tool.

To provide an overview of the existing OLAP open source tools, we refer to the following

32

sources [1, 27, 28, 48, 49, 50]. From those, [1, 49, 50] contain the work of Thomsen and Pedersen,and include a periodic survey of open source tools for business intelligence. The first survey [49],published in 2005, refers to three OLAP servers, Bee, Lemur and Mondrian and two OLAP clients,Bee and JPivot, which are the only ones implemented at the time. In the survey from 2011 [1], onlytwo OLAP servers are presented, Mondrian and Palo. That is because Bee and Lemur servers werediscontinued and a new OLAP server, Palo was created. In [28], we find again the same Mondrianand Palo OLAP servers mentioned. By 2011, there are already several OLAP clients available,e.g., JPalo, JPivot, JRubik, FreeAnalysis, JMagallanes OLAP & Reports. There are also severalintegrated BI Suites. Both [27] and [50] refer to Jasper Soft BI Suite, Pentaho and SpagoBI. Allthese BI suites use the Mondrian OLAP engine and the JPivot OLAP client graphical interface.Recently, the Palo BI Suite was released that is working with the Palo multidimensional OLAPserver and the Palo for Excel client.

As every OLAP client uses a specific OLAP server, selecting an OLAP server, automaticallynarrows the client choice. In the following, we offer a summary on the two previously introducedOLAP servers, Mondrian and Palo. These servers are quite different from each other, mainlybecause they use different types of databases to store the data. The first one, Mondrian, storesdata in relational databases, and it is therefore called a ROLAP server, and the other, Palo, storesdata in multidimensional databases, and it is therefore considered a MOLAP server.

4.2 Advantages & Disadvantages

The storage engine used, ROLAP or MOLAP, has a considerable influence on the characteristicsof the OLAP servers, e.g., implementation design and methods, query mechanisms, performance.Therefore, we start this section with a discussion on ROLAP and MOLAP engines. Then, weemphasize the advantages and disadvantages of Mondrian and Palo OLAP servers by comparingand contrasting their characteristics, e.g., performance, scalability, flexibility.

The major advantage of ROLAP is that the relational database technology is well standardized,e.g. SQL2, and is readily available off-the-shelf [17]. The disadvantage is that the query languageis not powerful and flexible enough to support true OLAP capabilities [51]. The multidimensionalmodel and its operations have to be mapped into relations and SQL queries [19].

The main advantage of MOLAP is that its model closely matches the multidimensional model,allowing for powerful and flexible queries in terms of OLAP processing [17]. In general, the maindisadvantage of MOLAP is that no real standard for MOLAP exists. However, for particular situ-ations, different problems can occur, e.g., scalability issues when it comes to very large databases,sparsity issues for sparse data.

In [21], Colliat deems that multidimensional databases are several orders of magnitude fasterthan relational databases in terms of data retrieval and several orders of magnitude faster interms of calculation. MOLAP servers have faster access times than ROLAP servers becausedata is partitioned and stored in dimensions, which allows retrieving data corresponding to anycombination of dimension members with a single I/O. In a ROLAP, on the other hand, due tointrinsic mismatches between OLAP-style querying and SQL (e.g., lack of sequential processingand column aggregation), performance bottlenecks are common [18].

Generally, MOLAP provides more space-efficient storage, as data is kept in dimensions and adimension may correspond to multiple data values. However, this is not valid for sparse data, asin this case, data values are missing for the majority of member combinations.

ROLAP systems work better with non-aggregate data and aggregate data management is doneat high cost. The MOLAP, on the other hand, works better with aggregate data. This is actuallyexpected, considering the table-based structure of a relational database and the structure of amultidimensional database, which is organized in dimensions and has a built-in hierarchy.

An advantage of ROLAP is that it is immune to sparse data, i.e., sparsity does not influenceits performance, nor its storage efficiency. On the other hand, sparsity is a limitation for MOLAPservers, which can hinder some of its benefits considerably. For example, a sparse MOLAP does notprovide space-efficient storage and runs into considerable performance issues. Therefore, MOLAP

33

servers typically include provisions for handling sparse arrays. For example, the sparsity problem isknown to be solved in the case of the commercial Essbase multidimensional database managementsystem, by adjusting the structure of the MOLAP server to handle separately sparse and densedimensions.

Now that the advantages and disadvantages in terms of the employed OLAP engine werepresented, in the following, we discuss the advantages and disadvantages of Mondrian and PaloOLAP server tools. Before continuing our discussion, we would like to remark that both Mondrianand Palo satisfy the requirement of being compatible with a Java-written system. Mondrian isimplemented in Java and offers cross-platform capabilities. As for what concerns Palo, the initialPalo MOLAP engine was programmed in C++. However, today various serial interfaces in VBA,PHP, C++, Java and .NET allow Palo OLAP to be extended.

PerformancePerformance is a characteristic where generally Palo outruns Mondrian. First, the PaloMOLAP engine offers faster query response times [19] than the ROLAP engine of Mondrian.Secondly, the in-memory feature of the Palo server, improves the speed even further, asnaturally, in-memory databases are faster than the disk-based databases. Nevertheless, ifnot as fast as Palo MOLAP server, the Mondrian ROLAP server is also known to providean acceptable performance [50].

ScalabilityThe in-memory characteristic is both an advantage (faster data retrieval), but also a dis-advantage of Palo. A database which is memory-based, becomes automatically memory-limited. Undoubtedly, the memory capacity grows very quickly, but so does the volume ofavailable data. There are advances made to compensate for the memory need. For example,3-D stacking in-memories such as Micron hybrid memory cube are available 1. Nevertheless,at the moment, scalability is considered an advantage of Mondrian and a disadvantage ofPalo.

FlexibilityBoth Mondrian and Palo provide different types of flexibility. Being a ROLAP server,Mondrian is more flexible regarding the cube redefinition and provides better support forfrequent updates [43]. On the other hand, the in-memory database of Palo does not requireindexes, recalculation and pre-aggregations. As analysis is possible to a detailed level withoutany preprocessing [28], Palo is more flexible in that sense.

4.3 Palo - Motivation of Choice

Considering all the features of both Mondrian and Palo presented in Section 4.2, it can be noticedthat, in general, the advantages of one technology are the disadvantages of another technology.Moreover, both Mondrian and Palo satisfy the requirements from Section 3.3, e.g., open source,Java-compatible, with OLAP capabilities. Consequently, either of the two OLAP servers can beused in this master project. We choose the Palo in-memory multidimensional OLAP server andin the following, we give a motivation for our choice.

First, we adopt Palo technology because we want to explore new and innovative technologies.Mondrian stores data in relational databases. Relational databases are simple and powerful solu-tions, but they are already used for decades. Palo stores data in a multidimensional in-memorydatabase. Both multidimensional OLAPs and in-memory technologies are relatively new com-pared to relational databases. Being still in their infancy, they provide various research challengeswhich are interesting to explore.

Secondly, we believe that Palo technologies have a real future perspective. With the decreasingmemory prices and the growth of the available memory capacity, there are real chances that in-memory databases will be more often used. Moreover, there are promising performance results

1http://www.edn.com/design/integrated-circuit-design/4402995/More-than-Moore-memory-grows-up

34

recorded for MOLAP engines. While there are different techniques employed to speed up relationalquery processing (e.g. index structures, data partitioning), there is not too much that can be doneto further improve ROLAP performance. On the other hand, we see Palo as a technology withpotential to develop performance-wise.

All in all, we choose Palo because it uses new technology and it has real chances to grow in thefuture. Since JPalo client is the only one to use Palo MOLAP server, JPalo is hence the OLAPclient choice for this project.

35

Chapter 5

Implementation

In the previous chapter we discussed the storage technologies to be used and we motivated theuse of Palo. In this chapter, we describe our implementation using Palo, ProM and MySQLcapabilities. We start by describing the system components and the way they are interconnected.Then, we focus on three main aspects:

• Storing the event data in the process cube.

• Preparing the process cube for analysis purposes, e.g., by filtering on dimensions.

• Comparing process cells by visualizing the corresponding process mining results.

5.1 Architectural Model

Figure 5.1: The PROCUBE System. It contains components, external parties and the correspond-ing communications between both internal and external elements of the system.

36

As explained in Section 3.3, our implementation is integrated in ProM, i.e., our application runsas a ProM plugin. The implemented plugin is called the PROCUBE plugin. Together with Paloand MySQL, the PROCUBE plugin forms the PROCUBE system. In this section we describe thearchitecture of the PROCUBE system. The main components of the PROCUBE system, togetherwith the external parties and the way they communicate with each other, are shown in Figure5.1. The system interacts with three external tools: ProM, MySQL and PALO. ProM is thehost framework of the system, since the PROCUBE application runs as a plugin in ProM. Therelational database of MySQL is used to store data from the event logs that is not relevant formultidimensional processing. Palo is employed for its OLAP capabilities. It is composed of twomain parts, the Palo Server and the Palo Client. While there are no changes made to the PaloServer in this project, the Palo Client is appropriately adjusted to allow operations on event data.Palo Server comes with an in-memory multidimensional database, for storage purposes, and anOLAP cube, built on top of the database, that is suited to support OLAP functionality.

The flow of the event data in the system starts with the loading of an event log. This functionis exerted by the Load component. Its role is to pre-process the incoming event data from anevent log and to load it in MySQL and Palo databases in such a way that it is properly storedand ready for further use. Also at loading, the Palo cube is created from the event data residingin the Palo database.

Immediately after loading, the process cube can be used to recreate the initially loaded eventlog. However, there is no benefit from having merely this functionality. As such, the systemcontains also a Filtering component. Its purpose is to perform various filtering operations on theprocess cube such that the different perspectives of the cube can be inspected. Note that filteringis used to extract parts of the process cube and not to modify its structure. Filtering is based onthe traditional OLAP operations: slice, dice, roll-up and drill-down. Except for filtering, pivotingis another useful OLAP operation that is employed. It allows rotating the cube to visualize itfrom a different angle.

Once created, the filtered parts of the process cube are used to unload the corresponding eventdata, from which an event log is then materialized. The Unload component is responsible fortaking the required data from both the relational and the in-memory database and creating anevent log out of it. The resulting event log is given as an input to a ProM plugin. The output isa process mining result that can be visualized. Not all the existing ProM plugins are considered.A representative list of ProM plugins is selected for this purpose.

Finally, a GUI component was specially created to show simultaneously different process miningresults. The advantage of such a component is that it facilitates the comparison of multiple processmining results by placing them next to each other.

5.2 Event Storage

The simplest and most intuitive way to store event data in a process cube is by selecting all theattributes in the event log as dimensions. To guarantee that an event is unique in terms of itsdimensions, an event id is assigned to each event. The same holds for cases, a case id is assignedto each case. Both the event id and the case id are considered as dimensions. Even though thisapproach is the easiest one, it can create in many cases considerable problems with respect to bothstorage space and performance. This is because such a way of storing event data leads to extremesparsity in the process cube.

There are two possible ways to cope with the sparsity problem. The first solution is to reducethe number of dimensions. By reducing the number of dimensions, only a subset from the entireset of attributes is selected to form the dimensions. Consequently, the problem of where and howto store the rest of the event and case attributes appears. Moreover, events are no longer uniquelyidentified by dimensions, which implies having more than one event corresponding to a cell. Animmediate solution is to save the rest of the event data in the process cell. The difficulty withthis approach is that Palo Server, as well as other OLAP servers, allows for a limited number ofcharacters per cell. In the case of Palo, the number is 255. Moreover, today’s OLAP servers work

37

with numerical values, rather than with text. This limitation forces to look for a new solution.

Figure 5.2: Event storage. Numbers represent cell ids and indicate the existence of a cell with acorresponding set of events.

The solution we applied, consists of giving a unique identifier to each cell and save the restof the event data corresponding to the cell in a relational database. Figure 5.2 illustrates theapproach. On the left-hand side, a cube consisting of three dimensions (task, timestamp and lastphase) is shown. The numbers in cells, e.g., 6, 7, 10, 11, represent cell ids. On the right-handside, there is a table with case and event properties. This table is actually saved in the relationaldatabase. A row of the table stores data corresponding to an event. The cell id is a column inthe table, and it indicates which event corresponds to which cell. For example, for the cell withid 11, three events, namely 27, 28 and 29, can be identified in the table. For each of these events,properties that are not among the dimensions in the process cube are stored in the relationaldatabase.

The solution presented above, does not fully guarantee that sparsity is sufficiently limited. Forinstance, if the dimensions stored in the in-memory multidimensional database are all sparse, i.e.,contain a large number of members that are hardly repeating in the log, then the sparsity problemis still present. Examples of sparse dimensions are the event id, because there is one memberfor each new event and the timestamp, since almost each event can have a unique timestamp.Therefore, the second solution consists of reducing the number of elements per dimension.

Palo Server, as well as other multi-dimensional OLAP servers, offer a very useful feature,called hierarchy. That is, members in a dimension can be hierarchically organized. An event logcan contain different types of attributes: binary, numerical, time, categorical, etc. For the timeattributes, there is already a natural built-in hierarchy that can be directly employed, e.g., year→month→ day of week. For example, the timestamp 2012-02-21T11:52:13 belongs to the year 2012,the month is 2012Feb and the day of week is 2012FebTue. Hierarchies can be used to reduce thenumber of members per dimension. For the time example, only year, month and day of week canbe stored in-memory, while the actual timestamp can be saved in the relational database. For therest of the attributes, it is also possible to construct hierarchies, but it is not so straightforwardas for the attributes of time type. That is, to have a meaningful hierarchy for a set of categoricalattribute values, applying clustering and classification techniques would be useful.

The time hierarchy is implemented in our project for any dimension which contains elements ofdate type. As for the rest of the attributes, there is no hierarchy established, since this is not easyto solve in a generic way. As a consequence, even though solutions to limit sparsity were applied,the sparsity problem can still occur, should the user select some sparse non-time dimensions to bestored in the multidimensional database.

38

5.3 Load/Unload of the Database

In Section 2.2.1, the XES meta-model was presented. From all the elements of the XES structure,attributes are the most relevant when employing a multidimensional structure for analysis. Caseattributes and event attributes are used to create the dimensions of a hypercube together withtheir corresponding members. Therefore, they have to be loaded in the Palo in-memory databasesuch that to be easily accessed for the process cube creation. As discussed in the previous section,due to sparsity issues, the user is asked to decide upon a smaller set of attributes to be usedas dimensions in the process cube. The rest of the attributes are stored in relational databases(RDB), as explained in Section 5.2.

Except for traces, events and their corresponding attributes, the log keeps also informationregarding the classifiers, the extensions and the global attributes. Even though unnecessary forOLAP operations, these elements are indispensable for the event log reconstruction. Therefore,they are stored separately in RDB tables and used later for unloading purposes.

The loading of an event log into databases consists of two steps. First, a special tree structureis created from event data to facilitate the construction of the process cube. Secondly, the createdstructure is used for building the process cube and storing parts of event data in RDB in aneasy-to-access manner. We use pseudocode to present both steps.

Algorithm Parsing(log)1. B log, gives the event log from the file2. Create a log id, that uniquely identifies the log3. Create tables in the RDB, with the attributes of the log, the classifiers, the extensions and

the globals4. B rootNode is the root node of a tree structure5. B eventCoordinates is a list of attribute values for all events in the log6. Determine the number of traces in the log (nt)7. for i← 1 to nt8. do traces[i] ← log.getTraces();9. rootNode.addNodes(traces[i].getAttributes());10. Determine the number of events in traces[i] ( ne)11. for j ← 1 to ne12. do eventCoordinates ← NULL;13. events[j] ← traces[i].getEvents();14. rootNode.addNodes(events[j].getAttributes());15. eventCoordinates.setEvent(log id, traces[i].getAttributes(),16. events[j].getAttributes());17. j ←j + 1;18. i ←i + 1;19. return rootNode, eventCoordinates

In the first step, the classifiers, the extensions and the global attributes are extracted from theXES log structure, and loaded in RDB tables. In that sense, a log id is assigned to the log and isused to distinguish between the classifiers, extensions and global attributes of this log from the onesof other already existing or to be created logs. Traces and events with their attributes are added toa tree structure with the rootNode as the root element of the tree. The rootNode contains all thelinks of the tree. Nodes are added to the tree structure in the following way: the first hierarchicallevel of the tree presents properties of cases and events, the next level contains the values of theproperties. Other hierarchical levels are also possible. In this project, we implemented hierarchiesfor time attributes. As such, in case of time attributes, years, months and days of week form thelevels of the tree.

Except for the rootNode, a set of event coordinates is determined for each event, on lines 15-16of the Parsing algorithm. Event coordinates give all the necessary information that can be usedto place an event back in an event log. Since an event is part of a trace and a trace belongs to

39

a log, also trace and log information is included in the event coordinates. Consequently, eventcoordinates are composed of the log id, the trace id with the corresponding trace attributes andthe event id with the event attributes.

Algorithm Loading(rootNode, eventCoordinates)1. B Create the process cube PC2. Determine the number of dimensions nd in the rootNode3. Allow the user to select a subset Md of all available dimensions4. for each i ∈Md

5. do Di ← rootNode.getChildren(i).getLeafs();6. if rootNode.getChildren(i) is a time attribute7. then Hi ← createHierarchy(rootNode.getAttribute(i));8. Create PC with the dimensions Di, i ∈Md with unique cell values9. Determine the total number of events in the log (nte)10. for i← 1 to nte11. do k ← 0;12. columnValues ← NULL13. for j ← 1 to nd14. do if j ∈Md

15. then k ← k + 1;16. mk ← eventCoordinates.getEvent(i).getAttribute(j);17. else columnValues.addAtribute(eventCoordinates.getEvent(i).getAttribute(j));18. columnValues.addAtribute(getCell(m1, . . . ,mk));19. RDB.addRow(columnValues);

Once the rootNode and the eventCoordinates are created, they can be used to build the processcube PC. All the trace and event attributes accessible from the rootNode, are potential dimensionsof the process cube. Due to sparsity issues, the user is allowed to select a subset of these to bethe actual dimensions of the cube. Of course, selecting all the dimensions is also possible. Foreach of the chosen dimensions, its corresponding member elements and the hierarchy are added,in line 5 to 7, in the Loading algorithm. After populating dimensions with elements, the processcube PC is created, based on these dimensions. At this point, the process cube PC has dimensionsand elements, but does not have any values in the cells. The eventCoordinates provides both thecoordinates of the cell and the set of its corresponding events. In Section 5.2, it was explainedthat event data cannot be directly stored in a cell, due to cell limitations. Instead, each cellis given a cell id and the rest of event data which is not yet saved in the PC can be stored inRDB tables, with cell id as a column. As such, members of the PC dimensions are identified ineventCoordinates, line 16, and are used as parameters for the getCell(m1, . . . ,mk) function whichidentifies a cell, line 18. The members that are not among PC dimension members, are added inthe RDB together with the cell id, line 19.

Algorithm Unloading(PC)1. B log, is the event log to be created after unloading2. B trace, is a trace of the event log3. B event, is an event of the event log4. log ← NULL;5. Add all the classifiers, extensions and globals to the log, from the RDB tables6. B eventList is a list with the corresponding coordinates of all the events7. B attributeList is a list with all the attributes corresponding to an event8. Create the eventList from both PC dimensions and RDB columns9. Determine the number of events in the eventList (ne)10. for i← 1 to ne11. do attributeList ← eventList.getEvent(i).getAttributes();12. trace ← NULL;13. event ← NULL;

40

14. Determine the number of attributes in eventList (na)15. for j ← 1 to na16. do attribute ← attributeList.getAttribute(j);17. if attribute is a log attribute18. then logAttributes.add(attribute); ;19. else if attribute is a trace attribute20. then traceAttributes.add(attribute);21. else eventAttributes.add(attribute);22. event.addAttributes(eventAttributes);23. if logAttributes are in log24. if there is a trace with the traceAttributes in log25. B k is the position of the trace in log26. then log.getTrace(k).add(event);27. else trace.addAttributes(traceAttributes);28. trace.add(event);29. log.add (trace);30. else trace.addAttributes(traceAttributes);31. trace.add(event);32. log.addAttributes(logAttributes);33. log.add(trace);34. return log;

Figure 5.1, presented earlier, shows the basic flow of event data in the system. From the eventlog, event data is loaded in both Palo and MySQL databases and can be retrieved from those atunloading and used to recreate the initial event log. Even though such a functionality does notadd yet any value, it can still be used to test the correctness of loading and unloading event data inand from relational and OLAP structures. In what follows, we describe the unloading procedureto complete the scenario.

For the Unloading algorithm presented in this thesis, we consider the complete list of eventsfrom the initially loaded event log. Nevertheless, this list can be filtered and, as a result, only asubset of total events can be considered at unloading. In any case, there is no change with respectto the pseudocode, only in line 8, the eventList is created differently, this time, based on filteringresults.

First, the initially NULL log is populated with classifiers, extensions and global attributes fromRDB tables. Then, both event data from RDB and from Palo OLAP cube is extracted and used tocreate an eventList structure. The eventList structure is similar to the eventCoordinates structurecreated in the Parsing algorithm, in the sense that the eventList constains enough information toplace events back in event logs. For instance, the event id gives the order of the event in the log.Note that information like the log id, the case id and the event id is discarded when constructingthe event log, as it was created at loading and was not initially part of the log.

The eventList contains a list of three types of attributes: log attributes, trace attributes andevent attributes. The event attributes, for instance, can be used to create an event, as in line22. The trace attributes can be used to created a trace. However, since a trace may correspondto multiple events, we check, in line 24, whether a trace with the same attributes already exists.Then, the created event is added either to the already existing trace or to the trace that is newlycreated. A similar test is required when adding the log attributes to the log, to avoid repeatingdata in the new event log.

5.4 Basic Operations on the Database Subsets

Once loaded in databases, the question appears what can the system do with the event data.First, the system benefits from the multidimensional structure of the OLAP cube. In that sense,inspecting different dimensions of the cube is possible. Moreover, the system supports a set of

41

(a) Dice filtering. Five elements are se-lected on the EVENT conceptEXT namedimension.

(b) Dice filtering result. While the event log corresponding toPC has 33 events, the event log corresponding to PCdiced hasonly 14 events.

Figure 5.3: Dice operation.

basic OLAP operations, e.g., slice, dice, drill-down, roll-up and pivoting. Filters can be createdthat would slice or dice the cube in various way. Default filters exist for drill-down and roll-upoperations that can be applied at request on specific chosen dimensions. Each filter is stored forfurther use, unless not explicitly deleted. Not only can the event data in the cube filtered, itcan also be visualized from different perspectives. This functionality is offered by employing thepivoting operation.

5.4.1 Dice & Slice

A dice operation is realized when multiple members are selected for one or more dimensions. Givena process cube PC, the result of a dice is a subcube PCdiced for which only a subset of membersare selected on particular dimensions, and for the rest is the same with the initial cube.

Figure 5.3a shows a dice filter applied on the EVENT conceptEXT name dimension. With dice,multiple elements of a dimension can be selected. In Figure 5.3a there are five task names selectedand the rest of the elements are just discarded for the EVENT conceptEXT name dimension.The result of the dice operation is shown in Figure 5.3b. From 33 events present in the event logcorresponding to the process cube PC, only 14 are considered for the PCdiced. The number ofcases remains the same.

A dice operation can influence more than one dimension. For example, together with thefilter on EVENT conceptEXT name dimension, a subset of timestamps can be selected on theEVENT TIME timeEXT timestamp 1. A dice operation allows the selection of any element ofthe time hierarchy. For example, one can select year 2012 and 2013 out of a set of years containing2010, 2011, 2012 and 2013. The month dimension can also be considered for dice. For instance,selecting the 2012Feb month in 2012 is also a dice, since it contains the following set of elements:2012FebMon, the 2012FebWed and the 2012FebThu.

For dimensions with numerical members, a dice filter can be created, by selecting a certain

1In the dimension name, the TIME tag is used to recognize a dimension corresponding to a time at-tribute. Other examples of such dimensions are: EVENT TIME dueDate, EVENT TIME plannedDate,EVENT TIME createdDate

42

(a) Slice filtering. Only a single eventname, 01 HOOFD 060 is selected on theEVENT conceptEXT name dimension.

(b) Slice filtering result. While the event log corresponding toPC has 33 events, the event log corresponding to PCsliced hasonly 2 events.

Figure 5.4: Slice operation.

range. For example, for the SUMLeges dimension, all the events with SUMLeges between 100.5and 500.2 can be selected.

The slice operation is a particular type of dice. That is, a slice is performed when only a singlemember of one dimension is selected and the other members corresponding to the dimension arefiltered out. Given a process cube PC, the result of a slice is a subcube PCsliced with the samedimensions as the cube PC, except for one, which has just a single member selected of the initialset of the dimension members.

Figure 5.4a shows a slice filter applied on the EVENT conceptEXT name dimension. From allthe elements of this dimension, only 01 HOOFD 060 is selected. After creation, the slice filter issaved and, at request, is applied on the event data of the process cube. That is, only events withthe event name 01 HOOFD 060 are considered for the new PCsliced cube. Figure 5.4b depicts theslice result on the process cube. In the top window, a Log Dialog shows information on the initialevent log. Note that the entire event log contains 4 cases and 33 events. The bottom windowillustrates a Log Dialog containing information on the event log created after slicing. The newevent log contains only 2 cases and 2 events. Consequently, there are only 2 events with the name01 HOOFD 060 and they belong to 2 different cases.

For a dimension with time attributes, the slice can be performed while selecting a leaf member,situated at the day of week hierarchical level. For example, for a timestamp dimension containing2012 at the year hierarchical level and 2012Feb at the month level and 2012FebTue at the day ofweek level, a slice can be executed by selecting the 2012FebTue element. Note that such a slicefilters out all the events except for the ones that occurred on Tuesday in the February month of2012, and not on all Tuesdays of the 2012 year or on all Tuesdays, in general.

5.4.2 Pivoting

The subcubes obtained after slice and dice operations can be visualized. In this project, thetraditional 2D visualization is considered for the process cube visualization. As such, only twodimensions of the process cube can be visualized simultaneously. This is possible through thetable of visualization. The rows of a table of visualization contain two dimensions of the process

43

cube and also the corresponding filters created by the user. Even though based on the elementsof two process cube dimensions, the dimensions of visualization are usually not identical with theformer. The main difference is that their elements can be both results of filtering and elements ofdifferent hierarchical levels. In that sense, two visualization neighbor-cells can contain overlappingdata, while this is never the case for two neighbor-cells of the process cube.

The restriction of visualizing only two dimensions at a time has no influence on which twodimensions to select. That is, any combination is possible and any of the two dimensions can besubstituted with a new PC dimension, at any time. By swapping from one dimension to another,the visualization perspective of the PC cube changes. This operation is known as pivoting or therotation operation.

Figure 5.5: The result of the pivoting operation. Rotation is obtained by replacing the conceptnames dimension with the timestamp dimension and the SUMLeges is replaced by the conceptnames dimension.

Figure 5.5 shows the effect of the pivoting operation on the visualization table. In the visualiza-tion table from the top of the image, the SUMLeges and the event names are the two dimensionsof visualization. In the second table of visualization, the same process cube is visualized throughthe event names and the timestamp dimensions. Also, while the event names was initially on thex axis, in the second table, it is changed on the y axis.

5.4.3 Drill-down & Roll-up

The drill-down operation is realized by unfolding a member situated on a hierarchically superiorposition in a set of members corresponding to a hierarchical level lower.

Figure 5.6 shows a table of visualization with one dimension corresponding to the timestampand another dimension corresponding to the event name. Elements of the timestamp dimensioncan be selected from a hierarchy. For example, the 2012 member is selected and a drill-downoperation is performed on it. As in the time hierarchy, months follow years, all the monthscorresponding to year 2012 are shown. Based on the definition of drill-down from Section 2.3.1,the children of 2012 are added to the timestamp dimension of the table of visualization and the2012 element is removed. In our project, we keep also the 2012 element, because it is useful tocompare process mining results corresponding to elements on different hierarchical levels, e.g. theprocess of 2012 with the process of 2012Mar.

44

Figure 5.6: Drill-down operation on the timestamp dimension. Year 2012 is drilled-down to itsmonths.

The roll-up operation is realized by folding certain members of a dimension into one member,which is hierarchically superior.

Figure 5.7: Roll-up operation on the timestamp dimension. The months corresponding to year2012 are folded back.

Figure 5.7 shows a table of visualization corresponding to the same timestamp and event namedimensions. Based on the definition of roll-up from Section 2.3.1, the children of 2012 are removedfrom the timestamp dimension of the table of visualization and the 2012 element is added. Inour project, there is no need to add the 2012 element, as it is already present from the drill-downoperation.

5.5 Integration with ProM

After filtering and selecting a particular side of the process cube for visualization, the Unloadingalgorithm, presented in Section 5.3, is applied to materialize event logs for different visualizationcells. The resulted event logs are given as input to a ProM plugin. Each ProM plugin has a plugincontext object, that is required to run in the ProM framework. For some plugins, it is impossible

45

to use them outside ProM, for example, due to the absence of a specific predefined plugin context.Therefore, to allow more flexibility, our application is adjusted to run in ProM.

Hundreds of ProM plugins could potentially be use. However, we select only a predefinedlist of plugins to run in our application. The reason for this is twofold. First, not all of theexisting plugins are relevant for the purpose of the PROCUBE tool. One of the objectives is toprovide the user a means to visually compare multiple subprocesses. Visual comparison of severalsubprocesses becomes difficult when there is a different visual representation for each process. Inthat sense, plugins that provide immediate visualization results are quite handy. If the user hasto make changes to get a specific result, repeating them for each visualization window can becometroublesome. For example, the user can miss a step, and then the results that are compared arenot the intended ones. Also, any change in one window, implies changes in all windows. Naturally,manual changes take time, while automatic changes are impossible, due to different event dataper cell. Another problem is that the graphical space is limited. Running in parallel multipleplugins that provide in-depth analysis e.g., LTL Checker, is not very practical, also due to spacerestrictions, while repeating the changes for each individual process is very time consuming. Inconclusion, we aim at quick superficial analysis, with immediate results on multiple sublogs ratherthan time-consuming, in-depth analysis on a single or very few logs.

Another type of ProM plugins, are the ones created to filter event logs. Since filtering is alreadyimplemented in the PROCUBE tool, part of the functionality of these plugins is redundant.

The second reason is related to the fact that providing a generic way of calling all the ProMplugins is difficult to realize. Each plugin has its own specific input and output parameters andalso its own methods. A solution for calling all plugins in a generic way is to create a Wrapperthat would uniformly integrate all ProM plugins. For this project, we focus mainly on pluginsthat return a JComponent, which can be directly used to display the result. The Alpha Miner,for instance, returns a Petri net object. In that case, the visualization component for the Petrinet has to be first created and only then can the visualization result be shown.

Nr ProM Plugin

1. Log Dialog2. Dotted Chart3. Fuzzy Miner4. Heuristics Miner5. Working-Together Social Network6. Handover-of-Work Social Network7. Similar-Task Social Network8. Reassignment Social Network9. Subcontracting Social Network

Table 5.1: The list of ProM plugins used in the PROCUBE tool.

Moreover, some plugins require going through a sequence of wizard screens to get to the finalresult. Even if creating a predefined set of parameters, to avoid following the wizard screens, anew set of parameters is required for each individual plugin. Furthermore, for our project, it is notpossible to set the parameters only once, beforehand, and use them for all the visualization cells.That is because, the parameters of the initial event log usually do not correspond to the ones ofthe sublogs resulted after filtering, as the corresponding event data is different. In that sense, forsuch plugins, following the wizard sequence for each sublog individually is a must. Again, in thiscase, plugins with immediate results are preferred over the ones preceded by a sequence of wizardscreens.

Derived from all the specifications mentioned above, Table 5.1 provides the list of pluginscurrently used in our project. The Log Dialog and the Dotted Chart give a panoramic view onthe sublogs processes. The Heuristics Miner and the Fuzzy Miner are used to discover processmodels from sublogs. The Social Network plugins provides details on the resource perspective of

46

the sublogs. There is no doubt that plugins such as Basic Perfomance and Conformance Checkerwould add a considerable value to the process analysis and would allow for more extensive usecase analysis. Therefore, we suggest adding such plugins as a potential further work.

5.6 Result Visualization

The main visualization challenge of the project is to display multiple process mining results at thesame time, in an integrated way. The size of the physical screen is the main limiting factor whenit comes to displaying multiple windows. Therefore, we apply several solutions to cope with thisissue. First of all, we create a new frame, detachable from the main frame, and use it to place allprocess mining results. Thus, should two screens be available, the table of visualization can beplaced on one screen, while the plugin results can be displayed on the second screen. On this newframe, windows are organized next to each other, in an easy-to-identify way. Even though such aframe layout is already enough for the visualization of the plugin results, we decided to make somechanges as it was lacking the desired flexibility. Hence, replacing the windows with dockable onesto allow moving them around is one of the most important visualization features that is supportedin the project. A large part of the dockable functionality is taken from the DockingFrames 1.1.22 and adjusted for the project needs.

In the following, we explain the framework of the windows, with details related to the layoutof the windows frame. Then, we give a list with the frame functionality items. Finally, we showthe result visualization obtained using the PROCUBE plugin.

Figure 5.8: Dockables functionality. Panels arewrapped into dockables. Dockables are put ontostations which lay on the main-frame. As such,dockables can be moved to different stations.

Figure 5.8, taken from [47], shows theframework based on which dockable windowsare created. Dockables are not stand-alonewindows. They require the support of a mainwindow (the Main-Frame). The main windowis most of the times a JFrame. As long as thisframe is visible, so are the rest of the compo-nents on it. In case of non-dockable panels,they are just directly connected to the mainframe. Consequently, the main frame can con-sist of several panels, with different data dis-played on them. To support floating panels,however, an additional layer is added betweenthe panels and the main frame. The compo-nents of this layer are the so-called Stations.Among their purposes is also to allow the userto drag & drop panels and to minimize or max-imize windows. A central controller is used towire all the objects of the framework together.It manages the way elements look and their position in the frame and it monitors all the occurringchanges within windows. Further, each panel is wrapped into a dockable. Dockables are the finalcomponents and they are the ones that actually offer the floating behaviour.

To display dockables in a certain layout, a Grid component is used. The matrix of the gridgives an organized way of displaying windows in the screen. For our project, the matrix of thegrid component corresponds with the matrix of the table of visualization. That is, the pluginresults for different cells are shown in the same order with the one used to display the cells in thevisualization table.

In the view of the above approach, the following visualization capabilities are supported:

• Default layout with all the dockables normalized. Normalized dockables are placed on themain visualization frame, in the way cells are displayed in the visualization table.

2http://dock.javaforge.com/

47

• Dockables can be maximized. A maximized dockable takes all the space it can, most of thetime, by covering other dockables.

• Dockables can be minimized. Minimized dockables are not visible right away. They can berestored to a normal state, by pressing again the minimization button.

• Dockables can be extended. Once extended, dockables have their own window, independentof the main visualization frame. This functionality is very useful as it allows, for example,moving windows with plugin results on different screens.

• By the drag & drop operation, dockables can be placed on any part of the screen. Forexample, by dragging one dockable on the place of another one, these two are swapped witheach other.

• When multiple plugin results are available for the same visualization cell, each result win-dow is a new tab in a tabbed pane. That makes it easy to quickly identify plugin resultscorresponding to the same visualization cell.

• Unnecessary windows can always be closed.

Figure 5.9: Visualization of plugin results in the PROCUBE tool. Each plugin result is displayedin a dockable window and can be part of a tabbed pane.

Figure 5.9 shows several windows with plugin results. Two Log Dialogs, a Fuzzy Miner, twoHeuristics Miners and a Social Network form the visualization results. Multiple tabs can bedistinguished since multiple plugin results exist for the same visualization cell. All the windowsare dockable. After undocking a window, the rest of the windows are automatically rearranged inthe screen.

48

Chapter 6

Case Study and Benchmarking

In the previous chapter, the implementation of the process cube was described as a combinationof external technology (Palo, MySQL, ProM) and newly-introduced process-cube-related features.Further, we continue with an evaluation of the functionality of different event logs and an assess-ment of the PROCUBE system performance. The results presented in the chapter are based on theevent data of an artificial digital photo copier event log and on event data of a Dutch municipalityevent log.

6.1 Evaluation of Functionality

In this section we choose both a synthetic and a real-life event log to ascertain the capabilities ofthe PROCUBE system. The functionality that is evaluated comprises the loading of an event log inrelational and in-memory databases, executing OLAP operations on the process cube, unloadingan event log from databases, generating ProM results based on the event log and visualizing ProMresults.

6.1.1 Synthetic Benchmark

The synthetic event log we use in this section is taken from the collection of synthetic eventlogs, found at http://data.3tu.nl/repository/collection:event_logs_synthetic. It is anartificial event log for a simple digital copier, also used as a running example in [33]. The copier isspecialized in copying, scanning and printing of documents. As such, users can request copy/scanor print services. The standard procedure followed by a copier is image creation, image processingfor quality enhancement, and then, depending on the request, either printing the image or justsending it to the user. The generation of the image for a print request differs from the one for acopy/scan request.

The digital photo copier event log contains 100 process instances, 76 event classes and 40995events. Traces can be separated based on their Class attribute in Print and Copy/Scan. For eachevent, the name of the activity is given, the lifecycle transition, to attest if an activity is startedor completed, and a timestamp of the recorded activity.

In the following, based on the digital photo copier process described in [33], we select a fewscenarios and use them to present the capabilities of the PROCUBE tool.

In Figure 9 from [33], two subprocesses, ‘Interpret’ and ‘Fusing’ are isolated. For our first sce-nario, the target is to load the entire digital photo copier event log in databases and filter it in sucha way that after unloading and applying the Fuzzy Miner plugin, the ‘Interpret’ subprocess fromFigure 9 in [33], is obtained. At loading, the TRACE Class and the EVENT conceptEXT nameattributes are selected as dimensions of the process cube. After loading, we perform a dice opera-tion on the EVENT conceptEXT name dimension of the process cube, by selecting the followingsubset of elements: Interpretation, Post Script, Unformatted Text and Page Control Language.

49

Figure 6.1: The ‘Interpret’ subprocess, obtained by dicing the process cube on the task name.

Further, an event log is materialized from the filtered event data and is used as a parameter forthe Fuzzy Miner plugin. The result is shown in Figure 6.1. The correspondence between our resultand the one in [33] can be easily noticed.

Figure 6.2: The ‘Interpret’ subprocess with its corresponding branches. The visualization resultsallow for easy comparison of subprocesses.

For further testing, we consider a second scenario, where the same ‘Interpret’ process is taken,but now subprocesses of each of the three branches of the ‘Interpret’ process are isolated, byfiltering on the task name. Figure 6.3 shows the main visualization frame with four windows.The first window on top, gives the same ‘Interpret’ process model. The three windows at thebottom, illustrate the subprocesses of the three branches of the process. Such visualization resultsare powerful for larger processes. First of all, multiple filtering results of the same process can

50

be visualized in the same time. After filtering, the initial process is not discarded, it can bereused again and again for filtering purposes. Presenting processes next to each other, highlightssimilarities and differences between them.

Figure 6.3: Zooming-in on the first part of the copier process model and on the first part of itscorresponding ‘Print’ and ‘Copy/Scan’ subprocesses.

In the last scenario, the entire copier process model is discovered, using the Heuristics Minerplugin. First, two slice operations are performed on the TRACE Class dimension. Their resultsare used to discover the ‘Print’ and the ‘Copy/Scan’ subprocesses. The resulted process modelsare quite large, which makes it difficult to visualize them entirely. Therefore, we zoom-in on thefirst part of the processes. By placing all the models in parallel, the paths for the ‘Print’ and‘Copy/Scan’ subprocesses can be distinguished in the copier process model. One branch of theprocess starts with the ‘Copy/Scan’, ‘Collect Copy/Scan’ and ‘Place Doc’ activities, correspondingto the ‘Copy/Scan’ subprocess, and the other branch starts with the ‘Remote Print’, ‘Read Print’and ‘Rasterization’ tasks, corresponding to the ‘Print’ subprocess. The same behavior is shownfor this part of the process, in Figure 7 from [33]. By zooming-in on the rest of the subprocesses,their entire behavior can be observed and their control-flows can be compared.

6.1.2 Real-life Log Data Example

For the real-life example, we select one of the event logs of a Dutch municipality, known underthe name of WABO1. The WABO1 event log consists of 691 process instances, 254 event classesand 22130 events. The data captures process events from October 2010 till November 2012 withan overall duration of 758 days.

At the case level, the following attributes are available:

• parts attribute, specifies for what building parts is the permit requested: “Bouw”(355 cases),“Sloop”(52 cases), “Kap”(32 cases), etc.

• SUMleges attribute, gives the total cost of a building permit application, e.g., 192.78, 284.55,1992.06.

• last phase attribute, denotes the outcome of a permit request application. Usually a casefinalizes with “Vergunning verleend”(permit given, in 344 cases) or “Vergunning geweigerd”

51

(permit declined, in 2 cases). However, there are a number of cases that end up with“Procedure afgebroken”(procedure aborted, in 74 cases).

• caseStatus attribute, indicates whether a case is still opened (“O”) or is already closed (“G”).For a case that is closed, no further objections are possible. However, for an opened case,objections can still be expected.

Event attributes give information related to the lifecycle of an event, the resource that executesa task or is responsible for it and different time characteristics, e.g., the time when a task wascreated or the time when an event was recorded. The lifecycle of en event comprises only a singletransition: complete. That is, all the work items in the event log are completed. There are 19resources that execute tasks. The majority of the tasks are performed by resource number 560872(30.764 %).

Figure 6.4: Dotted charts for a process of a Dutch municipality using absolute time. The influxof new cases is rather constant other time, the top chart. The influx of new cases is decreasingother time, the bottom left chart. For the bottom right chart, there is no pattern identified.

Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipal-ity using absolute time. These subprocesses are obtained by slicing the process cube on theTRACE last phase dimension. In all three cases, absolute, real times are used. Moreover, casesare sorted by the time of the first event. The top chart, corresponds to the building permit requestapplications finalized with giving a permit. For this subprocess, the initial events form an almoststraight line. Consequently, there is a close to constant arrival rate of new cases. The bottom leftchart correspond to canceled applications. The dotted chart shows that the influx of incoming newcases, that are eventually canceled, is decreasing other time. The last chart, on bottom right partof the image, corresponds to declined cases. Due to the reduced number of declined application,there is difficult to identify a pattern in the arrival of such cases.

Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipality usingrelative time, i.e., all cases start at time zero, with emphasis on the duration of a case. Typically,both approved and canceled cases are handled in 1-2 months, although a large amount of thoseare finished already after 10-20 days. Nevertheless, there are cases that take up to 1.5 years tocomplete. For instance, the duration of handling the declined cases is quite large. For one of thecases it takes one year after it is finally rejected. Such behavior is present also for approved andcanceled cases, however, very sporadic, like exceptions. Since the event data comes from a real-lifelog, we do not exclude the possibility of errors in recording for such cases.

52

Figure 6.5: Dotted charts for a process of a Dutch municipality using relative time. The durationof handling a building permit request, that is eventually approved, is typically about 1-2 months.The same is valid for canceled applications. Requests for applications that are declined take longertime to be handled.

Figure 6.6: Representation of the Working-Together Social Network for resources working atAanhoudingsgrond van toepassing (AH) type of activities and on Waw-aanvraag buiten behandeling(AWB) type of activities.

Mining social networks is yet another ProM feature supported in the PROCUBE plugin. Thesocial network miners, presented in [9], can be directly applied on the event logs of the subprocessesof a process cube. In this section, we present an example of a Working-Together Social Networkfor resources in the WABO1 process, working at Aanhoudingsgrond van toepassing (AH) type of

53

activities and on Waw-aanvraag buiten behandeling (AWB) type of activities. In both networks, acluster of resources working-together and several isolated resources, can be distinguished. Exceptfor a few isolated resources, i.e., 560589, 560999 and 560950, the AH network contains the sameelements as the AWB one. This is not the case when it comes to resource interactions in theworking-together clusters. Even though it contains almost the same resources, its correspondingchain of interaction changes. That is, compared to the AWB network, in the AH one, only 560912still works directly with 2670601 and only 3273854 still works directly with 560925. A ratherlarge percentage of resources involved in the entire process, i.e., 19 resources, are also presentin the networks, 84 % in the first network and 68% in the second network. This indicates thatthe majority of the resources may not be specialized in a particular type of activity, but ratherexecute different types of activities depending on the case. Other network graphs and plugins canbe used to fully prove the statement. Consequently, placing social networks next to each other,offers a parallel view of people’s interaction within an organization in various situations, e.g., whenhandling different tasks.

6.2 Performance Analysis

In this section the performance of the PROCUBE system with respect to loading and unloadingoperations is analysed. Clearly, loading time affects the productivity of the system only once, whenthe event log data is loaded into the databases, whereas unloading operation could be performedmultiple times, i.e., whenever a process mining technique is applied to the events in the cube(possibly a subcube). The time required by these operations has to be small enough to guaranteeadequate user interaction with the tool. In what follows, the PROCUBE tool is subject to severaltests.

Test 1. For the first test, subsets of the WABO1 event log are loaded and unloaded from thedatabase. These subsets contain 160, 338, 687, 1368, 2732, 5505, 11061, and 22130 events.The latter sublog is actually the entire WABO1 event log. The loading and unloading speedis assessed for each sublog in 4 distinct configurations of the in-memory database, i.e., 2Dwith dimensions TRACE parts and EVENT timestamp, 3D which contains the dimensionsfrom 2D and EVENT orgEXT resources, 4D adds EVENT created to 3D dimensions, andthe 5D configuration adds to 4D the TRACE termName dimension. This test illustratesthe dependency of the loading and unloading time for typical selection of dimensions.

Test 2. The second test illustrates the effects of sparse dimensions on the loading and unloadingperformance. This test is performed on two 2D configurations and follows the methodologyfrom Test 1. The dimensions of these two cubes are summarized in Table 6.1.

Cube Dimension Nr. of membersLow sparsity TRACE termName 12

EVENT orgEXT resources 20High sparsity EVENT taskDescription 73

EVENT conceptEXT name 692

Table 6.1: Summary of dimensions for the 2D cubes in Test 2.

Test 3. For the last test, the WABO1 event log is split into several non-overlapping sublogs andthe total unloading time of these sublogs is compared to the unloading of the entire WABO1event log. This test illustrates that the filtering operations and extraction of sublogs doesnot infer any additional penalty on the unloading time.

54

100 300 900 2700 8100 243001

3

9

27

81

Nr. of events

Tim

e (s

)

Loading speed

2D load3D load4D load5D load

Figure 6.7: Loading times for Test 1.

Test 1

Let us begin by showing the loading times for this test setup in Figure 6.7. Although, both scaleson the figure axis are logarithmic, it is easy to see that the loading time increases linearly withrespect to the number of events in the log. Moreover, the loading time is practically independent ofthe number of cube dimensions. The latter remark suggests that loading time per dimension intothe relational database and in-memory database are about the same, i.e., if one of the dimensionis moved from the relational database to the cube, the loading time does not change. Moreover,the loading implies just one constant set of operations per event, therefore it is independent of thenumber of dimensions in the created cube. Of course, the amount of memory used for the cubeincreases with the number of dimensions.

100 300 900 2700 8100 24300

1

10

100

700

Nr. of events

Tim

e (s

)

Unloading speed

2D load3D load4D load5D load

Figure 6.8: Unloading times for Test 1.

55

The situation during the unloading is completely different however. The unloading time forthe same databases is shown in Figure 6.8. The time spent for unloading the event log fromthe database increases considerably for larger numbers of cube dimensions. Of course, unloadingtime heavily depends on the number of cube cells that do not have any events corresponding tothem. These empty cells do not affect the loading time into the database, but consume memory.The opposite is true during unload, when each cell has to be verified. Hence, time is spent onempty cells, but these cells do not contribute with any information to the resulting log. Generally,the sparsity of a cube increases with the increase of the number of dimensions, and as such, thenumber of empty cells does too. For this particular case study, unloading an event log with 11061events takes 27 s for a 2D cube, and 688 s for a 5D cube, which illustrates a super-linear increasein the unloading time. Similar tendency can be observed with respect to the number of eventsin the log. It appears that the sparsity of the cube increases with the number events in the logwith a supper-linear rate as well. These observations can be intuitively explained by two facts.First, all the dependencies in the hyper-cubic structures are multiplicative rather than additive,hence the sparsity is expected to rise exponentially. Secondly, event logs contain attributes whichcharacterize the events very precisely, e.g., timestamp or name of a resource. Obviously, findingtwo events happening in exactly the same time, to say the least, is very difficult, and hardlyany resource is engaged in all activities. Hence, due to this precision of event logs the sparsity isunavoidable when a process cube is constructed, and unfortunately, the unloading time complexityrises exponentially with the number of dimensions and events for typical situations.

Test 2

As mentioned previously, for this test, we compare loading and unloading times of cubes configu-rations with different levels of sparsity.

100 300 900 2700 8100 243001

3

9

27

81110

Nr. of events

Tim

e (s

)

Loading speed

non−sparsesparce

Figure 6.9: Loading times for Test 2.

It can be seen in Figure 6.9 that the loading time does not vary much in between the twocubes. The sparser cube appears to load only slightly longer. This behavior is expected and wasexplained on the results of the Test 1. On the examples from Test 1, it is shown that unloadingtime heavily depends on the number of in-memory dimensions and number of events. However,the unloading time is also dependent on the sparsity of the cube. The unloading time for thetwo cube configurations with the same number of events and dimensions but different sparsity areillustrated in Figure 6.10. Observe that the difference in between unloading times of the higherand lower sparsity cubes for the entire WABO1 event log is more than 10 fold.

56

100 300 900 2700 8100 24300

1

10

100

700

Nr. of events

Tim

e (s

)

Unloading speed

non−sparsesparse

Figure 6.10: Unloading times for Test 2.

One might expect a larger difference, as the ratio between the number of cells in the cubes isactually about 191, i.e., 73 × 629 cells of a sparse cube divided by 12 × 20 cells of a non-sparsecube, where 73, 629, 12 and 20 represent the number of elements of the dimensions of the cubes.Although all the cells have to be visited while unloading the event log, the hybrid nature of thedatabase prevents huge increase in the required time. Processing time required for empty cells isconsiderably lower than for the cells with events, i.e., if an empty cell is detected then no query isissued to the relational database and the algorithm jumps to the next cell. Hence, with the 191times increase in the number of cells, the overall computational load increase is only 10 fold.

Test 3

For the purpose of this test, the WABO1 event log with 22,130 events was loaded with the fol-lowing two dimensions EVENT timestamp and TRACE caseStatus. Furthermore, the drill downoperation is applied along the timestamp dimension.

Cell Name All EVENT NO VALUE 2010 2011 2012 SUMUnload time (s) 61.9 0.001 4.4 32.5 26.3 63.2

Table 6.2: Summary of the unload time for the Test 3.

In Table 6.2 we provide the unloading time for each cell in the visualization table. The columnSUM stands for the sum of all columns except All EVENTS. Observe that the time to unloadthe entire WABO1 event log from the database is only marginally lower than the cumulative timerequired for its separate components. This result shows that filtering operation does not infer anyperformance penalties on the developed database structure. Applying the same operation on theevent data stored in the relational database would require complex queries, and as such, wouldslow down the process. Therefore, fast filtering along the process cube dimensions is herein provenand it represents a benefit of the multidimensional database technologies.

6.3 Discussion

There are three main observation that are derived from the experimental results.

57

Observation 1. Loading time of an event log is practically independent from the number of thedimensions of analysis. This fact is illustrated in Figure 6.7 and is the result of the loadingalgorithm. The event log is loaded into the database event by event, and for each event aconstant number of operations is performed. Hence, the loading time is dependent only onthe number of events.

Observation 2. Sparsity of the process cube heavily impacts the unloading performance. Forthe selected cell in the table of visualization, all combinations of the dimensions of analysismembers which correspond to this cell, are computed during the unload. For each combi-nation, it is verified whether the associated process cube cell contains any events. Hence,there is a fixed amount of time spent to check whether the cell is empty, i.e., the cell id isretrieved from the multidimensional database, if cell id is NULL, the cell is empty and nofurther actions are performed with respect to this cell. If the cell contains events, additionaltime is spent to unload the event data from the relational database. Obviously, checkingempty cells impacts negatively the unloading time. This is illustrated by the results of thesecond test, where for 191 times more cells to verify and the same number of events to unloadcomparing to a normally sparse cube, the unloading time is 10 times larger.

Observation 3. Manual splitting and analysing of sparse dimensions, e.g., with several hundredof dimension members, would be very time consuming and probably would overload theuser. Realistically, only the dimensions with at most 20 members are fit to be includedin the process cube structure. Selection of such dimensions ensures low sparsity of theresulting process cube, and results in good responsiveness of the developed tool. Test 1 wasbased on a typical selection of analysis dimensions and therefore, its results characterize theoperation speed of the tool in case of regular sparsity. Moreover, it was observed that thedeveloped tool with the processing step, e.g., Log Dialog, delivers the result within 10s forevent logs smaller than 2000 events and process cubes with about 3 to 4 normally sparsedimensions of analysis. This performance is respectable and makes the tool applicable todifferent processes. Moreover, the main focus of the tool is to compare selected parts of theevent log, thus, only small sections of the process cube will be unloaded for comparison intypical situations. Test 3 shows that the unloading time reduces when only a part the cubeis unloaded, which means that for 2000 events and 4 analysis dimensions, the average timeof an operation will be far lower than 10s. Furthermore, even if the entire cube is split insubcubes and all these subcubes are unloaded simultaneously, no performance penalty willoccur, i.e., all subcubes will be processes within 10s.

58

Chapter 7

Conclusions & Future Work

7.1 Summary of Contributions

This master thesis builds on the ideas presented in the PROCUBE project proposal [4]. Theproposal suggests to organize event data from logs in process cubes in such a way that discovery,analysis and comparison of multiple processes is possible. The main goal of this master projectwas to build a framework to support process cube exploration. The goal was achieved by followinga series of steps, which the thesis describes in detail.

We started by identifying the problem context. The role of business intelligence and processmining, in particular, in the functionality and performance of enterprise information systems, wasinvestigated. Further, the reader was introduced to the business intelligence area, with emphasison process mining and OLAP technologies. As concepts from both process mining and OLAP wererepeatedly employed throughout the thesis, a formalization was given for all the adherent notions.The formalization of OLAP and of process-cube-related notions is one of the contributions of thisthesis. Further elaboration and formalization of the process cube concept can be found in [6].

The next step in the project was to describe the central element of the project, the processcube. Process cubes realize the link between the process mining framework and the existingOLAP technology. While, process mining focuses on process anaysis, OLAP technology is usedfor its built-in hypercube structures allowing for operations like slice, dice, roll-up, drill-downand pivoting. As such, process cubes are defined by introducing the event-related aspects in theformalization of the OLAP cubes. Along with the process cube formalization, an example waspresented to illustrate the process cube capabilities. This stage of the project was an importantone, as it helped in establishing and clarifying the process cube functionality before its actualimplementation.

Since databases, OLAP and process mining tools already exist, we decided to reuse the currenttechnologies to save time. Choosing a framework for process mining was easy, as ProM is clearlythe leading open source framework and expertise is readily available at TU/e. Selecting a suit-able OLAP technology was not as straightforward though. That is because the applied methodsand principles vary quite a lot from OLAP tool to tool. Finally, we selected the Palo in-memorymultidimensional OLAP. In-memory tools are known for their increased speed. Moreover, unlikerelational databases, multidimensional databases have already the built-in multidimensional struc-ture that is natural for OLAP cubes and therefore, facilitates the OLAP analysis. Relatively new,this technology is still undergoing a lot of changes and improvements. Nevertheless, it is deemedto have a bright future perspective, especially because of its current and envisioned performancebenefits.

The main contribution of the thesis is creating a basic prototype supporting the notion ofprocess cube in a process mining context, with the following functionality: XES event logs areintroduced as data sources for OLAP applications; the OLAP process cube is created from eventdata; the cube can be visualized from different perspectives; one can “play” with the cube before

59

starting the analysis, by applying different OLAP operations. One of the challenges we encoun-tered after finishing the application was that MOLAP performance was worsening with increasingsparsity of the loaded data. We were aware of the sparsity problem from the very beginning, how-ever, we did not expect such poor performance results. One of the potential explanations is thatwe used an open source version of Palo from 2011, which might not include the latest performanceimprovements that can be found in the commercial tool. Moreover, sparsity is still an open issuefor many multidimensional tools. Only Essbase is known to provide a solution to this problem atthe moment, but it is not open source. We hope that Palo will also release a new version with thesparsity problem solved. In meanwhile, we offered an interim solution to improve the performancefor sparse data.

The solution we provided for dealing with sparsity, was to replace the in-memory database witha hybrid structure, that stores part of event data in-memory and the other part, in a relationaldatabase. The advantage of such a strategy is that it reduces the number of dimensions in thecube and thus, makes it less sparse. The limitation is that only a part of the event data canbe used for filtering purposes. Furthermore, we reduced the number of elements per dimensionby implementing the hierarchical feature for time data. By allowing time data to be stored in ahierarchical structure, the sparsity of some very sparse dimensions like the timestamp, is reducedconsiderably.

Finally, we tested the PROCUBE system to determine its capabilities. The information storedin event logs is inherently multidimensional, and as such, efficient application of process miningtools requires multidimensional filtering of the event database. The multidimensional, and as aparticular case, in-memory database technology is developed for exactly that purpose. However,the performed tests show that event logs generally result in sparse multidimensional databasestructures, which incurs severe performance penalties when unloading parts of the event log forfurther processing. The proposed hybridization of the database structure, i.e., keeping only strictlynecessary dimensions in memory and the rest in a relational database, makes an efficient trade-offin between the flexibility of the complete process cube and responsiveness of the user interaction.Nevertheless, complete understanding of the sparsity concept is required for efficient use of thedeveloped tool as only limited number dimensions, e.g., up to 4D for WABO1 event log, can beused for on-line analysis.

7.2 Limitations

In this section we describe two types of limitations of the in-memory multidimensional OLAPprocess cube approach. First, limitations at the conceptual level are presented, followed by imple-mentation limitations.

7.2.1 Conceptual Level

Cell Number Explosion ProblemThe cell number explosion problem, known also as sparsity, is common for multidimensionalstructures, where it is not possible to store data in a compact way, and therefore, resulting ina large number of missing values at the intersection of dimensions. As such, a process cubeexceeding a certain number of dimensions, with a large number of elements per dimensionand with a lot of missing cell values, leads to sparsity problems and high execution times foranalysis.

Visualization LimitationsIn the following, we present two types of limitations related to the visualization of processmining results. The first is related to the difficulty in visualizing hypercube structures, whilethe second one is related to the difficulty in visualizing multiple cell results.

Generally, the visualization of the hypercube structures is not an easy task. On one hand,multidimensionality is not the natural way in which people can visualize. On the other hand,

60

there are hardly any tools that provide multidimensional visualizations on more than threedimensions. In our case, we visualize only two dimensions of the process cube at a time.This is a simple, yet powerful visualization that allows efficient visual comparison of cellresults. The only mention is that the growth of the number of compared cells can become anissue. Fitting multiple results on a single screen, can impair the visualization of the results,thus, impeding the comparison between cells. This issue becomes even worse in case of largeresults. In the process mining area, the curse of dimensionality problem is well known. Thisis the case of large and complex models, that are usually unreadable. Visual comparison ofsuch models is not supported in this project, but this is still a research problem in the area.

7.2.2 Implementation Level

Filtering on a Subset of AttributesThe hybrid approach adopted in this project, of storing event data in both in-memory andrelational databases resulted in considerable performance gains. However it lacks flexibilitywith respect to the log filtering possibilities and changing dimensions in the cube. Thatis, the user is allowed to select a subset of attributes to be considered as dimensions inthe process cube, while the rest of the attributes and other log information are stored inrelational databases. Selecting only a subset of attributes, limits the log filtering possibilities.Moreover, changing one dimension of the cube, implies creating a new process cube, byselecting all the dimensions again.

Limited Set of Supported PluginsThe PROCUBE plugin uses only a limited set of ProM plugins to obtain process mining re-sults. There are two reasons for this limitation. First, not all the existing ProM plugins aresuitable for visual comparison of multiple subprocesses. The PROCUBE tool is designed towork with plugins that provide quick, direct process mining results. Secondly, there are plu-gins that cannot be used without following a sequence of wizards, which is problematic in thePROCUBE settings, as this procedure should be repeated for each process cell individually.

Performance issues for Sparse DimensionsOur methods are oriented on reducing the number of sparse dimensions and the sparsitywithin dimensions. Still, if the user selects all the attributes for creating cube dimensionsand there are sparse dimensions among those, the unloading of event data becomes veryslow.

7.3 Further Research

The process cube notion offers a wide range of new research questions and challenges. We willnot enumerate them in this section. Instead, we give some points of reference for improving andextending the current approach.

Data mining for Construction of HierarchiesHierarchies are one of the most powerful elements of the OLAP structures. In our tool, thehierarchy feature is supported only for dimensions with time values. However, meaningfulhierarchical structures can be also constructed for other types of dimensions. Machine learn-ing techniques can be applied in obtaining clusters of dimension elements that can be usedto create a hierarchy, e.g., hierarchical clustering. Moreover, data mining techniques can beused to combine elements of multiple dimensions to create a single dimension. That can beaccomplished by a meaningful partitioning of the elements, e.g., algorithms for partitioning,for instance, large categorical data exist [35].

Reuse of Precomputed ModelsKnowledge of the discovered processes can be reused, by storing this precomputed infor-mation, not only creating models on-the-fly. Since producing large models on-the-fly takes

61

time, performance can be improved by saving parts of the created models or aggregates ofthe entire models, for further reuse.

Further Visualization ImprovementThe visualization proposed in this thesis is based on the simple, traditional 2D visualization.Undoubtedly, more advanced visualization techniques can be found, with the advantageof being more representative for analysis and more user-friendly. Such an example is theicicle plot construction [32], that can be used to enhance the hierarchical representation ofdimensions and facilitate the comparison between two sub-processes.

62

Bibliography

[1] A Survey of Open Source Tools for Business Intelligence. In David Taniar and Li Chen,editors, Integrations of Data Warehousing, Data Mining and Database Technologies, pages237–257. Information Science Reference, 2011.

[2] Business Processing Intelligence Challenge (BPIC). In 8th International Workshop on Busi-ness Process Intelligence, 2012.

[3] W. M. P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement ofBusiness Processes. Springer, 1998.

[4] W. M. P. van der Aalst. Mining Process Cubes from Event Data (PROCUBE), projectproposal (under review). 2012.

[5] W. M. P. van der Aalst. Process Mining: Making Knowledge Discovery Process Centric.SIGKDD Explorations Newsletter, 13(2):45–49, 2012.

[6] W. M. P. van der Aalst. Process Cubes: Slicing, Dicing, Rolling Up and Drilling DownEvent Data for Process Mining. In J. Liu M. Song, M.Wynn, editor, Asia Pacific conferenceon Business Process Management (AP-BPM 2013), Lecture Notes in Business InformationProcessing, 2013.

[7] W. M. P. van der Aalst, A. Adriansyah, A. K. A. de Medeiros, F. Arcieri, T. Baier, T. Blickle,R. P. Jagadeesh Chandra Bose, P. van den Brand, R. Brandtjen, J. C. A. M. Buijs, A. Burat-tin, J. Carmona, M. Castellanos, J. Claes, J. Cook, N. Costantini, F. Curbera, E. Damiani,M. de Leoni, P. Delias, B. F. van Dongen, M. Dumas, S. Dustdar, D. Fahland, D. R. Ferreira,W. Gaaloul, F. van Geffen, S. Goel, C. W. Gnther, A. Guzzo, P. Harmon, A. H. M. ter Hofst-ede, J. Hoogland, J. Espen Ingvaldsen, K. Kato, R. Kuhn, A. Kumar, M. La Rosa, F. Maggi,D. Malerba, R. S. Mans, A. Manuel, M. McCreesh, P. Mello, J. Mendling, M. Montali,H. Motahari Nezhad, M. zur Muehlen, J. Munoz-Gama, L. Pontieri, J. Ribeiro, A. Rozinat,H. Seguel Prez, R. Seguel Prez, M. Seplveda, J. Sinur, P. Soffer, M. S. Song, A. Sperduti,G. Stilo, C. Stoel, K. Swenson, M. Talamo, W. Tan, C. Turner, J. Vanthienen, G. Varvares-sos, H. M. W. Verbeek, M. Verdonk, R. Vigo, J. Wang, B. Weber, M. Weidlich, A. J. M. M.Weijters, L. Wen, M. Westergaard, and M. T. Wynn. Process Mining Manifesto. In BPM2011 Workshops, Part I.

[8] W. M. P. van der Aalst, M. Pesic, and M. Song. Beyond Process Mining: From the Pastto Present and Future. In Proceedings of the 22nd international conference on Advancedinformation systems engineering, CAiSE’10, pages 38–52, 2010.

[9] W. M. P. van der Aalst, H. A. Reijers, and M. Song. Discovering Social Networks from EventLogs. Computer Supported Cooperative Work, 14(6):549–593, 2006.

[10] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, andS. Sarawagi. On the Computation of Multidimensional Aggregates. 1996.

63

[11] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling Multidimensional Databases. In Pro-ceedings of the Thirteenth International Conference on Data Engineering, ICDE ’97, pages232–243, 1997.

[12] I.-M. Ailenei. Process Mining Tools: A Comparative Analysis. Master’s thesis, EindhovenUniversity of Technology, 2011.

[13] A. Berson and S. J. Smith. Data Warehousing, Data Mining, and Olap. 1997.

[14] R.P. Jagadeesh Chandra Bose. Process Mining in the Large: Preprocessing, Discovery, andDiagnostics. PhD thesis, Eindhoven University of Technology, 2012.

[15] J. C. A. M. Buijs. Mapping Data Sources to XES in a Generic Way. Master’s thesis, EindhovenUniversity of Technology, 2010.

[16] J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst. Towards Cross-Organizational Process Mining in Collections of Process Models and Their Executions. InBusiness Process Management Workshops (2), pages 2–13, 2011.

[17] J. W. Buzydlowski, I.-Y. Song, and L. Hassell. A Framework for Object-Oriented On-LineAnalytic Processing. In Proceedings of the 1st ACM international workshop on Data ware-housing and OLAP, DOLAP ’98, pages 10–15, 1998.

[18] S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology.SIGMOD Record, 26(1):65–74, 1997.

[19] S. Chaudhuri, U. Dayal, and V. Narasayya. An Overview of Business Intelligence Technology.Commun. ACM, 54(8):88–98, August 2011.

[20] E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (On-Line Analytical Processing)to User-Analysis: An IT Mandate, 1993. White paper.

[21] G. Colliat. OLAP, Relational, and Multidimensional Database Systems. SIGMOD Record,25(3):64–69, 1996.

[22] T. H. Davenport. Putting the Enterprise into the Enterprise System. Harvard BusinessReview, 76(4):121–131, 1998.

[23] K. Dhinesh Kumar, H. Roth, and L. Karunamoorthy. Critical Success Factors for the Imple-mentation of Integrated Automation Solutions with PC Based Control. In Proceedings of the10th Mediterranean Conference on Control and Automation, 2002.

[24] B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M.P. van der Aalst. The ProM Framework: A New Era in Process Mining Tool Support. InProceedings of the 26th international conference on Applications and Theory of Petri Nets,ICATPN’05, pages 444–454, 2005.

[25] R. Finkelstein. MDD: Database Reaches the Next Dimension. In Database Programming andDesign, pages 27–38, 1995.

[26] H. Garcia-Molina and K. Salem. Main Memory Database Systems: An Overview. IEEETransactions on Knowledge and Data Engineering, 4(6):509–516, 1992.

[27] M. Golfarelli. Open Source BI Platforms: A Functional and Architectural Comparison. InProceedings of the 11th International Conference on Data Warehousing and Knowledge Dis-covery, DaWaK ’09, 2009.

[28] O. Grabova, J. Darmont, J.-H. Chauchat, and I. Zolotaryova. Business Intelligence for Smalland Middle-Sized Entreprises. SIGMOD Record, 39(2), 2010.

64

[29] C. W. Gunther. XES Standard Definition. Fluxicon Process Laboratories, pages 13–14, 2009.

[30] C. W. Gunther and W. M. P. van der Aalst. Fuzzy Mining Adaptive Process SimplificationBased on Multi-Perspective Metrics. BPM, pages 328–343, 2007.

[31] J. Han. OLAP Mining: An Integration of OLAP with Data Mining. In In Proceedings of the7th IFIP 2.6 Working Conference on Database Semantics (DS-7, pages 1–9, 1997.

[32] D. Holten and J. J. van Wijk. Visual Comparison of Hierarchically Organized Data. InProceedings of the 10th Joint Eurographics / IEEE - VGTC conference on Visualization,EuroVis’08, 2008.

[33] R. P. Jagadeesh Chandra Bose, W. M. P. van der Aalst, I. Zliobaite, and M. Pechenizkiy.Handling Concept Drift in Process Mining. In Proceedings of the 23rd international conferenceon Advanced Information Systems Engineering, CAiSE’11, pages 391–405, 2011.

[34] M. R. Jensen, T. H. Møller, and T. B. Pedersen. Specifying OLAP Cubes on XML Data.Journal of Intelligent Information Systems, 17(2-3):255–280, 2001.

[35] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of CategoricalData. Journal of the Royal Statistical Society, 29(2):119–127, 1980.

[36] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR Measures forMultidimensional Text Database Analysis. In Proceedings of the 2008 Eighth IEEE Interna-tional Conference on Data Mining, ICDM ’08, 2008.

[37] M. Liu, E. A. Rundensteiner, K. Greenfield, C. Gupta, S. Wang, I. Ari, and A. Mehta. E-Cube: Multidimensional event sequence processing using concept and pattern hierarchies. InInternational Conference on Data Engineering, pages 1097–1100, 2010.

[38] E. Lo, B. Kao, W.-S. Ho, S. D. Lee, C. K. Chui, and D. W. Cheung. OLAP on SequenceData. In Proceedings of the 2008 ACM SIGMOD international conference on Managementof data, SIGMOD ’08, 2008.

[39] F. Melchert, R. Winter, and M. Klesse. Aligning Process Automation and Business Intelli-gence to Support Corporate Performance Management. In AMCIS’04, pages 507–507, 2004.

[40] R. B. Messaoud, O. Boussaid, and S. Rabaseda. A New OLAP Aggregation Based on theAHC Technique. In Proceedings of the 7th ACM international workshop on Data warehousingand OLAP, DOLAP ’04, 2004.

[41] S. Negash. Business Intelligence. Communications of the Association for Information Systems,13(1):177–195, 2004.

[42] T. Niemi, J. Nummenmaa, and P. Thanisch. Constructing OLAP Cubes Based on Queries.In Proceedings of the 4th ACM international workshop on Data warehousing and OLAP,DOLAP ’01, 2001.

[43] T. B. Pedersen and C. S. Jensen. Multidimensional Database Technology. Computer,34(12):40–46, December 2001.

[44] D. Riazati, J. A. Thom, and X. Zhang. Drill Across and Visualization of Cubes with Non-conformed Dimensions. In Nineteenth Australasian Database Conference, volume 75, pages85–93, 2008.

[45] J. Ribeiro. Multidimensional Process Discovery. Beta Dissertation Series D165, 2013.

[46] C. Salka. Ending the MOLAP/ROLAP Debate: Usage Based Aggregation and FlexibleHOLAP (Abstract). In Proceedings of the Fourteenth International Conference on Data En-gineering, February 23-27, 1998, Orlando, Florida, USA, page 180, 1998.

65

[47] B. Sigg. DockingFrames 1.1.1 - Common. pages 7–8, 2012.

[48] Stratebi. Open Source B.I. comparative. 2010.

[49] C. Thomsen and T. B. Pedersen. A survey of open source tools for business intelligence. InProceedings of the 7th international conference on Data Warehousing and Knowledge Discov-ery, DaWaK’05, 2005.

[50] C. Thomsen and T. B. Pedersen. A Survey of Open Source Tools for Business Intelligence.International Journal of Data Warehousing and Mining, 5(3):56–75, 2009.

[51] E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. RobertIpsen, 2002.

[52] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization.In Proceedings of the 2008 ACM SIGMOD international conference on Management of data,SIGMOD ’08, 2008.

[53] A. J. M. M. Weijters and A. K. A. de Medeiros. Process Mining with the HeuristicsMinerAlgorithm. 2006.

[54] K. Withee. Microsoft Business Intelligence for Dummies. Wiley Publishing, 2010.

66

Documents

Realizing a Process Cube Allowing for the Comparison of Event Data