Process mining has emerged as a way to analyze systems and their actual use based on the event logs they produce. The focus of process mining, unlike other business intelligence domains, is on concurrent processes and not on static or mainly sequential structures. Process mining is applicable to a wide range of systems. These systems may be pure information systems (e.g., ERP systems) or systems where the hardware plays a more prominent role (e.g., embedded systems). The only requirement is that the system produces event logs, thus recording (parts of) the actual behavior. Current Process Mining Algorithms (PMAs) face two major challenges: 1) real-life event logs contain large amounts of data about previous executions of a process, and 2) since they attempt to derive accurate information from the event logs, PMAs are computational expensive. Moreover, process mining experts often combine multiple PMAs to offer insights into systems real behavior from different perspectives, i.e. they execute process mining workflows. These workflows are currently executed manually sequentially or are hard-coded. In the past decade, new emerging concepts such as grid computing, service-oriented architectures and cloud computing provide solutions to the increasing demand for data storage and computing power. These technologies enable worldwide distributed resources, e.g. software and infrastructure, to cooperate for a specific user defined goal. Such distributed environments can, on one hand, offer a solution to the complexity challenges of the process mining domain and, on the other hand, create the possibility to enable PMAs as services that can be combined and orchestrated via workflow engines. This PhD thesis proposes a framework for the execution of process mining workflows in a distributed environment. The distribution of the PMAs is done at two levels: 1) process mining algorithms are parallelized, thus reducing considerably their time consumption; and 2) a framework for automated execution of process mining workflows is proposed. For the first level, we focus on one particular advanced process mining algorithm - Genetic Mining Algorithm (GMA), and we propose two distribution strategies Distributed Genetic Mining Algorithm (DGMA) and Distributed Sample-based Genetic Mining Algorithm (DSGMA), improving significantly the GMA time efficiency. DGMA distributes the GMA computation on different computational resources by using a coarsegrained approach. The second strategy, DSGMA further reduces the computation time by data distribution and exploiting the event logs redundancy. For both of the algorithms, we derive guidelines for their parameter configuration based on empirical evaluations; we validate the guidelines on several real life event logs. All the proposed algorithms described in this thesis have been implemented as plug-ins in the ProM framework - an open source tool available at www.processmining.org. For the second level, we provide a formal description of a grid architecture suitable for process mining experiments in terms of a colored Petri net (CPN). The CPN can be seen as a reference model for grids and clarifies the basic concepts at a conceptual level. Moreover, the CPN allows for various kinds of analysis ranging from verification to performance analysis. The level of detail present in the CPN model allows us to, almost straightforwardly; implement a real grid architecture based on the model. Note that even if our reference model was inspired by the challenges from process mining domain challenges, it can be used for other computationally challenging domains as well. Based on the CPN reference model we implemented a prototype, called YAGA (Yet Another Grid Architecture), a service-based framework for supporting process mining experiments. YAGA is a simple, but extensible, grid architecture that combines a powerful workflow engine YAWL and the ProM framework through a JAVA-based grid middleware. By combining the CPN reference model and YAGA, we provide a powerful grid framework for process mining experiments. The model allows for easy experimentation and extensive debugging. It also ensures an easy and rapid way to choose optimal parameters of real life workflows. These estimations can help users in planning their experiments and/or re-configuring their workflows. Moreover, performing model simulations on-the-fly can give realistic resource load predictions, which can be used for improving the scheduling process.
|Qualification||Doctor of Philosophy|
|Award date||29 Mar 2011|
|Place of Publication||Eindhoven|
|Publication status||Published - 2011|