Enabling efficient process mining on large data sets: realizing an in-database process mining operator

Remco Dijkman (Corresponding author), Juntao Gao, Alifah Syamsiyah, Boudewijn van Dongen, Paul Grefen, Arthur ter Hofstede

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Process mining can be used to analyze business processes based on logs of their execution. These execution logs are often obtained by querying a database and storing the results in a file. The mining itself is then done on the file, such that the data processing power of the database cannot be used after the log is extracted. Enabling process mining directly on a database therefore provides additional flexibility and efficiency. To help facilitate this, this paper formally defines a database operator that extracts the ‘directly follows’ relation—one of the relations that is at the heart of process mining—from an operational database. It defines the operator using the well-known relational algebra and formally proves equivalence properties of the operator that are useful for query optimization. Subsequently, it presents time-complexity properties of the operator. Finally, it presents an implementation of the operator as part of the H2 DBMS and demonstrates that this implementation extracts the ‘directly follows’ relation from a database with an arbitrary database structure within a fraction of a second; several orders of magnitude faster than is currently possible.

Original languageEnglish
JournalDistributed and Parallel Databases
DOIs
Publication statusE-pub ahead of print - 19 May 2019

Fingerprint

Mathematical operators
Data base
Process mining
Operator
Algebra
Industry
Equivalence
Business process
Query optimization
Database management systems

Keywords

  • Database management system
  • Formal methods
  • Process mining
  • Relational algebra

Cite this

@article{c3fa4133a5b340bf89618e0c46255abc,
title = "Enabling efficient process mining on large data sets: realizing an in-database process mining operator",
abstract = "Process mining can be used to analyze business processes based on logs of their execution. These execution logs are often obtained by querying a database and storing the results in a file. The mining itself is then done on the file, such that the data processing power of the database cannot be used after the log is extracted. Enabling process mining directly on a database therefore provides additional flexibility and efficiency. To help facilitate this, this paper formally defines a database operator that extracts the ‘directly follows’ relation—one of the relations that is at the heart of process mining—from an operational database. It defines the operator using the well-known relational algebra and formally proves equivalence properties of the operator that are useful for query optimization. Subsequently, it presents time-complexity properties of the operator. Finally, it presents an implementation of the operator as part of the H2 DBMS and demonstrates that this implementation extracts the ‘directly follows’ relation from a database with an arbitrary database structure within a fraction of a second; several orders of magnitude faster than is currently possible.",
keywords = "Database management system, Formal methods, Process mining, Relational algebra",
author = "Remco Dijkman and Juntao Gao and Alifah Syamsiyah and {van Dongen}, Boudewijn and Paul Grefen and {ter Hofstede}, Arthur",
year = "2019",
month = "5",
day = "19",
doi = "10.1007/s10619-019-07270-1",
language = "English",
journal = "Distributed and Parallel Databases",
issn = "0926-8782",
publisher = "Springer",

}

TY - JOUR

T1 - Enabling efficient process mining on large data sets

T2 - realizing an in-database process mining operator

AU - Dijkman, Remco

AU - Gao, Juntao

AU - Syamsiyah, Alifah

AU - van Dongen, Boudewijn

AU - Grefen, Paul

AU - ter Hofstede, Arthur

PY - 2019/5/19

Y1 - 2019/5/19

N2 - Process mining can be used to analyze business processes based on logs of their execution. These execution logs are often obtained by querying a database and storing the results in a file. The mining itself is then done on the file, such that the data processing power of the database cannot be used after the log is extracted. Enabling process mining directly on a database therefore provides additional flexibility and efficiency. To help facilitate this, this paper formally defines a database operator that extracts the ‘directly follows’ relation—one of the relations that is at the heart of process mining—from an operational database. It defines the operator using the well-known relational algebra and formally proves equivalence properties of the operator that are useful for query optimization. Subsequently, it presents time-complexity properties of the operator. Finally, it presents an implementation of the operator as part of the H2 DBMS and demonstrates that this implementation extracts the ‘directly follows’ relation from a database with an arbitrary database structure within a fraction of a second; several orders of magnitude faster than is currently possible.

AB - Process mining can be used to analyze business processes based on logs of their execution. These execution logs are often obtained by querying a database and storing the results in a file. The mining itself is then done on the file, such that the data processing power of the database cannot be used after the log is extracted. Enabling process mining directly on a database therefore provides additional flexibility and efficiency. To help facilitate this, this paper formally defines a database operator that extracts the ‘directly follows’ relation—one of the relations that is at the heart of process mining—from an operational database. It defines the operator using the well-known relational algebra and formally proves equivalence properties of the operator that are useful for query optimization. Subsequently, it presents time-complexity properties of the operator. Finally, it presents an implementation of the operator as part of the H2 DBMS and demonstrates that this implementation extracts the ‘directly follows’ relation from a database with an arbitrary database structure within a fraction of a second; several orders of magnitude faster than is currently possible.

KW - Database management system

KW - Formal methods

KW - Process mining

KW - Relational algebra

UR - http://www.scopus.com/inward/record.url?scp=85065715620&partnerID=8YFLogxK

U2 - 10.1007/s10619-019-07270-1

DO - 10.1007/s10619-019-07270-1

M3 - Article

AN - SCOPUS:85065715620

JO - Distributed and Parallel Databases

JF - Distributed and Parallel Databases

SN - 0926-8782

ER -