Abstract
In order to (semi)automate data cleaning and preprocessing, we need a clear and measurable definition of data quality. Data readiness levels have been proposed to fit this need, but they require a more detailed and measurable definition than is given in prior works. We present a practical framework focused on machine learning that encapsulates data cleaning and (pre)processing procedures. In our framework, datasets are classified within bands, and each band introduces more fine-grained terminology and processing steps. Scores are assigned to each step, resulting in a data quality score. This allows teams of people, as well as automated processes, to track and reason about the cleaning process, and communicate the current status and deficiencies in a more structured, well-documented manner.
Original language | English |
---|---|
Title of host publication | Machine Learning and Knowledge Discovery in Databases - International Workshops of ECML PKDD 2019, Proceedings |
Editors | Peggy Cellier, Kurt Driessens |
Publisher | Springer |
Pages | 3-16 |
Number of pages | 14 |
ISBN (Print) | 9783030438227 |
DOIs | |
Publication status | Published - 2020 |
Event | 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019) - Wurzburg, Germany Duration: 16 Sept 2019 → 20 Sept 2019 Conference number: 19 http://ecmlpkdd2019.org/ |
Publication series
Name | Communications in Computer and Information Science |
---|---|
Volume | 1167 CCIS |
ISSN (Print) | 1865-0929 |
ISSN (Electronic) | 1865-0937 |
Conference
Conference | 2019 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2019) |
---|---|
Abbreviated title | ECML PKDD 2019 |
Country/Territory | Germany |
City | Wurzburg |
Period | 16/09/19 → 20/09/19 |
Internet address |
Keywords
- Automated data science
- Data cleaning
- Data quality
- Data readiness levels
- Preprocessing