Abstract
Data transformation is an important step in Machine Learning pipelines which can strongly improve their performance. For instance, min-max normalization is often used to make all variables lie in the same range, while log-transformation is used to map data that is scattered across several orders of magnitude to a logarithmic space. Such transformations can be beneficial when the machine learning approach measures distance in a metric space, such as cluster-based approaches. These two transformation approaches can be combined to reveal hidden patterns in the data in the case of log-normally distributed data points, which commonly occur in biological and medical data. In this work we introduce a novel evolutionary approach designed to automatically determine the optimal log-transformation and selection of variables. Our approach is built around an interpretable AI system (created by pyFUME), so that all transformations are followed by inverse transformations to map back the values into the original universe of discourse, and preserve the interpretability of the results. We test our approach on two synthetic datasets, designed to reproduce a condition in which some variables are normally distributed, some variables are log-normally distributed, and some variables are just noise in the dataset. Our results show that our approach yields better performing models compared to conventional methods, and that the resulting model is also characterised by a better interpretability, making such approach particularly useful to study biomedical datasets.
Original language | English |
---|---|
Title of host publication | 2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) |
Publisher | Institute of Electrical and Electronics Engineers |
Number of pages | 8 |
ISBN (Electronic) | 978-1-6654-8462-6 |
DOIs | |
Publication status | Published - 26 Aug 2022 |
Event | 2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2022 - Ottawa, Canada Duration: 15 Aug 2022 → 17 Aug 2022 |
Conference
Conference | 2022 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2022 |
---|---|
Country/Territory | Canada |
City | Ottawa |
Period | 15/08/22 → 17/08/22 |
Bibliographical note
Funding Information:ACKNOWLEDGMENT The work has been performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, the authors gratefully acknowledge the support of the Department of Environmental Sciences, Informatics and Statistics (DAIS) of the Ca’ Foscari University of Venice, and the computer resources and technical support provided by CINECA. Also, this work was partially supported by DAIS - Ca’ Foscari University of Venice within the IRIDE program.
Keywords
- data normalization
- data transformation
- fuzzy logic
- fuzzy model
- genetic algorithm
- interpretable AI
- log-transformation
- machine learning