Abstract
Dealing with partially known or missing data is a common problem in machine learning. This work is interested in the problem of querying the true value of data to improve the quality of the learned model, when those data are only partially known. This study is in the line of active learning, since we consider that the precise value of some partial data can be queried to reduce the uncertainty in the learning process, yet can consider any kind of partial data (not only entirely missing one). We propose a querying strategy based on the concept of racing algorithms in which several models are competing. The idea is to identify the query that will help the most to quickly decide the winning model in the competition. After discussing and formalizing the general ideas of our approach, we study the particular case of decision trees in case of interval-valued features and set-valued labels. The experimental results indicate that, in comparison with other baselines, the proposed approach significantly outperforms simpler strategies in the case of partially specified features, while it achieves similar performances in the case of partially specified labels.
Original language | English |
---|---|
Pages (from-to) | 9285-9305 |
Number of pages | 21 |
Journal | Soft Computing |
Volume | 25 |
Issue number | 14 |
DOIs | |
Publication status | Published - Jul 2021 |
Externally published | Yes |
Keywords
- Active learning
- Data querying
- Decision trees
- Partial data
- Racing algorithms