Abstract
While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.
Original language | English |
---|---|
Title of host publication | Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track - European Conference, ECML PKDD 2020, Proceedings |
Editors | Yuxiao Dong, Dunja Mladenic, Craig Saunders |
Publisher | Springer |
Pages | 68-84 |
Number of pages | 17 |
ISBN (Print) | 9783030676698 |
DOIs | |
Publication status | Published - 2021 |
Event | 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) - Virtual, Online, Ghent, Belgium Duration: 14 Sep 2020 → 18 Sep 2020 https://ecmlpkdd2020.net/ |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 12461 LNAI |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) |
---|---|
Abbreviated title | ECML PKDD 2020 |
Country/Territory | Belgium |
City | Ghent |
Period | 14/09/20 → 18/09/20 |
Internet address |
Bibliographical note
Funding Information:This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
Keywords
- Batch-mode active learning
- Hate-speech recognition
- Imbalance data