While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.
|Title of host publication||Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track - European Conference, ECML PKDD 2020, Proceedings|
|Editors||Yuxiao Dong, Dunja Mladenic, Craig Saunders|
|Number of pages||17|
|Publication status||Published - 2021|
|Event||2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) - Virtual, Online, Ghent, Belgium|
Duration: 14 Sep 2020 → 18 Sep 2020
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020)|
|Abbreviated title||ECML PKDD 2020|
|Period||14/09/20 → 18/09/20|
Bibliographical noteFunding Information:
This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.
© 2021, Springer Nature Switzerland AG.
- Batch-mode active learning
- Hate-speech recognition
- Imbalance data