Abstract
While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.
| Original language | English |
|---|---|
| Title of host publication | Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track |
| Subtitle of host publication | European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings |
| Editors | Yuxiao Dong, Georgiana Ifrim, Dunja Mladenic, Craig Saunders, Sofie Van Hoecke |
| Place of Publication | Cham |
| Publisher | Springer |
| Pages | 68-84 |
| Number of pages | 17 |
| Volume | V |
| ISBN (Electronic) | 978-3-030-67670-4 |
| ISBN (Print) | 978-3-030-67669-8 |
| DOIs | |
| Publication status | Published - 25 Feb 2021 |
| Event | 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) - Virtual, Online, Ghent, Belgium Duration: 14 Sept 2020 → 18 Sept 2020 https://ecmlpkdd2020.net/ |
Publication series
| Name | Lecture Notes in Computer Science (LNCS) |
|---|---|
| Volume | 12461 |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
| Name | Lecture Notes in Artificial Intelligence (LNAI) |
|---|---|
| Volume | 12461 |
| ISSN (Print) | 2945-9133 |
| ISSN (Electronic) | 2945-9141 |
Conference
| Conference | 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) |
|---|---|
| Abbreviated title | ECML PKDD 2020 |
| Country/Territory | Belgium |
| City | Ghent |
| Period | 14/09/20 → 18/09/20 |
| Internet address |
Bibliographical note
Funding Information:This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
Funding
Acknowledgement. This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.
Keywords
- Batch-mode active learning
- Hate-speech recognition
- Imbalance data
Fingerprint
Dive into the research topics of 'PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver