PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Citation (Scopus)

Abstract

While social media has taken a fixed place in our daily life, its steadily growing prominence also exacerbates the problem of hostile contents and hate-speech. These destructive phenomena call for automatic hate-speech detection, which, however, is facing two major challenges, namely i) the dynamic nature of online content causing significant data-drift over time, and ii) a high class-skew, as hate-speech represents a relatively small fraction of the overall online content. The first challenge naturally calls for a batch mode active learning solution, which updates the detection system by querying human domain-experts to annotate meticulously selected batches of data instances. However, little prior work exists on batch mode active learning with high class-skew, and in particular for the problem of hate-speech detection. In this work, we propose a novel partition-based batch mode active learning framework to address this problem. Our framework falls into the so-called screening approach, which pre-selects a subset of most uncertain data items and then selects a representative set from this uncertainty space. To tackle the class-skew problem, we use a data-driven skew-specialized cluster representation, with a higher potential to “cherry pick” minority classes. In extensive experiments we demonstrate substantial improvements in terms of G-Means, and F1 measure, over several baseline approaches and multiple datasets, for highly imbalanced class ratios.

Original languageEnglish
Title of host publicationMachine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track - European Conference, ECML PKDD 2020, Proceedings
EditorsYuxiao Dong, Dunja Mladenic, Craig Saunders
PublisherSpringer
Pages68-84
Number of pages17
ISBN (Print)9783030676698
DOIs
Publication statusPublished - 2021
Event2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020) - Virtual, Online, Ghent, Belgium
Duration: 14 Sep 202018 Sep 2020
https://ecmlpkdd2020.net/

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12461 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020)
Abbreviated titleECML PKDD 2020
Country/TerritoryBelgium
CityGhent
Period14/09/2018/09/20
Internet address

Bibliographical note

Funding Information:
This work was supported by grants from Indonesia Endowment Fund for Education (LPDP) and Ministry of Research, Technology and Higher Education of the Republic of Indonesia (BUDI-LN Scholarship). The authors also would like to thank the research programme Commit2Data, specifically the RATE-Analytics project NWO628 003 001 (partly) financed by the Dutch Research Council.

Publisher Copyright:
© 2021, Springer Nature Switzerland AG.

Keywords

  • Batch-mode active learning
  • Hate-speech recognition
  • Imbalance data

Fingerprint

Dive into the research topics of 'PS3: Partition-Based Skew-Specialized Sampling for Batch Mode Active Learning in Imbalanced Text Data'. Together they form a unique fingerprint.

Cite this