Contrastive learning of general-purpose audio representations

Aaqib Saeed, David Grangier, Neil Zeghidour

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

141 Citations (Scopus)

Abstract

We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio. Our approach is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings. We build on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio. We pre-train embeddings on the large-scale Audioset database and transfer these representations to 9 diverse classification tasks, including speech, music, animal sounds, and acoustic scenes. We show that despite its simplicity, our method significantly outperforms previous self-supervised systems. We furthermore conduct ablation studies to identify key design choices and release a library1 to pre-train and fine-tune COLA models.

Original languageEnglish
Title of host publicationICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublisherInstitute of Electrical and Electronics Engineers
Pages3875-3879
Number of pages5
ISBN (Electronic)978-1-7281-7605-5
DOIs
Publication statusPublished - 13 May 2021
Event2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Virtual, Toronto, Canada
Duration: 6 Jun 202111 Jun 2021
https://2021.ieeeicassp.org/

Conference

Conference2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Abbreviated titleICASSP 2021
Country/TerritoryCanada
CityVirtual, Toronto
Period6/06/2111/06/21
Internet address

Keywords

  • Audio
  • Self-supervised learning
  • Sound

Fingerprint

Dive into the research topics of 'Contrastive learning of general-purpose audio representations'. Together they form a unique fingerprint.

Cite this