Keyword spotting using time-domain features in a temporal convolutional network

Emad A. Ibrahim, Jos Huisken, Hamed Fatemi, Jose Pineda de Gyvez

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

6 Citations (Scopus)
1 Downloads (Pure)

Abstract

With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.

Original languageEnglish
Title of host publicationProceedings - Euromicro Conference on Digital System Design, DSD 2019
EditorsNikos Konofaos, Paris Kitsos
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages313-319
Number of pages7
ISBN (Electronic)978-1-7281-2862-7
DOIs
Publication statusPublished - Aug 2019
Event22nd Euromicro Conference on Digital System Design, DSD 2019 - Kallithea, Kallithea, Chalkidiki, Greece
Duration: 28 Aug 201930 Aug 2019
Conference number: 22
http://dsd-seaa2019.csd.auth.gr/

Conference

Conference22nd Euromicro Conference on Digital System Design, DSD 2019
Abbreviated titleDSD 2019
Country/TerritoryGreece
CityKallithea, Chalkidiki
Period28/08/1930/08/19
Internet address

Funding

This research has received funding from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No 737487. This Joint Undertaking receives support from the European Union's Horizon 2020 research and innovation program.

Keywords

  • Autocorrelation
  • MFCC
  • Speech Recognition
  • Spotting (KWS)
  • Temporal Convolutional Network (TCN)

Fingerprint

Dive into the research topics of 'Keyword spotting using time-domain features in a temporal convolutional network'. Together they form a unique fingerprint.

Cite this