Keyword spotting using time-domain features in a temporal convolutional network

Emad A. Ibrahim, Jos Huisken, Hamed Fatemi, Jose Pineda de Gyvez

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

3 Citations (Scopus)
1 Downloads (Pure)

Abstract

With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.

Original languageEnglish
Title of host publicationProceedings - Euromicro Conference on Digital System Design, DSD 2019
EditorsNikos Konofaos, Paris Kitsos
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages313-319
Number of pages7
ISBN (Electronic)978-1-7281-2862-7
DOIs
Publication statusPublished - Aug 2019
Event22nd Euromicro Conference on Digital System Design, DSD 2019 - Kallithea, Kallithea, Chalkidiki, Greece
Duration: 28 Aug 201930 Aug 2019
Conference number: 22
http://dsd-seaa2019.csd.auth.gr/

Conference

Conference22nd Euromicro Conference on Digital System Design, DSD 2019
Abbreviated titleDSD 2019
Country/TerritoryGreece
CityKallithea, Chalkidiki
Period28/08/1930/08/19
Internet address

Keywords

  • Autocorrelation
  • MFCC
  • Speech Recognition
  • Spotting (KWS)
  • Temporal Convolutional Network (TCN)

Fingerprint

Dive into the research topics of 'Keyword spotting using time-domain features in a temporal convolutional network'. Together they form a unique fingerprint.

Cite this