Keyword spotting using time-domain features in a temporal convolutional network

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

1 Downloads (Pure)

Abstract

With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.

Original languageEnglish
Title of host publicationProceedings - Euromicro Conference on Digital System Design, DSD 2019
EditorsNikos Konofaos, Paris Kitsos
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages313-319
Number of pages7
ISBN (Electronic)978-1-7281-2862-7
DOIs
Publication statusPublished - Aug 2019
Event22nd Euromicro Conference on Digital System Design, DSD 2019 - Kallithea, Chalkidiki, Greece
Duration: 28 Aug 201930 Aug 2019

Conference

Conference22nd Euromicro Conference on Digital System Design, DSD 2019
CountryGreece
CityKallithea, Chalkidiki
Period28/08/1930/08/19

    Fingerprint

Keywords

  • Autocorrelation
  • MFCC
  • Speech Recognition
  • Spotting (KWS)
  • Temporal Convolutional Network (TCN)

Cite this

Ibrahim, E. A., Huisken, J., Fatemi, H., & Pineda de Gyvez, J. (2019). Keyword spotting using time-domain features in a temporal convolutional network. In N. Konofaos, & P. Kitsos (Eds.), Proceedings - Euromicro Conference on Digital System Design, DSD 2019 (pp. 313-319). [8875204] Piscataway: Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/DSD.2019.00053