APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

Ryota Nishimura, Tokushima University, Japan, nishimura@is.tokushima-u.ac.jp , Takaaki Uno, Toyohashi University of Technology, Japan, Taiki Yamamoto, Tokushima University, Japan, Kengo Ohta, National Institute of Technology, Japan, Norihide Kitaoka, Tokushima University, Japan
 
Suggested Citation
Ryota Nishimura, Takaaki Uno, Taiki Yamamoto, Kengo Ohta and Norihide Kitaoka (2024), "Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e14. http://dx.doi.org/10.1561/116.20240014

Publication Date: 22 Aug 2024
© 2024 R. Nishimura, T. Uno, T. Yamamoto, K. Ohta and N. Kitaoka
 
Subjects
Speech and spoken language processing,  Speech/audio/image/video compression,  Adaptive signal processing,  Statistical signal processing,  Signal processing for communications
 
Keywords
Wake wordCTCend-to-end modelingphoneme sequence detector
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 176 times

In this article:
Introduction 
Related Work 
Wake Word Detection Method 
Experiment 
Discussion 
Conclusion 
References 

Abstract

Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the device to detect a different WW, collection of a new speech corpus and re-training of the model are required.

In this study, we propose a system which is capable of detecting any chosen WW without additional model training or a corpus of WW utterances, allowing users to select and use their preferred WW. Our system consists of a phoneme predictor (PP) and a phoneme sequence detector (PSD). The PP predicts phoneme sequences using acoustic features of the input speech, and outputs phoneme probability distributions. The acoustic models in the PP are trained using the Connectionist Temporal Classification (CTC) loss criterion. The PSD takes the output of the PP as input, and predicts the probability of whether or not the WW has been input. In our evaluation experiments, we performed six-phoneme WW detection. Our results showed that the proposed method achieved 90% WW detection accuracy.

DOI:10.1561/116.20240014