now publishers - Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

APSIPA Transactions on Signal and Information Processing > Vol 13 > Issue 1

Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

Ryota Nishimura, Tokushima University, Japan, nishimura@is.tokushima-u.ac.jp , Takaaki Uno, Toyohashi University of Technology, Japan, Taiki Yamamoto, Tokushima University, Japan, Kengo Ohta, National Institute of Technology, Japan, Norihide Kitaoka, Tokushima University, Japan

Suggested Citation

Ryota Nishimura, Takaaki Uno, Taiki Yamamoto, Kengo Ohta and Norihide Kitaoka (2024), "Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector", APSIPA Transactions on Signal and Information Processing: Vol. 13: No. 1, e14. http://dx.doi.org/10.1561/116.20240014

Publication Date: 22 Aug 2024

Subjects

Speech and spoken language processing, Speech/audio/image/video compression, Adaptive signal processing, Statistical signal processing, Signal processing for communications

Keywords

Wake word, CTC, end-to-end modeling, phoneme sequence detector

Journal details

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 902 times

In this article:

Abstract

Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the device to detect a different WW, collection of a new speech corpus and re-training of the model are required.

In this study, we propose a system which is capable of detecting any chosen WW without additional model training or a corpus of WW utterances, allowing users to select and use their preferred WW. Our system consists of a phoneme predictor (PP) and a phoneme sequence detector (PSD). The PP predicts phoneme sequences using acoustic features of the input speech, and outputs phoneme probability distributions. The acoustic models in the PP are trained using the Connectionist Temporal Classification (CTC) loss criterion. The PSD takes the output of the PP as input, and predicts the probability of whether or not the WW has been input. In our evaluation experiments, we performed six-phoneme WW detection. Our results showed that the proposed method achieved 90% WW detection accuracy.

DOI:10.1561/116.20240014

Introduction
Related Work
Wake Word Detection Method
Experiment
Discussion
Conclusion
References

Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

Share

Journal details

Abstract