now publishers - Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection

APSIPA Transactions on Signal and Information Processing > Vol 6 > Issue 1

Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection

Yasunari Obuchi, Tokyo University of Technology, Japan, obuchiysnr@stf.teu.ac.jp

Suggested Citation

Yasunari Obuchi (2017), "Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection", APSIPA Transactions on Signal and Information Processing: Vol. 6: No. 1, e7. http://dx.doi.org/10.1017/ATSIP.2017.8

Publication Date: 14 Jul 2017

Subjects

Keywords

Speech, Voice activity detection, Noise suppression, Convolutional neural network, CENSREC-1-C

Journal details

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1945 times

In this article:

Abstract

This paper proposes a new voice activity detection (VAD) algorithm based on statistical noise suppression and framewise speech/non-speech classification. Although many VAD algorithms have been developed that are robust in noisy environments, the most successful ones are related to statistical noise suppression in some way. Accordingly, we formulate our VAD algorithm as a combination of noise suppression and subsequent framewise classification. The noise suppression part is improved by introducing the idea that any unreliable frequency component should be removed, and the decision can be made by the remaining signal. This augmentation can be realized using a few additional parameters embedded in the gain-estimation process. The framewise classification part can be either model-less or model-based. A model-less classifier has the advantage that it can be applied to any situation, even if no training data are available. In contrast, a model-based classifier (e.g., neural network-based classifier) requires training data but tends to be more accurate. The accuracy of the proposed algorithm is evaluated using the CENSREC-1-C public framework and confirmed to be superior to many existing algorithms.

DOI:10.1017/ATSIP.2017.8

I. INTRODUCTION
II. AUGMENTED STATISTICAL NOISE SUPPRESSION
III. CLASSIFICATION WITHOUT MODEL TRAINING
IV. CLASSIFICATION WITH UNSUPERVISED AND SUPERVISED MODEL TRAINING
V. EXPERIMENTAL RESULTS
VI. CONCLUSIONS

Combining augmented statistical noise suppression and framewise speech/non-speech classification for robust voice activity detection

Share

Journal details

Abstract