APSIPA Transactions on Signal and Information Processing > Vol 9 > Issue 1

A multi-branch ResNet with discriminative features for detection of replay speech signals

Xingliang Cheng, Tsinghua University, China, Mingxing Xu, Tsinghua University, China, Thomas Fang Zheng, Tsinghua University, China, fzheng@tsinghua.edu.cn
 
Suggested Citation
Xingliang Cheng, Mingxing Xu and Thomas Fang Zheng (2020), "A multi-branch ResNet with discriminative features for detection of replay speech signals", APSIPA Transactions on Signal and Information Processing: Vol. 9: No. 1, e28. http://dx.doi.org/10.1017/ATSIP.2020.26

Publication Date: 29 Dec 2020
© 2020 Xingliang Cheng, Mingxing Xu and Thomas Fang Zheng
 
Subjects
 
Keywords
Anti-spoofingPresentation attack detectionReplay attackSpeaker verification
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 1314 times

In this article:
I. INTRODUCTION 
II. RELATED WORK 
III. CQT-BASED MODIFIED GROUP DELAY FEATURE 
IV. MULTI-BRANCH RESIDUAL NEURAL NETWORK 
V. EXPERIMENTAL SETUP 
VI. EXPERIMENTAL RESULTS 
VII. DISCUSSION 
VIII. CONCLUSION 

Abstract

Nowadays, the security of ASV systems is increasingly gaining attention. As one of the common spoofing methods, replay attacks are easy to implement but difficult to detect. Many researchers focus on designing various features to detect the distortion of replay attack attempts. Constant-Q cepstral coefficients (CQCC), based on the magnitude of the constant-Q transform (CQT), is one of the striking features in the field of replay detection. However, it ignores phase information, which may also be distorted in the replay processes. In this work, we propose a CQT-based modified group delay feature (CQTMGD) which can capture the phase information of CQT. Furthermore, a multi-branch residual convolution network, ResNeWt, is proposed to distinguish replay attacks from bonafide attempts. We evaluated our proposal in the ASVspoof 2019 physical access dataset. Results show that CQTMGD outperformed the traditional MGD feature, and the fusion with other magnitude-based and phase-based features achieved a further improvement. Our best fusion system achieved 0.0096 min-tDCF and 0.39% EER on the evaluation set and it outperformed all the other state-of-the-art methods in the ASVspoof 2019 physical access challenge.

DOI:10.1017/ATSIP.2020.26