APSIPA Transactions on Signal and Information Processing > Vol 11 > Issue 1

End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning

Ryo Imaizumi, Tokyo Metropolitan University, Japan, sayaka@tmu.ac.jp , Ryo Masumura, NTT Media Intelligence Laboratories, NTT Corporation, Japan, Sayaka Shiota, Tokyo Metropolitan University, Japan, Hitoshi Kiya, Tokyo Metropolitan University, Japan
 
Suggested Citation
Ryo Imaizumi, Ryo Masumura, Sayaka Shiota and Hitoshi Kiya (2022), "End-to-end Japanese Multi-dialect Speech Recognition and Dialect Identification with Multi-task Learning", APSIPA Transactions on Signal and Information Processing: Vol. 11: No. 1, e4. http://dx.doi.org/10.1561/116.00000045

Publication Date: 29 Mar 2022
© 2022 R. Imaizumi, R. Masumura, S. Shiota and H. Kiya
 
Subjects
 
Keywords
Japanese multi-dialect automatic speech recognitionJapanese dialect identificationmulti-task learningtransformer-based encoder-decoderend-to-end model
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 2453 times

In this article:
Introduction 
Challenges with Multi-dialect Japanese 
Transformer-based Network Architecture 
Multi-task Learning of Japanese DID and MD-ASR 
Experiments 
Conclusion 
References 

Abstract

End-to-end systems have demonstrated state-of-the-art performance on many tasks related to automatic speech recognition (ASR) and dialect identification (DID). In this paper, we propose multi-task learning of Japanese DID and multi-dialect ASR (MD-ASR) systems with end-to-end models. Since Japanese dialects have variety in both linguistic and acoustic aspects of each dialect, Japanese DID requires simultaneously considering linguistic and acoustic features. One solution realizing Japanese DID using these features is to use transcriptions from ASR when performing DID. However, transcribing Japanese multi-dialect speech into text is regarded as a challenging task in ASR because there are big gaps in linguistic and acoustic features between a dialect and standard Japanese. One solution is dialect-aware ASR modeling, which means DID is performed with ASR. Therefore, the multi-task learning framework of Japanese DID and ASR is proposed to represent the dependency of them. We explore three systems as part of the proposed framework, changing the order in which DID and ASR are performed. In the experiments, Japanese multi-dialect ASR and DID tests were conducted on our home-made Japanese multi-dialect database and a standard Japanese database. The proposed transformer-based systems outperformed the conventional single task systems on both DID and ASR tests.

DOI:10.1561/116.00000045