APSIPA Transactions on Signal and Information Processing > Vol 10 > Issue 1

Audio-to-score singing transcription based on a CRNN-HSMM hybrid model

Ryo Nishikimi, Kyoto University, Japan, nishikimi@sap.ist.i.kyoto-u.ac.jp , Eita Nakamura, Kyoto University, Japan AND Kyoto University, Japan, Masataka Goto, National Institute of Advanced Industrial Science and Technology (AIST), Japan, Kazuyoshi Yoshii, Kyoto University, Japan AND Japan Science and Technology Agency, Japan
 
Suggested Citation
Ryo Nishikimi, Eita Nakamura, Masataka Goto and Kazuyoshi Yoshii (2021), "Audio-to-score singing transcription based on a CRNN-HSMM hybrid model", APSIPA Transactions on Signal and Information Processing: Vol. 10: No. 1, e7. http://dx.doi.org/10.1017/ATSIP.2021.4

Publication Date: 20 Apr 2021
© 2021 Ryo Nishikimi, Eita Nakamura, Masataka Goto and Kazuyoshi Yoshii
 
Subjects
 
Keywords
Automatic singing transcriptionConvolutional recurrent neural networkHidden semi-Markov model
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 2303 times

In this article:
I. INTRODUCTION 
II. BACKGROUNDS 
III. PROPOSED METHOD 
IV. EVALUATION 
V. Conclusion 

Abstract

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of a singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

DOI:10.1017/ATSIP.2021.4