This paper describes an end-to-end automatic singing transcription (AST) method that translates a music audio signal containing a vocal part into a symbolic musical score of sung notes. A common approach to sequence-to-sequence learning for this problem is to use the connectionist temporal classification (CTC), where a target score is represented as a sequence of notes with discrete pitches and note values. However, if the note value of some note is incorrectly estimated, the score times of the following notes are estimated incorrectly and the metrical structure of the estimated score collapses. To solve this problem, we propose a refined score representation using metrical positions of note onsets. To decode a musical score from the output of a deep neural network (DNN), we use a hidden semi-Markov model (HSMM) that incorporates prior knowledge about musical scores and temporal fluctuation in human performance. We show that the proposed method achieves the state-of-the-art performance and confirm the efficacy of the refined score representation and the decoding method.
Companion
APSIPA Transactions on Signal and Information Processing Special Issue - Invited Papers from APSIPA ASC 2023
See the other articles that are part of this special issue.