APSIPA Transactions on Signal and Information Processing > Vol 14 > Issue 1

Unsupervised Pitch-Timbre-Variation Disentanglement of Monophonic Music Signals Based on Random Perturbation and Re-entry Training

Keitaro Tanaka, Waseda University, Japan, phys.keitaro1227@ruri.waseda.jp , Kazuyoshi Yoshii, Kyoto University, Japan, Simon Dixon, Queen Mary University of London, UK, Shigeo Morishima, Waseda University, Japan
 
Suggested Citation
Keitaro Tanaka, Kazuyoshi Yoshii, Simon Dixon and Shigeo Morishima (2025), "Unsupervised Pitch-Timbre-Variation Disentanglement of Monophonic Music Signals Based on Random Perturbation and Re-entry Training", APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 1, e4. http://dx.doi.org/10.1561/116.20240072

Publication Date: 19 Feb 2025
© 2025 K. Tanaka, K. Yoshii, S. Dixon, and S. Morishima
 
Subjects
Deep learning,  Variational inference,  Audio signal processing
 
Keywords
Disentangled representationpitch and timbre modelingvariational autoencoder
 

Share

Open Access

This is published under the terms of CC BY-NC.

Downloaded: 17 times

In this article:
Introduction 
Related Work 
Three-Factor Disentanglement 
Two-Factor Disentanglement 
Re-entry Training 
Evaluation 
Discussion 
Conclusion 
Acknowledgements 
References 

Abstract

This paper presents an unsupervised method for learning disentangled representations of monophonic music signals into three factors: global timbral, local pitch, and local variational features. While existing methods have achieved this for short isolated notes using random perturbation, they fail for sounds with pitch transitions or singing voices, causing leakage of the three characteristics into mismatched latent features. To address this, we introduce a new framework called re-entry training, which applies the network for three-factor disentanglement twice in series with shared weights. Re-entry training refines the characteristics extracted by the encoders and increases data variety, effectively performing implicit data augmentation. This serial model can be reinterpreted as a unified large variational autoencoder, offering an alternative probabilistic formulation for unsupervised training. Our experiments demonstrate that re-entry training results in a more focused extraction of sound characteristics, thereby enhancing the three-factor disentanglement for various monophonic music signals.

DOI:10.1561/116.20240072