This paper presents an unsupervised method for learning disentangled representations of monophonic music signals into three factors: global timbral, local pitch, and local variational features. While existing methods have achieved this for short isolated notes using random perturbation, they fail for sounds with pitch transitions or singing voices, causing leakage of the three characteristics into mismatched latent features. To address this, we introduce a new framework called re-entry training, which applies the network for three-factor disentanglement twice in series with shared weights. Re-entry training refines the characteristics extracted by the encoders and increases data variety, effectively performing implicit data augmentation. This serial model can be reinterpreted as a unified large variational autoencoder, offering an alternative probabilistic formulation for unsupervised training. Our experiments demonstrate that re-entry training results in a more focused extraction of sound characteristics, thereby enhancing the three-factor disentanglement for various monophonic music signals.