APSIPA Transactions on Signal and Information Processing > Vol 4 > Issue 1

A sampling-based speaker clustering using utterance-oriented Dirichlet process mixture model and its evaluation on large-scale data

Naohiro Tawara, Waseda University, Japan, tawara@pcl.cs.waseda.ac.jp , Tetsuji Ogawa, Waseda University, Japan, Shinji Watanabe, Mitsubishi Electric Research Laboratories, USA, Atsushi Nakamura, Nagoya City University, Japan, Tetsunori Kobayashi, Waseda University, Japan
 
Suggested Citation
Naohiro Tawara, Tetsuji Ogawa, Shinji Watanabe, Atsushi Nakamura and Tetsunori Kobayashi (2015), "A sampling-based speaker clustering using utterance-oriented Dirichlet process mixture model and its evaluation on large-scale data", APSIPA Transactions on Signal and Information Processing: Vol. 4: No. 1, e16. http://dx.doi.org/10.1017/ATSIP.2015.19

Publication Date: 28 Oct 2015
© 2015 Naohiro Tawara, Tetsuji Ogawa, Shinji Watanabe, Atsushi Nakamura and Tetsunori Kobayashi
 
Subjects
 
Keywords
Sampling approachNon-parametric Bayesian modelGibbs samplingUtterance-oriented Dirichlet process mixture modelSpeaker clustering
 

Share

Open Access

This is published under the terms of the Creative Commons Attribution licence.

Downloaded: 478 times

In this article:
I. INTRODUCTION 
II. UTTERANCE-ORIENTED MIXTURE MODEL FOR FINITE SPEAKERS 
III. UTTERANCE-ORIENTED MIXTURE MODEL FOR INFINITE SPEAKERS 
IV. SPEAKER CLUSTERING EXPERIMENTS 
V. DISCUSSION 
VI. CONCLUSION AND FUTURE WORK 

Abstract

An infinite mixture model is applied to model-based speaker clustering with sampling-based optimization to make it possible to estimate the number of speakers. For this purpose, a framework of non-parametric Bayesian modeling is implemented with the Markov chain Monte Carlo and incorporated in the utterance-oriented speaker model. The proposed model is called the utterance-oriented Dirichlet process mixture model (UO-DPMM). The present paper demonstrates that UO-DPMM is successfully applied on large-scale data and outperforms the conventional hierarchical agglomerative clustering, especially for large amounts of utterances.

DOI:10.1017/ATSIP.2015.19