Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training
Ekaterina Gonina
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2011-128
December 12, 2011
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.pdf
Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computa-tionally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this thesis we present a Python-based GMM training specialization framework that abstracts low-level GPU code and instead automatically selects the best parallel implementation of the training algorithm based on the diarization problem size and the processor features and maps the computation onto the parallel platform. Our specialization framework can automatically map the GMM training algorithm onto any CUDA-programmable NVIDIA GPU as well as multi-core CPUs. We then present a full speaker diarization system captured in about 50 lines of Python that uses our specialization framework and achieves 37-166× faster than real-time performance without significant loss in accuracy. We also investigate a trade-off between diarization accuracy and performance and show that for a 3% gain in Diarization Error Rate (DER), the application performance decreases by 5×. Using our framework allows the scientist to focus on developing the application algorithm while achieving significant additional performance improvements by automatically utilizing parallel hardware.
Advisors: Kurt Keutzer
BibTeX citation:
@mastersthesis{Gonina:EECS-2011-128, Author= {Gonina, Ekaterina}, Title= {Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training}, School= {EECS Department, University of California, Berkeley}, Year= {2011}, Month= {Dec}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.html}, Number= {UCB/EECS-2011-128}, Abstract= {Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computa-tionally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efficient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difficult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this thesis we present a Python-based GMM training specialization framework that abstracts low-level GPU code and instead automatically selects the best parallel implementation of the training algorithm based on the diarization problem size and the processor features and maps the computation onto the parallel platform. Our specialization framework can automatically map the GMM training algorithm onto any CUDA-programmable NVIDIA GPU as well as multi-core CPUs. We then present a full speaker diarization system captured in about 50 lines of Python that uses our specialization framework and achieves 37-166× faster than real-time performance without significant loss in accuracy. We also investigate a trade-off between diarization accuracy and performance and show that for a 3% gain in Diarization Error Rate (DER), the application performance decreases by 5×. Using our framework allows the scientist to focus on developing the application algorithm while achieving significant additional performance improvements by automatically utilizing parallel hardware.}, }
EndNote citation:
%0 Thesis %A Gonina, Ekaterina %T Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training %I EECS Department, University of California, Berkeley %D 2011 %8 December 12 %@ UCB/EECS-2011-128 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.html %F Gonina:EECS-2011-128