Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training

Ekaterina Gonina

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2011-128

December 12, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.pdf

Most current speaker diarization systems use agglomerative clustering of Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computa-tionally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efﬁcient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difﬁcult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this thesis we present a Python-based GMM training specialization framework that abstracts low-level GPU code and instead automatically selects the best parallel implementation of the training algorithm based on the diarization problem size and the processor features and maps the computation onto the parallel platform. Our specialization framework can automatically map the GMM training algorithm onto any CUDA-programmable NVIDIA GPU as well as multi-core CPUs. We then present a full speaker diarization system captured in about 50 lines of Python that uses our specialization framework and achieves 37-166× faster than real-time performance without signiﬁcant loss in accuracy. We also investigate a trade-off between diarization accuracy and performance and show that for a 3% gain in Diarization Error Rate (DER), the application performance decreases by 5×. Using our framework allows the scientist to focus on developing the application algorithm while achieving signiﬁcant additional performance improvements by automatically utilizing parallel hardware.

Advisors: Kurt Keutzer

BibTeX citation:

@mastersthesis{Gonina:EECS-2011-128,
    Author= {Gonina, Ekaterina},
    Title= {Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training},
    School= {EECS Department, University of California, Berkeley},
    Year= {2011},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.html},
    Number= {UCB/EECS-2011-128},
    Abstract= {Most current speaker diarization systems use agglomerative clustering of  Gaussian Mixture Models (GMMs) to determine “who spoke when” in an audio recording. While state-of-the-art in accuracy, this method is computa-tionally costly, mostly due to the GMM training, and thus limits the performance of current approaches to be roughly real-time. Increased sizes of current datasets require processing of hundreds of hours of data and thus make more efﬁcient processing methods highly desirable. With the emergence of highly parallel multicore and manycore processors, such as graphics processing units (GPUs), one can re-implement GMM training to achieve faster than real-time performance by taking advantage of parallelism in the training computation. However, developing and maintaining the complex low-level GPU code is difﬁcult and requires a deep understanding of the hardware architecture of the parallel processor. Furthermore, such low-level implementations are not readily reusable in other applications and not portable to other platforms, limiting programmer productivity. In this thesis we present a Python-based GMM training specialization framework that abstracts low-level GPU code and instead automatically selects the best parallel implementation of the training algorithm based on the diarization problem size and the processor features and maps the computation onto the parallel platform. Our specialization framework can automatically map the GMM training algorithm onto any CUDA-programmable NVIDIA GPU as well as multi-core CPUs. We then present a full speaker diarization system captured in about 50 lines of Python that uses our specialization framework and achieves 37-166× faster than real-time performance without signiﬁcant loss in accuracy. We also investigate a trade-off between diarization accuracy and performance and show that for a 3% gain in Diarization Error Rate (DER), the application performance decreases by 5×. Using our framework allows the scientist to focus on developing the application algorithm while achieving signiﬁcant additional performance improvements by automatically utilizing parallel hardware.},
}

EndNote citation:

%0 Thesis
%A Gonina, Ekaterina 
%T Fast Speaker Diarization Using a Specialization Framework for Gaussian Mixture Model Training
%I EECS Department, University of California, Berkeley
%D 2011
%8 December 12
%@ UCB/EECS-2011-128
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-128.html
%F Gonina:EECS-2011-128