Feature Design for Robust Speech Recognition: Nurture and Nature

Shuo-Yiin Chang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2016-62

May 12, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-62.pdf

As has been extensively shown, acoustic features for speech recognition can be nurtured from training data using neural networks (DNN) with multiple hidden layers. Although a large body of research has shown these learned features are superior to standard front- ends, this superiority is usually demonstrated when the data used to learn the features is very similar to the data used to test recognition performance. However, realistic environments cover many unanticipated types of novel inputs including noise, channel distortion, reverberation, accented speech, speaking rate variation, overlapped speech, etc. A quantitative analysis using bootstrap sampling shows that these trained features are easily specialized to training data and corrupted in mismatched scenarios. Gabor filtered spectrograms, on the other hand, are generated from spectro-temporal filters to model natural human auditory processing, which can be instrumental in improving generalization to unanticipated deviations from what was seen in training. In this thesis, I used Gabor filtering as feature processing or a convolutional kernel in neural networks where the former used filter outputs as DNN inputs while the latter used filter coefficients and structures to initialize a convolutional neural network (CNN). Experiments show that the proposed features perform better than other noise-robust features that I have tried on several noisy corpora. In addition, I demonstrate that inclusion of Gabor filters with lower or higher temporal modulations could be used to correlate better with human perception of slow or rapid speech. Finally, I report on the analysis of human cortical signals to demonstrate the relative robustness of these signals to the mixed signal phenomenon in contrast to a DNN-based ASR system. With a number of example tasks in the thesis, I conclude that designed feature is useful for greater robustness than just relying on DNN or CNN.

Advisors: Nelson Morgan

BibTeX citation:

@phdthesis{Chang:EECS-2016-62,
    Author= {Chang, Shuo-Yiin},
    Title= {Feature Design for Robust Speech Recognition: Nurture and Nature},
    School= {EECS Department, University of California, Berkeley},
    Year= {2016},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-62.html},
    Number= {UCB/EECS-2016-62},
    Abstract= {As has been extensively shown, acoustic features for speech recognition can be nurtured from training data using neural networks (DNN) with multiple hidden layers. Although a large body of research has shown these learned features are superior to standard front- ends, this superiority is usually demonstrated when the data used to learn the features is very similar to the data used to test recognition performance. However, realistic environments cover many unanticipated types of novel inputs including noise, channel distortion, reverberation, accented speech, speaking rate variation, overlapped speech, etc. A quantitative analysis using bootstrap sampling shows that these trained features are easily specialized to training data and corrupted in mismatched scenarios. Gabor filtered spectrograms, on the other hand, are generated from spectro-temporal filters to model natural human auditory processing, which can be instrumental in improving generalization to unanticipated deviations from what was seen in training. In this thesis, I used Gabor filtering as feature processing or a convolutional kernel in neural networks where the former used filter outputs as DNN inputs while the latter used filter coefficients and structures to initialize a convolutional neural network (CNN). Experiments show that the proposed features perform better than other noise-robust features that I have tried on several noisy corpora. In addition, I demonstrate that inclusion of Gabor filters with lower or higher temporal modulations could be used to correlate better with human perception of slow or rapid speech. Finally, I report on the analysis of human cortical signals to demonstrate the relative robustness of these signals to the mixed signal phenomenon in contrast to a DNN-based ASR system. With a number of example tasks in the thesis, I conclude that designed feature is useful for greater robustness than just relying on DNN or CNN.},
}

EndNote citation:

%0 Thesis
%A Chang, Shuo-Yiin 
%T Feature Design for Robust Speech Recognition: Nurture and Nature
%I EECS Department, University of California, Berkeley
%D 2016
%8 May 12
%@ UCB/EECS-2016-62
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-62.html
%F Chang:EECS-2016-62