Extremely Lightweight Vocoders for On-device Speech Synthesis

Tianren Gao

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-69
May 13, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.pdf

As edge device applications begin to increasingly interact with users through speech, efficient automatic speech synthesis is becoming increasingly important. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into raw audio waveforms. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. Flow-based feed-forward models, for example, WaveGlow, is an alternative to these auto-regressive models. However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This work presents SqueezeWave, an extremely lightweight vocoder that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.

Advisor: Kurt Keutzer and Joseph Gonzalez


BibTeX citation:

@mastersthesis{Gao:EECS-2021-69,
    Author = {Gao, Tianren},
    Title = {Extremely Lightweight Vocoders for On-device Speech Synthesis},
    School = {EECS Department, University of California, Berkeley},
    Year = {2021},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.html},
    Number = {UCB/EECS-2021-69},
    Abstract = {As edge device applications begin to increasingly interact with users through speech, efficient automatic speech synthesis is becoming increasingly important. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into raw audio waveforms. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. Flow-based feed-forward models, for example, WaveGlow, is an alternative to these auto-regressive models. However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This work presents SqueezeWave, an extremely lightweight vocoder that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.}
}

EndNote citation:

%0 Thesis
%A Gao, Tianren
%T Extremely Lightweight Vocoders for On-device Speech Synthesis
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 13
%@ UCB/EECS-2021-69
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.html
%F Gao:EECS-2021-69