Extremely Lightweight Vocoders for On-device Speech Synthesis

Tianren Gao

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-69

May 13, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.pdf

As edge device applications begin to increasingly interact with users through speech, efficient automatic speech synthesis is becoming increasingly important. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into raw audio waveforms. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. Flow-based feed-forward models, for example, WaveGlow, is an alternative to these auto-regressive models. However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This work presents SqueezeWave, an extremely lightweight vocoder that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.

Advisors: Kurt Keutzer and Joseph Gonzalez

BibTeX citation:

@mastersthesis{Gao:EECS-2021-69,
    Author= {Gao, Tianren},
    Title= {Extremely Lightweight Vocoders for On-device Speech Synthesis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.html},
    Number= {UCB/EECS-2021-69},
    Abstract= {As edge device applications begin to increasingly interact with users through speech, efficient automatic speech synthesis is becoming increasingly important. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into raw audio waveforms. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. Flow-based feed-forward models, for example, WaveGlow, is an alternative to these auto-regressive models. However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This work presents SqueezeWave, an extremely lightweight vocoder that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.},
}

EndNote citation:

%0 Thesis
%A Gao, Tianren 
%T Extremely Lightweight Vocoders for On-device Speech Synthesis
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 13
%@ UCB/EECS-2021-69
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-69.html
%F Gao:EECS-2021-69