Drake Lin

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-202

December 1, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-202.pdf

Articulatory synthesis, the process of generating speech from physical movements of human articulators, offers unique advantages due to its physically grounded and compact input features. However, recent advancements in the field have prioritized audio quality without a focus on streaming latency. In this paper, we propose a real-time streaming differentiable digital signal processing (DDSP) articulatory vocoder that can synthesize speech from electromagnetic articulography (EMA), fundamental frequency (F0), and loudness data. Our best model achieves a transcription word error rate (WER) of 8.9%, which is 4.0% lower than a state-of-the-art baseline. The same model can also generate 5 milliseconds of speech in less than 2 milliseconds on CPU in a streaming fashion, opening the door for downstream real-time low-latency audio applications.


BibTeX citation:

@mastersthesis{Lin:EECS-2024-202,
    Author= {Lin, Drake},
    Title= {Real-Time, Streamable Differentiable DSP Vocoder for Articulatory Synthesis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-202.html},
    Number= {UCB/EECS-2024-202},
    Abstract= {Articulatory synthesis, the process of generating speech from physical movements of human articulators, offers unique advantages due to its physically grounded and compact input features. However, recent advancements in the field have prioritized audio quality without a focus on streaming latency. In this paper, we propose a real-time streaming differentiable digital signal processing (DDSP) articulatory vocoder that can synthesize speech from electromagnetic articulography (EMA), fundamental frequency (F0), and loudness data. Our best model achieves a transcription word error rate (WER) of 8.9%, which is 4.0% lower than a state-of-the-art baseline. The same model can also generate 5 milliseconds of speech in less than 2 milliseconds on CPU in a streaming fashion, opening the door for downstream real-time low-latency audio applications.},
}

EndNote citation:

%0 Thesis
%A Lin, Drake 
%T Real-Time, Streamable Differentiable DSP Vocoder for Articulatory Synthesis
%I EECS Department, University of California, Berkeley
%D 2024
%8 December 1
%@ UCB/EECS-2024-202
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-202.html
%F Lin:EECS-2024-202