Motion Diffusion From Speech | EECS at UC Berkeley

Kushal Khangaonkar and Sanjay Subramanian and Daniel Klein and Trevor Darrell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-113

May 16, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-113.pdf

This project explores methods for motion synthesis from speech. Given a recorded speech sample we aim to generate joint angles for body and hand motion that is realistic and corresponds to the input speech. We propose a diffusion-based method that uses Prosody Embeddings as conditioning for a transformer encoder diffusion model. Our work emphasizes the importance of classifier-free guidance during generation as a key factor in improving accuracy and realism of generated motion. We also find that using a velocity loss term is a crucial aspect of learning motion patterns. Our results show that there is potential for Prosody Embeddings as conditioning for realistic motion synthesis, but additional conditioning may be necessary to generate motion with semantic connection to the input speech.

Advisors: Daniel Klein

BibTeX citation:

@mastersthesis{Khangaonkar:EECS-2024-113,
    Author= {Khangaonkar, Kushal and Subramanian, Sanjay and Klein, Daniel and Darrell, Trevor},
    Title= {Motion Diffusion From Speech},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-113.html},
    Number= {UCB/EECS-2024-113},
    Abstract= {This project explores methods for motion synthesis from speech. Given a recorded speech sample we aim to generate joint angles for body and hand motion that is realistic and corresponds to the input speech. We propose a diffusion-based method that uses Prosody Embeddings as conditioning for a transformer encoder diffusion model. Our work emphasizes the importance of classifier-free guidance during generation as a key factor in improving accuracy and realism of generated motion. We also find that using a velocity loss term is a crucial aspect of learning motion patterns. Our results show that there is potential for Prosody Embeddings as conditioning for realistic motion synthesis, but additional conditioning may be necessary to generate motion with semantic connection to the input speech.},
}

EndNote citation:

%0 Thesis
%A Khangaonkar, Kushal 
%A Subramanian, Sanjay 
%A Klein, Daniel 
%A Darrell, Trevor 
%T Motion Diffusion From Speech
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 16
%@ UCB/EECS-2024-113
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-113.html
%F Khangaonkar:EECS-2024-113