A Foundational Framework for Joint Speech and 4D Avatar Generation from Syllabic Tokens

Rishi Jain

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-117

May 16, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-117.pdf

This thesis addresses the challenge of generating synchronized, expressive facial animations from syllabic speech representations in an identity-independent manner. Traditional approaches to speech-driven facial animation often rely on existing ground truth audio or remain constrained to specific identities. We propose a novel framework that leverages conditional flow matching in a learned latent space to model the inherently ambiguous, one-to-many relationship between low a bitrate speech syllabic codec and facial movements.

Our approach begins with an exploratory study that identifies 3D morphable model parameters as effective encodings for expressive facial motion. Building on this finding, we develop a system that uses a variational autoencoder (VAE) combined with conditional flow matching to generate anatomically plausible facial animations from compact, identity-agnostic syllabic representations. By disentangling identity features from dynamic motion, our method enables one model to serve a broad user base, supporting applications in privacy-preserving communication, customizable digital personas, and accessibility.

Experimental results demonstrate significant improvements in lip synchronization accuracy and motion naturalness compared to direct parameter prediction approaches. Our model successfully captures the correlation between audio prosodic features and facial movements while maintaining consistent performance across both seen and unseen speakers. The stochastic nature of our approach enables diverse yet plausible animations from identical inputs, avoiding the uncanny repetitiveness often associated with deterministic methods. This work represents a significant step toward scalable, identity-independent audio-visual generation with applications in virtual communication, entertainment, and accessibility.

Advisors: Gopala Krishna Anumanchipalli

BibTeX citation:

@mastersthesis{Jain:EECS-2025-117,
    Author= {Jain, Rishi},
    Title= {A Foundational Framework for Joint Speech and 4D Avatar Generation from Syllabic Tokens},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-117.html},
    Number= {UCB/EECS-2025-117},
    Abstract= {This thesis addresses the challenge of generating synchronized, expressive facial animations from syllabic speech representations in an identity-independent manner. Traditional approaches to speech-driven facial animation often rely on existing ground truth audio or remain constrained to specific identities. We propose a novel framework that leverages conditional flow matching in a learned latent space to model the inherently ambiguous, one-to-many relationship between low a bitrate speech syllabic codec and facial movements.

Our approach begins with an exploratory study that identifies 3D morphable model parameters as effective encodings for expressive facial motion. Building on this finding, we develop a system that uses a variational autoencoder (VAE) combined with conditional flow matching to generate anatomically plausible facial animations from compact, identity-agnostic syllabic representations. By disentangling identity features from dynamic motion, our method enables one model to serve a broad user base, supporting applications in privacy-preserving communication, customizable digital personas, and accessibility.

Experimental results demonstrate significant improvements in lip synchronization accuracy and motion naturalness compared to direct parameter prediction approaches. Our model successfully captures the correlation between audio prosodic features and facial movements while maintaining consistent performance across both seen and unseen speakers. The stochastic nature of our approach enables diverse yet plausible animations from identical inputs, avoiding the uncanny repetitiveness often associated with deterministic methods. This work represents a significant step toward scalable, identity-independent audio-visual generation with applications in virtual communication, entertainment, and accessibility.},
}

EndNote citation:

%0 Thesis
%A Jain, Rishi 
%T A Foundational Framework for Joint Speech and 4D Avatar Generation from Syllabic Tokens
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 16
%@ UCB/EECS-2025-117
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-117.html
%F Jain:EECS-2025-117