Accented Non-Autoregressive Text to Speech via Articulatory Features

Xavier Yin

EECS Department, University of California, Berkeley

Technical Report No. UCB/

December 1, 2025

This work presents FastAccent, a non-autoregressive text-to-speech (TTS) system designed to synthesize English speech in multiple accents from a shared phonemic representation. Traditional phoneme sets such as ARPABET or IPA are accent-specific and thus unsuitable for multilingual or multi-accent synthesis. To address this, FastAccent utilizes Unilex, a diaphonemic lexicon that maps orthographic input to accent-agnostic phonemes. The system predicts articulatory features, disentangled from voice quality, as an intermediate representation, enabling accent conditioning without explicit speaker labels. FastAccent is trained using unsupervised alignment, removing reliance on forced aligners like the Montreal Forced Aligner.

Experiments demonstrate that FastAccent can synthesize both American and British English from the same phoneme input with high fidelity, achieving strong correlations with ground-truth articulatory features. However, prosodic features such as pitch remain more difficult to model. The results validate articulatory features as an effective target for accent-conditioned synthesis and highlight the limitations of current prosody modeling approaches. The system lays the groundwork for scalable, accent-aware TTS models, and future work includes expanding to more accents, improving prosody control, optimizing the architecture, and enabling real-time inference.

Advisors: Gopala Krishna Anumanchipalli

BibTeX citation:

@mastersthesis{Yin:31749,
    Author= {Yin, Xavier},
    Title= {Accented Non-Autoregressive Text to Speech via Articulatory Features},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {This work presents FastAccent, a non-autoregressive text-to-speech (TTS) system designed to synthesize English speech in multiple accents from a shared phonemic representation. Traditional phoneme sets such as ARPABET or IPA are accent-specific and thus unsuitable for multilingual or multi-accent synthesis. To address this, FastAccent utilizes Unilex, a diaphonemic lexicon that maps orthographic input to accent-agnostic phonemes. The system predicts articulatory features, disentangled from voice quality, as an intermediate representation, enabling accent conditioning without explicit speaker labels. FastAccent is trained using unsupervised alignment, removing reliance on forced aligners like the Montreal Forced Aligner.

Experiments demonstrate that FastAccent can synthesize both American and British English from the same phoneme input with high fidelity, achieving strong correlations with ground-truth articulatory features. However, prosodic features such as pitch remain more difficult to model. The results validate articulatory features as an effective target for accent-conditioned synthesis and highlight the limitations of current prosody modeling approaches. The system lays the groundwork for scalable, accent-aware TTS models, and future work includes expanding to more accents, improving prosody control, optimizing the architecture, and enabling real-time inference.},
}

EndNote citation:

%0 Thesis
%A Yin, Xavier 
%T Accented Non-Autoregressive Text to Speech via Articulatory Features
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 1
%@ UCB/
%F Yin:31749