Articulatory Voice-to-Instrument Timbre Transfer in Real-Time with Audio-Conditioned DDSP

David Babazadeh

EECS Department, University of California, Berkeley

Technical Report No. UCB/

May 1, 2025

Differentiable Digital Signal Processing has been used in generative neural network architectures to create lightweight real-time instrument-to-instrument timbre transfer. These systems are pre-trained on the output instrument--reconstructing the input audio in the style of the output instrument using extracted structural components, namely monophonic pitch (fundamental frequency) and loudness. In this paper we expand this idea in two ways: (1) audio-conditioning on the output instrument instead of pre-training and (2) additionally extracting the periodicity and electromagnetic articulography (EMA) features from the input audio to capture vocal nuances. With periodicity and EMA, we are unsurprisingly able to reduce the validation loss of the model; surprisingly, the EMA features extracted during training are able to predict where we would put our articulators to mimic instruments (with medium reliability).

Advisors: Claire Tomlin

BibTeX citation:

@mastersthesis{Babazadeh:31739,
    Author= {Babazadeh, David},
    Title= {Articulatory Voice-to-Instrument Timbre Transfer in Real-Time with Audio-Conditioned DDSP},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {Differentiable Digital Signal Processing has been used in generative neural network architectures to create lightweight real-time instrument-to-instrument timbre transfer. These systems are pre-trained on the output instrument--reconstructing the input audio in the style of the output instrument using extracted structural components, namely monophonic pitch (fundamental frequency) and loudness. In this paper we expand this idea in two ways: (1) audio-conditioning on the output instrument instead of pre-training and (2) additionally extracting the periodicity and electromagnetic articulography (EMA) features from the input audio to capture vocal nuances. With periodicity and EMA, we are unsurprisingly able to reduce the validation loss of the model; surprisingly, the EMA features extracted during training are able to predict where we would put our articulators to mimic instruments (with medium reliability).},
}

EndNote citation:

%0 Thesis
%A Babazadeh, David 
%T Articulatory Voice-to-Instrument Timbre Transfer in Real-Time with Audio-Conditioned DDSP
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 1
%@ UCB/
%F Babazadeh:31739