Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis

Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling,
National Engineering Research Center of Speech and Language Information Processing
University of Science and Technology of China

Abstract

This paper proposes IDEA-TTS, an Incremental Disentanglement-based Environment-Aware zero-shot text-to-speech (TTS) method that can synthesize speech for unseen speakers while preserving the acoustic characteristics of a given environment reference speech. IDEA-TTS adopts VITS as the TTS backbone. To effectively disentangle the environment, speaker, and text factors, we propose an incremental disentanglement process, where an environment estimator is designed to first decompose the environmental spectrogram into an environment mask and an enhanced spectrogram. The environment mask is then processed by an environment encoder to extract environmental embeddings, while the enhanced spectrogram facilitates the subsequent disentanglement of the speaker and text factors with the condition of the speaker embeddings, which are extracted from the environmental speech using a pretrained environment-robust speaker encoder. Finally, both the speaker and environment embeddings are conditioned into the decoder for environment-aware speech generation. Experimental results demonstrate that IDEA-TTS achieves superior performance in the environment-aware TTS task, excelling in speech quality, speaker similarity, and environmental similarity. Additionally, IDEA-TTS is also capable of acoustic environment conversion and achieves state-of-the-art performance.


I. Text-to-Speech


I.1 Environment-Robust TTS


Text Speaker Reference Speaker Reference (clean) YourTTS (Clean Ref.) YourTTS IDEA-TTS (w/o ID) IDEA-TTS
There is , according to legend, a boiling pot of gold at one end.
Yesterday, he had a chilling warning for the game in Scotland.
Falling from four frosted half globes set in the scrollwork of the ceiling.
Harris is what you would call a well-made man of about number one size, and looks hard and bony.

I.2 Environment-Aware TTS


Text Speaker Reference Speaker Reference (clean) Environment Reference IDEA-TTS (w/o ID) IDEA-TTS
Craig is a major concern for us.
Then was the summer of their discontent.
Strange creatures that rarely put nose out of doors, or set foot to ground.
I think, yes, that's about the right distance.
Lottery money was intended to be used for good causes.
Grey-coloured woods covered a large part of the surface.
Anyway, the job will be part-time.
We'd made a great deal of way during the night and were now lying becalmed about half a mile to the south-east of the low eastern coast.

II. Acoustic Environment Conversion


II.1 Env-to-Clean


Source Reference DiffRENT (W-R2-C) IDEA-TTS (w/o ID) IDEA-TTS Target

II.2 Clean-to-Env


Source Reference DiffRENT (W-R2-C) IDEA-TTS (w/o ID) IDEA-TTS Target

II.3 Env-to-Env


Source Reference DiffRENT (W-R2-C) IDEA-TTS (w/o ID) IDEA-TTS Target