Could Microsoft VALL-E and VALL-E X Be Revolutionizing Text-to-Speech Synthesis?


In the dynamic landscape of text-to-speech synthesis (TTS), Microsoft introduces a groundbreaking language modeling approach with VALL-E. This neural codec language model redefines the TTS paradigm, leveraging discrete codes from a neural audio codec model. Unlike traditional continuous signal regression, VALL-E embraces TTS as a conditional language modeling task, leading to in-context learning capabilities.

VALL E excels in synthesizing high-quality personalized speech with minimal input – just a 3-second recording of an unseen speaker. Its superiority is evident as it outperforms state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity. Furthermore, VALL E goes beyond mere vocal replication; it preserves the speaker’s emotion and the acoustic environment of the provided prompt during synthesis.

Ethics Statement

As with any powerful technology, VALL-E/X comes with ethical considerations. Capable of maintaining speaker identity, these models find applications in education, entertainment, journalism, accessibility features, and more. However, the ethical use of VALL-E/X is crucial. The model’s performance, similarity, and naturalness depend on various factors, including prompt length, quality, and background noise. To ensure responsible usage, Microsoft emphasizes the importance of user agreement and includes safeguards against potential misuse. Users can report any abusive, illegal, or rights-infringing activities through the Report Abuse Portal.

VALL-E: Redefining TTS

VALL-E Model Overview

VALL-E’s architecture shifts from the conventional phoneme → mel-spectrogram → waveform pipeline. Instead, it operates on a phoneme → discrete code → waveform model, generating discrete audio codec codes based on phoneme and acoustic code prompts. This unique approach enables diverse applications, including zero-shot TTS, speech editing, and content creation in conjunction with other generative AI models like GPT.

VALL-E Applications

Zero-shot TTS for LibriSpeech and VCTK Dataset

  • Samples for the LibriSpeech dataset
  • Samples for the VCTK Dataset

Synthesis of Diversity

VALL-E introduces diversity into synthesized speech samples by employing sampling-based discrete token generation methods. With different random seeds, it can create varied personalized speech samples from a pair of text and speaker prompts.

Acoustic Environment Maintenance

Maintaining the acoustic environment of the speaker prompt, VALL E can synthesize personalized speech using audio and transcriptions sampled from the Fisher dataset.

Speaker Emotion Maintenance

Preserving speaker emotion in the prompt, VALL-E synthesizes personalized speech using audio prompts sampled from the Emotional Voices Database.

VALL E X: Cross-Lingual Marvel

VALL-E extends its capabilities with VALL-E X, a cross-lingual neural codec language model. Through multi-lingual conditional codec language training, VALL-E X predicts acoustic token sequences for target language speech using both source language speech and target language text prompts.

VALL E X Model Overview

VALL-E X goes beyond language barriers, synthesizing personalized speech in another language for monolingual speakers. By deriving phoneme sequences from source and target text, and source acoustic tokens from an audio codec model, VALL E X produces acoustic tokens in the target language, which are then decompressed to the target speech waveform. With robust in-context learning, VALL-E X excels in zero-shot cross-lingual speech generation tasks, including cross-lingual text-to-speech synthesis and speech-to-speech translation.

VALL-E X Applications

Zero-shot Cross-Lingual Text to Speech

  1. English TTS with Chinese prompts (Samples from LibriSpeech and EMIME/AISHELL-3 datasets)
  2. Chinese TTS with English prompts (Samples from EMIME/AISHELL-3 test and LibriSpeech)

Zero-shot Speech-to-Speech Translation

  1. Chinese to English Translation on EMIME dataset
  2. English to Chinese Translation on EMIME dataset
  3. Chinese to English Translation using AISHELL-3 test
  4. English to Chinese Translation using LibriSpeech dev-clean

Foreign Accent Control

  1. English to Chinese on EMIME dataset
  2. Chinese to English on EMIME dataset

People Behind Vall-e

Let’s acknowledge the visionaries steering the innovation at Microsoft:

  • Long Zhou, Senior Researcher
  • Shujie Liu, Principal Research Manager
  • Yanqing Liu, Principal Software Engineer
  • Huaming Wang, Principal Software Engineer
  • Jinyu Li, Partner Applied Science Manager
  • Lei He, Principal Applied Science Manager
  • Sheng Zhao, Partner Group Engineer Manager
  • Furu Wei, Partner Research Manager

In conclusion, VALL-E and VALL E X represent a monumental leap in text-to-speech synthesis, transcending linguistic boundaries and enabling a new era of AI-driven communication. As we explore these innovations, it’s essential to embrace responsible use and ethical considerations, ensuring that these technological marvels benefit society positively.