Bonjour, accent-free speech-to-speech translation!

Hear your own voice in any language.

Apr 13, 2023

A great way to spoil many sci-fi flicks is to notice that random aliens throughout the universe happen to speak English. I mean, obviously the Vulcans would at least have the logical wherewithal to speak Esperanto.

Douglas Adams worked around this little problem by inventing the Babel fish, a small yellow fellow that fits in your ear, auto-translating everything everyone says into your very own native tongue. Fresh croissants, here we come! 🥐

While I’m all for talking fish, it’s nice to have an alternative that’s both non-fictional and (I presume) that feels less slimy when experienced by your ear.

— Tyler & Team

Paper: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

[pdf on arxiv]

Summary by Adrian Wilkins-Caruana

If you want to say “One coffee please” in French but you don’t know how to, a machine translator can tell you that you should say “Un café s'il vous plaît.” You can try to pronounce that phrase or, if you’re too shy, a text-to-speech program can say it for you in a generic voice. But now, a team from Microsoft has created a language model called VALL-E X, which lets you speak a phrase in your native language and, amazingly, it will then speak for you in French in your own voice and with a French accent.

To speak a foreign language, VALL-E X needs 4 inputs: a voice recording of what you want to say in your native language, written representations of the native and foreign phrases, and a language ID. The representations use a phonemic alphabet, which has various characters to represent speech sounds. The phonemes are generated using existing rule-based converters like G2P. Here’s a diagram of VALL-E X’s inputs and outputs:

VALL-E X is based on its predecessor VALL-E, a text-to-speech synthesizer that can generate audio that sounds like your voice. All VALL-E needs are the sounds you want to say (represented with a phonemic alphabet) and a short sample of what your voice sounds like (an audio recording). So you can simply record yourself talking about anything, and then VALL-E can say the word “mountain” in your voice using the word’s phonemes (sounds), which are /ˈmaʊntɪn/. VALL-E X adds the ability to translate: If you say “mountain” in English, then it can use the phonemes in both “mountain”(/ˈmaʊntɪn/) and “montagne” (/mɔ̃taɲ/) to say “montagne” in your voice with a French accent.

VALL-E and VALL-E X are both neural codec models, which are neural networks that use sound “codes” — numbers that represent sound, like how tokens are “codes” for text. Both models utilize the pre-trained EnCodec model, which converts between audio waveforms and codes. So, VALL-E X is essentially a sound code generator, like how GPT-3 is a token generator.

The training process for VALL-E X is similar to the one used for transformer-based language models and machine translation models; it involves a dataset of cross-lingual audio and their corresponding transcriptions. VALL-E X is better at text-to-speech and speech-to-speech tasks than existing baselines and — thanks to the use of a language ID input parameter — the translated speech uses a native accent, which other models can’t do.

Learn and Burn

Discussion about this post