A great way to spoil many sci-fi flicks is to notice that random aliens throughout the universe happen to speak English. I mean, obviously the Vulcans would at least have the logical wherewithal to speak Esperanto.
Douglas Adams worked around this little problem by inventing the Babel fish, a small yellow fellow that fits in your ear, auto-translating everything everyone says into your very own native tongue. Fresh croissants, here we come! đ„
While Iâm all for talking fish, itâs nice to have an alternative thatâs both non-fictional and (I presume) that feels less slimy when experienced by your ear.
â Tyler & Team
Paper: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Summary by Adrian Wilkins-Caruana
If you want to say âOne coffee pleaseâ in French but you donât know how to, a machine translator can tell you that you should say âUn cafĂ© s'il vous plaĂźt.â You can try to pronounce that phrase or, if youâre too shy, a text-to-speech program can say it for you in a generic voice. But now, a team from Microsoft has created a language model called VALL-E X, which lets you speak a phrase in your native language and, amazingly, it will then speak for you in French in your own voice and with a French accent.
To speak a foreign language, VALL-E X needs 4 inputs: a voice recording of what you want to say in your native language, written representations of the native and foreign phrases, and a language ID. The representations use a phonemic alphabet, which has various characters to represent speech sounds. The phonemes are generated using existing rule-based converters like G2P. Hereâs a diagram of VALL-E Xâs inputs and outputs:
VALL-E X is based on its predecessor VALL-E, a text-to-speech synthesizer that can generate audio that sounds like your voice. All VALL-E needs are the sounds you want to say (represented with a phonemic alphabet) and a short sample of what your voice sounds like (an audio recording). So you can simply record yourself talking about anything, and then VALL-E can say the word âmountainâ in your voice using the wordâs phonemes (sounds), which are /ËmaÊntÉȘn/. VALL-E X adds the ability to translate: If you say âmountainâ in English, then it can use the phonemes in both âmountainâ(/ËmaÊntÉȘn/) and âmontagneâ (/mÉÌtaÉČ/) to say âmontagneâ in your voice with a French accent.Â
VALL-E and VALL-E X are both neural codec models, which are neural networks that use sound âcodesâ â numbers that represent sound, like how tokens are âcodesâ for text. Both models utilize the pre-trained EnCodec model, which converts between audio waveforms and codes. So, VALL-E X is essentially a sound code generator, like how GPT-3 is a token generator.
The training process for VALL-E X is similar to the one used for transformer-based language models and machine translation models; it involves a dataset of cross-lingual audio and their corresponding transcriptions. VALL-E X is better at text-to-speech and speech-to-speech tasks than existing baselines and â thanks to the use of a language ID input parameter â the translated speech uses a native accent, which other models canât do.