A great way to spoil many sci-fi flicks is to notice that random aliens throughout the universe happen to speak English. I mean, obviously the Vulcans would at least have the logical wherewithal to speak Esperanto.
Douglas Adams worked around this little problem by inventing the Babel fish, a small yellow fellow that fits in your ear, auto-translating everything everyone says into your very own native tongue. Fresh croissants, here we come! š„
While Iām all for talking fish, itās nice to have an alternative thatās both non-fictional and (I presume) that feels less slimy when experienced by your ear.
āĀ Tyler & Team
Paper: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Summary by Adrian Wilkins-Caruana
If you want to say āOne coffee pleaseā in French but you donāt know how to, a machine translator can tell you that you should say āUn cafĆ© s'il vous plaĆ®t.ā You can try to pronounce that phrase or, if youāre too shy, a text-to-speech program can say it for you in a generic voice. But now, a team from Microsoft has created a language model called VALL-E X, which lets you speak a phrase in your native language and, amazingly, it will then speak for you in French in your own voice and with a French accent.
To speak a foreign language, VALL-E X needs 4 inputs: a voice recording of what you want to say in your native language, written representations of the native and foreign phrases, and a language ID. The representations use a phonemic alphabet, which has various characters to represent speech sounds. The phonemes are generated using existing rule-based converters like G2P. Hereās a diagram of VALL-E Xās inputs and outputs:
VALL-E X is based on its predecessor VALL-E, a text-to-speech synthesizer that can generate audio that sounds like your voice. All VALL-E needs are the sounds you want to say (represented with a phonemic alphabet) and a short sample of what your voice sounds like (an audio recording). So you can simply record yourself talking about anything, and then VALL-E can say the word āmountainā in your voice using the wordās phonemes (sounds), which are /ĖmaŹntÉŖn/. VALL-E X adds the ability to translate: If you say āmountainā in English, then it can use the phonemes in both āmountainā(/ĖmaŹntÉŖn/) and āmontagneā (/mÉĢtaɲ/) to say āmontagneā in your voice with a French accent.Ā
VALL-E and VALL-E X are both neural codec models, which are neural networks that use sound ācodesā ā numbers that represent sound, like how tokens are ācodesā for text. Both models utilize the pre-trained EnCodec model, which converts between audio waveforms and codes. So, VALL-E X is essentially a sound code generator, like how GPT-3 is a token generator.
The training process for VALL-E X is similar to the one used for transformer-based language models and machine translation models; it involves a dataset of cross-lingual audio and their corresponding transcriptions. VALL-E X is better at text-to-speech and speech-to-speech tasks than existing baselines and ā thanks to the use of a language ID input parameter ā the translated speech uses a native accent, which other models canāt do.