Microsoft’s new language model, Vall-E, can reportedly imitate any sound with just three seconds of sample recording.
The recently released AI tool was tested on 60,000 hours of English speech data. It can replicate a speaker’s mood and tone, researchers say in a Cornell paper.
These findings were clearly true, even when creating recordings of words that were never actually spoken by the original speakers.
“Vall-E emerges with contextual learning capabilities for synthesizing high-quality personalized speech Only 3 seconds of registration recording An invisible speaker acts as an audio cue.Experimental results show that Vall-E significantly outperforms state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “Furthermore, we found that Vall-E can preserve the speaker’s emotion and the acoustic context of the vocal cues in the synthesis. “
Android Spyware Attacks Financial Institutions and Your Money Again
Vall-E sample The content shared on GitHub is surprisingly similar to the speaker tips, although they vary in quality.
In a synthetic sentence in the Emotional Voices database, Vall-E uttered sleepily: “We must reduce the number of plastic bags.”
Disney characters come to Amazon Alexa with ‘Hey Disney’ command
However, research into text-to-speech AI comes with a caveat.
“Because Vall-E can synthesize speech that preserves speaker identity, it poses a potential risk of misusing the model, such as spoofing speech recognition or impersonating a specific speaker,” the researchers said on the page. “We are assuming that users agree to Experiment with the target speaker in speech synthesis. When the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and the synthetic speech detection model.”
Click here for the Fox News app
Currently, Vall-E, which Microsoft calls a “neural codec language model,” is not available to the public.