sourcegraph
March 29, 2024

Microsoft’s new language model, Vall-E, can reportedly imitate any sound with just three seconds of sample recording.

The recently released AI tool was tested on 60,000 hours of English speech data. It can replicate a speaker’s mood and tone, researchers say in a Cornell paper.

These findings were clearly true, even when creating recordings of words that were never actually spoken by the original speakers.

“Vall-E emerges with contextual learning capabilities for synthesizing high-quality personalized speech Only 3 seconds of registration recording An invisible speaker acts as an audio cue.Experimental results show that Vall-E significantly outperforms state-of-the-art zero-shot [text to speech] system in terms of speech naturalness and speaker similarity,” the authors wrote. “Furthermore, we found that Vall-E can preserve the speaker’s emotion and the acoustic context of the vocal cues in the synthesis. “

Android Spyware Attacks Financial Institutions and Your Money Again

Booth signage for Microsoft Corporation is displayed at CES 2023 at the Las Vegas Convention Center on January 6, 2023 in Las Vegas, Nevada.
((Photo by David Becker/Getty Images))

Vall-E sample The content shared on GitHub is surprisingly similar to the speaker tips, although they vary in quality.

In a synthetic sentence in the Emotional Voices database, Vall-E uttered sleepily: “We must reduce the number of plastic bags.”

Disney characters come to Amazon Alexa with ‘Hey Disney’ command

Microsoft's new language model, Vall-E, can reportedly imitate any sound with just three seconds of sample recording.

Microsoft’s new language model, Vall-E, can reportedly imitate any sound with just three seconds of sample recording.
(iStock)

However, research into text-to-speech AI comes with a caveat.

“Because Vall-E can synthesize speech that preserves speaker identity, it poses a potential risk of misusing the model, such as spoofing speech recognition or impersonating a specific speaker,” the researchers said on the page. “We are assuming that users agree to Experiment with the target speaker in speech synthesis. When the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and the synthetic speech detection model.”

A corporate signage is seen at the Microsoft India Development Center of Microsoft Corporation in Noida, India, Friday, November 11, 2022.

A corporate signage is seen at the Microsoft India Development Center of Microsoft Corporation in Noida, India, Friday, November 11, 2022.
(Photographer: Prakash Singh/Bloomberg via Getty Images)

Click here for the Fox News app

Currently, Vall-E, which Microsoft calls a “neural codec language model,” is not available to the public.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *