Like phonetician Henry Higgins in Shaw’s play Pygmalion, Marius Kotescu and Georgy Tinchev recently showed how their students struggled with pronunciation difficulties.
The two data scientists working at Amazon in Europe are teaching the company’s digital assistant, Alexa. Their mission: Help Alexa master Irish-accented English with the help of artificial intelligence and recordings from native speakers.
During the demo, Alexa talked about a night to remember. “The party last night was crazy,” Alexa said briskly, using the Irish word for “fun.” “We bought ice cream on the way home and we’re happy.”
Mr. Tincheff shook his head. Alexa dropped the “r” in “party,” making the word sound flat, like pah-tee. Too British, he concluded.
These technologists are part of a team at Amazon working on a challenging area of data science called speech disentanglement. It’s a thorny problem that’s gaining new relevance amid a wave of AI development, and researchers believe speech and technical difficulties could help AI-powered devices, robots, and speech synthesizers be more conversational—that is, Enables a variety of regional dialogues. accent.
Solving phonics requires more than mastering vocabulary and grammar. The speaker’s pitch, timbre, and accent often give words subtle meaning and emotional weight. Linguists call this characteristic of language “prosody,” which is difficult for machines to grasp.
Only in the last few years, thanks to advances in artificial intelligence, computer chips and other hardware, have researchers made strides in solving the problem of speech disentanglement, turning computer-generated speech into something more pleasing to the ear.
Such work could eventually merge with an explosion of “generative artificial intelligence,” the technique that enables chatbots to generate their own responses, the researchers say. Chatbots like ChatGPT and Bard may one day act exactly on the user’s voice commands and respond verbally. Meanwhile, voice assistants such as Alexa and Apple’s Siri will become more conversational, analysts say, potentially reigniting consumer interest in a seemingly stagnant tech sector.
Getting voice assistants like Alexa, Siri and Google Assistant to speak multiple languages is an expensive and lengthy process. Tech companies hire voice actors to record hundreds of hours of speech, which helps create synthetic voices for digital assistants. Advanced artificial intelligence systems known as “text-to-speech models” – because they convert text into natural-sounding synthetic speech – just started to streamline this process.
The technology “is now capable of creating human voices and synthetic audio from text input in different languages, accents and dialects,” said Marion Labour, senior strategist at Deutsche Bank Research.
Amazon has been under pressure to catch up to rivals such as Microsoft and Google in the artificial intelligence race. In April, Amazon CEO Andy Jassy tell a wall street analyst The company plans to make Alexa “more proactive and conversational” with the help of sophisticated generative artificial intelligence Rohit Prasad, Amazon’s chief scientist for Alexa told CNBC In May, he saw voice assistants as voice-enabled “personal artificial intelligence out of the box”
After nine months of training to understand Irish accents and speak them, Irish Alexa made its commercial debut in November.
“Accents are not the same as languages,” Prasad said in an interview. AI technologies must learn to separate accents from other parts of speech, such as tone and frequency, before they can replicate the properties of local dialects—for example, “a”s may be flatter and “t’s” pronounced more forcefully.
These systems have to figure out these patterns, “so you can synthesize a whole new accent,” he said. “It’s hard.”
The more difficult task remains getting the technology to learn a new accent from a different speech model. That’s what Mr Cotescu’s team tried when they built Alexa in Ireland. They relied heavily on existing speech models for mostly British English accents (with a much smaller range of American, Canadian, and Australian accents) to train it to speak Irish English.
The team tackles a wide range of language challenges in Irish English. For example, the Irish tend to drop the “h” in “th” and pronounce those letters as a hard “t” or “d,” making “bath” sound like “bat,” or even “bad.” Irish English also has a gill, which means the “r” is pronounced too much. This means the “r” in “party” will be more pronounced than what you’d hear from a Londoner. Alexa must learn and master these speech characteristics.
Kotescu, who is Romanian and lead researcher on the Alexa team in Ireland, said Irish English was “difficult”.
The speech models that underpin Alexa’s language skills have become increasingly advanced in recent years. In 2020, Amazon researchers will teach Alexa speak fluent spanish From English-speaking models.
Mr. Cotescu and team see accents as the next frontier for Alexa’s voice capabilities. Their Irish Alexa relies more on artificial intelligence than actors to model its voice. So Irish Alexa was trained on a relatively small corpus—about 24 hours of recordings of voice actors reciting 2,000 sentences in Irish-accented English.
At first, when Amazon researchers fed Irish recordings into a still-learning Irish Alexa, something strange happened.
Letters and syllables occasionally disappear from the response. The “S” sometimes sticks together. A word or two, sometimes crucial words, are inexplicably slurred and incomprehensible. In at least one instance, Alexa’s female voice dropped a few octaves to sound more masculine. To make matters worse, the male voice sounds distinctly British, a silliness that might surprise some Irish families.
“They’re big black boxes,” Mr. Tinchev, a Bulgarian and Amazon’s chief scientist on the project, said of the speech models. “You have to do a lot of experimentation to tune them.”
That’s what techies did to fix Alexa’s “party” gaffe. They untangle the speech word by word and phoneme (the smallest audible fragment of a word) to pinpoint and fine-tune Alexa’s blunders. They then fed more recorded speech data to Irish Alexa’s speech model to correct pronunciation errors.
Result: The “r” in “party” returns. But then the “p” disappears.
So data scientists go through the same process again. They eventually focused on the phoneme containing the missing “p”. They then fine-tuned the model further so that the “p” sound returned and the “r” didn’t disappear. Alexa has finally learned to talk like a Dubliner.
Two Irish linguists – Elaine Vaughan, who teaches at the University of Limerick, and Kate Tallon, a doctoral student who works in the Phonetics and Speech Laboratory at Trinity College Dublin – — has since given high praise to Irish Alexa’s accent. The way Irish Alexa emphasizes “r’s” and softens “ts” stands out, they say, and Amazon handles the accent correctly overall.
“It sounded very real to me,” Ms. Tarron said.
Amazon researchers said they were pleased with the mostly positive feedback. Their speech model eliminated the Irish accent so quickly it left them hoping to replicate it elsewhere.
“We also plan to extend our approach to accents of languages other than English,” they wrote in a report. January Research Papers About the Irish Alexa Project.