We’ve met the face behind the U.K.’s male version of Apple’s Siri, but what about the iconic voice of the U.S.’s female Siri? She still remains a mystery, but the Verge website recently explored the process of how such a voice is created for a text-to-speech product.
It starts with an actor like 37-year-old September Day:
For six to seven hours a day, for eight days, Day read passages from Alice in Wonderland, bits of news off the AP wire, and sometimes random sentences, sitting as still in her chair as possible. She read hundreds of numbers, in different cadences. “One! One. One? Two! Two. Two?”
“It was like the Ironman of VO,” says Day. “I had not experienced anything like that. I am the queen of the 30–60 second TV spot. That’s my safe place.” She had to take a break after the fourth day, because she had gone hoarse. But then Day soldiered on, and became the voice of many a breezy beach read.
While we don’t know who exactly created the female Siri, the Verge reported that it’s believed she came out of the company Nuance.
J. Brant Ward, the senior director of advanced speech design and development with the company, and David Vazquez, senior design lead, won’t answer any questions regarding Siri, but they did give the Verge a glimpse into the process of recording words and phrases that can be stitched together to answer requests:
“Just say you want to know where the nearest florist is,” Ward says. “Well, there are 27 million businesses in this country alone. You’re not going to be able to record every single one of them.”
“It’s about finding short cuts,” says Vazquez, a trim, bearded man who exudes a laid-back joviality. He rifles through a packet of stapled together papers that contains a script. It doesn’t look like a script in the Hamlet sense of the word, but rather, an Excel-type grid containing weird sentences.
Scratching the collar of my neck, where humans once had gills.
Most of the sentences are chosen, says Vazquez, because they are “phonetically rich:” that is, they contain lots of different combinations of phonemes. Phonemes are the acoustic building blocks of language, i.e.: the “K” sound in “cat”.
“The sentences are sort of like tongue twisters,” says Vazquez. Later, a linguist on his team objects to his use of this expression, and calls them “non sequiturs.”
“The point is, the more data we have, the more lifelike it’s going to be,” says Ward. The sentences, while devoid of contextual meaning, are packed with data.
After an actor has recorded enough speech — a process that could take months — the Nuance system, and others like it, searches recorded sounds, combining them into answers to your burning questions or requests.
The challenge is to making the user forget they’re talking to a computer.
“My kids interact with Siri like she’s a sentient being,” Ward told the Verge. “They ask her to find stuff for them. They don’t know the difference.”
Watch the Verge’s video feature about the technology:
Be sure to check out the Verge’s full article for even more details about creating a voice like Siri and where the text-to-speech industry is going from here.