Microsoft is teaming up with a research institute affiliated with the University of California at Berkeley to turn the tone, pacing and pitch of spoken words into information that will improve human interaction with applications.
They are hoping to develop software that can recognize the meaning of intonation and the rhythms of speech to make talking to computers seem more natural, says Elizabeth Shriberg, principal scientist with the Conversational Systems Lab (CSL) at Microsoft. She is also an external fellow at the International Computer Science Institute (ICSI).
CRYSTAL BALL: 10 technologies shaping the future of IT
The software would have to be able to capture this information that humans can understand naturally and communicate it to applications being accessed by a voice interface, Shriberg says.
This verbal communication that falls outside the strict meaning of the words used is called prosody, and can be used to clarify ambiguous speech, says Andreas Stolcke, also a principal scientist with Microsoft's CSL and an ICSI external fellow. So if two people are trying to set up a meeting and one says, "Today," the inflection could clarify whether today is a suggestion (why not meet today?) or an assertion (we must meet today), he says.
Prosody can also indicate whether a speaker is angry, rushed or confused and whether a pause is to indicate the end of a sentence or is being used for emphasis, Shriberg says.
Using prosody as a source of information for applications has never succeeded, says Roberto Pieraccini, the director of ICSI. But it could improve interactions between people and machines, he says.
While Microsoft won't let them talk about what specific purpose it has in mind for the research, Shriberg says that the more information an application gets about the state of the user, the more effectively it can interact with the user. So a visual display from an application could be modified if the user's speech is identified as indicating the person is confused, for example.
In gaming, speech that indicates a player is frustrated might lead the game to prompt the player with a help display, says Stolcke.
The technology they are working on might also be used to help to subdivide data. So from a database of spoken words, it might be able to sort it by the mode of speech the speaker used, Shriberg says. For example, was it cryptic or flowery?
The distinctions that can be made with prosody are inexact and don't lend themselves to rules but are better sorted based on statistics and trends compared to stored models of what indicates a particular characteristic, she says.
Microsoft's Conversational Systems Lab works on new ways to use speech, natural language text and gestures to interact with computers and devices including phones and gaming consoles.