- More porn sneaks onto the iPhone
- 'Swatting' case shows need to ban caller-ID spoofing
- Why the iPhone can't be "killed"
- Nortel enterprise chief wants to bring back Bay
- US sets final emergency responder wireless pilot
There are many interactive voice response applications that let users listen to computers and respond by pressing the buttons on touchtone phones. However, callers often get lost traversing long, time-consuming sequences of menus. It's also difficult for callers to juggle between listening and searching for the right buttons to press on the small keypads of their cell phones. What's needed are IVR user interfaces that let users listen and speak to computers.
VoiceXML 2.0 is a markup language for building speech interfaces - the voice equivalent of HTML. A voice browser is like a Web browser - it interprets VoiceXML 2.0 scripts to present spoken information to users and accept spoken requests from them.
The World Wide Web Consortium last week made VoiceXML 2.0 a full recommendation, which is commonly understood to be a Web standard. The standard adds a speech-recognition grammar format - for words and phrases that users can speak in response to prompts - that was not included in previous version.
Because telephones, including many cell phones, don't have the computation capability to host a voice browser, the voice browser resides on the network in a speech server. The speech server may be located in a corporate data center or off-site at a hosting provider.
Users dial a speech server, which downloads VoiceXML 2.0 scripts, grammar formats and audio files from an application server.
The voice browser interprets the VoiceXML 2.0 script by presenting users with a voice message, such as:
System: "Welcome to Ajax. Do you want to speak with sales, accounting or repairs?"
The voice message could be prerecorded voice or text that is routed through a text-to-speech synthesizer.
The voice browser invokes an automatic speech recognizer (ASR), which uses the grammar format to recognize words users speak:
User: "Repairs."
The ASR recognizes the user's spoken response. In this case the grammar format consists of only three words: "sales," "accounting" and "repairs." This type of grammar-driven ASR performs more accurately than dictation ASRs, which attempt to recognize most of the words in English or whatever language a user is speaking.
Sometimes, users might respond by using dual-tone modulated frequency (DTMF). DTMF is useful in noisy environments or when the user wants to reply confidentially.
Comment