VoiceXML lets you talk to computers
By James Larson
,
Network World
, 03/22/2004
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
- Share/Email
- Tweet This
- Print
There are many interactive voice response applications that let users listen to computers and respond by pressing the buttons
on touchtone phones. However, callers often get lost traversing long, time-consuming sequences of menus. It's also difficult
for callers to juggle between listening and searching for the right buttons to press on the small keypads of their cell phones.
What's needed are IVR user interfaces that let users listen and speak to computers.
VoiceXML 2.0 is a markup language for building speech interfaces - the voice equivalent of HTML. A voice browser is like a
Web browser - it interprets VoiceXML 2.0 scripts to present spoken information to users and accept spoken requests from them.
The World Wide Web Consortium last week made VoiceXML 2.0 a full recommendation, which is commonly understood to be a Web
standard. The standard adds a speech-recognition grammar format - for words and phrases that users can speak in response to
prompts - that was not included in previous version.
Call components
Because telephones, including many cell phones, don't have the computation capability to host a voice browser, the voice browser
resides on the network in a speech server. The speech server may be located in a corporate data center or off-site at a hosting
provider.
Users dial a speech server, which downloads VoiceXML 2.0 scripts, grammar formats and audio files from an application server.
The voice browser interprets the VoiceXML 2.0 script by presenting users with a voice message, such as:
System: "Welcome to Ajax. Do you want to speak with sales, accounting or repairs?"
The voice message could be prerecorded voice or text that is routed through a text-to-speech synthesizer.
The voice browser invokes an automatic speech recognizer (ASR), which uses the grammar format to recognize words users speak:
User: "Repairs."
The ASR recognizes the user's spoken response. In this case the grammar format consists of only three words: "sales," "accounting"
and "repairs." This type of grammar-driven ASR performs more accurately than dictation ASRs, which attempt to recognize most
of the words in English or whatever language a user is speaking.
Sometimes, users might respond by using dual-tone modulated frequency (DTMF). DTMF is useful in noisy environments or when
the user wants to reply confidentially.
The voice browser continues processing the VoiceXML 2.0 script, perhaps performing additional conversational turns, invoking
an application-specific function or accessing information in a database.
With VoiceXML 2.0, developers can create speech-enabled applications by specifying high-level menus and forms rather than
procedural program code. This frees up more time for developers to test the application usability and refine its design.
Giving voice to new apps
Developers use VoiceXML 2.0 to provide a telephone-user interface for many types of applications and information, including
time-sensitive data, business data and personal information. These applications let users access enterprise data wherever
they are and whenever they need it by simply dialing in from any phone, identifying themselves and asking for the desired
information. Customers also can use these systems to access data such as order status and catalog, delivery and account information.
Comment