• United States
by James Larson, special to Network World

VoiceXML lets you talk to computers

Mar 22, 20043 mins
Programming Languages

There are many interactive voice response applications that let users listen to computers and respond by pressing the buttons on touchtone phones. However, callers often get lost traversing long, time-consuming sequences of menus. It’s also difficult for callers to juggle between listening and searching for the right buttons to press on the small keypads of their cell phones. What’s needed are IVR user interfaces that let users listen and speak to computers.

VoiceXML 2.0 is a markup language for building speech interfaces – the voice equivalent of HTML. A voice browser is like a Web browser – it interprets VoiceXML 2.0 scripts to present spoken information to users and accept spoken requests from them.

The World Wide Web Consortium last week made VoiceXML 2.0 a full recommendation, which is commonly understood to be a Web standard. The standard adds a speech-recognition grammar format – for words and phrases that users can speak in response to prompts – that was not included in previous version.

Call components

Because telephones, including many cell phones, don’t have the computation capability to host a voice browser, the voice browser resides on the network in a speech server. The speech server may be located in a corporate data center or off-site at a hosting provider.

Users dial a speech server, which downloads VoiceXML 2.0 scripts, grammar formats and audio files from an application server.

The voice browser interprets the VoiceXML 2.0 script by presenting users with a voice message, such as:

System: “Welcome to Ajax. Do you want to speak with sales, accounting or repairs?”

The voice message could be prerecorded voice or text that is routed through a text-to-speech synthesizer.

The voice browser invokes an automatic speech recognizer (ASR), which uses the grammar format to recognize words users speak:

User: “Repairs.”

The ASR recognizes the user’s spoken response. In this case the grammar format consists of only three words: “sales,” “accounting” and “repairs.” This type of grammar-driven ASR performs more accurately than dictation ASRs, which attempt to recognize most of the words in English or whatever language a user is speaking.

Sometimes, users might respond by using dual-tone modulated frequency (DTMF). DTMF is useful in noisy environments or when the user wants to reply confidentially.

The voice browser continues processing the VoiceXML 2.0 script, perhaps performing additional conversational turns, invoking an application-specific function or accessing information in a database.

With VoiceXML 2.0, developers can create speech-enabled applications by specifying high-level menus and forms rather than procedural program code. This frees up more time for developers to test the application usability and refine its design.

Giving voice to new apps

Developers use VoiceXML 2.0 to provide a telephone-user interface for many types of applications and information, including time-sensitive data, business data and personal information. These applications let users access enterprise data wherever they are and whenever they need it by simply dialing in from any phone, identifying themselves and asking for the desired information. Customers also can use these systems to access data such as order status and catalog, delivery and account information.

There are many more phones than PCs in the world, so many more users can access the World Wide Web using telephones. Now people can talk and listen to computers at any time from any place.

Larson is manager of Advanced Human Input/Output for Intel. He can be reached at