Speech technology is evolving to the point where an exchange of information between a person and a computer is becoming more like a real conversation. Many factors are responsible for this, ranging from an exponential increase in computing power to a general advancement of basic speech technology and user interface design.
Speech-based applications deployed to date have been based on code created by a few speech software vendors. VoiceXML will likely change this landscape by virtue of its promised vendor independence in creating speech applications.
VoiceXML is the emerging standard for speech-enabled applications. It defines how a dialog is constructed and executed between a caller and a computer running speech recognition and/or text-to-speech software.
VoiceXML incorporates the flexibility to create speech-enabled Web-based content or to build telephony-based speech recognition call center applications.
Specifically, VoiceXML outlines a common language to follow when programming a speech application. In VoiceXML, many of these rules are referred to as tags, as used in HTML. Tags denote actions for creating dialog between a human voice and speech recognition system.
How it works
Subscribe to the Tech Update newsletter Here is a weekly newsletter to help you stay abreast of new networking standards and technologies by providing down-to-earth explanations of how they work.
An example of a VoiceXML tag would be to queue an audio output. Main components of a VoiceXML-based service include tags, forms and rules that define the content, and a speech browser for interpreting and presenting audio content.
Vocabularies and grammars are the key components that define the input to a speech-enabled page. The vocabulary consists of the words to be recognized by the speech recognition engine. For example, a vocabulary for a flight information system might consist of city names and travel-related words such as "leaving" and "fly." Grammars provide the structure to identify meaningful phrases. A vocabulary and grammar are combined within a speech-enabled application to define speech recognition within a reasonable range of efficiency for both the caller and the speech recognition processor.
Designing a speech application includes presenting data for delivery over the phone, constructing a call flow and enabling prompts and grammars. VoiceXML provides a common set of rules as a flexible foundation, but it's up to the designer to create the appropriate flow and personality for a speech system.
Just as HTML content is interpreted by a browser and presented visually over the Web, so must VoiceXML be understood or interpreted for presentation over the telephone by a speech, or voice, browser. The speech browser serves as a gateway between a call and an Internet connection. It interprets VoiceXML code and manages dialog between callers and VoiceXML content located at a Web site.
Speech browser software also maintains the calls, presents voice prompts that equate to URLs and downloads pages for audio interaction.
A VoiceXML-based application using a speech browser provides flexibility, benefiting callers and content providers alike. A caller could use a rotary telephone or the newest wireless model and receive the same service. Content providers have a choice of locating a speech browser at their facilities or outsourcing to an application service provider, carrier or service bureau. As with current visual Web models, trade-offs have to be weighed between ease of implementation, flexibility, cost and other factors.
Today, companies are building businesses on speech-based Web content by providing telephony access and presentation of data in interactive audio formats. These businesses host speech applications to provide greater scalability, maintenance and support, while letting content providers focus on their core business.
A number of obvious and subtle factors are converging to bring the Web model of VoiceXML to prominence. Many consider the broad industry support of VoiceXML its most apparent strength. Other factors such as recent improvements in text-to-speech quality mean information can be immediately presented in audio format without the time and expense of recording a voice. Looking at the evolution of the Web, it's clear the adoption of a common format for content presentation - HTML - fueled the growth of the Web as we know it today. The VoiceXML standard holds similar promise for speech.
Chambers is vice president of marketing at SpeechWorks. He can be reached at firstname.lastname@example.org.