A key standard for building speech-based telephony applications, VoiceXML 2.0, received a final nod of approval from the World Wide Web Consortium last week.
The standard's official graduation comes just days before Microsoft is expected to formally launch its Speech Server products - which adhere to a competing standards effort - at the SpeechTEK conference this week in San Francisco.
The W3C advanced VoiceXML 2.0, along with the supporting Speech Recognition Grammar Specification (SRGS), to final "recommendation" status, effectively making them Web standards. These are the most mature of a handful of specifications in the W3C's evolving Speech Interface Framework.
The Speech Interface Framework aims to define a set of standards for building applications that let people interact with Web-based services over a telephone. The applications use a variety of voice-based interfaces that range from keypads and spoken commands to music and synthetic speech. Within the framework, VoiceXML controls how a voice application interacts with a user. Developers use SRGS to describe the words and phrases that end users are expected to give in response to spoken prompts.
Other elements of the framework include Speech Synthesis Markup Language (SSML), which is used for creating spoken prompts; Voice Browser Call Control (CCXML), which provides telephony call-control support for VoiceXML and other dialog systems; and Semantic Interpretation for Speech Recognition, which defines links between grammar rules and application semantics so that an application recognizes that two spoken variations of the same element, such as "Coke" and "Coca-Cola," should be treated as the same response.
VoiceXML, already is broadly adopted. It has become a standard scripting language for making Web content accessible via voice and phone - letting users make selections and provide information by talking instead of touching numbers on a keypad.
"VoiceXML allows users to create a description of a dialog between computer and user that can output text, graphics, synthesized speech, digitized audio - and also provide a means to recognize inputs from all these sources," says Ron Schmelzer, a senior analyst at ZapThink. "What makes VoiceXML cool is that you can specify an interface for application functionality that is not Web-based, but specify it in a way that allows Web developers to control how these voice-based application interfaces work."
Scores of vendors have deployed VoiceXML 2.0-compliant applications, products and services, including HP, IBM, Lucent, Motorola and Nuance.
Meanwhile, Microsoft is making waves with its Speech Server 2004 speech-recognition platform. Bill Gates, Microsoft's chairman and chief software architect, is scheduled to formally launch the Standard and Enterprise editions at the SpeechTEK conference.
With Speech Server, Visual Studio .Net developers can write applications that recognize spoken commands, convert text to speech and generate spoken prompts by adding code based on XML and Speech Application Language Tags (SALT) technologies to existing Web applications.
SALT is the Microsoft-backed alternative to VoiceXML. It's not nearly as far along in the standards process - currently it is under consideration by the W3C. But it has industry support: Members of the SALT Forum include founding companies Cisco, Comverse, Intel, Microsoft and Philips, along with Compaq and Siemens Enterprise Networks.
Speech Server takes calls and communicates with a Web server via XML and SALT, and makes applications offered online available through the phone, says James Mastan, director of marketing for Microsoft's Speech Server group.
Developers can use Visual Studio to build applications, and Speech Server runs just like any other Microsoft server product, Mastan says. "It's not some black box in a call center that you have to program for in some weird language and you can't maintain yourself because you don't know how it works," he says. Ease of use is at the center of the VoiceXML vs. SALT voice browser specification battle.
Microsoft argues that SALT is easier to use because of its Visual Basic tie-in, whereas VoiceXML requires more telephony-type skills, says Bill Meisel, president of speech-technology research company TMA Associates. There's some validity to that position, he says.
"It's much easier for an IT department that has been focused on a Microsoft development environment to use a SALT solution," Meisel says. "On the other hand, telephone application developers classically have used tool kits that are very specialized for telephony. For them, VoiceXML is a much more natural solution."
Steve Chambers, general manager of ScanSoft's network speech division, expects users of the two standards to become divided along familiar lines: .Net converts choosing SALT, and Java shops going with VoiceXML. ScanSoft, whose products include speech-recognition and text-to-speech converters, will support both standards, Chambers says.
Dave Raggett, activity lead for the W3C's voice browser and multimodal working groups, says the two specifications could someday merge. The voice browser working group is focusing on the next major version of VoiceXML, which will incorporate ideas from SALT, among other sources, Raggett says.
Meanwhile, Microsoft's entry will stir the speech-recognition market, but the vendor likely won't become a competitive threat until it releases its second- and third-generation products, says Steve Cramoysan, a principal analyst at Gartner.
IDG News Service correspondent Joris Evers contributed to this story.