You are on page 1of 16

Voice eXtensible Markup Language

(VoiceXML)
Contents
1. Introduction 2. Architectural Model 3. Concepts 4. VoiceXML Ele ents ". #ocu ent $tructure and E%ecution 6. &ra ars 7. 'esource (etchin) 8. *ro pts +. *ractical applications o, VoiceXML 1!. -he Constraints 11. Conclusion 12. 'e,erences 6 7 8 1! 11 12 16 17 18 18 1+ 2!

Introduction
The deployment of VoiceXML applications has grown increasingly popular, as enterprises seek to improve customer relationships and trim customer support costs. More interestingly, this market growth is a function not only of larger numbers of enterprises seeking to capture the benefits of voice, but also of a trend toward the deployment of larger, more ambitious applications and the deployment of self-service portals offering callers multiple applications. VoiceXML is an XML sche a. VoiceXML is a ar.up lan)ua)e that/ Minimi es client!server interactions by specifying multiple interactions per document. "hields application authors from low-level and platform-specific details. "eparates user interaction code #in VoiceXML$ from service logic #%&' scripts$. (romotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers. 's easy to use for simple interactions, and yet provides language features to support comple) dialogs.

Figure 1 The VoiceXML based bevocal speech server #courtesy+ http+!!cafe.bevocal.com$ ,hile VoiceXML strives to accommodate the re-uirements of a ma.ority of voice response services, services with stringent re-uirements may best be served by dedicated applications that employ a finer level of control. VoiceXML/s main goal is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. 't enables integration of voice services with data services using the familiar client-server paradigm. 0 voice service is viewed as a se-uence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be e)ternal to the implementation platform. 1ocument servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. 0 VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. 2ser input affects dialog interpretation and is collected into re-uests submitted to a document server. The document server may reply with another VoiceXML document to continue the user/s session with other dialogs.

Architectural Model
The architectural model of a VoiceXML based server can be generali ed as shown in Figure 2. 0 document server #e.g. a web server$ processes re-uests from a client 3

application, the VoiceXML Interpreter, through the VoiceXML interpreter conte)t. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter conte)t may monitor user inputs in parallel with the VoiceXML interpreter. 4or e)ample, one VoiceXML interpreter conte)t may always listen for a special escape phrase that takes the user to a highlevel personal assistant, and another may listen for escape phrases that alter user preferences like volume or te)t-to-speech characteristics. The implementation platform is controlled by the VoiceXML interpreter conte)t and by the VoiceXML interpreter. 4or instance, in an interactive voice response application, the VoiceXML interpreter conte)t may be responsible for detecting an incoming call, ac-uiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions #e.g. spoken or character input received, disconnect$ and system events #e.g. timer e)piration$. "ome of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter conte)t.

Figure 2 0rchitectural Model #courtesy: www.voicexml.org$

Concepts
0 VoiceXML document #or a set of documents called an application$ forms a conversational finite state machine. The user is always in one conversational state, or 5

dialog, at a time. 6ach dialog determines the ne)t dialog to transition to. Transitions are specified using 27's, which define the ne)t document and dialog to use. 'f a 27' does not refer to a document, the current document is assumed. 'f it does not refer to a dialog, the first dialog in the document is assumed. 6)ecution is terminated when a dialog does not specify a successor, or if it has an element that e)plicitly e)its the conversation.

Dialogs and Subdialogs


There are two kinds of dialogs+ forms and menus. 4orms define an interaction that collects values for a set of field item variables. 6ach field may specify a grammar that defines the allowable inputs for that field. 'f a form-level grammar is present, it can be used to fill several fields from one utterance. 0 menu presents the user with a choice of options and then transitions to another dialog based on that choice. 0 subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Local data, grammars, and state information are saved and are available upon returning to the calling document. "ubdialogs can be used, for e)ample, to create a confirmation se-uence that may re-uire a database -uery8 to create a set of components that may be shared among documents in a single application8 or to create a reusable library of dialogs shared among many applications.

Sessions
0 session begins when the user starts to interact with a VoiceXML interpreter conte)t, continues as documents are loaded and processed, and ends when re-uested by the user, a document, or the interpreter conte)t.

Applications
0n application is a set of documents sharing the same application root document. ,henever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. ,hile it is loaded, the application root document/s variables are available to the other documents as application variables, and its grammars can also be set to remain active for the duration of the application. Figure 3 shows the transition of documents #1$ in an application that share a common application root document #root$.

Figure 3 Transitioning between documents in an application

Grammars
6ach dialog has one or more speech and!or 1TM4 grammars associated with it. 'n machine directed applications, each dialog/s grammars are active only when the user is in that dialog. 'n mi)ed initiative applications, where the user and the machine alternate in determining what to do ne)t, some of the dialogs are flagged to make their grammars active #i.e., listened for$ even when the user is in another dialog in the same document, or on another loaded document in the same application. 'n this situation, if the user says something matching another dialog/s active grammars, e)ecution transitions to that other dialog, with the user/s utterance treated as if it were said in that dialog. Mi)ed initiative adds fle)ibility and power to voice applications.

Events
VoiceXML provides a form-filling mechanism for handling :normal: user input. 'n addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism. 6vents are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn;t respond intelligibly, re-uests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. 6vents are caught by catch elements or their syntactic shorthand. 6ach element in which an event can occur may specify catch elements. %atch elements are also inherited from enclosing elements :as if by copy:. 'n this way, common event handling behavior can be specified at any level, and it applies to all lower levels.

Links

<

0 link supports mi)ed initiative. 't specifies a grammar that is active whenever the user is in the scope of the link. 'f user input matches the link/s grammar, control transfers to the link/s destination 27'. 0 link! can be used to throw an event to go to a destination 27'.

VoiceXML "le#ents
The VoiceXML elements are very similar to the XML components, since it is an e)tension of XML itself. Most of the elements are related to operations regarding audio!voice or 1TM4 signals. "ome of them are listed below+ $%#l! =audio> =disconnect> =dt#&> =gra##ar> =no#atch> =record> Top-level element in each VoiceXML document. (lay an audio clip within a prompt. 1isconnect!end a session. "pecify a touch-tone key grammar. "pecify a speech recognition grammar. %atch a no match event. 7ecord an audio sample. one may refer to

For a complete list of VoiceXML elements http'(()))*$oice%#l*org(specs(VoiceXML+1,,*pd&.

-ocu#ent .tructure and "%ecution


0 VoiceXML document is primarily composed of top-level elements called dialogs. There are two types of dialogs+ forms and menus. 0 document may also have =meta> elements, =var> and =script> elements, =property> elements, =catch> elements, and =link> elements.

Execution within one document


1ocument e)ecution begins at the first dialog by default. 0s each dialog e)ecutes, it determines the ne)t dialog. ,hen a dialog doesn/t specify a successor dialog, document e)ecution stops.

0ttributes of =$%#l> include+ ?

version base lang application

The version of VoiceXML of this document #re-uired$. !"e initial version number is #.$. The base 27'. The language and locale type for this document. The 27' of this document/s application root document, if any

Executing a multi-document application


@ormally, each document runs as an isolated application. 'n cases where we want multiple documents to work together as one application, we select one document to be the application root document, and refer to it in the other documents/ =v)ml> elements. ,hen this is done, every time the interpreter is told to load a document in this application, it also loads the application root document if it is not already loaded. The application root document remains loaded until the interpreter is told to load a document that belongs to a different application. Thus one of the following two conditions always holds during interpretation+ The application root document #or a stand-alone document$ is loaded and the user is e)ecuting in it. The application root document and one other document in the application are both loaded and the user is e)ecuting in the non-root document.

There are two benefits to multi-document applications. 4irst, the application root document/s variables are available for use by the other documents in the application, so that information can be s"ared and retained. "econd, the grammars of the application root document may be set to remain active even when the user is in other application documents, so that the user can always interact with common forms, links, and menus.

/ra##ars
Speech Grammars
The =gra##ar> element is used to provide a speech grammar that A. specifies a set of utterances that a user may speak to perform an action or supply information, and *. provides a corresponding string value #in the case of a field grammar$ or set of attribute-value pairs #in the case of a form grammar$ to describe the information or action. B

The =gra##ar> element is designed to accommodate any grammar format that meets these two re-uirements. 0t this time, VoiceXML does not specify a grammar format nor re-uire support of a particular grammar format. This is similar to the situation with recorded audio formats for VoiceXML, and with media formats in general for CTML. The =grammar> element may be used to specify an inline grammar or an e)ternal grammar. 0n inline grammar is specified by the content of a =grammar> element+
<grammar type="mime-type"> inline speech grammar </grammar>

DT ! Grammars
The =dt#&> element is used to specify a 1TM4 grammar that A. defines a set of key presses that a user may use to perform an action or supply information, and *. defines the corresponding string value that describes that information or action. The =dt#&> element is designed to accommodate any grammar format that meets these two re-uirements. VoiceXML does not specify nor re-uire support for any particular grammar format+ as with =gra##ar>, it is e)pected that standards efforts and market pressures will cause each widely used VoiceXML interpreter conte)t to support a common set of formats.

Activation o" Grammars


,hen the interpreter waits for input as a result of visiting a field, the following grammars are active+ A. grammars for that field, including grammars contained in links in that field8 *. grammars for its form, including grammars contained in links in that form8 3. grammars contained in links in its document, and grammars for menus and other forms in its document which are given document scope8 5. grammars contained in links in its application root document, and grammars for menus and forms in its application root document which are given document scope. 'n the case that an input matches more than one active grammar, the list above defines the precedence order. 'f the input matches more than one active grammar with the same precedence, the precedence is determined using document order. Menus behave with regard to grammar activation like their e-uivalent forms.

#uilt-in Grammars
"ome built-in field types can be parameteri ed. This may be done by e)plicitly referring to built-in grammars using a special-purpose Ebuiltin+F 27' scheme and a 27'-style -uery synta) of the form type%param&value in the src attribute of a =grammar> or =dtmf> element, or in the type attribute of a field, for e)ample+
<grammar src="builtin:grammar/boolean"/> <dtmf src="builtin:dtmf/boolean?y=7"/> <field type="digits?minlength=3;maxlength= ">!</field>

@ote+ 0ll e)plicitly defined grammars are stored in file with .gram e)tension. The format used for grammar definition is generally 0a$a1 .peech /ra##ar For#at #G"&4$ or the 23C .peech 3ecognition /ra##ar For#at. 0 summary of G"&4 is given in the following table+

Feature ,ord or EwordF =rule> H)I #...$ ) Jtag te)tK )L )M ) y ... ) N y N N ... =rule> O )8

4urpose words #terminals, tokens$ need not be -uoted rule names #non-terminals$ are enclosed in => optionally ) &rouping arbitrary :tag: te)t may be associated with any of the above P or more occurrences of ) A or more occurrences of ) a se-uence of ) then y then then . a set of alternatives of ) or y or or ... a private and a public rule definition

0n e)ample of grammar definition is given below+


<lin" e#ent="help">

AP

<grammar type="application/x$%sgf"> &please' help &me' &please' ( &please' ) *need(+ant, help &please' </grammar> </lin">

0 complete manual available at+

for Gava "peech &rammar constructs and application is

http+!!.ava.sun.com!products!.ava-media!speech!for1evelopers!G"&4!

The ,3% "peech 7ecognition &rammar 4ormat specification embodies two e-uivalent languages+ XML For# of the ,3% "peech 7ecognition &rammar 4ormat Aug#ented 56F #0Q@4$ 4orm of the ,3% "peech 7ecognition &rammar 4ormat

The first form represents grammar as an XML document with logical structure of the grammar captured by the XML elements. This format is ideal for computer-tocomputer communication of grammars because widely available XML technology #parsers, X"LT, etc.$ can be used to produce and accept the grammar format. The logical structure of the grammar is captured by a combination of traditional Q@4 #Qackus-@aur 4orm$ and a regular e)pression language. This format is familiar to many current speech application developers, is similar to the proprietary grammar formats of most current speech recogni ers and is a more compact representation than XML. Cowever, a special parser is re-uired to accept this format. 0 few features of this grammar format have been discussed below.

#asic Grammar Document


,ords, or more precisely tokens, are the basic units of a grammar and indicate those things that a user can say. 0ny token is a legal e)pansion in a rule definition. 'f a token contains white-space #e.g., :@'T "urat:$ it should be contained in -uotes. "e-uences of individual tokens are separated by white space and the se-uence is a legal e)pansion. Tokens can be enclosed in a =token> element that may be used to indicate the language of the contained token. 4or e)ample+
hello surat "-). /urat" to be or not to be <to"en xml:lang="en">+elcome</to"en> <0$$ 1nglish $$>

0 rule reference is a legal e)pansion and is represented by a =rulere&> element. 0 rule reference is e-uivalent to a non-terminal reference in a traditional grammar. The referenced rule is provided by a 27'. The referenced rule may be local to the

AA

grammar, in which case the 27' is of the form :Rrulename:. The referenced rule may be any public rule of another grammar in which case a relative 27' or absolute 27' is used. The =ruleref> element is always an empty element #contains no te)t or other elements$. 4or e)ample+
<ruleref uri="2city"/> <ruleref uri="33/locations3xml2city"/> <ruleref uri="http://mrinabh3com/grammars/locations3xml2city"/>

0 se-uence of legal e)pansions is itself a legal e)pansion. The se-uence may be surrounded in an =ite#> element or other elements such as = count> or =rule>. 0s mentioned previously, tokens in se-uence should be separated by white space. "e-uential elements other than tokens #the =token>, =rulere&>, =ite#>, =count> and =one+o&> elements$ do not re-uire white-space separation. The following are each e)amples of se-uences+
phone home call the "/urat" office call<ruleref uri="2location"/> <item>call <ruleref uri="2location"/></item> <count num="optional">please</count> call home

The =one+o&> element is used to declare a set of alternative e)pansions. The =one+o&> element must contain one or more =ite#> elements, each of which declares one of the alternatives. 'n the following e)ample, each alternative is a single token but any legal e)pansion can be contained within the item. Like+
<one$of> <item> <item> <item> <item> </one$of> ne+ delhi </item> gu+ahati </item> ahmedabad </item> surat </item>

The =count> element indicates that the e)pansion it contains might be optional # ero of one occurrences$, or may occur ero-or-more or one-or-more times.
this is <count num="optional">not</count> good this is <count num="45">#ery</count> good

The ,3% "peech 7ecognition &rammar 4ormat is a powerful language for developing both simple grammars and natural language grammars for use in VoiceXML applications. The availability of a standard grammar format will increase the interoperability of VoiceXML applications by allowing each grammar to be authored once and reused across many VoiceXML browsers.

A*

3esource Fetching
4etching of content from a 27' occurs in a VoiceXML interpreter conte)t to+ # 1$ fetch VoiceXML documents to interpret, or # 2$ fetch other document types, such as audio files, ob.ects, grammars, and scripts. 0ll occasions for fetching content in a VoiceXML interpreter conte)t are governed by the following three attributes+ caching 6ither safe to force a -uery to fetch the most recent copy of the content, or fast to use the cached copy of the content if it has not e)pired. 'f not specified, a value derived from the innermost caching property is used.

fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. 'f not specified, a value derived from the innermost fetchtimeout property is used. fetchhint 1efines when the interpreter conte)t should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. 'n the case of a very large file #implying long download times$ or a streaming audio source, stream indicates to the interpreter conte)t to begin processing the content as it arrives and should not wait for full retrieval of the content. 'f not specified, a value derived from the innermost relevant fetchhint property is used.

4ro#pts
The prompt element is used for handling pre-recorded audio and synthesi ed speech. (rompts are instantaneously -ueued for playing. To provide pause, we use the =break> element. (rompts can have audio clips played in between, or in the background, using the =audio! tag. 4or e)ample+
<prompt> <audio src="+elcome3+a#"><emp>6elcome</emp> to 7oice 8ortal3</audio> </prompt>

#arge-in
A3

'f an implementation platform supports barge-in, the service author can specify whether a user can interrupt, or Ebarge-inF on, a prompt. This speeds up conversations, but is not always desired. 'f the user must hear all of a warning, legal notice, or advertisement, barge-in should be disabled. This is done with the bargein attribute. 2sers can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. 'n the case where several prompts are -ueued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. 'f bargein occurs during any prompt in a se-uence, all subse-uent prompts are not played. 'f bargein is not specified, then the value of the bargein property is used.

4ractical Applications o& VoiceXML


VoiceXML scripts are e)tensively used in internet telephony applications. The use of voice interface is becoming increasingly popular over the recent years, owing to interests in making applications interfacing more natural. The /.M Cellular .er$ices across the globe are implementing voice commands based menus for various operations, like en-uiries, information kiosk, etc. Traditional Interacti$e Voice 3esponse #'V7$ applications have been deployed in enterprises for decades, but they/ve faced serious limitations including poor usability and the inability to go beyond providing access to proprietary information. Telephone based 7elp lines delivering information regarding reservations, advertisements, etc. have been implemented widely that implements speech grammar to enhance interaction. VoiceXML goes a step beyond the menu-driven 'V7 portals, in the direction of free natural language, by allowing a certain degree of freedom and e)pressiveness through grammars. Voice interactions is also desired in robotics* 2se of voice commands based interface makes interactions more user friendly and natural. 2se of voice applications offer several benefits, like+ 1eliver web content and services through the telephone Leverage e)isting 'nternet infrastructure and skill-sets 6nsure portability across implementation platforms 1ecrease the level of e)pertise re-uired to create voice applications 6nable rapid voice application development, similar to CTML for the web (rovide :Voice View: for web content

A5

To sum up, we can that VoiceXML is an emerging industry standard for providing web content and services through telephone. This includes information, entertainment, games and business services.

8he Constraints
The primary constraint, as faced today by any artificial intelligence application is the vastness and variance of natural human responses. This demands for large storage needs and very comple) and long grammar definitions, even for simple implementations. VoiceXML is based of speech grammars. Qut human voice varies from person to person. To design a common set of responses for a multitude of customers is a challenge face by the programmers today. Technically speaking, programming in VoiceXML is not a problem, courtesy the fle)ible nature of it. Qut voice based intelligent systems re-uire large storage and fast processing re-uirements.

Conclusion
VoiceXML based applications have become popular over the last few years. The ease of programming, implementation and added features pertaining to 0' have made it very popular, especially for end-user products. "peech interaction based servers provide a more natural way of communication, especially targeted for general public with minimum or no technical knowledge. %onsidering the entertainment industry, it enhances specially gaming. &ames may implement voice commands and scripts with grammars can make conversations natural, simulating human-to-human conversations. HVoiceXML '.$ w"ic" "as been key in t"e growt" of speec" applications by providing a standards(based framework allows businesses to deploy applications today t"at leverage existing development skills and resources. )ecause it allows speec" deployments to be built over a standard web(application infrastructure VoiceXML also provides a clear upgrade pat" as applications grow ( unlike closed proprietary languages. VoiceXML forms t"e foundation for I)M*s voice middleware including +eb,p"ere Voice ,erver and +eb,p"ere Voice -pplication -ccess. )y committing to open standards we provide a clear pat" to future upgrades t"at leverage existing skills allowing enterprises to extend t"eir infrastructure. !"is

A9

commitment and t"e +./*s work is driving us toward t"e next p"ase of speec" interaction and in t"e near future multimodality. -- Igor 0abloko$9 4rogra# -irector9 I5M 4er$asi$e Co#puting9 I5MI

3e&erences A. 8he VoiceXML Foru# Hhttp+!!www.voiceXML .comI *. Qevocal #Managed Voice 0pplication "olutions$ Hhttp+!!www.bevocal.comI 3. -e$elop#ents in .peech .:nthesis by Mark !at"am 0at"erine Morton H'"Q@+ P5?PB993BX, 1rovided by 2o"n +iley and ,ons t"roug" t"e 3oogle 1rint 1ublis"er 1rogramI 5. 7obust "peech 7ecognition in 6mbedded "ystems and (% 0pplications by 2ean(/laude 2un4ua H'"Q@+ P?D*3?B?33, 1rovided by ,pringer t"roug" t"e 3oogle 1rint 1ublis"er 1rogramI

A<

You might also like