Professional Documents
Culture Documents
(VoiceXML)
Contents
1. Introduction 2. Architectural Model 3. Concepts 4. VoiceXML Ele ents ". #ocu ent $tructure and E%ecution 6. &ra ars 7. 'esource (etchin) 8. *ro pts +. *ractical applications o, VoiceXML 1!. -he Constraints 11. Conclusion 12. 'e,erences 6 7 8 1! 11 12 16 17 18 18 1+ 2!
Introduction
The deployment of VoiceXML applications has grown increasingly popular, as enterprises seek to improve customer relationships and trim customer support costs. More interestingly, this market growth is a function not only of larger numbers of enterprises seeking to capture the benefits of voice, but also of a trend toward the deployment of larger, more ambitious applications and the deployment of self-service portals offering callers multiple applications. VoiceXML is an XML sche a. VoiceXML is a ar.up lan)ua)e that/ Minimi es client!server interactions by specifying multiple interactions per document. "hields application authors from low-level and platform-specific details. "eparates user interaction code #in VoiceXML$ from service logic #%&' scripts$. (romotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers. 's easy to use for simple interactions, and yet provides language features to support comple) dialogs.
Figure 1 The VoiceXML based bevocal speech server #courtesy+ http+!!cafe.bevocal.com$ ,hile VoiceXML strives to accommodate the re-uirements of a ma.ority of voice response services, services with stringent re-uirements may best be served by dedicated applications that employ a finer level of control. VoiceXML/s main goal is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. 't enables integration of voice services with data services using the familiar client-server paradigm. 0 voice service is viewed as a se-uence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be e)ternal to the implementation platform. 1ocument servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. 0 VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. 2ser input affects dialog interpretation and is collected into re-uests submitted to a document server. The document server may reply with another VoiceXML document to continue the user/s session with other dialogs.
Architectural Model
The architectural model of a VoiceXML based server can be generali ed as shown in Figure 2. 0 document server #e.g. a web server$ processes re-uests from a client 3
application, the VoiceXML Interpreter, through the VoiceXML interpreter conte)t. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter conte)t may monitor user inputs in parallel with the VoiceXML interpreter. 4or e)ample, one VoiceXML interpreter conte)t may always listen for a special escape phrase that takes the user to a highlevel personal assistant, and another may listen for escape phrases that alter user preferences like volume or te)t-to-speech characteristics. The implementation platform is controlled by the VoiceXML interpreter conte)t and by the VoiceXML interpreter. 4or instance, in an interactive voice response application, the VoiceXML interpreter conte)t may be responsible for detecting an incoming call, ac-uiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions #e.g. spoken or character input received, disconnect$ and system events #e.g. timer e)piration$. "ome of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter conte)t.
Concepts
0 VoiceXML document #or a set of documents called an application$ forms a conversational finite state machine. The user is always in one conversational state, or 5
dialog, at a time. 6ach dialog determines the ne)t dialog to transition to. Transitions are specified using 27's, which define the ne)t document and dialog to use. 'f a 27' does not refer to a document, the current document is assumed. 'f it does not refer to a dialog, the first dialog in the document is assumed. 6)ecution is terminated when a dialog does not specify a successor, or if it has an element that e)plicitly e)its the conversation.
Sessions
0 session begins when the user starts to interact with a VoiceXML interpreter conte)t, continues as documents are loaded and processed, and ends when re-uested by the user, a document, or the interpreter conte)t.
Applications
0n application is a set of documents sharing the same application root document. ,henever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. ,hile it is loaded, the application root document/s variables are available to the other documents as application variables, and its grammars can also be set to remain active for the duration of the application. Figure 3 shows the transition of documents #1$ in an application that share a common application root document #root$.
Grammars
6ach dialog has one or more speech and!or 1TM4 grammars associated with it. 'n machine directed applications, each dialog/s grammars are active only when the user is in that dialog. 'n mi)ed initiative applications, where the user and the machine alternate in determining what to do ne)t, some of the dialogs are flagged to make their grammars active #i.e., listened for$ even when the user is in another dialog in the same document, or on another loaded document in the same application. 'n this situation, if the user says something matching another dialog/s active grammars, e)ecution transitions to that other dialog, with the user/s utterance treated as if it were said in that dialog. Mi)ed initiative adds fle)ibility and power to voice applications.
Events
VoiceXML provides a form-filling mechanism for handling :normal: user input. 'n addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism. 6vents are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn;t respond intelligibly, re-uests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. 6vents are caught by catch elements or their syntactic shorthand. 6ach element in which an event can occur may specify catch elements. %atch elements are also inherited from enclosing elements :as if by copy:. 'n this way, common event handling behavior can be specified at any level, and it applies to all lower levels.
Links
<
0 link supports mi)ed initiative. 't specifies a grammar that is active whenever the user is in the scope of the link. 'f user input matches the link/s grammar, control transfers to the link/s destination 27'. 0 link! can be used to throw an event to go to a destination 27'.
VoiceXML "le#ents
The VoiceXML elements are very similar to the XML components, since it is an e)tension of XML itself. Most of the elements are related to operations regarding audio!voice or 1TM4 signals. "ome of them are listed below+ $%#l! =audio> =disconnect> =dt#&> =gra##ar> =no#atch> =record> Top-level element in each VoiceXML document. (lay an audio clip within a prompt. 1isconnect!end a session. "pecify a touch-tone key grammar. "pecify a speech recognition grammar. %atch a no match event. 7ecord an audio sample. one may refer to
The version of VoiceXML of this document #re-uired$. !"e initial version number is #.$. The base 27'. The language and locale type for this document. The 27' of this document/s application root document, if any
There are two benefits to multi-document applications. 4irst, the application root document/s variables are available for use by the other documents in the application, so that information can be s"ared and retained. "econd, the grammars of the application root document may be set to remain active even when the user is in other application documents, so that the user can always interact with common forms, links, and menus.
/ra##ars
Speech Grammars
The =gra##ar> element is used to provide a speech grammar that A. specifies a set of utterances that a user may speak to perform an action or supply information, and *. provides a corresponding string value #in the case of a field grammar$ or set of attribute-value pairs #in the case of a form grammar$ to describe the information or action. B
The =gra##ar> element is designed to accommodate any grammar format that meets these two re-uirements. 0t this time, VoiceXML does not specify a grammar format nor re-uire support of a particular grammar format. This is similar to the situation with recorded audio formats for VoiceXML, and with media formats in general for CTML. The =grammar> element may be used to specify an inline grammar or an e)ternal grammar. 0n inline grammar is specified by the content of a =grammar> element+
<grammar type="mime-type"> inline speech grammar </grammar>
DT ! Grammars
The =dt#&> element is used to specify a 1TM4 grammar that A. defines a set of key presses that a user may use to perform an action or supply information, and *. defines the corresponding string value that describes that information or action. The =dt#&> element is designed to accommodate any grammar format that meets these two re-uirements. VoiceXML does not specify nor re-uire support for any particular grammar format+ as with =gra##ar>, it is e)pected that standards efforts and market pressures will cause each widely used VoiceXML interpreter conte)t to support a common set of formats.
#uilt-in Grammars
"ome built-in field types can be parameteri ed. This may be done by e)plicitly referring to built-in grammars using a special-purpose Ebuiltin+F 27' scheme and a 27'-style -uery synta) of the form type%param&value in the src attribute of a =grammar> or =dtmf> element, or in the type attribute of a field, for e)ample+
<grammar src="builtin:grammar/boolean"/> <dtmf src="builtin:dtmf/boolean?y=7"/> <field type="digits?minlength=3;maxlength= ">!</field>
@ote+ 0ll e)plicitly defined grammars are stored in file with .gram e)tension. The format used for grammar definition is generally 0a$a1 .peech /ra##ar For#at #G"&4$ or the 23C .peech 3ecognition /ra##ar For#at. 0 summary of G"&4 is given in the following table+
Feature ,ord or EwordF =rule> H)I #...$ ) Jtag te)tK )L )M ) y ... ) N y N N ... =rule> O )8
4urpose words #terminals, tokens$ need not be -uoted rule names #non-terminals$ are enclosed in => optionally ) &rouping arbitrary :tag: te)t may be associated with any of the above P or more occurrences of ) A or more occurrences of ) a se-uence of ) then y then then . a set of alternatives of ) or y or or ... a private and a public rule definition
AP
<grammar type="application/x$%sgf"> &please' help &me' &please' ( &please' ) *need(+ant, help &please' </grammar> </lin">
http+!!.ava.sun.com!products!.ava-media!speech!for1evelopers!G"&4!
The ,3% "peech 7ecognition &rammar 4ormat specification embodies two e-uivalent languages+ XML For# of the ,3% "peech 7ecognition &rammar 4ormat Aug#ented 56F #0Q@4$ 4orm of the ,3% "peech 7ecognition &rammar 4ormat
The first form represents grammar as an XML document with logical structure of the grammar captured by the XML elements. This format is ideal for computer-tocomputer communication of grammars because widely available XML technology #parsers, X"LT, etc.$ can be used to produce and accept the grammar format. The logical structure of the grammar is captured by a combination of traditional Q@4 #Qackus-@aur 4orm$ and a regular e)pression language. This format is familiar to many current speech application developers, is similar to the proprietary grammar formats of most current speech recogni ers and is a more compact representation than XML. Cowever, a special parser is re-uired to accept this format. 0 few features of this grammar format have been discussed below.
0 rule reference is a legal e)pansion and is represented by a =rulere&> element. 0 rule reference is e-uivalent to a non-terminal reference in a traditional grammar. The referenced rule is provided by a 27'. The referenced rule may be local to the
AA
grammar, in which case the 27' is of the form :Rrulename:. The referenced rule may be any public rule of another grammar in which case a relative 27' or absolute 27' is used. The =ruleref> element is always an empty element #contains no te)t or other elements$. 4or e)ample+
<ruleref uri="2city"/> <ruleref uri="33/locations3xml2city"/> <ruleref uri="http://mrinabh3com/grammars/locations3xml2city"/>
0 se-uence of legal e)pansions is itself a legal e)pansion. The se-uence may be surrounded in an =ite#> element or other elements such as = count> or =rule>. 0s mentioned previously, tokens in se-uence should be separated by white space. "e-uential elements other than tokens #the =token>, =rulere&>, =ite#>, =count> and =one+o&> elements$ do not re-uire white-space separation. The following are each e)amples of se-uences+
phone home call the "/urat" office call<ruleref uri="2location"/> <item>call <ruleref uri="2location"/></item> <count num="optional">please</count> call home
The =one+o&> element is used to declare a set of alternative e)pansions. The =one+o&> element must contain one or more =ite#> elements, each of which declares one of the alternatives. 'n the following e)ample, each alternative is a single token but any legal e)pansion can be contained within the item. Like+
<one$of> <item> <item> <item> <item> </one$of> ne+ delhi </item> gu+ahati </item> ahmedabad </item> surat </item>
The =count> element indicates that the e)pansion it contains might be optional # ero of one occurrences$, or may occur ero-or-more or one-or-more times.
this is <count num="optional">not</count> good this is <count num="45">#ery</count> good
The ,3% "peech 7ecognition &rammar 4ormat is a powerful language for developing both simple grammars and natural language grammars for use in VoiceXML applications. The availability of a standard grammar format will increase the interoperability of VoiceXML applications by allowing each grammar to be authored once and reused across many VoiceXML browsers.
A*
3esource Fetching
4etching of content from a 27' occurs in a VoiceXML interpreter conte)t to+ # 1$ fetch VoiceXML documents to interpret, or # 2$ fetch other document types, such as audio files, ob.ects, grammars, and scripts. 0ll occasions for fetching content in a VoiceXML interpreter conte)t are governed by the following three attributes+ caching 6ither safe to force a -uery to fetch the most recent copy of the content, or fast to use the cached copy of the content if it has not e)pired. 'f not specified, a value derived from the innermost caching property is used.
fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. 'f not specified, a value derived from the innermost fetchtimeout property is used. fetchhint 1efines when the interpreter conte)t should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. 'n the case of a very large file #implying long download times$ or a streaming audio source, stream indicates to the interpreter conte)t to begin processing the content as it arrives and should not wait for full retrieval of the content. 'f not specified, a value derived from the innermost relevant fetchhint property is used.
4ro#pts
The prompt element is used for handling pre-recorded audio and synthesi ed speech. (rompts are instantaneously -ueued for playing. To provide pause, we use the =break> element. (rompts can have audio clips played in between, or in the background, using the =audio! tag. 4or e)ample+
<prompt> <audio src="+elcome3+a#"><emp>6elcome</emp> to 7oice 8ortal3</audio> </prompt>
#arge-in
A3
'f an implementation platform supports barge-in, the service author can specify whether a user can interrupt, or Ebarge-inF on, a prompt. This speeds up conversations, but is not always desired. 'f the user must hear all of a warning, legal notice, or advertisement, barge-in should be disabled. This is done with the bargein attribute. 2sers can interrupt a prompt whose bargein attribute is true, but must wait for completion of a prompt whose bargein attribute is false. 'n the case where several prompts are -ueued, the bargein attribute of each prompt is honored during the period of time in which that prompt is playing. 'f bargein occurs during any prompt in a se-uence, all subse-uent prompts are not played. 'f bargein is not specified, then the value of the bargein property is used.
A5
To sum up, we can that VoiceXML is an emerging industry standard for providing web content and services through telephone. This includes information, entertainment, games and business services.
8he Constraints
The primary constraint, as faced today by any artificial intelligence application is the vastness and variance of natural human responses. This demands for large storage needs and very comple) and long grammar definitions, even for simple implementations. VoiceXML is based of speech grammars. Qut human voice varies from person to person. To design a common set of responses for a multitude of customers is a challenge face by the programmers today. Technically speaking, programming in VoiceXML is not a problem, courtesy the fle)ible nature of it. Qut voice based intelligent systems re-uire large storage and fast processing re-uirements.
Conclusion
VoiceXML based applications have become popular over the last few years. The ease of programming, implementation and added features pertaining to 0' have made it very popular, especially for end-user products. "peech interaction based servers provide a more natural way of communication, especially targeted for general public with minimum or no technical knowledge. %onsidering the entertainment industry, it enhances specially gaming. &ames may implement voice commands and scripts with grammars can make conversations natural, simulating human-to-human conversations. HVoiceXML '.$ w"ic" "as been key in t"e growt" of speec" applications by providing a standards(based framework allows businesses to deploy applications today t"at leverage existing development skills and resources. )ecause it allows speec" deployments to be built over a standard web(application infrastructure VoiceXML also provides a clear upgrade pat" as applications grow ( unlike closed proprietary languages. VoiceXML forms t"e foundation for I)M*s voice middleware including +eb,p"ere Voice ,erver and +eb,p"ere Voice -pplication -ccess. )y committing to open standards we provide a clear pat" to future upgrades t"at leverage existing skills allowing enterprises to extend t"eir infrastructure. !"is
A9
commitment and t"e +./*s work is driving us toward t"e next p"ase of speec" interaction and in t"e near future multimodality. -- Igor 0abloko$9 4rogra# -irector9 I5M 4er$asi$e Co#puting9 I5MI
3e&erences A. 8he VoiceXML Foru# Hhttp+!!www.voiceXML .comI *. Qevocal #Managed Voice 0pplication "olutions$ Hhttp+!!www.bevocal.comI 3. -e$elop#ents in .peech .:nthesis by Mark !at"am 0at"erine Morton H'"Q@+ P5?PB993BX, 1rovided by 2o"n +iley and ,ons t"roug" t"e 3oogle 1rint 1ublis"er 1rogramI 5. 7obust "peech 7ecognition in 6mbedded "ystems and (% 0pplications by 2ean(/laude 2un4ua H'"Q@+ P?D*3?B?33, 1rovided by ,pringer t"roug" t"e 3oogle 1rint 1ublis"er 1rogramI
A<