You are on page 1of 33

Speech recognition, understanding and conversational interfaces

Alexander Rudnicky School of Computer Science


http://www.cs.cmu.edu/~air

Outline
Speech Types of speech interfaces Speech systems and their structure Designing speech interfaces Some applications
SpeechWear Communicator

Speech as a signal
The difference between speech and sound
CD quality vs. intelligible quality
high-quality is 44.1 / 48 kHz desirable speech bandwidth: 0-8kHz, 16bits
at 16bits/sample: 256kbps (tethered mic) telephone: 64kbps (and lower)

Compression:
MPEG: 64kbps/channel and up (but not speech-optimal) CELP: 16kbps 2.4kbps (optimized for speech)

Speech for communication


The difference between speech and language Speech recognition and speech understanding

Computers and speech


Transcription
dictation, information retrieval

Command and control


data entry, device control, navigation

Information access
airline schedules, stock quotes

Problem solving
travel planning, logistics

Speech system architecture


SIGNAL PROCESSING DECODING UNDERSTANDING DISCOURSE ACTION

Varieties of speech systems


Transcription
I O I X X X X

ommand & Information ontrol ccess

roblem olving

O I T I

TIO

A generic speech system


speech

Signal processing

Parser

Dialog manager Domain Domain Domain agent agent agent

Language Generator Speech synthesizer

Decoder

Post parser

speech

display

effector

Decoding speech
Reduce dimensionality of signal Signal processing noise conditioning Decoder Transcribe speech to words

Acoustic models

Language models

Corpus-base statistical models

Creating models for recognition


Speech data
Acoustic models

Transcribe*

Train

Text data

Train

Language models

Understanding speech
Grammar

Ontology design, language acquisition

Parser

Extract semantic content from utterance

Post parser

Introduce context and world knowledge into interpretation

Context

Domain Agents

Grounding, knowledge engineering

Interacting with the user


Task schemas

Task analysis

Context

Dialog manager Domain Domain Domain agent agent agent

Guide interaction through task Map user inputs and system state into actions Interact with back-end(s) Interpret information using domain knowledge

Database

Live data (e.g. Web)

Domain expert

Knowledge engineering

Communicating with the user


Language Decide what to say to user (and how to phrase it) Generator Speech synthesizer Display Generator Action Generator

Speech recognition and understanding


Sphinx system
speaker-independent continuous speech large vocabulary

ATIS system
air travel information retrieval context management

film clip

Command and control systems


Small vocabularies, fixed syntax
OPEN WINDOW <window_id> MOVE OBJECT <object_id> to <position> Applications:
data entry (e.g., zip codes), process control (e.g., electron microscope, darkroom equipment)

Large vocabulary, fixed syntax


Web browsing (?)

SpeechWear
Vehicle inspection task
USMC mechanics, fixed inspection form Wearable computer (COTS components) html-based task representation

film clip

Information access
Moderate to very large vocabulary
IVR and frame based systems

Commercial systems:
Nuance: http://www.nuance.com/demo/index.html SpeechWorks:
http://www.speechworks.com/demos/demos.htm

lots of others..

IVR and frame-based systems


Interactive voice response (IVR)
interactions specified by a graph (typically a tree)

Frame systems
ergodic graphs states defined by multi-item forms

Graph-based systems
Welcome to Bank ABC! Please say one of the following: Balance, Hours, Loan, ...

What type of loan are you interested in? Please say one of the following: Mortgage, Car, Personal, ...

. . . .

Frame-based systems
I would like to fly to Boston
Id like to go to Boston on Friday,
Destination_City: Boston Departure_Date: ______ Departure_Time: ______ Preferred_Airline: ______ . . .

When would you like to fly?

Frame-based systems
Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . . Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

Transition on keyword or phrase


Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

Zxfgdh_dxab: _____ askjs: _____ dhe: _____ aa_hgjs_aa: _____ . .

Some problems
IVR systems work great, but only for wellstructured ( shallow) tasks Frame systems are good for tasks that correspond to a single form leading to an action Neither approach does well with more complex problem-solving activities

Dialog Systems
Problem solving activity; complex task
Order of progression through task depends on user goals (which can change) and system state (a back-end retrieval) and is not predictable.

Track progress and help task along


mixed-initiative dialog

Discourse phenomena
User expect to converse with the system

Carnegie Mellon Communicator


A dialog system that supports complex problem solving in a travel planning domain
create an itinerary using air schedule, hotel and car information 186 U.S. airports (>140k enplanements/yr)
currently: >500 world airports

Web-based data resources


Live and cached flight information Airport, airline, etc. information

Value schema/handlers

receptors

transform

value

Domain Agent

Compound schema
Value_1 Value_2 Value_3 +
transform
e.g. SQL query

value

Domain Agent

Schema ordering
Schema i Destination airport Value i Schema j Date Value j Flight Leg Value k Schema k Time

transform Database lookup

Value

Available flights

Carnegie Mellon Communicator


CMU Communicator
Call: 268-5144 the information is accurate; you can use it for your own travel planning...

User-aware speech interfaces


Predictable behavior on the systems part Users coomunicate at different levels
http://www.speech.cs.cmu.edu/air/papers/Interface Chars.html

User-aware speech interfaces


Content: task-centric utterances Possibility: What can I do? Orientation: Where are we? Navigation: moving through the task space Control: verbose/terse, listen! Customization: define this word

Speech interface guidelines


Speech recognition is errorful System state is often opaque to the user http://www.speech.cs.cmu.edu/air/papers/S pInGuidelines/SpInGuidelines.html

Interface guidelines
State transparency Input control Error recovery Error detection Error correction Log performance Application integration

Summary
Speech and language communication Dialog structure Interface design

You might also like