Speech Recognition 2

Speech Recognition
From Judith A. Markowitz. Using Speech Recognition, Prentice Hall, NJ, 1996 Guojun Lu. Multimedia Database Management Systems, Chapter 5, Artech House, 1999
Speech Recognition

Preprocessing

Digitize and represent waveforms Feature extraction (10 ms frame)

Important Feature:
MelMel-frequency cepstral coefficients (MFCC) developed based on how human hears sound
Recognition: Identify what the user has said

Three approaches

Template matching AcousticAcoustic-phonetic recognition (e.g., FastTalk) Stochastic processing
Terminology

Phoneme: the smallest unit of sound that is unique (distinguishing one word from another word for a given language) Example:

The words seat, meat, beat, cheat are different s m b c words since the initial sound is a separate phoneme in English. About 40-50 phonemes in English 40Abnormal: AE B N AO R M AX L
Terminology (Contd)

The simplest sound is pure tone that has a sign waveform. Pure tone are rare. Most sounds, including speech phonemes, are complex waves, having a dominant or primary frequency called fundamental frequency overlaid with secondary frequencies.
Fundamental frequency for speech is the rate at which the vocal cords flap against each other when producing a voiced phoneme.
Examples of Complex Waves for Phoneme
Noise
Phoneme
Terminology (Contd)

Formants: Bands of secondary frequencies that distinguish one phoneme from another MultiMulti-frequency sounds like the phonemes of speech can be represented as complex waves. waves. Bandwidth of a complex wave: the range of frequencies in the waveform. Sounds that produce acyclic waves are often called noise.
CoCo-articulation

CoCo-articulation effects: inter-phoneme interinfluences Neighboring phonemes, the position of a phoneme within words, and the position of the word in the sentence all influence the way a phoneme is uttered. Because of co-articulation effects, a specific coutterance or instance of a phoneme is called a phone.
Template Matching

Each word or phrase is stored as a separate template. Idea: Select the template that best matches the spoken input (frame-by-frame comparison) and the (frame-bydissimilarity is within a predetermined threshold. Template matching is performed at the word level. Temporal alignment is used to ensure that fast or slow utterance of the same word is not identified as different words.

Dynamic Time Warping is used for temporal alignment.
Dynamic Time Warping
Robust Template

In early systems, one template for one example (token) To handle variability, many templates of the same word are stored. Robust template is created from more than one token of the same word using mathematical averages and statistical clustering techniques.
Template Matching

Advantage

Perform well with small vocabularies of phonetically distinct words. Midsize vocabularies in the range of 1000-10000 words are 1000possible if the number of vocabulary choices at a one time is kept minimal. Must have at least one template for each word in the application vocabulary. Not good with large vocabularies containing words that have similar sounds (confusable words., e.g., to and two)
Disadvantage

AcousticAcoustic-Phonetic Recognition

Store only representations of phonemes for a language Three steps

I. Feature extraction II. Segmentation and labeling:

Segmentation determine when one phoneme ends and another begins Labeling identify phonemes
III. Word-level recognition: WordSearch for words matching phoneme hypotheses. The word best matching a sequence of hypotheses is identified.
Output a set of phoneme hypotheses that can be represented by a phoneme lattice, a decision tree, etc.
Stochastic Processing

Use Hidden Markov Model (HMM) to store the model of each of the items that will be recognized.

Items: phonemes or subwords.

3-state HMM of a triphone obtain from training
Each state of the HMM has statistics for a segment of the word.

The statistics describe the parameter values and variation that were found in samples of the word.
A recognition system may have numerous HMMs or may combine them into one network of states and transitions.

Stochastic processing using HMM is accurate and flexible.
Subword Units

Training whole-word models are not good for large wholevocabularies. Subword units are considered. The most popular subword unit is triphone. Triphone (phoneme in context (PIC)) consists of the current phoneme and its left and right phonemes. A triphone is generally represented by a 3-state HMM. 3
The first state represents the left phoneme The middle state represents the current phoneme The last state represents the following phoneme.
The number of triphones for English is much larger than the number of phonemes
The recognition system compares the input with stored models Two comparison approaches

BaumBaum-Welch maximum likelihood algorithm computes the probability scores between the input and the stored models and selects the best match. Viterbi Algorithm looks for the best path
Evaluation of Speech Recognition System

Vocabulary size and flexibility Required sentence and application structures The end users Type and amount of noise Stress placed upon the person using the application
Basic class of errors

Deletion: dropping of words Substitution: replace a word with another word Insertion: adding a word Rejection: cannot recognized by the program High threshold more rejection Low threshold more substitution or insertion errors.
Variability

CoCo-articulation InterInter-speaker differences IntraIntra-speaker inconsistencies Robustness of a system: how system performs under variability
CorpusCorpus- reference database for training; it includes machine-readable dictionaries, word machinelists, and published materials from specific professions. Homophones: same pronunciation, different spelling Ex. one and won Active Vocabulary: set of words the application expects to be spoken at any one time.
Grammars (models, scripts) are used to structure words to reduce perplexity, increase speed and accuracy, and enhance vocabulary flexibility.

Finite state grammars Probabilistic models LinguisticsLinguistics-based grammars
Search through the vocabularies to find the best match Branching factor: the number of items in the active vocabulary at a single point in a recognition process. Perplexity is often used to refer to the average branching factor.

High branching factor high recognition time

A total vocabulary of 1000 words Input: Take the toll road to Milwaukee With no grammar, the branching factor at each point is 1000. The perplexity is 1000. With a finite state grammar,

Take the TYPE road to PLACE TYPE=high OR toll OR back OR rocky OR long PLACE=Milwaukee OR Kokomo OR nowhere OR Arby OR Rio
The branching factor of the first, second, forth, fifth is one; the branching factor of the third and sixth is five.
Weakness of Finite-State Grammar Finite
Users cannot deviate from the patterns Cannot rank the probability of occurrence to improve speed and accuracy
Statistical Models

Often used in dictation systems Specify what is likely instead of what is allowed. Two forms of statistical modeling

N-gram models N-class models
N-gram Model

Identify the current word (unknown word) by assuming the identity of that word dependent upon the previous N-1 words and the acoustic information Nof the unknown word.

Example: Trigram---N=3 two words prior the unknown Trigram---N=3 word This is my printer [unknown word] The unknown word would be identified using the two prior words my printer and the acoustic information of the current word
Example

Good for large vocabulary dictation applications.
N-class model

Extend the concept of N-gram modeling to syntactic Ncategories. Bi-class modeling calculates the possibility that two Bicategories will appear in succession. Ex of biclass

Article: a, an, the Countable noun: table, book, shoe

The probability of article countable-noun countableGood for corpus much smaller than N-gram modeling N-
LinguisticsLinguistics-Based Grammars

Aim to understand what a user has said as well as identify the spoken words. ContextContext-free grammars are often used.

Speech Recognition 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition 2

Uploaded by

Copyright:

Available Formats

Speech Recognition

Digitize and represent waveforms Feature extraction (10 ms frame)

Recognition: Identify what the user has said

Template matching AcousticAcoustic-phonetic recognition (e.g., FastTalk) Stochastic processing

Examples of Complex Waves for Phoneme

Dynamic Time Warping is used for temporal alignment.

Dynamic Time Warping

Store only representations of phonemes for a language Three steps

I. Feature extraction II. Segmentation and labeling:

Items: phonemes or subwords.

Stochastic processing using HMM is accurate and flexible.

Evaluation of Speech Recognition System

Basic class of errors

Finite state grammars Probabilistic models LinguisticsLinguistics-based grammars

High branching factor high recognition time

Weakness of Finite-State Grammar Finite

N-gram models N-class models

Good for large vocabulary dictation applications.

Article: a, an, the Countable noun: table, book, shoe

You might also like

Speech Recognition 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition 2

Uploaded by

Copyright:

Available Formats

Speech Recognition

Digitize and represent waveforms Feature extraction (10 ms frame)

Recognition: Identify what the user has said

Template matching AcousticAcoustic-phonetic recognition (e.g., FastTalk) Stochastic processing

Examples of Complex Waves for Phoneme

Dynamic Time Warping is used for temporal alignment.

Dynamic Time Warping

Store only representations of phonemes for a language Three steps

I. Feature extraction II. Segmentation and labeling:

Items: phonemes or subwords.

Stochastic processing using HMM is accurate and flexible.

Evaluation of Speech Recognition System

Basic class of errors

Finite state grammars Probabilistic models LinguisticsLinguistics-based grammars

High branching factor high recognition time

Weakness of Finite-State Grammar Finite 

N-gram models N-class models

Good for large vocabulary dictation applications.

Article: a, an, the Countable noun: table, book, shoe

You might also like

Weakness of Finite-State Grammar Finite