You are on page 1of 44

Isolated Digit Recognizer

using GMMs
ECE5526 FINAL PROJECT
SPRING 2011
JIM BRYAN
Abstract
Provide an in depth look at how GMMs can be used for word recognition
based on Matlabs statistical toolbox.
The isolated digit recognizer is based on a voice activity detector using
energy thresholding and zero crossing detection. Moveover, the recognizer
uses MFCCs as the basis for acoustic speech representation. These are
standard voice processing techniques which it is assumed the reader is
familiar with. The focus of this presentation is on the details the GMM
implementation in Matlab, with the idea that a good understanding of the
Matlab approach will yield insight to other system implementations such as
Sphinx and HTK.
Word recognition is comprised of two components, Model training and Model
testing.
The statistical toolbox functionGmmdistribution.fit is used for training
The statistical toolbox Postpriori is used for testing
The purpose of this effort is to train and run the recognizer, and to
understand the basic functionality of functionGmmdistribution.fit and
Postpriori funcion calls.
Introduction
Based on
MATLAB Digest - January 2010
Developing an Isolated Word Recognition System in MATLAB
ByDaryl Ning
Describe the Matlab GUI base recognizer application
Provide introductory material on GMMs using a simple 2 Mixture
example with 2 models
Discuss in detail the algorithms used to determine the best model
match
Show examples of Matlabs statistical toolbox representation of GMMs
Run the simulation
Discuss simulaton results and show possible improvements
Summary
Conclusions
Areas for further study
Isolated digit recognizer overview
Uses 8 GMMs per digit to train and recognize an individual
users voice
Matlab GUI based digit recognizer uses the following
toolboxes
Signal Processing toolbox provides a filtering and signal
processing functions
Statistics toolbox is used to implement a GMM Expectation
Maximization algorithm to build the GMMs and to compute the
Mahalnobis distance during recognition
Data acquisition toolbox is used to stream the microphone
input to Matlab for continous recognition
Single digit recognizer implemented using dictionary of
digits 0 9
Training is done with 30 second captures of repeated
utterance of the given digit using the wavread function in
Matlab
Overview Continued
Uses laptops internal microphone
Sample rate is 8ksps
Uses 20msec frames with a 10 msec overlap with a
frame size of 160 samples per frame
Uses a simple voice activity detector based on energy
threshold and zero crossings per second for both
training and the recognizer
Voice activity energy and zero crossing thresholds
are programmable and must be the same for training
and recognition
No model for silence or missed digit, so the
recognizer displays the closest digit
GMM training and recognizer Matlab
function calls
The recognizer compute the posterior probabilities using
the Statistics Toolbox function posterior
Posterior accepts a gmm object/model as its input, along
with an input data set, and returns a log-likelihood number
that represents the data set match to the model
The smallest log-likelihood has the highest posterior
probability
The recognizer computes the probability of the current
word to each model in the dictionary. The model that
has the lowest posterior probability is the recognized digit.
A gmm object is created during training for each
dictionary entry, in this case digits 0-9, using the function
call gmdistribution.fit.
Example using 2 GMMs with 2 mixtures
pdf(obj2,[x,y])
6

gmm1
4

0
y

-2

-4 gmm2

-6

-8
-8 -6 -4 -2 0 2 4 6
x
Posterior
Posterior extracts gmdistribution object parameters
necessary to call Wdensity
Wdensity performs the actual log-likely-hood
calculation for the GMM, given the data set
Wdensity returns two arrays
log_lh is an array of size length(data)x order(GMM)
mahalaD is an array of size length(data)x order(GMM),
this is not the actual Mahalnoblis distance
mahalaD = (x -)-1 (x -)T
Estep calculates the loglikelihood based on the
log_lh array and returns ll which is the loglikelihood
of data x given the model
Wdensity function description

Example funtioncall
[log_lh,mahalaD]=wdensity(X, mu, Sigma, p, sharedCov,
CovType)
Where X is input data
Mu is an array of means with(j,:) corresponding to jth
mean vector
Sigma is an array of arrays with (:,:,j) corresponding to
the jth sigma in the model
P are the mixture weights
sharedCov indicates the covariance matrices may be
common to all mixtures
CovType may be either diagonal or full
,

Wdensity log-likelihood calculation


Wdensity log-likelihood
implementation details
L = sqrt(Sigma(:,:,j)); % a vector
Xcentered = bsxfun(@minus, X, mu(j,:));
xRinv = bsxfun(@times,Xcentered , (1./ L));

mahalaD(:,j) = sum(xRinv.^2, 2);


log_lh(:,j) = -0.5 * mahalaD(:,j) +...
(-0.5 *logDetSigma + log_prior(j)) - *log(2*pi)/2;
estep
[ll, post, logpdf]=estep(log_lh)
Find the max of each row of log_lh matrix
This represents the closest distance to the jth mixture for this
data point.
Convert log_ih distance probabilities by using post =
exp(bsxfun(@minus, log_lh, maxll)), there will always be a 1
in the column of the maximum value, therefore this number
is always >=1
Sum across the rows to normalize the relative probabilities
density = sum(post,2);
normalize posteriors post = bsxfun(@rdivide, post, density)
Calculate the logpdf = log(density) + maxll;
ll = sum(logpdf)
Estep example showing log_lh inputsfor two
Gaussian Mixtures and the Maximum value
of the log_lh
P11 data from model
log_lh = maxll =
-18.6236 -3.0708 -3.0708
-36.2569 -3.0821 -3.0821
-24.1669 -2.2514 -2.2514
-33.8821 -3.2357 -3.2357
-18.4447 -3.2818 -3.2818
-5.8488 -4.2339 -4.2339
-18.4529 -2.5661 -2.5661
-14.7058 -3.5421 -3.5421
-2.7563 -19.3866 -2.7563
-3.0744 -21.2154 -3.0744
-2.4251 -14.8179 -2.4251
-4.1699 -12.7317 -4.1699
-2.5825 -16.8520 -2.5825
-4.4938 -8.5847 -4.4938
-3.7883 -13.7861 -3.7883
-2.8691 -7.2573 -2.8691
Estep example showing Post and
density,
density is used to normalize post
P11 data from model
post = exp(bsxfun(@minus, density = sum(post,2)
log_lh, maxll));
1.0000 0.0000 1.0000
1.0000 0.0000 1.0000
0.0000 1.0000 1.0000
1.0000 0.0000 1.0000
1.0000 0.0001 1.0001
0.0832 1.0000 1.0832
1.0000 0.0109 1.0109
1.0000 0.0008 1.0008
0.0000 1.0000 1.0000
0.0003 1.0000 1.0003
0.0001 1.0000 1.0001
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
Estep example showing post after
normalization
and logpdf
P11 data from model
post = bsxfun(@rdivide, post, logpdf = log(density) + maxll;
density) ll = sum(logpdf) =-53.7464
-3.6490
1.0000 0.0000 -4.6937
1.0000 0.0000 -2.3765
0.0000 1.0000 -3.3219
1.0000 0.0000 -3.1317
0.9999 0.0001 -4.4911
0.0768 0.9232 -4.0361
0.9892 0.0108 -3.8076
0.9992 0.0008 -2.7171
0.0000 1.0000 -2.5739
0.0003 0.9997 -2.3359
0.0001 0.9999 -2.6023
0.0000 1.0000 -2.1502
0.0000 1.0000 -5.5963
0.0000 1.0000 -2.2777
0.0000 1.0000 -3.9857
Estep example showing log_lh inputs
for two Gaussian Mixtures and the Maximum
value of the log_lh P12 Data not from Model
log_lh = maxll =
-6.2916 -6.2281 -6.2281
-6.1189 -7.3603 -6.1189
-12.5238 -2.5414 -2.5414
-7.3336 -24.5710 -7.3336
-7.0679 -14.3058 -7.0679
-5.7049 -7.7255 -5.7049
-7.8564 -23.6082 -7.8564
-6.8128 -4.4655 -4.4655
-27.4139 -19.2832 -19.2832
-20.1139 -14.0730 -14.0730
-27.0048 -11.4791 -11.4791
-17.2614 -8.2714 -8.2714
-33.8912 -15.5351 -15.5351
-26.0666 -9.9934 -9.9934
-20.4353 -9.9218 -9.9218
-15.9387 -13.2732 -13.2732
Estep example showing Post and
density,
density is used to normalize post
Data not from model P12
post = exp(bsxfun(@minus, density = sum(post,2)
log_lh, maxll));
0.9384 1.0000 1.9384
1.0000 0.2890 1.2890
0.0000 1.0000 1.0000
1.0000 0.0000 1.0000
1.0000 0.0007 1.0007
1.0000 0.1326 1.1326
1.0000 0.0000 1.0000
0.0956 1.0000 1.0956
0.0003 1.0000 1.0003
0.0024 1.0000 1.0024
0.0000 1.0000 1.0000
0.0001 1.0000 1.0001
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
0.0000 1.0000 1.0000
0.0696 1.0000 1.0696
Estep example showing post after
normalization
and logpdf P12 data not from model
post = bsxfun(@rdivide, post, logpdf = log(density) + maxll;
density) ll = sum(logpdf) = -147.9445
-5.5662
0.4841 0.5159 -5.8650
0.7758 0.2242 -2.5414
0.0000 1.0000 -7.3336
1.0000 0.0000 -7.0671
0.9993 0.0007 -5.5804
0.8829 0.1171 -7.8564
1.0000 0.0000 -4.3742
0.0873 0.9127 -19.2829
0.0003 0.9997 -14.0706
0.0024 0.9976 -11.4791
0.0000 1.0000 -8.2713
0.0001 0.9999 -15.5351
0.0000 1.0000 -9.9934
0.0000 1.0000 -9.9218
0.0000 1.0000 -13.2060
Log-likelihood for 2 mixture
example

P =Nlogl = -ll

55.3416 109.3820
184.7868 42.8043

The diagonal term are the case where the


data came from the model
The off diagonal terms represent when the
data came from the other model
Gaussian Models in Matlab
Model for one
Gaussian Mixture Distribution
Structure one
8 Gaussian model means 8x39 one
Diagonal Covariance Matrix
Training the GMMs
Before recording can begin it is necessary to set
the laptops internal microphone
Training involves finding a quiet environment and
recording 30 seconds of utterance for each digit
These are captured using Matlabs wavrecord
y = wavrecord(30*8000,8000);
There is a utility supplied that allows viewing the
Voice Activity detection algorithm in order to
determine correct captures of the training data
speechdetect(y);
Trainmodels overview
Generates Frames of speech base on 160 samples/frame with an 80
sample overlap
Uses the same energy detect and zero crossing thresholds as the
recognizer
Determines portions of voiced speech based on these thresholds as well
as a minimum of 250msec duration for each word
A minimum of 100msec is required between each word
Frames are marked as VA, voice active, and stored in a buffer call ALLdata.
ALLdata is arranged so that the frames are in columns, the dimensions are
160xnumFRAMES
Once all the words are captured, MFCC is called which is passed the
ALLdata buffer for Mel cepstral coefficient processing
MFCC returns MFCC vectors that are 39 coefficients per frame
Gmmdistribution.fit is passed the MFCC vectors which runs an EM
algorithm on the MFCC vectors to generate an 8 Mixture GMM for each
digit
MFCC credits
Derived from the original function 'mfcc.m' in the Auditory Toolbox
% written by:
%
% Malcolm Slaney
% Interval Research Corporation
% malcolm@interval.com
% http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/
%
% Also uses the 'deltacoeff.m' function written by:
%
% Olutope Foluso Omogbenigun
% London Metropolitan University
% http://www.mathworks.com/matlabcentral/fileexchange/19298
MFCC overview
Pre-filter the data using a pre-emphasis filter
preEmphasized = filter([1 -.97], 1, input);
Window the data with a Hamming window
preEmphasized =
preEmphasized.*repmat(hamWindow(:),1,frames);
fftMag = abs(fft(preEmphasized,fftSize));
earMag = log10(mfccFilterWeights * fftMag);
ceps = mfccDCTMatrix * earMag;
meanceps = mean(ceps,2);
ceps = ceps - repmat(meanceps,1,frames);
d = (deltacoeff(ceps')).*0.6; %Computes delta-mfcc
d1 = (deltacoeff(d)).*0.4; %as above for delta-delta-mfcc
ceps = [ceps; d'; d1']; %concatenates all together
Return vector of 13 cep, 13 diff and 13 diff diff coefficients
Sound Settings for Microphone on
Windows 7 laptop
Voice Activity Detector Overview
Voice activity detection based on energy detection and zero
crossing rate
std_zxings: is the zero crossing threshold, default = .5
std_energy: is the energy detect threshold, default = .5
Energy and zero crossings thresholds are determined during the
first 500msec of training to determine the background silence
energy and zero crossing rate
The same threshold settings must be used for all digit
recordings
Once a good recording has been made, save it to the hard drive
using;
wavwrite(y,8000,one.wav);
Repeat for all the digits
Run transcript and this will train the GMMs
Authors Ideal Voice Activity detector
Voice Detect using default thresholds
digit = one

1.2

0.8

0.6

0.4

0.2

-0.2
7.5 8 8.5 9 9.5 10 10.5 11
4
x 10
Voice Detect using default
thresholds 1,1 digit = one
1.2

0.8

0.6

0.4

0.2

-0.2

4 4.5 5 5.5 6 6.5 7 7.5


4
x 10
Voice Detect using default
thresholds 1.5,1.5 digit = one

0.8

0.6

0.4

0.2

-0.2

-0.4
5.5 6 6.5 7 7.5 8 8.5 9
4
x 10
Transcript reads each model and
calls trainmodels
y = wavread('one.wav');
trainmodels(y,'one');
y = wavread('two.wav');
trainmodels(y,'two');
y = wavread('three.wav');
trainmodels(y,'three');
y = wavread('four.wav');
trainmodels(y,'four');
y = wavread('five.wav');
trainmodels(y,'five');
y = wavread('six.wav');
trainmodels(y,'six');
y = wavread('seven.wav');
trainmodels(y,'seven');
y = wavread('eight.wav');
trainmodels(y,'eight');
y = wavread('nine.wav');
trainmodels(y,'nine');
y = wavread('zero.wav');
trainmodels(y,'zero');
GMM dimensions for typical
utterance
Assume average digit length is 300 mSec
Fs = 8000Hz
1/Fs = 125sec
160 samples/Fs = 20msec
Since overlap and add using 50 % Hamming widow, 1
Frame occurs every 10msec
Average number of frames per word 300/10 = 30
MFCC takes in 30x160 samples and produces 30x39
MFCC vectors on average
Average size of log_lh vector per word for 8 Gaussian
mixtures = 30x8
Log-likelihood based on average 30x8 matrix
Voice Activity detect filter implemented as a
128 tap FIR filter based on a Chebyschev
window with 40 dB sidelobe attenation
Voice detector using 125-750 Hz 128 tap Chebyshev
bandpass filter with 40 dB side lobe suppression
and 20mse pre oneshot with 40msec post oneshot
digit = one

1000

500

-500

-1000

1 2 3 4 5 6 7
4
x 10
Training Vector for digit one after
modified VA detection
6
x 10
1.5

0.5

-0.5

-1

-1.5
0 0.5 1 1.5 2 2.5
5
x 10
Scoring
Difficult to score based on the real time recognizer.
Recognizer fires on ambient noise
Recognizer is slow as it has to perform GMM calculation for all
dictionary entries
Recorded test set of test set, counting from 1-9,0 produced 70%
accuracy two and seven and eight did not correctly classify
Had to lower zerocrossing threshold for test to collect all the
utterances
Accuracy might be due to insufficient training data
Could have bad models for some of the classes
Hand scoring difficult because must correctly label each
utterance for the classifier. Seven had a null portion in the middle
Lap top computers fan kicked on during training, this caused
ambient noise during training so data set was not perfect
Test Set counting 1-9,0 and repeat
frame based with silence removed

0.03

0.02

0.01

-0.01

-0.02

-0.03

-0.04

2 4 6 8 10 12 14
4
x 10
Summary
An 8 mixture GMMs for speech recognition were demonstrated.
Using only a small training set and an laptop microphone, digit
recognition was demonstrated using only 8000Hz sample rate
Care and feeding of the GMMs is very important for successful
implementation.
Garbage in, garbage out is especially true for speech recognition
Background noise is a very big problem in accurate speech
recognition. Adaptive noise cancellation using a second
microphone for just the background noise should improve accuracy
The voice activity detector is a critical component of the recognizer
Scoring is also a difficult problem as the acoustic data must be
synchronized with the dictionary to provide accurate results
Marking the speech pattern and word isolation is not without
difficulties as pauses between syllables occur during a single
utterance
Conclusion
GMMs are very powerful models for speech recognition.
Scoring the models is difficult. The EM algorithm will
produce different models based on the random seeding of
the starting conditions.
Simple utterances of ~15 repetitions is not sufficient for
good GMM accuracy
The voice activity detector plays a significant part in the
training and testing of the data
A new voice activity detector did not magically produce
100 percent scoring accuracy with a recorded test wav file
Noise cancellation techniques and sophisticated voice
detection algorithms are necessary for good performance
as well as model optimization
Areas for further
investigation
Automate the scoring process
Improve the Voice activity detector in the real
time recognizer
Add a second microphone for adaptive noise
cancellation
Convert GMMs to combination GMMs and HMMs
so dictionary search isnt so computationally
intensive
Modify the number of mixtures of the GMMs with
HMM phonetic implementation
HMMs will allow for continuous digit recognition

You might also like