EMI Lab Manual2

Tutorial on singing types
This tutorial explains the basics of singing type classification which classify a singing clip into
two classes based on its resonant location of head or chest. The corpus was recorded by Bernard
Wang.
Contents
 Preprocessing
 Dataset collection
 Performance evaluation
 Dimensionality reduction
 Summary
 Appendix
Preprocessing
Before we start, let's add necessary toolboxes to the search path of MATLAB:
addpath d:/users/jang/matlab/toolbox/utility
addpath d:/users/jang/matlab/toolbox/sap
addpath d:/users/jang/matlab/toolbox/machineLearning
All the above toolboxes can be downloaded from the author's toolbox page. Make sure you are
using the latest toolboxes to work with this script.
For compatibility, here we list the platform and MATLAB version that we used to run this script:
fprintf('Platform: %s\n', computer);

fprintf('MATLAB version: %s\n', version);
fprintf('Script starts at %s\n', char(datetime));
scriptStartTime=tic; % Timing for the whole script
Platform: PCWIN64
MATLAB version: 8.5.0.197613 (R2015a)
Script starts at 04-Feb-2017 19:55:14
Dataset collection
First of all, we can collect all the sound files. The dataset can be found at this link. We can use
the commmand "mmDataCollect" to collect all the file information:
auDir='datasetOfSingingTypes';
opt=mmDataCollect('defaultOpt');
opt.extName='wav';
auData=mmDataCollect(auDir, opt, 1);
Collecting 28 files with extension "wav" from "datasetOfSingingTypes"...
We need to perform feature extraction and put all the dataset into a format that is easier for
further processing, including classifier construction and evaluation.
myTic=tic;
if ~exist('ds.mat', 'file')
opt=dsCreateFromMm('defaultOpt');
opt.auFeaFcn=@auFeaMfcc; % Function for feature
extraction
opt.auEpdOpt.method='vol';
opt.auEpdSelectionMethod='maxDuration';
ds=dsCreateFromMm(auData, opt);
fprintf('Saving ds.mat...\n'); save ds ds
else
fprintf('Loading ds.mat...\n'); load ds.mat
end
fprintf('time=%g sec\n', toc(myTic));
Loading ds.mat...
time=0.00217325 sec
Now all the frame-based features are extracted and stored in "ds". Next we can try to plot the
extracted features for each class:
figure; dsFeaVecPlot(ds);
Performance evaluation
Now we want to do performance evaluation on LOFOCV (leave-one-file-out cross validation),
where each file is a recording of a complete sound event. LOFOCV is proceeded as follows:
opt=perfLoo4audio('defaultOpt');
[ds2, fileRr, frameRr]=perfLoo4audio(ds, opt);
fprintf('Frame-based leave-one-file-out RR=%g%%\n', frameRr*100);
fprintf('File-based leave-one-file-out RR=%g%%\n', fileRr*100);
1/28: Leave-one-file-out CV for "datasetOfSingingTypes/Chest
voice/A#3_chest.WAV", time=0.283076 sec
voice/A2_chest.WAV", time=0.223322 sec
voice/A3_chest.WAV", time=0.235403 sec
voice/B2_chest.WAV", time=0.211378 sec
voice/B3_chest.WAV", time=0.217453 sec
voice/C#3_chest.WAV", time=0.212429 sec
voice/C3_chest.WAV", time=0.218728 sec
voice/D#3_chest.WAV", time=0.506582 sec
voice/D3_chest.WAV", time=0.2052 sec
voice/E3_chest.WAV", time=0.237799 sec
voice/F#3_chest.WAV", time=0.248914 sec
voice/F3_chest.WAV", time=0.217656 sec
voice/G#3_chest.WAV", time=0.375588 sec
voice/G3_chest.WAV", time=0.227494 sec
15/28: Leave-one-file-out CV for "datasetOfSingingTypes/Head
Voice/A#3_head.WAV", time=0.23512 sec
Voice/A3_head.WAV", time=0.297975 sec
Voice/B3_head.WAV", time=0.226897 sec
Voice/C#4_head.WAV", time=0.233566 sec
Voice/C4_head.WAV", time=0.228133 sec
Voice/D#4_head.WAV", time=0.226349 sec
Voice/D4_head.WAV", time=0.221052 sec
Voice/E4_head.WAV", time=0.223945 sec
Voice/F#4_head.WAV", time=0.206353 sec
Voice/F4_head.WAV", time=0.231214 sec
Voice/G#3_head.WAV", time=0.251855 sec
Voice/G#4_head.WAV", time=0.221119 sec
Voice/G3_head.WAV", time=0.259713 sec
Voice/G4_head.WAV", time=0.237161 sec
Frame-based leave-one-file-out RR=96.9448%
File-based leave-one-file-out RR=100%
We can plot the frame-based confusion matrix:
confMat=confMatGet(ds2.output, ds2.frameClassIdPredicted);
confOpt=confMatPlot('defaultOpt');
confOpt.className=ds.outputName;
figure; confMatPlot(confMat, confOpt);
We can also plot the file-based confusion matrix:
confMat=confMatGet(ds2.fileClassId, ds2.fileClassIdPredicted);
We can also list all the misclassified sounds in a table:
for i=1:length(auData)
auData(i).classPredicted=ds.outputName{ds2.fileClassIdPredicted(i)};
end
opt=mmDataList('defaultOpt');
opt.listType='all';
mmDataList(auData, opt);
List of 28 cases
GT
Index\F ==> Hi
File url
ield Predic t
ted
Chest
voice
A#3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
1 ==>
WAV ue asetOfSingingTypes/Chest voice/A#3_chest.WAV
Chest
voice
Chest
A2_chest. voice tr /jang/books/audioSignalProcessing/appNote/singingType/dat
2
WAV ==> ue asetOfSingingTypes/Chest voice/A2_chest.WAV
Chest
voice
Chest
voice
A3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
3 ==>
WAV ue asetOfSingingTypes/Chest voice/A3_chest.WAV
Chest
voice
Chest
voice
B2_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
4 ==>
WAV ue asetOfSingingTypes/Chest voice/B2_chest.WAV
Chest
voice
Chest
voice
B3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
5 ==>
WAV ue asetOfSingingTypes/Chest voice/B3_chest.WAV
Chest
voice
Chest
voice
C#3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
6 ==>
WAV ue asetOfSingingTypes/Chest voice/C#3_chest.WAV
Chest
voice
Chest
voice
C3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
7 ==>
WAV ue asetOfSingingTypes/Chest voice/C3_chest.WAV
Chest
voice
Chest
voice
D#3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
8 ==>
WAV ue asetOfSingingTypes/Chest voice/D#3_chest.WAV
Chest
voice
Chest
voice
D3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
9 ==>
WAV ue asetOfSingingTypes/Chest voice/D3_chest.WAV
Chest
voice
Chest
voice
E3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
10 ==>
WAV ue asetOfSingingTypes/Chest voice/E3_chest.WAV
Chest
voice
F#3_chest. Chest tr /jang/books/audioSignalProcessing/appNote/singingType/dat
11
WAV voice ue asetOfSingingTypes/Chest voice/F#3_chest.WAV
==>
Chest
voice
Chest
voice
F3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
12 ==>
WAV ue asetOfSingingTypes/Chest voice/F3_chest.WAV
Chest
voice
Chest
voice
G#3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
13 ==>
WAV ue asetOfSingingTypes/Chest voice/G#3_chest.WAV
Chest
voice
Chest
voice
G3_chest. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
14 ==>
WAV ue asetOfSingingTypes/Chest voice/G3_chest.WAV
Chest
voice
Head
Voice
A#3_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
15 ==>
WAV ue asetOfSingingTypes/Head Voice/A#3_head.WAV
Head
Voice
Head
Voice
A3_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
16 ==>
WAV ue asetOfSingingTypes/Head Voice/A3_head.WAV
Head
Voice
Head
Voice
B3_head.W tr /jang/books/audioSignalProcessing/appNote/singingType/dat
17 ==>
AV ue asetOfSingingTypes/Head Voice/B3_head.WAV
Head
Voice
Head
Voice
C#4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
18 ==>
WAV ue asetOfSingingTypes/Head Voice/C#4_head.WAV
Head
Voice
Head
Voice
C4_head.W tr /jang/books/audioSignalProcessing/appNote/singingType/dat
19 ==>
AV ue asetOfSingingTypes/Head Voice/C4_head.WAV
Head
Voice
Head
Voice
D#4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
20 ==>
WAV ue asetOfSingingTypes/Head Voice/D#4_head.WAV
Head
Voice
Head
Voice
D4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
21 ==>
WAV ue asetOfSingingTypes/Head Voice/D4_head.WAV
Head
Voice
Head
Voice
E4_head.W tr /jang/books/audioSignalProcessing/appNote/singingType/dat
22 ==>
AV ue asetOfSingingTypes/Head Voice/E4_head.WAV
Head
Voice
Head
Voice
F#4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
23 ==>
WAV ue asetOfSingingTypes/Head Voice/F#4_head.WAV
Head
Voice
Head
Voice
F4_head.W tr /jang/books/audioSignalProcessing/appNote/singingType/dat
24 ==>
AV ue asetOfSingingTypes/Head Voice/F4_head.WAV
Head
Voice
Head
Voice
G#3_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
25 ==>
WAV ue asetOfSingingTypes/Head Voice/G#3_head.WAV
Head
Voice
Head
Voice
G#4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
26 ==>
WAV ue asetOfSingingTypes/Head Voice/G#4_head.WAV
Head
Voice
Head
Voice
G3_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
27 ==>
WAV ue asetOfSingingTypes/Head Voice/G3_head.WAV
Head
Voice
Head
G4_head. tr /jang/books/audioSignalProcessing/appNote/singingType/dat
28 Voice
WAV ue asetOfSingingTypes/Head Voice/G4_head.WAV
==>
Head
Voice
Dimensionality reduction
In order to visualize the distribution of the dataset, we need to project the original dataset into 2-
D space. This can be achieved by LDA (linear discriminant analysis):
ds2d=lda(ds);
ds2d.input=ds2d.input(1:2, :);
figure; dsScatterPlot(ds2d); xlabel('Input 1'); ylabel('Input 2');
title('MFCC projected on the first 2 lda vectors');
As can be seen from the scatter plot, the overlap between "10" and "50" is the largest among all
class pairs, indicating that these two classes are likely to be confused with each other. This is
also verified by the confusion matrices shown earlier.
Actually it is possible to do LDA projection and obtain the corresponding accuracies vs.
dimensionalities via leave-one-out cross validation over KNNC:
opt=ldaPerfViaKnncLoo('defaultOpt');
opt.mode='exact';
recogRate1=ldaPerfViaKnncLoo(ds, opt);
ds2=ds; ds2.input=inputNormalize(ds2.input); % input normalization
recogRate2=ldaPerfViaKnncLoo(ds2, opt);
[featureNum, dataNum] = size(ds.input);
plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-');
grid on
legend('Raw data', 'Normalized data', 'location', 'southeast');
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)');
We can also perform input selection to reduce dimensionality:
myTic=tic;
z=inputSelectSequential(ds, inf, [], [], 1); figEnlarge;
toc(myTic)
Construct 91 models, each with up to 13 inputs selected from 13
candidates...
Selecting input 1:
Model 1/91: selected={ 1} => Recog. rate = 84.7%
Model 10/91: selected={10} => Recog. rate = 71.4%
Currently selected inputs: 1
Selecting input 2:
Model 14/91: selected={ 1, 2} => Recog. rate = 87.7%
Currently selected inputs: 1, 5
Selecting input 3:
Model 26/91: selected={ 1, 5, 2} => Recog. rate = 92.9%
Currently selected inputs: 1, 5, 8
Selecting input 4:
Model 37/91: selected={ 1, 5, 8, 2} => Recog. rate = 97.3%
Currently selected inputs: 1, 5, 8, 4
Selecting input 5:
Model 47/91: selected={ 1, 5, 8, 4, 2} => Recog. rate = 99.1%
Currently selected inputs: 1, 5, 8, 4, 10
Selecting input 6:
Model 56/91: selected={ 1, 5, 8, 4, 10, 2} => Recog. rate = 99.3%
Currently selected inputs: 1, 5, 8, 4, 10, 11
Selecting input 7:
Model 64/91: selected={ 1, 5, 8, 4, 10, 11, 2} => Recog. rate = 99.4%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7
Selecting input 8:
Model 71/91: selected={ 1, 5, 8, 4, 10, 11, 7, 2} => Recog. rate = 99.5%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6
Selecting input 9:
Model 77/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 2} => Recog. rate =
99.5%
99.6%
99.7%
99.8%
99.6%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6, 12
Selecting input 10:

Model 82/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 12, 2} => Recog. rate
= 99.6%
= 99.7%
= 99.8%
= 99.7%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6, 12, 9
Selecting input 11:

Model 86/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 2} => Recog.
rate = 99.5%
rate = 99.6%
rate = 99.7%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13
Selecting input 12:

Model 89/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13, 2} =>
Recog. rate = 99.6%
Model 90/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13, 3} =>
Recog. rate = 99.7%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13, 3
Selecting input 13:

Model 91/91: selected={ 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13, 3, 2} =>
Recog. rate = 99.8%
Currently selected inputs: 1, 5, 8, 4, 10, 11, 7, 6, 12, 9, 13, 3, 2
Overall maximal recognition rate = 99.8%.

Selected 9 inputs (out of 13): 1, 5, 8, 4, 10, 11, 7, 6, 12
Elapsed time is 274.399640 seconds.
It seems the feature selection is not very effective since the accuracy is the best when all the
inputs are selected.
After dimensionality reduction, we can perform all combinations of classifiers and input
normalization to search the best performance via leave-one-out cross validation:
myTic=tic;
poOpt=perfCv4classifier('defaultOpt');
poOpt.foldNum=10; % 10-fold cross validation
figure; [perfData, bestId]=perfCv4classifier(ds, poOpt, 1);
toc(myTic)
structDispInHtml(perfData, 'Performance of various classifiers via cross
validation');
Iteration=200/1000, recog. rate=72.094%
Then we can display the confusion matrix corresponding to the best classifier and the best input
normalization scheme:
confMat=confMatGet(ds.output, perfData(bestId).bestComputedClass);
Summary
This is a brief tutorial which uses the basic techniques in pattern recognition. There are several
directions for further improvement:
 Explore other features (such as magnitude spectrum)

 Verify that endpoint detection has been performed correctly on each recording
 Use other classifiers
Appendix
List of functions and datasets used in this script
 MIR-QBSH dataset
 List of files in this folder
Date and time when finishing this script:
fprintf('Date & time: %s\n', char(datetime));

Date & time: 04-Feb-2017 20:09:14
Overall elapsed time:

toc(scriptStartTime)
Jyh-Shing Roger Jang.
Published with MATLAB® R2015a

EMI Lab Manual2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EMI Lab Manual2

Uploaded by

Copyright:

Available Formats

Tutorial on singing types

fprintf('Platform: %s\n', computer);

We can plot the frame-based confusion matrix:

We can also plot the file-based confusion matrix:

We can also perform input selection to reduce dimensionality:

Selecting input 10:

Selecting input 11:

Selecting input 12:

Selecting input 13:

Overall maximal recognition rate = 99.8%.

 Explore other features (such as magnitude spectrum)

Date and time when finishing this script:

fprintf('Date & time: %s\n', char(datetime));

Overall elapsed time:

Jyh-Shing Roger Jang.

Published with MATLAB® R2015a

You might also like