You are on page 1of 9

International Journal of Speech Technology

https://doi.org/10.1007/s10772-017-9488-z

A novel whispered speaker identification system based on extreme


learning machine
J. Sangeetha1 · T. Jayasankar2

Received: 1 September 2017 / Accepted: 26 December 2017


© Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract
Whispered speech speaker identification system is one of the most demanding efforts in automatic speaker recognition appli-
cations. Due to the profound variations between neutral and whispered speech in acoustic characteristics, the performance
of conventional speaker identification systems applied on neutral speech degrades drastically when compared to whisper
speech. This work presents a novel speaker identification system using whispered speech based on an innovative learning
algorithm which is named as extreme learning machine (ELM). The features used in this proposed system are Instantaneous
frequency with probability density models. Parametric and nonparametric probability density estimation with ELM was
compared with the hybrid parametric and nonparametric probability density estimation with Extreme Learning Machine
(HPNP-ELM) for instantaneous frequency modeling. The experimental result shows the significant performance improve-
ment of the proposed whisper speech speaker identification system.

Keywords  Speaker identification · Whispered speech identification · MFCC · Extreme learning machine

1 Introduction concentrates on whispered speech since it is difficult to be


duplicated and has elevated protection.
Nowadays biometrics facilitates the organization to vali- Whispered speech is a precise form of speech that is an
date the user as divergent to a portion of information (like innate alternative speech construction method in verbal
a PIN number) or a section of apparatus (like a dedicated communication. It is most frequently used in public condi-
telephone). Biometrics is defined as the statistical investiga- tions to keep away from being overhead or to keep private
tion of biological samples and human phenomena. Access information. For example, a narrator may choose whisper
control methods based on human speech progressively turn speech when giving their date of birth, billing address and
into the mainstream because authentication by means of PIN credit card number, when making a travelling reservation
codes and password is insufficient because it can be sub- over the telephone. It is supposed to whisper in diverse con-
stituted easily. There are numerous usually employed bio- ditions, for example whenever we would like to compose a
metric methods, e.g., faces, irises, fingerprints and speech discreet or a close atmosphere in discussion, in the library so
(Wang et al. 2015; Jain et al. 2004). Speech has the benefits as not to bother other group, or while someone attempts to
of easy acquisition, non-intrusiveness, and proper ease of hide some secret information from other people. Whispered
use. Instead of taking normal speech, this proposed work speech is frequently utilized in criminal activities, particu-
larly in phone conversations where criminals aim to cover up
their identity. Though, apart from the mindful production of
* J. Sangeetha a whisper speech, It may happen due to health issues which
sangita.sudhakar@gmail.com result after rhinitis and laryngitis, but for certain people it
T. Jayasankar seems like a chronic disease of larynx structures (Jovičić and
jayasankar27681@gmail.com Šarić 2008); Ito et al. 2005).
By its temperament and method of making the whisper
1
Department of IT/SOC, SASTRA Deemed University, speech is considerably dissimilar from standard speech.
Thanjavur 613409, India
There is a significant difference in the spectral domain of
2
Department of ECE, Anna University, BIT Campus, the whisper and neutral speech which created by a loss in
Trichirappalli 620024, India

13
Vol.:(0123456789)
International Journal of Speech Technology

expressed excitation constitution and changing of formant lack of harmonic structure or periodic excitation in whis-
positions with lower frequency (Zhang and Hansen 2007; pered speech. Second, the positions of lower frequency for-
Morris and Clements 2002; John 2007; Jovičić 1998; Fan mants are reallocated to higher frequencies in whispered
and Hansen 2011). It is described by the lack of glottal vibra- speech than that of neutral (Jain et al. 2004). Third, the whis-
tions and a noisy structure. As a result, whispering losses pered speech segment’sspectral slope is flatter compared to
the fundamental frequency of voice, intonation contours and neutral speech and the period is longer for whispered speech.
the other prosodic features (Huang et al. 2011; Haim et al. Fourth, F1–F2 frequency space vowel regions boundaries
2006). Added to that the whispered segment has a substan- also differ in whispered speech from neutral speech. Finally,
tially lower energy when compared to standard speech (Ito whispered speech requires lower energy. For these reasons,
et al. 2005). The discussed characters of whispered speech conventional neutral speaker identification systems degrade
are a considerable setback in speech technology, particu- drastically when experienced with whispered speech.
larly in speech recognition, speech synthesis as well as in Related efforts on speaker recognition using whispered
the recognition of the speaker. Thus, whispered speech is a speech have dealt with the setbacks using different methods.
radiating topic in current research (Wang et al. 2007). To increase the performance of this work, Jin et al. proposed
(25) the utilization of frame-based score classification and
feature warping techniques. In Gu and Zhao (2010), Chinese
2 Related work whispered speech speaker identification in the channel mis-
match circumstances has been proposed. To handle channel
In the earlier decade, numerous speaker recognition systems variability issues, Factor Analysis (FA) is used. There is a
have been implemented in literature (Li 2001; Mak and significantprogress in recognition rate andrequires less com-
Kung 2000; Campbell et al. 2006a, b; Pellom and Hansen putation at training and test time combined FA with SVM.
1998; Poignant et al. 2014; Zhao et al. 2014; Sadjadi and This is more important in real time identification situations.
Hansen 2014; Xu and Zhao 2012; Gu and Zhao 2010; Wang An access control system (Wang et  al. 2015), which
et al. 2015). Whispered speech segment has various attrib- is based on whispered speech, has been proposed using
utes. It has lower signal-to-noise ratio (SNR) and much instantaneous frequencies and an approximated probability
lower energy profile. Besides, the minor frequency formants product kernel support vector machine (APPKSVM). In the
are altered to higher frequencies in whispered speech (Jain work, to estimate the probability product kernel, Riemann
et al. 2004). sum is utilized.
The remarkable variation between neutral and whispered In general, the learning pace of feed forward neural net-
speech is shown in Fig. 1. The whispered voice segment has works is time-consuming and it has been the most important
lower amplitude and requires periodic segments (Wang et al. obstacle in their applicationsfor earlier decades. Two key fac-
2007). Major differences in the speech production method torsat the back may be: (1) the learning algorithm which is
between neutral and whispered speech are as follows: first, based on gradient are broadly used to train neural networks,

Fig. 1  Waveforms of neutral
and whispered speech

13
International Journal of Speech Technology

and (2) all the parameter are tuned iteratively by utilizing This model is mostly used to decompose a speech into
such learning protocols. Apart from these traditional imple- decorrelated band pass channels. In order to each of the sig-
mentations, this paper utilizes a new learning algorithm called nal characterized as instantaneous amplitude and frequency
extreme learning machine (ELM) which tends to offer good with k resonance components.
generalization performance at enormously quick learning For process viable speaker identification, execute a Gabor
speed. filter and Emprical mode Decomposition (EMD) are utilized
A novel learning algorithm which is called ELM based to evacuate the on the instantaneous frequencies (IF)and
whispered speech speaker identification system has been develop useful acoustical features for speaker identification-
proposed in this work. It is dedicated for single-hidden layer from the enter talk signal.Gabor filters are preferred since
feed forward neural networks (SLFNs). The output weights they are more compact and smooth in speech analysis in
have been determined by randomly chosen hidden nodes. terms of time and frequency domains. This filter character-
This proposed system utilizes instantaneous frequencies for istic promises accurate estimate in instantaneous amplitude
feature extraction. Signal independent and signal dependent and instantaneous frequency in demodulation phase.In addi-
filters are used to filter the input speech initially. Speech sig- tion, EMD as signal dependent filter which is adaptive and
nal’s instantaneous frequencies are calculated by applying the appears to be appropriate for non-linear and non-stationary
Hilbert transform. The investigated frequencies are continued signal analysis. It was carried in the time domain to form
to be represented as probability density models. These models the basis functions adaptively. The main advantage of EMD
are used as the feature in the speaker identification system. In basis functions can be directly derived from the signal itself.
this work, we compare the results of MFCC, parametric and The demodulation phase implemented is based on the Hil-
nonparametric probability density models and hybrid models bert Transform. After works, separated IF is demonstratedas
with ELM. The experimental result reveals the supremacy of parametric/nonparametric probability densities and after that
the proposed whisper speech speaker identification system. the yield can be utilized to prepare ELM for distinguishing
the whisper speaker.
3 Proposed system overview

The general block diagram of the proposed identification 4 Instantaneous frequency extractions
method is shown in Fig. 2. It comprises of the following blocks
via signal preprocessing, instantaneous frequency feature The significance of the IF stems from the information that
extraction and recognition phase. In modern years for mod- speech is a nonstationary with spectral features that vary
eling and characterizing speech with different technique have with time. In this work, we depict the initial two divisions
been proposed. Of specific curiosity here is AM–FM signal filtering is Gabor filtering and EMD utilized in extracting
modeling. The AM–FM signal model is used to characterize the IF of an enter speech signal. Then how the IF signals is
a speech resonance represented by. demodulated from the filter utilizing the Hilbert transform
is showed up.
⎛ t

⎜ ∫

ui (t) = ai (t)cos 2𝜋 fi (𝜏)d𝜏 ⎟

⎝ 0 ⎠

Fig. 2  Block diagram of the


proposed speaker identification
system

13
International Journal of Speech Technology

4.1 Gabor filtering (3) Repeat the same process as steps 1 and 2, the normal
of the envelopes µıı (t) for ­hı (t) is computed and ­hıı (t)
The Tamil whispered speech samples for developed in is obtained by
recording session with very high quality recording envi-
ronment and apparatus. A little specimen of neutral and
h11 (t) = h1 (t) − 𝜇1 (t).
whispered speech was gathered from an aggregate of 15
subjects. In a quiet room, every speaker peruses sentences (4) The above steps are iteratively repeated until the thresh-
in typical and whispered discourse styles of articulation. old SD. The first IMF ­hıĸ (t) is then generated
For both articulation styles, there are 40 sentences includ-
SD ≥
ing 28 phonetically adjusted sentences and 12 sentences ∑
K
[h1K (t) − h1K −1(t)]∕h21K−1 (t).
from news articles. Discourse information was digitized k=2
utilizing a sample frequency of 16 kHz, with 16 bits for
every case.   Correspondingly, alternate IMFs can be made via
The extract the speech resonance using Gabor filters passing the staying signal rı(t) back to the EMD above
with 400 Hz constant bandwidth and the impulse response
r1 (t) = s(t) − h1K (t).
is given by
  The new speech signal is decayed into the mixture of
g(t) = exp (−𝛼 2 t2 )cos (2𝜋fct) many intrinsic mode functions.
The resultant frequency response is represented by
√ � � 2 � � 2 �� S(t) ≈ h1K (t) + h2K (t) + ⋯ + hNK (t).
2𝜋 −𝜋 (f − f )2 −𝜋 (f + f )2
G(f) = exp exp 4.3 Speech demodulation
2𝛼 𝛼2 𝛼2

where fc is the Center frequency with consistently spaced In order to calculate the IF of the input speech signals (t)
on the Hertz scale, α is the Bandwidth control parameter. pass through filter bank and then each bandpass waveform is
demodulated using Hilbert transform or the energy operator.
This work implements to speech demodulation using Hilbert
4.2 EMD‑empirical mode decomposition transform because of less computationally expensive, lesser
error and smoother frequency evaluations. Combination of the
EMD disintegrates a signal into few intrinsic mode func- EMD and Hilbert transform are called as the Hilbert-Huang
tions (IMF). An IMF can be regarded as an AM–FM sepa- transform (HHT). The HHT has been widely used in signal
rated component.The EMD decomposes the original signal analysis, mechanical fault diagnosis, medical pathology, etc.
into a definable set of adaptive basis of functionscalled the The Hilbert transform H (t) is given as
intrinsic mode functions.The disintegrated signal’s fun- ∞
damental property can be introduced effectively by these
𝜋 ∫ t−𝜏
1 s(𝜏)
H(t) = P d𝜏
IMFs and this meets the imperative to figure exact IF uti-
lizing Hilbert transform. This work outlines the procedure −∞

of EMD to the accompanying strides. where P is the Cauchy principle value of the singular inte-
gral, s(t) is the real input signal. The analytical signal func-
(1) Using the spline function the first speech signal, the tion can be calculated as ­Sa(t)
envelopes are enfolded to get u(t) and l(t).

(2) Gauge the normal of the envelopes acquired from Sa (t) = s(t) + H(t) = s2 (t) + H 2 (t)ej𝜃(t) ;
step 1.
phase 𝜃(t) = arc tan(H(t)∕s(t))

u(t) + l(t) The IF f (t) would then be able to be registered from the
𝜇1 (t) = phase:
2
1
h1 (t) = s(t) − 𝜇1 (t) f(t) = d𝜃(t)∕d(t)
2𝜋
where s(t) is the original speech signal, h (t) is the inter-
mediate signal, u(t) is the upper envelope, l(t) is the
lower envelope.

13
International Journal of Speech Technology

4.4 Probability density modeling of instantaneous where Φ (.)-Kernel function, ɷĸ- IF instance which belongs
frequency to ­Wi; uι, σι -Weight and bandwidth of the ­lth kernel. In
the majority of the caseGaussian function utilized as kernel
This proposed work relates both EMD and multiband function, yielding the above condition:
Gabor filtering. Let Sĸ (t) states the ­kith IMF obtained by � �
the EMD or the signal which is behind leaving through the �
k×l
−‖𝜔 − 𝜔l‖2
F(𝜔) = u𝜄exp
kith band pass filtering thro the Gabor filtering. By means l=1
2𝜎 2
of Hilbert transform, the instantaneous frequency of Sĸ (t),
denoted as ɷĸ (t) is obtained. For the ith frame of filtered For the estimators, this work embraces a fixed bandwidth
signal Sĸ (t), its corresponding instantaneous frequencies kernel which gives the accompanying condition from:
set are described as the following equation:
� �
1 �
k×L
1 −‖𝜔 − 𝜔l‖2
Wi k = {wk[i − 1)L], wk[(i − 1)L + 1], … wk[(i − 1)L + (L1)]} f(𝜔) = √ u𝜄 exp −
k × L 2𝜋𝜎 l=1 2𝜎 2
L is the frame size. Frame I has k sets of frequencies that
is ­Wi = {Wi1, ­Wi2…Wik}. Each W ­ i is then represented as
the PDF by achieving the probability density estimation on 5 Extreme learning machine
it. Two methods used in this proposed work are parametric
density modeling (PDM) and nonparametric density mod- In this paper, a ELM is utilized as a training algorithm is
eling (NPDM). examined on the recognize to whispered speech and achieve
real speaker identification. ELM (Huang et al. 2006) is set
4.4.1 Parametric density modeling up for summed up single hiddenlayerfeed forward networks
(SLFNs) with a wide variability of hidden nodes. The gen-
GMM are frequently used to show the probability den- eral architecture of ELM as shown in Fig. 3. In principle,
sity of complex signal, for example, pictures and sound, this calculation has a tendency to give fast learning speed;
and yields astounding execution in numerous applications algorithm looks much simpler than compared with neural
(Haim et al. 2006). networks and SVM classifier and does not need any iterative
In this effort, GMMs are used to show the probability tuning or setting parameters such as learning rate, momen-
distribution of IF. Given an IF ɷ, the GMM is bidding in tum, etc.
follows:
5.1 Algorithm of ELM

Given a training set N={(xi, ­ti)|xi ∈ ­Rn, ­ti ∈ ­Rm, i = 1,...,N},


where νс is the Mixing parameter. The mixing parameters activation function g(x), and hidden node number N,
ought to fulfill the accompanying requirements:
1. Randomly assign input weight ­wi and bias ­bi, i = 1,...,N.
𝜈 ≥ 0 and

C
𝜈=1 2. Compute the hidden layer output matrix H.
C=1 3. Find thee output weight 𝛽 = H × T,

To get the finest parameters of Gaussian probability where T = [t1, ­tn]T.


density functions, the EM algorithm is used.

4.4.2 Non parametric density modeling

Kernel density estimation (KDE) is a non-parametric den-


sity modeling method that has been utilized to demonstrate
vital issue in different applications (Oyang et al. 2005).
Utilizing theKDE, an IFɷ is demonstrated as takes after:


K×L
f (𝜔) = u𝜄Φ(𝜔;𝜔𝜄, σι)
l=1

Fig. 3  General structure of ELM

13
International Journal of Speech Technology

In this work, the identification performance is evaluated • Precision is the fraction of retrieved instances that are
by testing the following identification systems. (1) MFCC- relevant. It is defined as follows.
GMM based system, (2) PDM based fusion system, (3)  
True positive
NPDM based fusion system, (4) Hybrid PNP based fusion Precision =
True positive + Forecast negative
system, (5) PDM based ELM, (6) NPDM based ELM, (7)
Hybrid PNP based ELM. • Recall is the fraction of relevant instances that are
retrieved. It is defined as

Truepositive
6 Experimental results Recall =
Truepositive + Forecastnegative
The experiments have been carried out over a corpus which Figures 4, 5, 6 and 7 respectively shows the accuracy,
consists of 10,000 Tamil speech samples. Initially voices precision, recall and processing time of the proposed
are recorded by using mobile phone and recorded audio files methods such as MFCC with GMM, Parametric Probabil-
are converted into wave extension file by using Gold wave ity Density Modeling, Non Parametric Probability Density
software. Sampling rate of speech signal is 44,100 Hz. The Modeling, HPNP, PDM with ELM, NPDM with ELM and
channel is stereo type. The dataset is divided into the devel- HPNP with Extreme Learning machine respectively. It is
opment corpus and evaluation corpus. The development cor- clearly shows that HPNP with ELM achieves the higher
pus is utilized for training the structure and fine-tuning the accuracy, precision; recall rate and lower processing time.
parameters which consists of 7000 samples. The evaluation
corpus is composed of 3000 speech which is for validation.
Metrics such as accuracy, precision, recall and average pro- 6.2 Overall performance analysis
cessing Time are determined from the experiment.
From Fig.  8, our proposed system that is whispered
6.1 Evaluation in terms of accuracy, precision, recall speaker identification using ELM has high accuracy, pre-
and processing time cision and recall.
From the Table 1, Accuracy, Precision, Recall and Aver-
• Accuracy is defined as follows age Processing Time for ELM is more when compared

Truepositive + Truenegative
Accuracy =
Truepositive + Truenegative + Forecastpositive + Forecastnegative

Fig. 4  Samples versus Accuracy

13
International Journal of Speech Technology

Fig. 5  Samples versus Precision

Fig. 6  Samples versus Recall

13
International Journal of Speech Technology

Fig. 7  Samples versus Process-


ing Time

Fig. 8  System performance 100

90

80

70

60
Accuracy
50
Precision
40 Recall
30

20

10

0
MFCC- GMM HPNP PDM-ELM NPDM-ELM HPNP-ELM

Table 1  Comparison between MFCC- HPNP (%) PDM-ELM (%) NPDM-ELM (%) HPNP-ELM (%)
different systems GMM (%)

Accuracy 73.68 80.70 82.56 84.32 90.21


Precision 65.94 71.9 75.670 76.57 83.32
Recall 66.13 72.35 76.32 77.95 84.44

13
International Journal of Speech Technology

with Parametric Probability Density Modeling, Non Para- Ito, T., Takeda, K., & Itakura, F. (2005). Analysis and recognition of
metric Probability Density Modeling and MFCC- GMM whispered speech. Speech Communication, 45(2), 139–152.
Jain, K., Ross, A., & Prabhakar, S. (2004). An introduction to biometric
based system. recognition. IEEE Transactions on Circuits and Systems for Video
Technology, 14(1), 4–20.
Jin, Q., Jou, S.-C. S., & Schultz, T. (2007). Whispering speaker identifi-
7 Conclusion cation. In Multimedia and Expo, 2007 IEEE International Confer-
ence on, pp. 1027–1030.
John, H. L. (2007). Analysis and classification of speech Mode: Whis-
This work presents an automatic speaker identification pered through shouted. 8th Annual Conference of the Interna-
system which is based on whispered speech. The testing tional Speech Communication Association, Interspeech.
is carried out in the Tamil whispered speech corpus. Sig- Jovičić, S. T. (1998). Formant feature differences between whispered
and voiced sustained vowels. Acta Acustica United with Acustica,
nals are filtered by means of Gabor filter. Then Empirical 84(4), 739–743.
Mode Decomposition is transformed utilizing the Hilbert Jovičić, S. T., & Šarić, Z. (2008). Acoustic analysis of consonants in
transform. Extreme Learning Machine is used as classifica- whispered speech. Journal of Voice, 22(3), 263–274.
tion method. We compared the performance of Parametric Li, Q. (2001). A detection approach to search-space reduction for
HMM state alignment in speaker verification. IEEE Transactions
Probability Density Modeling, Non-Parametric Probability on Speech and Audio Processing, 9(5), 569–578.
Density Modeling with Extreme Learning Machine. From Mak, M. W., & Kung, S. Y. (2000). Estimation of elliptical basis func-
the experimental results the accuracy of Extreme Learning tion parameters by the EM algorithm with application to speaker
Machine is 89.5% which is greater than the other methods verification. IEEE Transactions on Neural Networks, 11(4),
961–969.
such as Parametric Probability Density Modeling, Non-Par- Morris, R. W., & Clements, M. A. (2002). Reconstruction of speech
ametric Probability Density Modelingbased speaker identi- from whispers. Medical Engineering & Physics, 24(7), 515–520.
fication system. Then the Average processing time of ELM Oyang, Y. J., Ou, Y. Y., Hwang, S. C., Chenl, C. Y., & Chang, D. T. H.
is 1289.4 ms which is low and compared with other three (2005). Data classification with a relaxed model of variable kernel
density estimation. In Proc. IEEE Int. Joint Conf. Neural Netw,
methods. The precision of proposed system gets improved by vol. 5, pp. 2831–2836.
10.99%. A possible future work is to enhance this framework Pellom, L., & Hansen, J. H. L. (1998). An efficient scoring algorithm
to online education for speaker adaptation. for Gaussian mixture model based speaker identification. IEEE
Signal Processing Letters, 5(11) 281–284.
Poignant, J., Besacier, L., & Quenot, G. (2014). Unsupervised speaker
identification in TV broadcast based on written names. IEEE/
References ACM Transactions on Audio, Speech and Language Processing,
23, 57–68.
Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Sadjadi, S. O., & Hansen, J. H. L. (2014). Blind spectral weighting for
Torres-Carrasquillo, P. A. (2006). Support vector machines for robust speaker identification under reverberation mismatch. IEEE/
speaker and language recognition. Computer Speech & Language, ACM Transactions on Audio, Speech and Language Processing,
20(2), 210–229. 22(5), 937–945.
Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006). Support Wang, J. C., Chin, Y. H., Hsieh, W. C., Lin, C. H., Chen, Y. R., &
vector machines using GMM supervectors for speaker verifica- Siahaan, E. (2015). Speaker identification with whispered speech
tion. IEEE Signal Processing Letters, 13(5), 308–311. for the access control system. IEEE Transactions on Automation
Fan, X., & Hansen, J. H. (2011). Speaker identification within whis- Science and Engineering, 12(4), 1191–1199.
pered speech audio streams. IEEE transactions on Audio, Speech, Wang, J. C., Yang, C. H., Wang, J. F., & Lee, H. P. (2007). Robust
and Language Processing, 19(5), 1408–1421. speaker identification and verification.” IEEE Computational
Gu, X., & Zhao, H. (2010). Whispered speech speaker identification Intelligence Magazine, 2(2), 52–59.
based on SVM and FA. In Audio Language and Image Processing Xu, J., & Zhao, H. (2012). Speaker identification with whispered
(ICALIP), 2010 International Conference on (pp. 757–760). IEEE. speech using unvoiced-consonant phonemes. In Proc. Int. Conf.
Haim, P., Joseph, F., & Ian, J. (2006). A study of Gaussian mixture Image Anal. Signal Process, pp. 9–11.
models ofcolor and texture features for image classification and Zhang, C., & Hansen, J. H. (2007). Analysis and classification of
segmentation. Pattern Recognition, 39(4), 695–706, 2006. speech mode: Whispered through shouted. In Interspeech (Vol. 7,
Huang, G. B., Wang, D., Lan, Y. (2011). Extreme learning machine: A pp. 2289–2292).
survey. International Journal of Machine Learning and Cybernet- Zhao, Y., Wang, & Wang, D. (2014). Robust speaker identification
ics, 2, 107–122. in noisy and reverberant conditions. IEEE/ACM Transactions on
Huang, G. B., Zhu, Q. Y., & Siew, C. K. (2006). Extreme learning Audio, Speech and Language Processing, 22(4), 836–845.
machine: Theory and applications. Neurocomputing, 70(1),
489–501.

13

You might also like