Journal of Signal and Information Processing, 2013, 4, 106-110
doi:10.4236/jsip.2013.42014 Published Online May 2013 (/journal/jsip)
Rearch on Different Feature Parameters in Speaker Recognition
Qiyue Liu, Mingqiu Yao, Han Xu, Fang Wang
Department of Communication and Information System, Hebei University of Science and Technology, Shijiazhuang, China. Received March 15th, 2013; revid April 16th, 2013; accepted April 25th, 2013
Copyright © 2013 Qiyue Liu et al. This is an open access article distributed under the Creative Commons Attribution Licen, which permits unrestricted u, distribution, and reproduction in any medium, provided the original work is properly cited. ABSTRACT
Feature parameters extraction is critical for speaker recognition rearch. The paper prents the funct
ion of pitch, for- mant and Mel frequency central coefficient (MFCC) in speaker recognition. It can increa the identification rate effec- tively for feature parameter sorts the speech corpus. Using Euclid Distance to compare feature parameters is very effec- tive.
Keywords: Pitch; Formant; MFCC; Euclid Distance
关于阅读的句子1. Introduction
People can distinguish different speakers through the ear, since people can know the difference, machine can also do it in some kind of method. Speaker recognition is to make a machine to identify different people, which are to let the machine know who is talking.
The ultimate purpo of speaker recognition is to iden-tify who is speaking, while to ignore the content of speech. In fact, it is the recognition of the characteristics of speech.
The human voice is a natural property, each person’s speech organs have their own characteristics and pronun- ciation habits. Therefore, to identify speaker exactly, the parameters that can fully reflect the personality charac- teristic must be extracted from the speech signal.
The feature parameters should have the character- istics [1,2]:
•The can fully embody the large difference between different people, and can keep stability relatively when the speaker’s speech changes.
•The can maintain good health and stubbornness when voice suffers from outside interference. •The cannot imitate easily.
•The are easy to extract and compute, and have fa-vorable independence between each dimension of the characteristic parameter.
Voice is different from fingerprint, the fingerprint is fixed, but the voice is changing, so it has not been found in some kind of parameter which could fully meet all of the features we mentioned above. The sound is con-nected with human emotion, health and environment, etc. And also has a relationship with the voice content. Therefore, all of the characteristic parameters we applied to have some defects now, which cannot accurately stand for the speaker’s personality traits.
2. Rearch on Different Feature Parameters Speaker’s characteristics are generally reflected in chan-nel feature and the glottal feature.
In the ca of ensuring the recognition rate, it should be very difficult to improve recognition time thro
ugh reducing computational complexity. It has been ud that expending computation time to improve the recognition rate. And in speaker recognition, with the increasing in the number of speakers, the time it takes to identify is increasing in a rectilinear fashion. Becau every time recognition must be matched with every speaker model orderly, and then find the clost corresponding speaker model as the final recognition result. In this way, the more registered number, the longer discriminating time, it must be reached by a limit that leads to a very long time to identify, it cannot meet the requirement finally. In this ca it could be nicely solved adopting classification.
2.1. Pitch Frequency
The pitch has aroud the periodicity through vocal cords vibration when madding voiced sound, pitch frequency is
All Rights Rerved.
请帖模板Rearch on Different Feature Parameters in Speaker Recognition
107
a very important parameter using to describe the charac- teristic of voice excitation source. The varia
tional range of pitch frequency is generally from 50 Hz to 500 Hz, the cycle of the male voice is 50 Hz - 300 Hz, and the female is 100 Hz - 500 Hz. Although each person’s different vocal structure lead to different fundamental frequency, becau of the pitch frequency’s scope is a little small, the gap between different people is little, and the most important is pitch frequency is affected by a lot of factors, such as emotion, tone, it is very difficult to achieve accu- rate fundamental frequency. Thus, the recognition rate is very low using the fundamental frequency for speaker recognition now. But male fundamental frequency is gen- erally lower than the female, it is a good argument as clas- sification.
Since the rearch of voice signal analysis, pitch ex- traction is always an important rearch topic. Speech signal changes complexly, which is affected by channel and has an ample harmonic constituent. Although many methods have been propod at prent, they all have limitations, cannot delegate speaker’s different characteris- tics, and can not adapt to different requirement and envi- ronment.
There are a variety of methods to extract the funda- mental tone [1]. The can be roughly divided into three categories, wave form estimation, correlation process and converter technique [3]. This paper ud a converter tech- nique to extract pitch, it transforms the speech signal to the cepstrum d
omain, eliminates channel impact using homo- morphic analytical method, then obtains the information of pumping part, and ermittelts fundamental frequency. Only voice sound has pitch alternation. The glottal ex- citation is less energy and white noi of spectrum evenly distributed when madding voiceless; when madding voice sound, it is a shock quence having a certain pe- riod. This period is the pitch alternation. A finite length quence of periodic impul has a periodic impul -
quence in cepstrum domain ()(0M
r r )p s n n αδ==− rT , M
is positive, r is crest factor, αp T is pitch alternation, and the period cannot change in cepstrum domain, the amplitude increas along with r and the rate of decay is faster than in the time domain. In this way, the method bad on cepstrum can be ud to extract fundamental frequency and it has a better effect.
Lab ttings: Intel(R) Core(TM)2 Duo T6400, 2 GHz memory, Windows XP system, MATLAB7.0 develop- ment platform, the experiment’s voice data u Cool Edit Pro to transcribe, sampling frequency is 16,000 Hz, sam- pling precision is 16 bit, single track, the age of recorded speaker is in
8 - 60 years old, speaking mandarin, every- one speaks 7 ntences, the time of every ntence is in 3 - 12 s, including vowel, consonant, Chine, English and
figure.
The experimental results were shown in Table 1, every speaker’s pitch frequency could not be accurately achieved with this method. The result appears in a scope rather than is a exact value. The scope of different peo-ple’s pitch frequency has a small gap and interction. So it is clearly not feasible only with a frequency value in the speaker recognition. The male voice’s pitch fre-quency is generally lower than the female, therefore, pitch can be ud to distribute speakers.
2.2. Formant
Formant information include in spectral envelope. The formant is generally the maximum of spectral envelope, so the necessary procedure of extracting formant is to estimate spectral envelope.
Methods of fetching formant contain cepstrum method and linear forecasting method [1]. Formant generally de-fined as the attenuation sine component of sound channel impul respon. A primary question for extracting for-mant is that impul respon of the sound channel can-not measure dire
ctly. Voice signals are the convolution of all pole model and glottal quasiperiodic function, so when analyzing, it must solve convolution, parate im-pul respon and excitation function.
The paper adopts linear forecasting method to estimate formant, the specific method is peak detection. Analyz-ing formant with linear predictor coefficients is faster and better than others. The track function which is de-scribed by linear predictor coefficients (LPC) is com-puted firstly, the function is ud to compute the spec-trum, according to the spectrum, the formant’s peak, fre-quency and bandwidth are computed [4].
The experimental environment is identical with pitch’s. The Table 2 shows that each formant could change when the same person said different word. Even though the same person’s value of alteration has a scope, this scope includes others’. Therefore formant parameter can- not be the effective one in speaker recognition. Experi- mental data proved that children’s value of F1 are higher than adults’, so the parameter can be ud to distinguish between child and adult.
Talbe 1. The result of pitch frequency with cepstrum me- thod.
Voice 1Voice 2 Voice 3 Voice 4Voice 5Woman 1Woman 2Woman 3Woman 4Man 1 Man 2 Man 3 Man 4
333 266 262 202 183 172 121 112
301 262 231 213 195 141 133 109
307 210 280 210 183 168 124 134
311 250 271 220 178 141 132 114
318 243 250 206 181 156 108 129
All Rights Rerved.
Rearch on Different Feature Parameters in Speaker Recognition 108
Talbe 2. Formants of speech signals.
F1 F2 F3
Adult
Man 1
Man 1 (different content)
Man 2
Man 3
Man 4
Woman 1
Woman 2
Woman 3
Woman 4
704
616
652
581
438
618
544
551
590
1174
1831
1323
1830
1614
1814
1834
2653
大学音乐1210
2456
2891
2721
2519
1780
2617
2960
2630
2279
Child Girl 1
Girl 2
Boy 1
Boy 2
749
1015
904
837
1405
1733
1353
1379
1643
2314
2990
2560
2.3. Mel Frequency Central Coefficient
In a noisy environment, people can also identify correctly different sound in the ear, the important reason is the cochlea played a role. The cochlear is equivalent to a t of filters, the filters are done to the signal on logarithmic frequency scale, and so human ear is more nsitive to low frequency signals [5].
A t of Mel filters of imitating the role of the cochlea are triangular filters, the center frequency is equispaced in the Mel frequency axis, and they have the same span on the Mel frequency scale. The number of filter bank is decided by cutoff frequency of signal, all of the filter bank collectively cover between 0 and 1/2 sampling fre-quency.
To emphasizing low frequency information of the sig-nal, MFCC change the linear frequency scale into Mel frequency scale, so uful information for identifying is highlighted and the noi jamming is shielded effectively. If Mel cepstrum is ud, filtering and weighting in the cepstrum domain are bad on linear spectrum process-ing [6].
MFCC generally reflect the static characteristics, but the human ear is more nsitive to the dynamic charac-teristics of voice. ΔMFCC can reflect dynamic property. This parameter can be acquired by computing first-order difference and cond. The paper us the parameter combining 12 dimensions MFCC with ΔMFCC.
The experimental environment is identical with pitch’s. Five methods of comparing to two MFCC were attempted.
1) Correlation coefficient
In theory, the correlation coefficient is the maximum when the same person speaks the same word, and it is the cond highest when the same people speak different words. Only in this way, could the speaker be identified. Analyzing experimental result, the Table 3 shows that the correlation coefficient of female speaker L cannot be identified, becau the value of the same people speaking different words is lower than the different people speak-ing the same word. So this means cannot be resultful Table 3. Correlation coefficient of MFCC of two voice sig-
nal.
y
L
x Same people
same content
0.52980.69470.6371
Same people
different content
0.36650.41160.4446
Different people
多少的英语same content
0.41610.66660.4463
Different people
different content
0.39320.5084 0.4544
method in speaker recognition.
2) Comparing to similarity of corresponding three- dimensional map
The data of the Figures 1 and 2 are from the same person. Although they are similar in general, there are
many data of MFCC, they don’t have regularity, the drew
three-dimensional map is intricate, after smoothing, it is
difficult to compare the similarity.
3) Comparing related coefficient of each column
Becau each MFCC dimension is uncorrelated, they
can be compared independently.
As shown in Table 4, this method is not uful for comparing MFCC. The part of women in different people
different content is larger than same people same content
in related coefficient of the first dimension. It is a nega-
tive relationship in the first, cond, fifth, sixth, tenth and
twelfth dimension of same people same content. So this
method cannot rve as the way of comparing MFCC.
4) Euclid distance
Table 5 shows that the euclid distance is minimum
when the same person speaks the same word, and it is
cond smallest when the same person speaks different
words. Regardless of what the speaker said, the mini-
mum Euclidean distance corresponding to the speaker is
the recognition results.
3. Conclusions
Pitch and formant are both the most important parame-
ters of the speech signal. In theory, becau of the dif-
叮咚叮咚
ferences of buccal structure and sound track, everyone
should have their own different characteristics of pitch
and formant. Speech signal changes in complex, sound
channel and noi have an effect on the signal, and ex-
tracting methods are imperfect, so pitch or formant is not
an effective parameter in speaker recognition recently,
they can only play a supporting role. MFCC is effective
for speaker identification, becau it combines nsing features of the human ear with producing mechanism of
voice.
Speaker’s personality can not be reprented well by a
single parameter, using only one just describes part of
All Rights Rerved.
Rearch on Different Feature Parameters in Speaker Recognition
109
Figure 2. Three-dimensional map of female consonant.
Figure 1. Three-dimensional map of female vowel.
Table 4. Correlation coefficient of each dimension of MFCC.
Different people same content
Different people different content
Same people same content
Same people different content
woman man woman man 1 −0.0053 −0.3237 −0.1266 −0.1800 0.0782 −0.2611 2 −0.3008
有雨的诗句−0.1839
−0.2274
−0.3389 0.2223 0.1788
3 0.4592 0.3923 0.5468 0.3992 0.3423 −0.102
4 4 0.3984 −0.1197
−0.0152 0.4367 0.2005 0.2433
5 −0.0324
−0.1635 0.3900 0.2537 −0.2118
−0.1130
6 −0.054
7 0.0859 −0.2082 −0.0695 0.1711 −0.0897
7 0.0890
−0.0764 0.1685 −0.2432 −0.0870 0.1036
8 0.0187 0.2787 0.1434 0.1532 0.1763 −0.1018 9 0.4090 0.0681 0.2142 0.0786 0.2639 −0.0544 10 −0.0865 −0.1368
−0.0766
−0.1188 0.2791 0.3179
中国百家姓11 0.1750
−0.3922 0.4273 −0.2079 −0.2229 0.3064
12
−0.1299
月子早餐食谱大全0.0124
0.3571 0.0360 0.2908 0.0185
Table 5. Euclid distance of MFCC of two speech singnal.
y L x Same people same content 92,338 41,214 79,086 Same people different content
124,110 90,346 139,190 woman 141,340 94,724 199,120 Different people same content
man 182,240 183,370 140,270 woman 176,860 92,334 219,090 Different people different content
man
149,040 116,800 188,970
All Rights Rerved.
Rearch on Different Feature Parameters in Speaker Recognition 110
speaker’s characteristics, therefore, to improve the speaker recognition rate, many parameters should be combined to identify.
REFERENCES
[1]H. Hu, “Introduction to Speech Signal Processing,”
Harbin Institute of Technology Press, Harbin, 2000. [2]X. J. Yang and H. S. Chi, “Digital Processing of Speech
Signals,” Electronic Industry Press, Beijing, 1995.
[3]M. M. Sondhi, “New Methods of Pitch Extraction,” IEEE
Transaction on AU, Vol. 16, No. 1, 1968, pp. 262-266. [4]K. Du, “LPC Analysis on Formant of Speech
Signal,”
Natural Science Journal of Harbin Normal University, Vol. 2, 1998, pp. 49-52.
[5]N. Do Minh, “An Automatic Speaker Recognition Sys-
tem,” Audio Visual Communications Laboratory Swiss Federal Institute of Technology, Lausanne, 2001.
[6]Y. Chen, Z. Y. Qu, Y. Liu, K. Jiu, A. P. Guo and Z. G.
Yang, “Extraction and Application on One of Speech Pa-
rameters,” MFCC Journal of Hunan Agricultural Univer-
sity (Natural Science), Vol. 35, No. 1, 2009, pp. 106-107.
All Rights Rerved.