Application of parameters of voice singal autoregressive models to solve speaker recognition problems

Bài báo mô tả một cách tiếp cận để

tạo ra các đặc trưng thông tin tín hiệu thoại (VSvoice signal)

của tiếng Việt trên cơ sở các hệ số

của mô hình tự hồi quy dừng. Một thuật toán độc

đáo để phân đoạn tín hiệu thoại dựa trên ước tính

khoảng của các đặc trưng số mẫu tiếng nói đã

được phát triển để tạo ra các vùng tĩnh cục bộ

của tín hiệu thoại. Điểm đặc biệt là việc sử dụng

các hệ số tự hồi quy bậc cao, tập hợp của chúng

được xác định trên cơ sở phân tích biệt thức

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 1

Trang 1

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 2

Trang 2

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 3

Trang 3

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 4

Trang 4

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 5

Trang 5

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 6

Trang 6

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 7

Trang 7

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 8

Trang 8

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 9

Trang 9

Application of parameters of voice singal autoregressive models to solve speaker recognition problems trang 10

Trang 10

Tải về để xem bản đầy đủ

pdf 11 trang minhkhanh 5020
Bạn đang xem 10 trang mẫu của tài liệu "Application of parameters of voice singal autoregressive models to solve speaker recognition problems", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Application of parameters of voice singal autoregressive models to solve speaker recognition problems

Application of parameters of voice singal autoregressive models to solve speaker recognition problems
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 25 
Evgeny Novikov, Vladimir Trubitsyn
Abstract— An approach to the formation of 
the voice signal (VS) informative features of the 
Vietnamese language on the basis of stationary 
autoregressive model coefficients is described. An 
original algorithm of VS segmentation based on 
interval estimation of speech sample numerical 
characteristics was developed to form local 
stationarity areas of the voice signal. The 
peculiarity is the use of high order autoregressive 
coefficients, the set of which is determined on the 
basis of discriminant analysis. 
Tóm tắt—Bài báo mô tả một cách tiếp cận để 
tạo ra các đặc trưng thông tin tín hiệu thoại (VS-
voice signal) của tiếng Việt trên cơ sở các hệ số 
của mô hình tự hồi quy dừng. Một thuật toán độc 
đáo để phân đoạn tín hiệu thoại dựa trên ước tính 
khoảng của các đặc trưng số mẫu tiếng nói đã 
được phát triển để tạo ra các vùng tĩnh cục bộ 
của tín hiệu thoại. Điểm đặc biệt là việc sử dụng 
các hệ số tự hồi quy bậc cao, tập hợp của chúng 
được xác định trên cơ sở phân tích biệt thức. 
Keywords— voice signal; voice signal 
segmentation; informative speech features; 
autoregression models; discriminant data analysis. 
Từ khóa— tín hiệu thoại; phân đoạn tín hiệu 
thoại; các đặc tính thông tin thoại; mô hình tự hồi 
quy; phân tích dữ liệu khác biệt . 
I. INTRODUCTION 
Despite quite a large number of studies in the 
field of speaker recognition, this problem has not 
yet been fully solved, since the required accuracy 
of identification or verification is not provided. 
The fundamental stage of speaker recognition 
is VS parameterization, which means the 
formation of informative speech features 
demonstrating the individual characteristics of the 
speaker. Currently, the following parameterization 
This manuscript is received July 19, 2019. It is 
commented on October 22, 2019 and is accepted on October 
31, 2019 by the first reviewer. It is commented on December 
17, 2019 and is accepted on December 27, 2019 by the 
second reviewer. 
methods of speaker recognition are most 
commonly used [1-4]: 
 Formant methods; 
 Methods of analyzing primary tone 
statistics; 
 Methods based on linear prediction 
coefficients; 
 Cepstral methods. 
These methods are widely studied for the 
Slavic and Germanic languages and automatic 
speaker recognition systems are developed on 
their basis. However, the recognition accuracy of 
such systems does not allow their industrial 
implementation. The main reasons for this 
situation are the following: 
 The absence of formalized criteria of 
selecting the length of the window for the 
original VS decomposition; 
 Ambiguity of choosing the basic VS 
conversion functions; 
 Instability of informative speech features 
relative to noise; 
 Transformation of the original VS, leading 
to an increase in resource capacity and 
significant errors in calculating informative 
speech features; 
 Significant variability of informative 
feature values for the same speaker. 
At the same time, the existing systems have not 
been tested for the Vietnamese language, because 
they are not focused on the Vietnamese speech 
and do not take into account its features. 
Therefore, there is a contradiction between the 
capabilities of existing voice recognition methods 
and the need in ensuring the required values of 
voice biometrics accuracy. 
When creating modern systems of the 
speaker’s voice recognition, the advantage is 
Application of Parameters of Voice Singal 
Autoregressive Models to Solve Speaker 
Recognition Problems
Journal of Science and Technology on Information Security 
26 No 2.CS (10) 2019 
given to VS stochastic models, which are based on 
the assumption that the signal can be well described 
as a parametric random process and that its 
parameters can be estimated accurately enough [5, 6]. 
This approach, taking into account the 
peculiarities of Vietnamese speech (words are 
monosyllabic, speech is slower than Russian or 
English one, the temporal structure of vocalized 
sounds is quite stable) makes extensive use of the 
autoregressive method. 
The VS autoregressive model is the most 
common form of speech path mathematical 
description for solving the problems of VS 
analysis and synthesis. It is explained by the 
adequacy of this model to the acoustic 
representation of the speech path in the form of 
pipe segments [5]. The method of estimating 
autoregressive model parameters allows applying 
them to solve the problems of VS recognition and 
synthesis. On the other hand, the coefficients of 
such a model can be interpreted as parameters of 
the speech path model generating VS. In this case, 
we can talk about the connection of the 
autoregressive method with the methods of 
identifying dynamic systems, which allow us to 
evaluate the structure and parameters of the 
identified models. Another prerequisite for the use 
of autoregressive models is the possibility of 
representing the VS in the form of a time series 
with time-varying probabilistic characteristics. 
The information above suggests that the 
possibility of describing VS by autoregressive 
models in the time domain is possible, as well as 
applying these models to solving speaker 
recognition problems. 
II. VOICE SIGNAL SEGMENTATION 
The analysis of existing approaches to speaker 
recognition based on autoregressive models 
showed that the main reason for the lack of this 
method’s reliability is the presence of their 
significant parameters variations in different VS 
implementations. Among the influencing factors 
the key one is to determine the beginning and 
length of the speech segment, which will provide 
stable parameters of autoregressive models. At the 
same time, recent studies have shown that the best 
recognition of the speaker co ... d on a step-by-step 
discriminant analysis with exceptions based on 
statistics of the following form [18]: 

 

1
1g
sgn
F , (14) 
where TW  , W and T – are the intra-group 
and inter-group correlation matrices of the AC, 
respectively, s – is the number of AC, n – is the 
number of VS realizations for all speakers. 
Then, when conditions of the following form 
are satisfied: 
 ),1,( 21expul sgngFF   
(15)
discriminant power of the AC is significant. 
Otherwise, an insignificant AC must be excluded 
from the list of informative features. 
V. QUALITY ASSESSMENT OF SPEAKER 
RECOGNITION 
Assessing the quality of speaker recognition 
on the basis of the generated informative ACs 
will be considered using the classification 
problem as an example. 
The quality of the speakers classification can 
only be assessed as posteriori. For this purpose, at 
the training stage, informative ACs are calculated 
for each PW (password word) implementation for 
all the speakers registered in the system. After 
that, the average AC values for each j-th password 
word and k-th speaker jkaˆ . 
At the classification stage, the AC values for 
the target speaker phh aaa ˆˆˆ ,...,1, are calculated. 
The obtained values using the selected 
proximity measure must be compared with the 
average AC values calculated at the training 
stage and, based on the results, decide on 
assigning the target speaker to a specific class. 
To assess the degree of proximity of AC values 
calculated for the target speaker to their 
reference values obtained at the training stage, it 
is necessary to use some measure to find the 
distance between two points in the 
multidimensional space of AC values. In the 
general case, it is advisable to use the 
Mahalanobis distance, since, firstly, the 
standard deviations of the AC values may be 
Journal of Science and Technology on Information Security 
32 No 2.CS (10) 2019 
unequal, and secondly, these values can be 
correlated. To calculate the proximity degree of 
the target speaker to a particular class, an 
expression of the following form is used: 
 )ˆˆ)(ˆˆ()( 12 jk
p
hi
p
hj
jkikikijk aaaawD 
 CA ,(16) 
where kD CA2 – is the square of the distance 
from the AC values vector A of the target 
speaker to the center of the class of vectors 
characterizing the k-th speaker, ijw )(
1 – is the 
element of the matrix inverse to the intragroup 
covariance AC matrix, ikaˆ – is the value of the i-
th AC in the class k, ikaˆ – is the average value of 
the i-th AC in class k. 
In the particular case when the intragroup 
covariance matrix is single, the Mahalanobis 
distance is the Euclidean distance. 
According to expression (16), the degree of 
proximity of the target speaker to each speaker 
registered in the system is estimated, and the 
target speaker is assigned to the class for which 
the value of the proximity measure is minimal. 
Then the indicator of the classification quality will 
be the proportion of correctly assigned AC vectors 
to the corresponding class of speakers. It is clear 
that the closer its value is to 1, the more accurate 
is the separation of speakers. 
To check the quality of speaker recognition 
based on the selected ACs, we analyzed the 
classification matrix, which contains 
information on the number and percentage of 
correctly classified observations in each group. 
For example, Tables 5 and 6 show the results 
of applying the developed procedure for 
comparing a set of coefficients 127 ˆˆ aa , 
calculated for the sound a
 in the word na
 and 
a set 118 ˆˆ aa obtained after removing excess 
coefficients 7aˆ and 12aˆ . 
TABLE 5. SPEAKERS CLASSIFICATION RESULTS BY AC CALCULATED BY THE SEGMENT 
MODEL OF THE SOUND IN THE WORD 
Group 
Classification matrix 
Line: observable classes 
Columns: predicated classes 
a
cc
u
ra
cy
ra
te
sp
ea
k
er
 1
sp
ea
k
er
 2
sp
ea
k
er
 3
sp
ea
k
er
 4
sp
ea
k
er
 5
sp
ea
k
er
 6
sp
ea
k
er
 7
sp
ea
k
er
 8
sp
ea
k
er
 9
sp
ea
k
er
 1
0
speaker 1 96.08 49 1 0 0 0 1 0 0 0 0 
speaker 2 87.75 0 43 0 0 4 0 2 0 0 0 
speaker 3 84.00 0 0 42 2 0 0 0 5 1 0 
speaker 4 88.00 0 4 2 44 0 0 0 2 2 0 
speaker 5 88.00 2 2 0 0 44 0 0 0 0 0 
speaker 6 98.00 1 0 0 0 0 49 0 0 0 0 
speaker 7 98.00 0 1 0 0 0 0 49 0 0 0 
speaker 8 88.00 0 0 6 0 0 0 0 44 0 0 
speaker 9 98.00 0 0 0 1 0 0 0 0 49 0 
speaker 10 100.00 0 0 0 0 0 0 0 0 0 50 
Total 92.60 52 49 50 47 48 50 51 51 52 50 
127 ˆˆ aa 
a
na
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 33 
TABLE 6. SPEAKERS CLASSIFICATION RESULTS BY AC CALCULATED BY THE SEGMENT 
MODEL OF THE SOUND ă IN THE WORD ăn 
Group 
Classification matrix 
Line: observable classes 
Columns: predicated classes 
a
cc
u
ra
cy
ra
te
sp
ea
k
er
 1
sp
ea
k
er
 2
sp
ea
k
er
 3
sp
ea
k
er
 4
sp
ea
k
er
 5
sp
ea
k
er
 6
sp
ea
k
er
 7
sp
ea
k
er
 8
sp
ea
k
er
 9
sp
ea
k
er
 1
0
speaker 1 98.04 50 0 0 0 0 0 1 0 0 0 
speaker 2 97.96 0 48 0 0 0 0 1 0 0 0 
speaker 3 100.00 0 0 50 0 0 0 0 0 0 0 
speaker 4 100.00 0 0 0 50 0 0 0 0 0 0 
speaker 5 100.00 0 0 0 0 50 0 0 0 0 0 
speaker 6 100.00 0 0 0 0 0 50 0 0 0 0 
speaker 7 100.00 0 1 0 0 0 0 49 0 0 0 
speaker 8 100.00 0 0 0 0 0 0 0 50 0 0 
speaker 9 100.00 0 0 0 0 0 0 0 0 50 0 
speaker 10 100.00 0 0 0 0 0 0 0 0 0 50 
Total 99.40 50 49 50 50 50 50 51 50 50 50 
It can be seen from the tables that the quality 
of the speakers classification using AC 118 ˆˆ aa 
(Tab. 6) significantly exceeds the quality of the 
speakers classification based on the initial set 
127 ˆˆ aa (Tab. 5). 
In case of significant errors in the speakers 
classification, it is advisable to analyze the sets of 
autoregression coefficients in order to identify the 
observations that caused these deviations. 
If there are incorrect classifications, they 
must be excluded from the training set of VS 
implementations. The procedure of excluding 
from training samples is that there is the AC set 
which should be excluded from the sample and 
the number of its belonging to this group is 
removed from the table of initial data, after 
which the process of assessing the quality of 
speakers classification is repeated. When a 
regular observation is deleted from the class, 
new incorrectly assigned coefficient vectors 
may appear, which were taken into account as 
correctly assigned before removal. The 
procedure of excluding observations must be 
continued until the classification accuracy 
indicator reaches its maximum value. This 
approach allows, when forming informative 
ACs, identifying and excluding VS 
implementations for each speaker which for 
various reasons (illness, emotional state, etc.) 
differ from the others in this class. 
VI. CONCLUSION 
The article presents an approach to modeling 
VS to solve the speaker recognition problem. It 
is shown that as informative features 
characterizing speakers one can use the 
parameters of autoregressive time series models 
describing voice signals. 
An algorithm is proposed for VS automatic 
segmentation into quasistationary sections 
based on interval estimation of speech samples 
standard deviation. At the same time, to solve 
the speaker recognition problem, segments are 
allocated to which the unchanged PTF and 
maximum energy correspond, since they contain 
the basic information about the features of the 
speaker's features. 
It is demonstrated that the segments formed 
on the basis of the developed algorithm are 
stationary time series, and it allows using 
autoregressive models of various orders to 
describe them. In order to reduce the uncertainty 
in the formation of the decisive rule for speaker 
recognition, it is proposed to include only 
higher-order ACs in the model, since they are 
the ones that characterize VS high-frequency 
variations and contain basic information about 
the speaker's features. The possibility of using 
multivariate discriminant analysis to 
substantiate an AC set of coefficients is shown. 
118 ˆˆ aa 
Journal of Science and Technology on Information Security 
34 No 2.CS (10) 2019 
The results of assessing the quality of speakers 
classification allow us to conclude that it is 
possible to use the described approach to create 
speaker recognition automatic systems for the 
Vietnamese language. 
REFERENCES 
[1]. Sorokin V. N. “Voice recognition: analytical 
review” V. N. Sorokin, V. V. Vyugin, A. A. 
Tananykin, Information processes,.Vol. 12, No. 1. 
pp. 1–13, 2012. 
[2]. Pervushin E. A. “Review of the main methods of 
speaker recognition” / E. A. Pervushin // 
Mathematical structures and modeling,, No. 3 (24), 
pp. 41–54, 2011. 
[3]. I. Rohmanenko. “Algorithms and software for 
verifying an announcer using an arbitrary phrase: 
thesis ... cand. tech. sciences”. [Electronic 
resource]. URL: https://postgraduate.tusur.ru 
/system/file_copies/ files / 000/000/262 / original / 
dissertation.pdf Tomsk, , 111 pp. 2017. 
[4]. Ahmad K. S. A “unique approach in text 
independent speaker recognition using MFCC 
feature sets and probabilistic neural network” // 
Advances in Pattern Recognition (ICAPR), Eighth 
International Conference on..,pp.16. 2015 
[5]. Markel, J. D. Linear Prediction of Speech: [trans. 
from English.] / J. D. Markel, A. H. Gray; under the 
editorship of Yu.N. Prokhorov and V. S. Zvezdin. – 
Moscow: Communication, 308 p,1980. 
[6]. Lysak A. B. Identification and authentication of a 
person: a review of the basic biometric methods of 
user authentication of computer systems / 
A. B. Lysak // Mathematical structures and 
modeling.. No. 2 (26). – pp. 124–134,2012. 
[7]. Meshcheryakov R. V. Algorithms for evaluating 
automatic segmentation of a speech signal / 
R. V. Meshcheryakov, A. A. Konev // 
Informatics and Control Systems.– No. 1 (31). – 
pp. 195–206. 2012. 
[8]. Ding J., Yen C. T. Enhancing GMM speaker 
identification by incorporating SVM speaker 
verification for intelligent web-based speech 
applications // Multimedia Tools and Applications.. 
– Vol. 74. – No. 14. – pp. 5131-5140, 2015. 
[9]. Trubitsyn VG Models and algorithms in speech 
signal analysis systems: dis. ... cand. tech. sciences. 
– Belgorod, 2013 .– 134 pp. [Electronic resource]. 
URL: 
algoritmy-v-sistemakh-analiza-rechevykh-signalov. 
[10]. Ganapathiraju A., Hamaker J., Picone J., 
Doddington G.R. and Ordowski M. Syllable-Based 
Large Vocabulary Continuous Speech Recognition. 
IEEE Transactions on Speech and Audio 
Processing, Vol. 9, No. 4, pp. 358–366, 2001. 
[11]. Tomchuk K. K. Segmentation of speech signals 
for tasks of automatic speech processing: dis. cand. 
tech. sciences. – St. Petersburg.– 197 pp. 
[Electronic resource]. URL: 
 2017. 
[12]. Sorokin V. N. Segmentation and recognition of 
vowels / V. N. Sorokin, A. I. Tsyplikhin // 
Information Processes. Vol . 4. – No. 2. – pp. 202–
220, 2004. 
[13]. Nguyen An Tuan Automatic analysis, recognition 
and synthesis of tonal speech (based on the material 
of the Vietnamese language): dissertation ... 
Doctors of technical sciences. – Moscow– 456 pp. 
[Electronic resource]. URL: https: // 
dissercat.com/content/avtomaticheskii-analiz-
raspoznavanie-i-sintez-tonalnoi-rechi-na-materiale-
vetnamskogo-yazyka, 1984. 
[14]. Gmurman V. Ye. Probability theory and 
mathematical statistics: textbook. manual for 
universities / V. E. Gmurman. – 12th ed., Revised. 
– M.: Yurayt, 2010 .– 478 p. 
[15]. Boxing J., Jenkins G. Time Series Analysis / Per. 
from English; Ed. V.F. Pisarenko. M .: Mir, 1974.– 
406 pp. 
[16]. Kantorovich, G. G. Analysis of time series // 
Moscow, 2003. – 129 pp. [Electronic resource]. 
URL: http: // 
biznesbooks.com/components/com_jshopping/ 
files / demo_products / kantorovich-g-g-analiz-
vremennykh-ryadov.pdf. 
[17]. Novikov E.I. Parameterization of a speech signal 
based on autoregressive models / E.I. Novikov, Do 
Kao Khan, // XI All-Russian Interdepartmental 
Scientific Conference "Actual problems of the 
development of security systems, special 
communications and information for the needs of 
public authorities of the Russian Federation 
Federations”: materials and reports (Oryol, 
February 5-6, 2019). At 10 hours / under the 
general editorship of P. L. Malyshev. – Eagle: 
Academy of the Federal Security Service of Russia, 
–pp. 127–130, 2019.. 
[18]. Kim J.-O. Factor, discriminant and cluster 
analysis: Per. from English / J.-O. Kim, C.W. 
Muller, W.R. Kleck and others; Ed. I.S. Enyukova. 
– M.: Finance and Statistics, – 215 pp, 1989. 
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 35 
ABOUT THE AUTHORS 
PhD. Evgeny Novikov 
Workplace: The Academy of 
Federal Guard Service of the 
Russian Federation. 
Email: nei05@rambler.ru 
The education process: Received 
his Ph.D. degree at the Research 
Institute of Radio-Electronic 
Systems of the Russian Federation in Sep 2010. 
Research today: modeling of random processes, 
statistical data processing and analysis, decision-making. 
The education process: received his Ph.D. degree in 
Engineering Sciences in Academy of Federal Guard 
Service of the Russian Federation in Dec 2013. 
Research today: information security, unauthorized 
access protection, mathematical cryptography, 
theoretical problems of computer science. 
PhD. Vladimir Trubitsyn 
Workplace: The Academy of 
Federal Guard Service of the 
Russian Federation. 
Email: gremlin.kop@mail.ru 
The education process: 
received his Ph.D. degree at 
Belgorod technical University of the Russian 
Federation in Dec 2014. 
Research today: modeling of random processes, 
information and coding theory, voice signal processing 
and analysis. 

File đính kèm:

  • pdfapplication_of_parameters_of_voice_singal_autoregressive_mod.pdf