Application of parameters of voice singal autoregressive models to solve speaker recognition problems
Bài báo mô tả một cách tiếp cận để
tạo ra các đặc trưng thông tin tín hiệu thoại (VSvoice signal)
của tiếng Việt trên cơ sở các hệ số
của mô hình tự hồi quy dừng. Một thuật toán độc
đáo để phân đoạn tín hiệu thoại dựa trên ước tính
khoảng của các đặc trưng số mẫu tiếng nói đã
được phát triển để tạo ra các vùng tĩnh cục bộ
của tín hiệu thoại. Điểm đặc biệt là việc sử dụng
các hệ số tự hồi quy bậc cao, tập hợp của chúng
được xác định trên cơ sở phân tích biệt thức
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Bạn đang xem 10 trang mẫu của tài liệu "Application of parameters of voice singal autoregressive models to solve speaker recognition problems", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Application of parameters of voice singal autoregressive models to solve speaker recognition problems
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 25 Evgeny Novikov, Vladimir Trubitsyn Abstract— An approach to the formation of the voice signal (VS) informative features of the Vietnamese language on the basis of stationary autoregressive model coefficients is described. An original algorithm of VS segmentation based on interval estimation of speech sample numerical characteristics was developed to form local stationarity areas of the voice signal. The peculiarity is the use of high order autoregressive coefficients, the set of which is determined on the basis of discriminant analysis. Tóm tắt—Bài báo mô tả một cách tiếp cận để tạo ra các đặc trưng thông tin tín hiệu thoại (VS- voice signal) của tiếng Việt trên cơ sở các hệ số của mô hình tự hồi quy dừng. Một thuật toán độc đáo để phân đoạn tín hiệu thoại dựa trên ước tính khoảng của các đặc trưng số mẫu tiếng nói đã được phát triển để tạo ra các vùng tĩnh cục bộ của tín hiệu thoại. Điểm đặc biệt là việc sử dụng các hệ số tự hồi quy bậc cao, tập hợp của chúng được xác định trên cơ sở phân tích biệt thức. Keywords— voice signal; voice signal segmentation; informative speech features; autoregression models; discriminant data analysis. Từ khóa— tín hiệu thoại; phân đoạn tín hiệu thoại; các đặc tính thông tin thoại; mô hình tự hồi quy; phân tích dữ liệu khác biệt . I. INTRODUCTION Despite quite a large number of studies in the field of speaker recognition, this problem has not yet been fully solved, since the required accuracy of identification or verification is not provided. The fundamental stage of speaker recognition is VS parameterization, which means the formation of informative speech features demonstrating the individual characteristics of the speaker. Currently, the following parameterization This manuscript is received July 19, 2019. It is commented on October 22, 2019 and is accepted on October 31, 2019 by the first reviewer. It is commented on December 17, 2019 and is accepted on December 27, 2019 by the second reviewer. methods of speaker recognition are most commonly used [1-4]: Formant methods; Methods of analyzing primary tone statistics; Methods based on linear prediction coefficients; Cepstral methods. These methods are widely studied for the Slavic and Germanic languages and automatic speaker recognition systems are developed on their basis. However, the recognition accuracy of such systems does not allow their industrial implementation. The main reasons for this situation are the following: The absence of formalized criteria of selecting the length of the window for the original VS decomposition; Ambiguity of choosing the basic VS conversion functions; Instability of informative speech features relative to noise; Transformation of the original VS, leading to an increase in resource capacity and significant errors in calculating informative speech features; Significant variability of informative feature values for the same speaker. At the same time, the existing systems have not been tested for the Vietnamese language, because they are not focused on the Vietnamese speech and do not take into account its features. Therefore, there is a contradiction between the capabilities of existing voice recognition methods and the need in ensuring the required values of voice biometrics accuracy. When creating modern systems of the speaker’s voice recognition, the advantage is Application of Parameters of Voice Singal Autoregressive Models to Solve Speaker Recognition Problems Journal of Science and Technology on Information Security 26 No 2.CS (10) 2019 given to VS stochastic models, which are based on the assumption that the signal can be well described as a parametric random process and that its parameters can be estimated accurately enough [5, 6]. This approach, taking into account the peculiarities of Vietnamese speech (words are monosyllabic, speech is slower than Russian or English one, the temporal structure of vocalized sounds is quite stable) makes extensive use of the autoregressive method. The VS autoregressive model is the most common form of speech path mathematical description for solving the problems of VS analysis and synthesis. It is explained by the adequacy of this model to the acoustic representation of the speech path in the form of pipe segments [5]. The method of estimating autoregressive model parameters allows applying them to solve the problems of VS recognition and synthesis. On the other hand, the coefficients of such a model can be interpreted as parameters of the speech path model generating VS. In this case, we can talk about the connection of the autoregressive method with the methods of identifying dynamic systems, which allow us to evaluate the structure and parameters of the identified models. Another prerequisite for the use of autoregressive models is the possibility of representing the VS in the form of a time series with time-varying probabilistic characteristics. The information above suggests that the possibility of describing VS by autoregressive models in the time domain is possible, as well as applying these models to solving speaker recognition problems. II. VOICE SIGNAL SEGMENTATION The analysis of existing approaches to speaker recognition based on autoregressive models showed that the main reason for the lack of this method’s reliability is the presence of their significant parameters variations in different VS implementations. Among the influencing factors the key one is to determine the beginning and length of the speech segment, which will provide stable parameters of autoregressive models. At the same time, recent studies have shown that the best recognition of the speaker co ... d on a step-by-step discriminant analysis with exceptions based on statistics of the following form [18]: 1 1g sgn F , (14) where TW , W and T – are the intra-group and inter-group correlation matrices of the AC, respectively, s – is the number of AC, n – is the number of VS realizations for all speakers. Then, when conditions of the following form are satisfied: ),1,( 21expul sgngFF (15) discriminant power of the AC is significant. Otherwise, an insignificant AC must be excluded from the list of informative features. V. QUALITY ASSESSMENT OF SPEAKER RECOGNITION Assessing the quality of speaker recognition on the basis of the generated informative ACs will be considered using the classification problem as an example. The quality of the speakers classification can only be assessed as posteriori. For this purpose, at the training stage, informative ACs are calculated for each PW (password word) implementation for all the speakers registered in the system. After that, the average AC values for each j-th password word and k-th speaker jkaˆ . At the classification stage, the AC values for the target speaker phh aaa ˆˆˆ ,...,1, are calculated. The obtained values using the selected proximity measure must be compared with the average AC values calculated at the training stage and, based on the results, decide on assigning the target speaker to a specific class. To assess the degree of proximity of AC values calculated for the target speaker to their reference values obtained at the training stage, it is necessary to use some measure to find the distance between two points in the multidimensional space of AC values. In the general case, it is advisable to use the Mahalanobis distance, since, firstly, the standard deviations of the AC values may be Journal of Science and Technology on Information Security 32 No 2.CS (10) 2019 unequal, and secondly, these values can be correlated. To calculate the proximity degree of the target speaker to a particular class, an expression of the following form is used: )ˆˆ)(ˆˆ()( 12 jk p hi p hj jkikikijk aaaawD CA ,(16) where kD CA2 – is the square of the distance from the AC values vector A of the target speaker to the center of the class of vectors characterizing the k-th speaker, ijw )( 1 – is the element of the matrix inverse to the intragroup covariance AC matrix, ikaˆ – is the value of the i- th AC in the class k, ikaˆ – is the average value of the i-th AC in class k. In the particular case when the intragroup covariance matrix is single, the Mahalanobis distance is the Euclidean distance. According to expression (16), the degree of proximity of the target speaker to each speaker registered in the system is estimated, and the target speaker is assigned to the class for which the value of the proximity measure is minimal. Then the indicator of the classification quality will be the proportion of correctly assigned AC vectors to the corresponding class of speakers. It is clear that the closer its value is to 1, the more accurate is the separation of speakers. To check the quality of speaker recognition based on the selected ACs, we analyzed the classification matrix, which contains information on the number and percentage of correctly classified observations in each group. For example, Tables 5 and 6 show the results of applying the developed procedure for comparing a set of coefficients 127 ˆˆ aa , calculated for the sound a in the word na and a set 118 ˆˆ aa obtained after removing excess coefficients 7aˆ and 12aˆ . TABLE 5. SPEAKERS CLASSIFICATION RESULTS BY AC CALCULATED BY THE SEGMENT MODEL OF THE SOUND IN THE WORD Group Classification matrix Line: observable classes Columns: predicated classes a cc u ra cy ra te sp ea k er 1 sp ea k er 2 sp ea k er 3 sp ea k er 4 sp ea k er 5 sp ea k er 6 sp ea k er 7 sp ea k er 8 sp ea k er 9 sp ea k er 1 0 speaker 1 96.08 49 1 0 0 0 1 0 0 0 0 speaker 2 87.75 0 43 0 0 4 0 2 0 0 0 speaker 3 84.00 0 0 42 2 0 0 0 5 1 0 speaker 4 88.00 0 4 2 44 0 0 0 2 2 0 speaker 5 88.00 2 2 0 0 44 0 0 0 0 0 speaker 6 98.00 1 0 0 0 0 49 0 0 0 0 speaker 7 98.00 0 1 0 0 0 0 49 0 0 0 speaker 8 88.00 0 0 6 0 0 0 0 44 0 0 speaker 9 98.00 0 0 0 1 0 0 0 0 49 0 speaker 10 100.00 0 0 0 0 0 0 0 0 0 50 Total 92.60 52 49 50 47 48 50 51 51 52 50 127 ˆˆ aa a na Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 33 TABLE 6. SPEAKERS CLASSIFICATION RESULTS BY AC CALCULATED BY THE SEGMENT MODEL OF THE SOUND ă IN THE WORD ăn Group Classification matrix Line: observable classes Columns: predicated classes a cc u ra cy ra te sp ea k er 1 sp ea k er 2 sp ea k er 3 sp ea k er 4 sp ea k er 5 sp ea k er 6 sp ea k er 7 sp ea k er 8 sp ea k er 9 sp ea k er 1 0 speaker 1 98.04 50 0 0 0 0 0 1 0 0 0 speaker 2 97.96 0 48 0 0 0 0 1 0 0 0 speaker 3 100.00 0 0 50 0 0 0 0 0 0 0 speaker 4 100.00 0 0 0 50 0 0 0 0 0 0 speaker 5 100.00 0 0 0 0 50 0 0 0 0 0 speaker 6 100.00 0 0 0 0 0 50 0 0 0 0 speaker 7 100.00 0 1 0 0 0 0 49 0 0 0 speaker 8 100.00 0 0 0 0 0 0 0 50 0 0 speaker 9 100.00 0 0 0 0 0 0 0 0 50 0 speaker 10 100.00 0 0 0 0 0 0 0 0 0 50 Total 99.40 50 49 50 50 50 50 51 50 50 50 It can be seen from the tables that the quality of the speakers classification using AC 118 ˆˆ aa (Tab. 6) significantly exceeds the quality of the speakers classification based on the initial set 127 ˆˆ aa (Tab. 5). In case of significant errors in the speakers classification, it is advisable to analyze the sets of autoregression coefficients in order to identify the observations that caused these deviations. If there are incorrect classifications, they must be excluded from the training set of VS implementations. The procedure of excluding from training samples is that there is the AC set which should be excluded from the sample and the number of its belonging to this group is removed from the table of initial data, after which the process of assessing the quality of speakers classification is repeated. When a regular observation is deleted from the class, new incorrectly assigned coefficient vectors may appear, which were taken into account as correctly assigned before removal. The procedure of excluding observations must be continued until the classification accuracy indicator reaches its maximum value. This approach allows, when forming informative ACs, identifying and excluding VS implementations for each speaker which for various reasons (illness, emotional state, etc.) differ from the others in this class. VI. CONCLUSION The article presents an approach to modeling VS to solve the speaker recognition problem. It is shown that as informative features characterizing speakers one can use the parameters of autoregressive time series models describing voice signals. An algorithm is proposed for VS automatic segmentation into quasistationary sections based on interval estimation of speech samples standard deviation. At the same time, to solve the speaker recognition problem, segments are allocated to which the unchanged PTF and maximum energy correspond, since they contain the basic information about the features of the speaker's features. It is demonstrated that the segments formed on the basis of the developed algorithm are stationary time series, and it allows using autoregressive models of various orders to describe them. In order to reduce the uncertainty in the formation of the decisive rule for speaker recognition, it is proposed to include only higher-order ACs in the model, since they are the ones that characterize VS high-frequency variations and contain basic information about the speaker's features. The possibility of using multivariate discriminant analysis to substantiate an AC set of coefficients is shown. 118 ˆˆ aa Journal of Science and Technology on Information Security 34 No 2.CS (10) 2019 The results of assessing the quality of speakers classification allow us to conclude that it is possible to use the described approach to create speaker recognition automatic systems for the Vietnamese language. REFERENCES [1]. Sorokin V. N. “Voice recognition: analytical review” V. N. Sorokin, V. V. Vyugin, A. A. Tananykin, Information processes,.Vol. 12, No. 1. pp. 1–13, 2012. [2]. Pervushin E. A. “Review of the main methods of speaker recognition” / E. A. Pervushin // Mathematical structures and modeling,, No. 3 (24), pp. 41–54, 2011. [3]. I. Rohmanenko. “Algorithms and software for verifying an announcer using an arbitrary phrase: thesis ... cand. tech. sciences”. [Electronic resource]. URL: https://postgraduate.tusur.ru /system/file_copies/ files / 000/000/262 / original / dissertation.pdf Tomsk, , 111 pp. 2017. [4]. Ahmad K. S. A “unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network” // Advances in Pattern Recognition (ICAPR), Eighth International Conference on..,pp.16. 2015 [5]. Markel, J. D. Linear Prediction of Speech: [trans. from English.] / J. D. Markel, A. H. Gray; under the editorship of Yu.N. Prokhorov and V. S. Zvezdin. – Moscow: Communication, 308 p,1980. [6]. Lysak A. B. Identification and authentication of a person: a review of the basic biometric methods of user authentication of computer systems / A. B. Lysak // Mathematical structures and modeling.. No. 2 (26). – pp. 124–134,2012. [7]. Meshcheryakov R. V. Algorithms for evaluating automatic segmentation of a speech signal / R. V. Meshcheryakov, A. A. Konev // Informatics and Control Systems.– No. 1 (31). – pp. 195–206. 2012. [8]. Ding J., Yen C. T. Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications // Multimedia Tools and Applications.. – Vol. 74. – No. 14. – pp. 5131-5140, 2015. [9]. Trubitsyn VG Models and algorithms in speech signal analysis systems: dis. ... cand. tech. sciences. – Belgorod, 2013 .– 134 pp. [Electronic resource]. URL: algoritmy-v-sistemakh-analiza-rechevykh-signalov. [10]. Ganapathiraju A., Hamaker J., Picone J., Doddington G.R. and Ordowski M. Syllable-Based Large Vocabulary Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 4, pp. 358–366, 2001. [11]. Tomchuk K. K. Segmentation of speech signals for tasks of automatic speech processing: dis. cand. tech. sciences. – St. Petersburg.– 197 pp. [Electronic resource]. URL: 2017. [12]. Sorokin V. N. Segmentation and recognition of vowels / V. N. Sorokin, A. I. Tsyplikhin // Information Processes. Vol . 4. – No. 2. – pp. 202– 220, 2004. [13]. Nguyen An Tuan Automatic analysis, recognition and synthesis of tonal speech (based on the material of the Vietnamese language): dissertation ... Doctors of technical sciences. – Moscow– 456 pp. [Electronic resource]. URL: https: // dissercat.com/content/avtomaticheskii-analiz- raspoznavanie-i-sintez-tonalnoi-rechi-na-materiale- vetnamskogo-yazyka, 1984. [14]. Gmurman V. Ye. Probability theory and mathematical statistics: textbook. manual for universities / V. E. Gmurman. – 12th ed., Revised. – M.: Yurayt, 2010 .– 478 p. [15]. Boxing J., Jenkins G. Time Series Analysis / Per. from English; Ed. V.F. Pisarenko. M .: Mir, 1974.– 406 pp. [16]. Kantorovich, G. G. Analysis of time series // Moscow, 2003. – 129 pp. [Electronic resource]. URL: http: // biznesbooks.com/components/com_jshopping/ files / demo_products / kantorovich-g-g-analiz- vremennykh-ryadov.pdf. [17]. Novikov E.I. Parameterization of a speech signal based on autoregressive models / E.I. Novikov, Do Kao Khan, // XI All-Russian Interdepartmental Scientific Conference "Actual problems of the development of security systems, special communications and information for the needs of public authorities of the Russian Federation Federations”: materials and reports (Oryol, February 5-6, 2019). At 10 hours / under the general editorship of P. L. Malyshev. – Eagle: Academy of the Federal Security Service of Russia, –pp. 127–130, 2019.. [18]. Kim J.-O. Factor, discriminant and cluster analysis: Per. from English / J.-O. Kim, C.W. Muller, W.R. Kleck and others; Ed. I.S. Enyukova. – M.: Finance and Statistics, – 215 pp, 1989. Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 35 ABOUT THE AUTHORS PhD. Evgeny Novikov Workplace: The Academy of Federal Guard Service of the Russian Federation. Email: nei05@rambler.ru The education process: Received his Ph.D. degree at the Research Institute of Radio-Electronic Systems of the Russian Federation in Sep 2010. Research today: modeling of random processes, statistical data processing and analysis, decision-making. The education process: received his Ph.D. degree in Engineering Sciences in Academy of Federal Guard Service of the Russian Federation in Dec 2013. Research today: information security, unauthorized access protection, mathematical cryptography, theoretical problems of computer science. PhD. Vladimir Trubitsyn Workplace: The Academy of Federal Guard Service of the Russian Federation. Email: gremlin.kop@mail.ru The education process: received his Ph.D. degree at Belgorod technical University of the Russian Federation in Dec 2014. Research today: modeling of random processes, information and coding theory, voice signal processing and analysis.
File đính kèm:
- application_of_parameters_of_voice_singal_autoregressive_mod.pdf