Classification of sequences generated by compression and encryption algorithms
Bài báo này xem xét khả năng sử
dụng phương pháp thử nghiệm các đặc trưng của
dãy bít như là một trong các cách tiếp cận để giải
quyết bài toán phân loại các dãy giả ngẫu nhiên
và các dãy được tạo ra bởi các thuật toán nén và
mã hóa. Các kết quả của việc đánh giá dẫn tới kết
luận rằng không gian đặc trưng được đề xuất có
thể được sử dụng để xác định các thuật toán nén
ZIP, RAR và các thuật toán mã hóa AES, 3DES
với độ chính xác lớn hơn 95%.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Bạn đang xem tài liệu "Classification of sequences generated by compression and encryption algorithms", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Classification of sequences generated by compression and encryption algorithms
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 3 Classification of Sequences Generated by Compression and Encryption Algorithms Alexander Kozachok, Spirin Andrey Andreevich Abstract—The paper considers the possibility of using the method of testing the properties of bit sequences as one of the possible approaches to solving the problem of classifying pseudo-random sequences and the sequences formed by encryption and compression algorithms. The results of the analysis led to the conclusion that the proposed feature space could be used to identify ZIP, RAR compression algorithms and AES, 3DES encryption algorithms with an accuracy of more than 0.95. Tóm tắt—Bài báo này xem xét khả năng sử dụng phương pháp thử nghiệm các đặc trưng của dãy bít như là một trong các cách tiếp cận để giải quyết bài toán phân loại các dãy giả ngẫu nhiên và các dãy được tạo ra bởi các thuật toán nén và mã hóa. Các kết quả của việc đánh giá dẫn tới kết luận rằng không gian đặc trưng được đề xuất có thể được sử dụng để xác định các thuật toán nén ZIP, RAR và các thuật toán mã hóa AES, 3DES với độ chính xác lớn hơn 95%. Keywords—Identification of compression and encryption algorithms, statistical testing of information. Từ khóa—Xác định các thuật toán nén và mã hóa, kiểm tra thông tin thống kê. I. INTRODUCTION According to reports from the Infowatch Analytics Center [1-3], the number of confidential data leaks is growing from year to year. Fig.1 provides statistics on the leaks occurred in 2011-2018. The total damage from leaks in 2013 amounted to more than US$7.5 billion. Equifax, the international credit reference bureau, has already spent US$1.4 billion to eliminate one information leakage that occurred in 2017 and the amount has continued to grow due to repetitive claims. The share of This manuscript is received April 22, 2019. It is commented on July 30, 2019 and is accepted on August 6, 2019 by the first reviewer. It is commented on August 20, 2019 and is accepted on August 27, 2019 by the second reviewer. intentional leaks carried out through network channels in 2018 is 86.6 %. One of the means to prevent leaks is using DLP (Data Leakage Prevention) systems. The methods used in them are divided into the methods of content analysis and the ones of analyzing the context of the information transmitted outside [4]. The methods of content analysis include the following: 1. Methods based on the predefined rules. 2. Hash function. 3. Statistical methods. Fig.1. Statistics on the leakages occurred for 2011-2018 Methods based on the predefined rules possess high performance rate, but they badly cope with the problem if there is unknown data. Hash methods calculate the hash values of the data sent outside and compare them with the value of the prepared template, this method works only if the data has not been changed after the hash value template has been compiled. Statistical methods are based on statistical properties of data, for example, the frequency of words in the text. Encryption can be used to hide the meaning and content of the data [4]. Other means of preventing leaks are intrusion detection systems (Intrusion Detection System – IDS), which are able to analyze traffic and warn of a potential threat, but when Journal of Science and Technology on Information Security 4 No 2.CS (10) 2019 transmitting data in encrypted form, they perform only limited verification based on batch headers. The complexity of analyzing the content of encrypted network traffic is one of the urgent problems for IDS. II. A REVIEW OF RESEARCH In a number of works [5-9] for traffic classification there have been applied machine learning methods using the characteristics extracted from the properties of the network streams (primarily from the packet headers). However, these methods have a significant drawback, their accuracy decreases when using encapsulation or encryption. In early 2018, Cisco estimated that 60% of Internet traffic was encrypted. According to Gartner forecasts, by the end of 2019, 80% of the traffic will be the same. Encryption can also be used to hide interactions with malware command servers and solve other problems. According to the Ponemon Institute report for 2016, almost half (41%) attackers use encryption to bypass mechanisms of detecting their unauthorized activity. Security tools cannot inspect encrypted traffic (according to the Ponemon Institute, 64% of companies cannot detect malicious code in encrypted traffic) [10]. The existing security solutions are able to analyze the content of encrypted traffic, for example, by means of a man-in-the-middle attack, but due to the high cost of implementing the methods they are practically not applicable in real conditions [10]. However, there is at least one way to analyze the content of encrypted traffic without decrypting it – Cisco ETA (Encrypted Traffic Analytics), which allows, on the basis of network telemetry received from the network equipment and machine learning algorithms, classifying encrypted traffic, simultaneously separating the pure traffic in it from the malicious one. It is not the data field in the encrypted batch that is used for analysis, but rather its header, from which the extended telemetry is obtained [10]: 1. from Netflow – addresses and ports of the source and destination (SrcIP, DstIP, SrcPort, DstPort), information on the protocol, the number of transferred packets and bytes; 2. intra-stream – packet sizes & time parameters, byte distribution (occurrence in the stream) and their entropy (the higher it is, the higher the expectation to see the encrypted traffic); 3. from TLS metadata – extensions, cryptographic algorithm sets, SNI, certif ... requests from the same host. It should be noted that in the case of data encapsulation in other protocols, the proposed method may give incorrect results. The authors in [11] considered a method of detecting hidden encrypted partitions of a personal computer hard disk drive created by the TrueCrypt data encryption program. To classify files the authors used the NIST statistical test suite indicating encrypted data with an accuracy of more than 0 95. , In order to prevent information leakage, it is necessary to block the transmission of encrypted data, which determines the relevance of solving the problem of classifying encrypted, compressed and pseudo-random sequences (PRS). III. PROBLEM STATEMENT In general, the research problem is formulated as follows: it is necessary to map the original set of bit sequences X to the new set of classes Y by using the selected feature space. To be able to estimate the accuracy of the classification, the accuracy characteristic calculated by Formula 1 was used. FNFPTNTP TNTP Accuracy , (1) where TP is the number of objects correctly assigned to class i, TN is the number of objects correctly assigned to class j, FP is the number of false positives (type I error), FN is the number of false negatives (type II error). In a formalized form, the classification task can be defined by the following expression: YXF : , (2) where X is the initial set of bit sequences to be classified, Y is the set of classes. Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 5 The accuracy value in the classification must satisfy the condition presented in expression (3): ,95,0)|)(( jiyxFp ji (3) where p is the probability, F(х) is the function of displaying the i-th file of the set X, y is the mark of the class from the set Y, i, j are the indices in the set of X and Y, respectively. To develop and evaluate the classification method of encrypted, compressed and PRS the following particular tasks were set: 1. To check the possibility of binary classification of sequences generated by encryption and compression algorithms. 2. To check the possibility of multi-class classification of sequences generated by encryption, compression, and PRS algorithms. IV. TESTING THE POSSIBILITY OF A BINARY CLASSIFICATION OF SENQUENCES GENERATED BY COMPRESSION AND ENCRYPTION ALGORITHMS 4.1 Binary classification for features generated on the basis of NIST test results In the course of the research it was assumed that the results of NIST tests might be a feature space for constructing a classifier that allows distinguishing pseudo-random sequences by a source type. To test the possibility of solving the classification problem, the NIST SP 800- 22rev1a statistical test suite was used to evaluate random and pseudo-random sequence generators in cryptography [17]. To conduct the experiment the initial sample of 1000 files, 600 KB each, containing the text in Russian, was converted by encryption and compression algorithms. As a result, the resulting sample of 4000 files was divided into 2 classes: 1. Encrypted by AES, 3DES algorithms. 2. Compressed by RAR, ZIP algorithms. Next, the files from the sample were processed with a package of statistical tests, as a result, 188 features were obtained. To assess the applicability of the selected feature space, cross-validation (the number of subsamples equal to 10 were chosen) of machine learning algorithms with default parameters was used [18-23]: Decision Tree Classifier (DTC), Random Forest Classifier (RFC). The results of the experiments are presented in Table 2. TABLE 1. THE RESULTS OF CROSS- VALADATION OF MACHINE LEARNING ALGORITHMS IN THE NIST TEST FEATURE SPACE ON A SAMPLE COMPRESSED AND ENCRYPTED SEQUENCES Algorithm Accuracy value Decision Tree 0,57508 Random Forest 0,642579 From the analysis of the results presented in Table 2, it is reasonable to conclude that the RFC has greater accuracy, but the resulting accuracy value of detecting sequence types equal to 0 642 does not allow constructing a classifier that meets the requirement presented in expression 3. Thus, it was concluded that the feature space based on the NIST test results would not allow solving the problem of binary classification taking into account the selected restrictions. , 4.2 Binary classification on the feature space generated on the basis of the results of analyzing N length subsequence frequency. Then it was assumed in the research that as a feature space for solving the problem of bit sequence binary classification the results of analyzing the frequency of independent bit subsequences of different length N (in bits) without taking into account the complete overlap of each subsequence can be used. For example, for the sequence S = "00011011" and N = 2 bits, the frequency occurrence of N bit length subsequences is presented in Table 3. TABLE 2. EXAMPLE OF COUNTING THE FREQUENCIES OF BIT SUBSEQUENCES Subsequence Amount Frequency 00 1 0,142857143 01 2 0,285714286 10 1 0,142857143 11 2 0,285714286 In the works of the authors it was shown that for binary sequences it was enough to analyze Journal of Science and Technology on Information Security 6 No 2.CS (10) 2019 half of all possible subsequences [24,25]. Using this assumption, the dimension of the feature space for N bit length subsequences has been halved, the number of possible feature on the assumption is presented in Table 4. TABLE 3: THE DIMENSION OF THE FEATURE SPACE FOR N BIT LENGTH Length of subsequence (N), bit Amount of features Amount of features, total Cross- validation time on assumption, min 4 8 16 6 5 16 32 11 6 32 64 15 7 64 128 22 8 128 256 37 9 256 512 69 In the course of the experiments to assess the possibility of constructing binary classifiers,the RFC showed the highest accuracy, the results are presented in Table 5. An average accuracy of more than 0 95 is achieved at N> = 8 bits. , The dependence of the average accuracy of classifying files on the basis of the RFC on the length of the subsequence and the time spent on the selection of features from the training sample is shown in Fig.2. A sufficient ratio of accuracy and time is determined at N = 9 bits length of the subsequence, in this case there is a significant increase in the accuracy of sequence classification at maintaining an acceptable amount of time spent on extracting features. TABLE 4. ACCURACY OF DISTINGUISHING FILE TYPES WHEN USING RFC File type Algorithm accuracy for length sequence N=4 N=5 N=6 N=7 N=8 N=9 N=10 N=11 Time taken to retrieve features in minutes 6 11 15 22 37 69 135 268 AES/7-Z 0,821 0,843 0,834 0,835 0,938 0,993 0,998 1,000 AES/RAR 0,986 0,992 0,994 0,992 0,991 0,997 0,993 0,993 AES/ZIP 0,986 0,992 0,988 0,990 0,993 0,998 0,999 0,999 3DES/7-Z 0,834 0,846 0,865 0,865 0,947 0,993 0,998 1,000 3DES/RAR 0,984 0,990 0,991 0,993 0,991 0,996 0,994 0,994 3DES/ZIP 0,991 0,988 0,991 0,989 0,991 0,997 0,998 0,999 7-Z/ZIP 0,968 0,974 0,974 0,973 0,971 0,972 0,976 0,978 7-Z/RAR 0,948 0,960 0,964 0,963 0,964 0,983 0,996 0,999 RAR/ZIP 0,840 0,864 0,864 0,863 0,868 0,977 0,999 1,000 Mean value значение 0,929 0,939 0,941 0,940 0,962 0,990 0,995 0,996 Fig.2. Dependence of the file type classification accuracy at using the RFC on the subsequence length Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 7 The AES/3DES pair was removed from the tested sample. Fig.3. shows the dependence of the accuracy of distinguishing sequences encrypted with the AES and 3DES algorithms on the length of the subsequence N. Due to the ability of ciphers to dissipate the statistics of the source data, the accuracy of the classification of sequences encrypted with the AES and 3DES algorithms is on average 0,512, which makes it impossible to construct a classifier for sequences of this type. In the course of the research, the possibility of applying the results of NIST statistical tests as a feature space for the classifying binary sequences generated by encryption and compression algorithms was verified. The accuracy of distinguishing the RFC sequences was 0,64, which does not satisfy the requirement specified in expression (3). To solve the problem of binary sequence binary classification, a new feature space was proposed, it was formed by counting the frequency of different length bit subsequences. The classification accuracy of the RFC at the length of the subsequence N = 9 bits is 0,99, it satisfies the requirement specified in expression (3). Fig.3. Dependence of the accuracy of distinguishing sequences encrypted with the AES and 3DES algorithms on the length of the subsequence REFERENCES [1]. INFOWATCH company group site. URL: https://www.infowatch.ru/analytics/reports.4.html (дата обращения: 30.05.2019). [2]. INFOWATCH company group site. URL: https://www.infowatch.ru/sites/default/files/report/ analytics/russ/infowatch_otchet_032014_smb_fin.p df (дата обращения: 30.05.2019). [3]. INFOWATCH company group site. URL: https://www.infowatch.ru/analytics/leaks_monitori ng/15678 (дата обращения: 30.05.2019). [4]. X. Huang, Y. Lu, D. Li, M. Ma. A novel mechanism for fast detection of transformed data leakage // IEEE Access. Special section on challenges and opportunities of big data against cyber crime. Vol. 6, 2018. pp. 35926-35936 [5]. Y. Miao, Z. Ruan, L. Pan, Y. Wang, J. Zhang, Y. Xiang. Automated Big Traffic Analytics for Cyber Security // Eprint arXiv:1804.09023, bibcode: 2018arXiv180409023M. 2018. [6]. S. Miller, K. Curran, T. Lunney. Multilayer Perceptron Neural Network for Detection of Encrypted VPN Network Traffic // International Conference on Cyber Situational Awarness, Data Analytics and Assessment. 2018. ISBN: 978-1- 5386-4565-9. [7]. P. Wang, X. Chen, F. Ye, Z. Sun. A Survey of Techniques for Mobile Service Encrypted Traffic Classification Using Deep Learning // IEEE Access. Special section on challenges and opportunities of big data against cyber crime. Vol. 7, 2019. pp. 54024-54033 doi: 10.1109/ACCESS.2019.2912896 Journal of Science and Technology on Information Security 8 No 2.CS (10) 2019 [8]. K. Demertzis, N. Tziritas, P. Kikiras, S.L. Sanchez, L. Iliadis. The Next Generation Cognitive Security Operations Center: Adaptive Analytic Lambda Architecture for Efficient Defense against Adversarial Attacks // Big Data and Cognitive Computing, 2019 3(6). [9]. H. Zhang, C. Papadopoulos, D. Massey. Detecting encrypted botnet traffic // 16th IEEE Global Internet Symposium. 2013. p. 3453. [10]. T. Radivilova, L. Kirichenko, D. Ageyev, M. Tawalbeh, V. Bulakh Decrypting SSL/TLS Traffic for Hidden Threats Detection // IEEE 9th International Conference on Dependable Systems, Services and Technologies (DESSERT), 2018. ISBN: 978-1-5386-5903-8. [11]. M. Piccinelli, P. Gubian. Detecting hidden encrypted volume files via statistical analysis // International Journal of Cyber-Security and Digital Forensics. Vol. 3(1). 2013 pp. 30-37. [12]. NIST STS manual. URL: https://csrc.nist.gov/Projects/Random-Bit- Generation/ (дата обращения: 14.01.2019). [13]. Toolkit for the transport layer security and secure sockets layer protocols. URL: (дата обращения: 14.01.2019) [14]. Archive manager WinRAR. URL: (дата обращения: 14.01.2019). [15]. Pedregosa F., et al. Scikit-learn: Machine Learning in Python // Journal of Machine Learning Research 12. 2011. pp. 2825-2830. [16]. Breiman L., Friedman J., Olshen R., Stone C. Classification and Regression Trees // Wadsworth, Belmont, CA. 1984. 368 p. ISBN: 9781351460491. [17]. Hastie T., Tibshirani R., Friedman J. Elements of Statistical Learning // Springer. 2009. pp. 587-601. ISBN: 978-0387848570. [18]. L. Breiman, A. Cutler. Random Forests // URL:https://www.stat.berkeley.edu/~breiman/Ran domForests/cc_home.htm (дата обращения: 14.01.2019). [19]. S. Raska. Python and machine learning // M .: DMK-Press. 2017. 418 p. ISBN: 978-5-97060- 409-0. [20]. L. Breiman. Random Forests // Journal Machine Learning 45(1). 2001. pp. 5-32. [21]. M.Yu. Konyshev. Formation of probability distributions of binary vectors of the source of errors of a Markov discrete communication channel with memory using the method of "grouping probabilities" of error vectors. / M.Yu. Konyshev, A.Yu. Barabashov, K.E. Petrov, A.A. Dvilyansky // Industrial ACS and controllers. 2018. № 3. P. 42-52. [22]. M.Yu. Konyshev. A compression algorithm for a series of distributions of binary multidimensional random variables. / M.Yu. Konyshev, A.A. Dvilyansky, K.E. Petrov, G.A. Ermishin // Industrial ACS and controllers. 2016. No. 8. P. 47-50. ABOUT THE AUTHORS D.S. Alexander Kozachok Workplace: The Academy of Federal Guard Service of the Russian Federation. Email: alex.totrin@gmail.com The education process: received PhD. degree in Engineering Sciences in Academy of Federal Guard Service of the Russian Federation in Dec. 2012. Research today: Information security; Unauthorized access protection; Mathematical cryptography; theoretical problems of computer. Spirin Andrey Andreevich Workplace: The Academy of Federal Guard Service of the Russian Federation. Email: spirin_aa@bk.ru The education process: graduated from the Academy of the Federal Guard Service of the Russian Federation in 2010. Research today: Information security, information leakage prevention systems, statistical testing.
File đính kèm:
- classification_of_sequences_generated_by_compression_and_enc.pdf