Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol
Trong những năm gần đây, số
lượng sự cố liên quan đến các ứng dụng Web có
xu hướng tăng lên do sự gia tăng số lượng người
dùng thiết bị di động, sự phát triển của Internet
cũng như sự mở rộng của nhiều dịch vụ của nó.
Do đó càng làm tăng khả năng bị tấn công vào
thiết bị di động của người dùng cũng như hệ
thống máy tính. Mã độc thường được sử dụng để
thu thập thông tin về người dùng, dữ liệu cá
nhân nhạy cảm, truy cập vào tài nguyên Web
hoặc phá hoại các tài nguyên này. Mục đích của
nghiên cứu nhằm tăng cường độ chính xác phát
hiện các cuộc tấn công máy tính vào các ứng
dụng Web. Bài báo trình bày một mô hình biểu
diễn các yêu cầu Web, dựa trên mô hình không
gian vectơ và các thuộc tính của các yêu cầu đó
sử dụng giao thức HTTP. So sánh với các nghiên
cứu được thực hiện trước đây cho phép chúng
tôi ước tính độ chính xác phát hiện xấp xỉ 96%
cho các ứng dụng Web khi sử dụng bộ dữ liệu
KDD 99 trong đào tạo cũng như phát hiện tấn
công đi kèm với việc biểu diễn truy vấn dựa trên
không gian vectơ và phân loại dựa trên mô hình
cây quyết định.
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Tóm tắt nội dung tài liệu: Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol
Journal of Science and Technology on Information Security 44 No 2.CS (10) 2019 Manh Thang Nguyen, Alexander Kozachok Abstract— Recently, the number of incidents related to Web applications, due to the increase in the number of users of mobile devices, the development of the Internet of things, the expansion of many services and, as a consequence, the expansion of possible computer attacks. Malicious programs can be used to collect information about users, personal data and gaining access to Web resources or blocking them. The purpose of the study is to enhance the detection accuracy of computer attacks on Web applications. In the work, a model for presenting requests to Web resources, based on a vector space model and attributes of requests via the HTTP protocol is proposed. Previously carried out research allowed us to obtain an estimate of the detection accuracy as well as 96% for Web applications for the dataset KDD 99, vector- based query representation and a classifier based on model decision trees Tóm tắt – Trong những năm gần đây, số lượng sự cố liên quan đến các ứng dụng Web có xu hướng tăng lên do sự gia tăng số lượng người dùng thiết bị di động, sự phát triển của Internet cũng như sự mở rộng của nhiều dịch vụ của nó. Do đó càng làm tăng khả năng bị tấn công vào thiết bị di động của người dùng cũng như hệ thống máy tính. Mã độc thường được sử dụng để thu thập thông tin về người dùng, dữ liệu cá nhân nhạy cảm, truy cập vào tài nguyên Web hoặc phá hoại các tài nguyên này. Mục đích của nghiên cứu nhằm tăng cường độ chính xác phát hiện các cuộc tấn công máy tính vào các ứng dụng Web. Bài báo trình bày một mô hình biểu diễn các yêu cầu Web, dựa trên mô hình không gian vectơ và các thuộc tính của các yêu cầu đó sử dụng giao thức HTTP. So sánh với các nghiên cứu được thực hiện trước đây cho phép chúng tôi ước tính độ chính xác phát hiện xấp xỉ 96% cho các ứng dụng Web khi sử dụng bộ dữ liệu KDD 99 trong đào tạo cũng như phát hiện tấn công đi kèm với việc biểu diễn truy vấn dựa trên This manuscript is received June 14, 2019. It is commented on June 17, 2019 and is accepted on June 24, 2019 by the first reviewer. It is commented on June 16, 2019 and is accepted on June 25, 2019 by the second reviewer. không gian vectơ và phân loại dựa trên mô hình cây quyết định. Keywords— Computer attacks; Web resources, classification; machine learning; attributes; HTTP protocol. Từ khóa— Tấn công mạng; tài nguyên web, học máy, thuộc tính, giao thức HTTP. I. INTRODUCTION Recently, the number of information security incidents has increased worldwide, related to the security of Web applications, due to the increase in the number of users of mobile devices, the development of the Internet of things, the expansion of many services and, as a result, the expansion of possible computer attacks. The web resources of state structures and departments are also subject to attacks. One of the reasons for the growth of these attacks is also an increase in the number of malicious programs. Malicious programs can be used to collect information about users, personal data and gaining access to Web resources or blocking them. Impact on the rate of spread of various malware and viruses is caused by such factors as: • widespread social networking; • increased resilience and stealth botnets; • cloud service distribution. According to the analyses [1], attacks on Web applications account for more than half of all Internet traffic for information security. The purpose of the study is to improve the accuracy of detecting computer attacks on Web applications. The main result is the presented model for submitting requests to Web resources, based on the vector space model and attributes of requests via the HTTP protocol. Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 45 II. WAYS TO DETECT COMPUTER ATTACKS ON WEB APPLICATIONS Many attack detection systems use 3 basic approaches: methods based on signature [2;3], anomaly detection methods [4–8] and machine learning methods. A. Signature methods The signature analysis based on the assumption that the attack scenario is known and an attempt to implement it can be detected in the event logs or by analyzing for network traffic with high reliability. There is a certain signature of attacks in the database of signatures. Intrusion detection systems (IDS) that use signature analysis methods are designed to solve the indicated problem, as in most cases they allow not only detecting but also preventing the implementation of known attacks at the initial stage of its implementation. The disadvantage of this approach is the impossibility of detecting unknown attacks, the signatures of which are missing in the database of signatures. B. Anomaly Detection Methods Anomaly detection method is a way to detect a typical behavior of subjects in the world. At the same time in the system of detection of computer attacks models of ¬ the behavior of the subjects (behavior profiles) should be determined. For this purpose, test or training data sets are used to simulate traffic, which is considered legitimate in the network. For the operation of an attack detection system based on the detection of anomalies, it is necessary to develop a criterion for distinguishing the normal behavior of subjects from the anomalous. If the behavior deviates from normal one by an amount greater than a certain threshold value, then the system notifies of this deviation. Training datasets are also used to simulate malicious traffic so that the system can recognize patterns of unknown threats and att ... ng lĩnh vực An toàn thông tin No 2.CS (10) 2019 47 in space. The dimension of space corresponds to the number of classifying signs, their value determining the position of elements (points) in space. The support vector machine method refers to linear classification methods. Two sets of points belonging to two different classes are separated by a hyperplane in space. At the same time, the hyperplane is constructed in such a way that the distances from it to the nearest instances of both classes (support vectors) were maximum, which ensures the strict accuracy of classification. The support vector machine method allows [22; 23]: • obtaining a classification function with a minimum upper estimate of the expected risk (level of classification error); • using a linear classifier to work with nonlinearly shared data. III. MODEL FOR PRESENTING REQUESTS TO WEB RESOURCES, BASED ON THE VECTOR SPACE MODEL AND ATTRIBUTES OF REQUESTS VIA HTTP The anomaly detection approach is based on the analysis of HTTP requests processed by most common Web servers (for example, Apache or nginx) and is intended to be built in Web Application Firewall (WAF). WAF analyzes all requests coming to the Web server and makes decisions about their execution on the server (Fig.1). Fig.1. WAF in Web Application Security System A. Formation of feature space for our model To set the model for presenting requests to Web resources, the author has carried out the formation of a corresponding feature space, that has allowed to evaluate its adequacy from the standpoint of solving the problem of detecting computer attacks on Web applications. In fig.2 the main stages of analyzing an HTTP request received at the Web server input are demonstrated. We divided the dataset into two parts: requests with information about attacks and normal requests. In the learning process, we will calculate all the necessary values such as the expected value and the variance of normal queries, then these values are stored in the database MySQL for the attack detection process. The analysis is performed on the appropriate fields of the protocol to ensure further possibility of its representation in the vector space model. It also analyzes and calculates a number of attributes selected by the author. Thus, the proposed query representation model allows moving from the text representation to the totality of features of the vector space model for the corresponding protocol fields and query attributes. The basic steps to form a model for each query are the following: • Extracting and analyzing data: analysis of all the incoming requests from the Web browser is carried out. • Transformation into a vector space model: it is used to transform text data into a vector representation using the TF-IDF algorithm [24], which allows estimating the weight of features for the entire text data array. Calculation of attribute values: the values of 8 attributes proposed by the author are calculated. 1. Extracting and analyzing data At the entrance of the Web server requests via HTTP are received. An example of the contents of a GET request is shown in Fig.3. Journal of Science and Technology on Information Security 48 No 2.CS (10) 2019 Fig. 2. Example of the content fields of HTTP request (GET method) 2. Conversion to a Vector Space Model To convert strings into a vector form, allowing further application of machine learning methods, an approach based on the TF-IDF method was chosen [24]. TF-IDF is a statistical measure used to assess the importance of words in the context of a document that is part of a document collection or corpus. The weight of a word is proportional to the number of uses of the word in the document and inversely proportional to the frequency of the word use in other documents of the collection. Application of the TF-IDF approach to the problem being solved is carried out for each request. For each word 𝑡 in the query 𝑑 in the total of queries 𝐷 the value tfidf is calculated according to the following expression: ( , ) ( , ) ( )tfidf t d tf t d idf t (2) The values of tf, idf are calculated in accordance with expressions (3), (4) respectively, where 𝑣 is the rest of the words in the query 𝑑. ( , ) ( , ) ( , ) d count t d tf t d count v d (3) | | ( ) log | : | D idf t d D t d (4) Thus, after converting the query 𝑑 ∈ 𝐷 into the vector representation | 𝑑 | it will be set using the set of weights {𝑤𝑡∈𝑇} for each value t from the dictionary T. 3. Calculation of attribute values In [25], 5 basic attributes were proposed for building a detection system computer attacks on web applications: The length of the request fields sent from the browser (A1). The distribution of characters in the request (A2). Structural inference (A3). Token finder (A4). Attribute order (A5). The author proposed to introduce 3 additional attributes to improve the accuracy of attack detection. The length of the request sent from the browser (A6) From the analysis of legitimate requests via the HTTP protocol, it was found out that their length varies slightly. However, in the event of an attack, the length of the data field may change significantly (for example, in the case of SQL injection or cross-site scripting). Therefore, to estimate the limiting thresholds for changing the length of requests, two of the parameters are evaluated: the expected value and variance 2 for the training set of legitimate data. Using Chebyshev's inequality, we can estimate the probability that a random variable will take a value far from its mean (expression (5)). 2 (| | )P x , (5) where 𝑥 is a random variable, 𝜏 is the threshold value of its change. Accordingly, for any probability distribution with mean and variance 2 , it is necessary to choose a value such that a deviation x from the Fig.3 - Analysis of incoming requests for Web applications within the framework of the proposed model Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin No 2.CS (10) 2019 49 mean 𝜇, when the threshold is exceeded, results in blocking the query with the lowest level of errors of the first and second kind. The attribute value is equal to the probability value from expression (5): 6 (| | )A P x . (6) Appearance of new characters (A7) From the training sample of legitimate requests, we have to select some non-repeating characters (including various encodings) in order to compose the set of symbols of the alphabet 𝐴. Thus, when the symbol b A appears in the query, the value of the counter for this attribute is increased by one. The value of the attribute itself is calculated as the ratio of the counter value to the power of the alphabet set: 7 | | bpA A (7) The emergence of new keywords (A8) From the training sample of legitimate queries, we have to select some non-repeating terms (words) - 𝑡 in order to compose a set of terms of the dictionary. Thus, when the word T appears in the query, the counter value p for this attribute is increased by one. The value of the attribute itself is calculated as the ratio of the value of the counter to the power of the set of terms of the dictionary: 8 | | p A T (8) IV. CONCLUSION For testing the operation of machine learning methods, a data set from several data sources of system protection tools will be used, such as log files of the intrusion detection and prevention system, HTTP requests (GET, POST method) of the web application firewall, etc. Fig. 4. An example of the complete dangerous HTTP request with the POST method When analyzing a full HTTP request, the author focuses on the data in a red frame (Fig. 3). After the extraction process, the data will be saved in the appropriate files (good_request.txt and bad_request.txt). The structure of these files is shown in Fig. 4. Fig.5. File of dangerous HTTP request A preliminary study allowed us to obtain an estimate of the accuracy of detecting attacks on Web applications of 96% for the data set [15] using the entered query attributes, query vector representation models and classifier based on decision trees. This fact allows us to conclude that it is possible to build an algorithm for detecting computer attacks on Web applications based on the proposed model for presenting requests to Web resources based on the vector space model and differing in the attribute attributes of requests via HTTP. REFERENCES [1] ]. Kaspersky Lab. Security report. - 2019. - (дата обращения: 15.04.2019). http:/ / www. securelist. com / en / analysis / 204792244 / The - geography - of - cybercrime - Western - Europe- and-North-America. [2]. A survey of intrusion detection techniques in cloud / C. Modi [et al.] // Journal of Network and Computer Applications. - Vol. 36, no. 1. - P. 42-57, 2013. Journal of Science and Technology on Information Security 50 No 2.CS (10) 2019 [3]. Khamphakdee N., Benjamas N., Saiyod S. Improving intrusion detection system based on snort rules for network probe attack detection // Information and Communication Technology (IColCT), 2014 2nd International Conference On. - IEEE. - P. 69-74. 2014. [4]. A stateful intrusion detection system for world-wide web servers / G. Vigna [et al.] // Computer Security Applications Conference, 2003. Proceedings. 19th Annual. - IEEE.. - P. 34-43., 2003 [5]. Sekar R. An Efficient Black-box Technique for Defeating Web Application Attacks. // NDSS. - 2009. [6]. Mutz D., Vigna G., Kemmerer R. An experience developing an IDS stimulator for the blackbox testing of network intrusion detection systems // Computer Security Applications Conference, 2003. Proceedings. 19th Annual. - IEEE- P. 374-383, . 2003.. [7]. Li X., Xue Y. BLOCK: a black-box approach for detection of state violation attacks towards web applications // Proceedings of the 27th Annual Computer Security Applications Conference. - ACM - P. 247-256, 2011. [8]. Saxena P., Sekar R., Puranik V. Efficient fine-grained binary instrumentationwith applications to taint- tracking // Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. - ACM..- P. 74-83, 2008. [9]. Браницкий А. А., Котенко И. В. Анализ и классификация методов обнаружения сетевых атак // Труды СПИИРАН. - Т. 2, № 45. - С. 207—244, 2016. [10]. Heckerman D. A tutorial on learning with Bayesian networks // Innovations in Bayesian networks. - Springer. - P. 33-82, 2008. [11]. Friedman N., Geiger D., Goldszmidt M. Bayesian network classifiers // Machine learning. - - Vol. 29, no. 2-3. - P. 131-163, 1997. [12]. Goldszmidt M. Bayesian network classifiers // Wiley Encyclopedia of Operations Research and Management Science. - 2010. [13]. Barbara D., Wu N., Jajodia S. Detecting novel network intrusions using bayes estimators // Proceedings of the 2001 SIAM International Conference on Data Mining. - SIAM. - P. 1-17, . 2001 . [14]. Нейросетевая технология обнаружения сетевых атак на информационные ресурсы / Ю. Г. Емельянова [и др.] // Программные системы: теория и приложения. - Т. 2, № 3. - С. 3-15., 2011. [15]. A Detailed Analysis of the KDD CUP 99 Data Set / M. Tavallaee [и др.] // Proceedings of the Second IEEE International Conference on Computational Intelligence for Security and Defense Applications. - Ottawa, Ontario, Canada: IEEE Press. - С. 53—58. - (CISDA’09). - URL: 1736481.17 36489, 2009. [16]. Васильев В.И., Шарабыров И.В. Интеллектуальная система обнаружения атак в ло¬кальных беспроводных сетях // Вестник Уфимского государственного авиационного тех¬нического университета. - 2015. - Т. 19, 4 (70). [17]. Su M.-Y. Real-time anomaly detection systems for Denial-of-Service attacks by weighted k- nearest- neighbor classifiers // Expert Systems with Applications. - Vol. 38, no. 4. - P. 3492-3498. - 2011. [18]. Lee C. H., Chung J. W., Shin S. W. Network intrusion detection through genetic feature selection // Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, 2006. SNPD 2006. Seventh ACIS International Conference on. - IEEE - P. 109-114, 2006. [19]. Intrusion detection with genetic algorithms and fuzzy logic / E. Ireland [et al.] // UMM CSci senior seminar conference..- Pp. 1-6, 2013. [20]. Kruegel C., Toth T. Using decision trees to improve signature-based intrusion detection // Recent Advances in Intrusion Detection. - Springer - P. 173-191, 2003. [21]. Bouzida Y., Cuppens F. Neural networks vs. ABOUT THE AUTHORS Manh Thang Nguyen Workplace: Information Technology Faculty – Academy of cryptography techniques. Email: chieumatxcova@gmail.com Training process: 2005-2007: Student at the Military Technical Academy. 2007-2013: Student at the Applied Mathematics and Informatics Faculty - Lipetsk State Pedagogical University – Russia Federation. 2017-present: Post-graduate student at the Military Academy of the Federal Guard Service Russian Federation. Research today: Computer network, network security, machine learning and data mining. D.S. Alexander Kozachok Workplace: The Academy of Federal Guard Service of the Russian Federation. Email: alex.totrin@gmail.com The education process: has received PhD. degree in Engineering Sciences in Academy of Federal Guard Service of the Russian Federation in Dec. 2012. Research today: Information security; Unauthorized access protection; Mathematical cryptography; theoretical problems of computer.
File đính kèm:
- representation_model_of_requests_to_web_resources_based_on_a.pdf