Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol

Trong những năm gần đây, số

lượng sự cố liên quan đến các ứng dụng Web có

xu hướng tăng lên do sự gia tăng số lượng người

dùng thiết bị di động, sự phát triển của Internet

cũng như sự mở rộng của nhiều dịch vụ của nó.

Do đó càng làm tăng khả năng bị tấn công vào

thiết bị di động của người dùng cũng như hệ

thống máy tính. Mã độc thường được sử dụng để

thu thập thông tin về người dùng, dữ liệu cá

nhân nhạy cảm, truy cập vào tài nguyên Web

hoặc phá hoại các tài nguyên này. Mục đích của

nghiên cứu nhằm tăng cường độ chính xác phát

hiện các cuộc tấn công máy tính vào các ứng

dụng Web. Bài báo trình bày một mô hình biểu

diễn các yêu cầu Web, dựa trên mô hình không

gian vectơ và các thuộc tính của các yêu cầu đó

sử dụng giao thức HTTP. So sánh với các nghiên

cứu được thực hiện trước đây cho phép chúng

tôi ước tính độ chính xác phát hiện xấp xỉ 96%

cho các ứng dụng Web khi sử dụng bộ dữ liệu

KDD 99 trong đào tạo cũng như phát hiện tấn

công đi kèm với việc biểu diễn truy vấn dựa trên

 không gian vectơ và phân loại dựa trên mô hình

cây quyết định.

 

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 1

Trang 1

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 2

Trang 2

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 3

Trang 3

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 4

Trang 4

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 5

Trang 5

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 6

Trang 6

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol trang 7

Trang 7

pdf 7 trang minhkhanh 5840
Bạn đang xem tài liệu "Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol

Representation Model of Requests to Web Resources, Based on a Vector Space Model and Attributes of Requests for HTTP Protocol
Journal of Science and Technology on Information Security 
44 No 2.CS (10) 2019 
Manh Thang Nguyen, Alexander Kozachok 
Abstract— Recently, the number of incidents 
related to Web applications, due to the increase 
in the number of users of mobile devices, the 
development of the Internet of things, the 
expansion of many services and, as a 
consequence, the expansion of possible computer 
attacks. Malicious programs can be used to 
collect information about users, personal data 
and gaining access to Web resources or blocking 
them. The purpose of the study is to enhance the 
detection accuracy of computer attacks on Web 
applications. In the work, a model for presenting 
requests to Web resources, based on a vector 
space model and attributes of requests via the 
HTTP protocol is proposed. Previously carried 
out research allowed us to obtain an estimate of 
the detection accuracy as well as 96% for Web 
applications for the dataset KDD 99, vector-
based query representation and a classifier based 
on model decision trees 
Tóm tắt – Trong những năm gần đây, số 
lượng sự cố liên quan đến các ứng dụng Web có 
xu hướng tăng lên do sự gia tăng số lượng người 
dùng thiết bị di động, sự phát triển của Internet 
cũng như sự mở rộng của nhiều dịch vụ của nó. 
Do đó càng làm tăng khả năng bị tấn công vào 
thiết bị di động của người dùng cũng như hệ 
thống máy tính. Mã độc thường được sử dụng để 
thu thập thông tin về người dùng, dữ liệu cá 
nhân nhạy cảm, truy cập vào tài nguyên Web 
hoặc phá hoại các tài nguyên này. Mục đích của 
nghiên cứu nhằm tăng cường độ chính xác phát 
hiện các cuộc tấn công máy tính vào các ứng 
dụng Web. Bài báo trình bày một mô hình biểu 
diễn các yêu cầu Web, dựa trên mô hình không 
gian vectơ và các thuộc tính của các yêu cầu đó 
sử dụng giao thức HTTP. So sánh với các nghiên 
cứu được thực hiện trước đây cho phép chúng 
tôi ước tính độ chính xác phát hiện xấp xỉ 96% 
cho các ứng dụng Web khi sử dụng bộ dữ liệu 
KDD 99 trong đào tạo cũng như phát hiện tấn 
công đi kèm với việc biểu diễn truy vấn dựa trên 
This manuscript is received June 14, 2019. It is commented 
on June 17, 2019 and is accepted on June 24, 2019 by the 
first reviewer. It is commented on June 16, 2019 and is 
accepted on June 25, 2019 by the second reviewer. 
không gian vectơ và phân loại dựa trên mô hình 
cây quyết định. 
Keywords— Computer attacks; Web resources, 
classification; machine learning; attributes; HTTP 
protocol. 
Từ khóa— Tấn công mạng; tài nguyên web, học 
máy, thuộc tính, giao thức HTTP. 
I. INTRODUCTION 
Recently, the number of information security 
incidents has increased worldwide, related to the 
security of Web applications, due to the increase in 
the number of users of mobile devices, the 
development of the Internet of things, the 
expansion of many services and, as a result, the 
expansion of possible computer attacks. 
The web resources of state structures and 
departments are also subject to attacks. One of 
the reasons for the growth of these attacks is 
also an increase in the number of malicious 
programs. Malicious programs can be used to 
collect information about users, personal data 
and gaining access to Web resources or 
blocking them. 
Impact on the rate of spread of various 
malware and viruses is caused by such factors as: 
• widespread social networking; 
• increased resilience and stealth botnets; 
• cloud service distribution. 
According to the analyses [1], attacks on 
Web applications account for more than half of 
all Internet traffic for information security. The 
purpose of the study is to improve the accuracy 
of detecting computer attacks on Web 
applications. The main result is the presented 
model for submitting requests to Web 
resources, based on the vector space model and 
attributes of requests via the HTTP protocol. 
Representation Model of Requests to Web 
Resources, Based on a Vector Space Model 
and Attributes of Requests for HTTP Protocol 
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 45 
II. WAYS TO DETECT COMPUTER 
ATTACKS ON WEB APPLICATIONS 
Many attack detection systems use 3 basic 
approaches: methods based on signature [2;3], 
anomaly detection methods [4–8] and machine 
learning methods. 
A. Signature methods 
The signature analysis based on the 
assumption that the attack scenario is known 
and an attempt to implement it can be detected 
in the event logs or by analyzing for network 
traffic with high reliability. There is a certain 
signature of attacks in the database of 
signatures. 
Intrusion detection systems (IDS) that use 
signature analysis methods are designed to 
solve the indicated problem, as in most cases 
they allow not only detecting but also 
preventing the implementation of known 
attacks at the initial stage of its implementation. 
The disadvantage of this approach is the 
impossibility of detecting unknown attacks, the 
signatures of which are missing in the database 
of signatures. 
 B. Anomaly Detection Methods 
Anomaly detection method is a way to 
detect a typical behavior of subjects in the 
world. At the same time in the system of 
detection of computer attacks models of ¬ 
the behavior of the subjects (behavior 
profiles) should be determined. For this 
purpose, test or training data sets are used to 
simulate traffic, which is considered 
legitimate in the network. For the operation 
of an attack detection system based on the 
detection of anomalies, it is necessary to 
develop a criterion for distinguishing the 
normal behavior of subjects from the 
anomalous. If the behavior deviates from 
normal one by an amount greater than a 
certain threshold value, then the system 
notifies of this deviation. Training datasets 
are also used to simulate malicious traffic so 
that the system can recognize patterns of 
unknown threats and att ... ng lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 47 
in space. The dimension of space corresponds 
to the number of classifying signs, their value 
determining the position of elements (points) 
in space. 
The support vector machine method 
refers to linear classification methods. Two 
sets of points belonging to two different 
classes are separated by a hyperplane in 
space. At the same time, the hyperplane is 
constructed in such a way that the distances 
from it to the nearest instances of both 
classes (support vectors) were maximum, 
which ensures the strict accuracy of 
classification. 
The support vector machine method allows 
[22; 23]: 
• obtaining a classification function with a 
minimum upper estimate of the expected risk 
(level of classification error); 
• using a linear classifier to work with 
nonlinearly shared data. 
III. MODEL FOR PRESENTING 
REQUESTS TO WEB RESOURCES, BASED 
ON THE VECTOR SPACE MODEL AND 
ATTRIBUTES OF REQUESTS VIA HTTP 
The anomaly detection approach is based on 
the analysis of HTTP requests processed by 
most common Web servers (for example, 
Apache or nginx) and is intended to be built in 
Web Application Firewall (WAF). WAF 
analyzes all requests coming to the Web server 
and makes decisions about their execution on 
the server (Fig.1). 
Fig.1. WAF in Web Application Security System 
A. Formation of feature space for our model 
To set the model for presenting requests to 
Web resources, the author has carried out the 
formation of a corresponding feature space, that 
has allowed to evaluate its adequacy from the 
standpoint of solving the problem of detecting 
computer attacks on Web applications. 
In fig.2 the main stages of analyzing an 
HTTP request received at the Web server input 
are demonstrated. We divided the dataset into 
two parts: requests with information about 
attacks and normal requests. In the learning 
process, we will calculate all the necessary 
values such as the expected value and the 
variance of normal queries, then these values 
are stored in the database MySQL for the attack 
detection process. The analysis is performed on 
the appropriate fields of the protocol to ensure 
further possibility of its representation in the 
vector space model. It also analyzes and 
calculates a number of attributes selected by the 
author. Thus, the proposed query representation 
model allows moving from the text 
representation to the totality of features of the 
vector space model for the corresponding 
protocol fields and query attributes. 
The basic steps to form a model for each 
query are the following: 
• Extracting and analyzing data: analysis of 
all the incoming requests from the Web 
browser is carried out. 
• Transformation into a vector space model: 
it is used to transform text data into a vector 
representation using the TF-IDF algorithm 
[24], which allows estimating the weight of 
features for the entire text data array. 
Calculation of attribute values: the values of 8 
attributes proposed by the author are calculated. 
1. Extracting and analyzing data 
At the entrance of the Web server requests via 
HTTP are received. An example of the contents 
of a GET request is shown in Fig.3. 
Journal of Science and Technology on Information Security 
48 No 2.CS (10) 2019 
Fig. 2. Example of the content fields of 
HTTP request (GET method) 
 2. Conversion to a Vector Space Model 
To convert strings into a vector form, 
allowing further application of machine learning 
methods, an approach based on the TF-IDF 
method was chosen [24]. 
TF-IDF is a statistical measure used to 
assess the importance of words in the context 
of a document that is part of a document 
collection or corpus. The weight of a word is 
proportional to the number of uses of the word 
in the document and inversely proportional to 
the frequency of the word use in other 
documents of the collection. Application of the 
TF-IDF approach to the problem being solved 
is carried out for each request. 
For each word 𝑡 in the query 𝑑 in the total 
of queries 𝐷 the value tfidf is calculated 
according to the following expression: 
( , ) ( , ) ( )tfidf t d tf t d idf t (2) 
The values of tf, idf are calculated in 
accordance with expressions (3), (4) respectively, 
where 𝑣 is the rest of the words in the query 𝑑. 
( , )
( , )
( , )
d
count t d
tf t d
count v d
 

 (3) 
| |
( ) log
| : |
D
idf t
d D t d
 (4) 
Thus, after converting the query 𝑑 ∈ 𝐷 into 
the vector representation | 𝑑 | it will be set using 
the set of weights {𝑤𝑡∈𝑇} for each value t from 
the dictionary T. 
3. Calculation of attribute values 
In [25], 5 basic attributes were proposed for 
building a detection system computer attacks on 
web applications: 
 The length of the request fields sent from 
the browser (A1). 
 The distribution of characters in the 
request (A2). 
 Structural inference (A3). 
 Token finder (A4). 
 Attribute order (A5). 
The author proposed to introduce 3 
additional attributes to improve the accuracy of 
attack detection. 
The length of the request sent from the 
browser (A6) 
From the analysis of legitimate requests via 
the HTTP protocol, it was found out that their 
length varies slightly. However, in the event of an 
attack, the length of the data field may change 
significantly (for example, in the case of SQL 
injection or cross-site scripting). 
Therefore, to estimate the limiting thresholds 
for changing the length of requests, two of the 
parameters are evaluated: the expected value and 
variance 2 for the training set of legitimate data. 
Using Chebyshev's inequality, we can estimate 
the probability that a random variable will take a 
value far from its mean (expression (5)). 
2
(| | )P x
 

 , (5) 
where 𝑥 is a random variable, 𝜏 is the threshold 
value of its change. 
Accordingly, for any probability distribution 
with mean  and variance 2 , it is necessary to 
choose a value such that a deviation x from the 
Fig.3 - Analysis of incoming requests for Web 
applications within the framework of the proposed model 
Nghiên cứu Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
 No 2.CS (10) 2019 49 
mean 𝜇, when the threshold is exceeded, results 
in blocking the query with the lowest level of 
errors of the first and second kind. 
The attribute value is equal to the probability 
value from expression (5): 
 6 (| | )A P x   . (6) 
 Appearance of new characters (A7) 
 From the training sample of legitimate 
requests, we have to select some non-repeating 
characters (including various encodings) in order 
to compose the set of symbols of the alphabet 𝐴. 
Thus, when the symbol b A appears in the 
query, the value of the counter for this attribute is 
increased by one. The value of the attribute itself 
is calculated as the ratio of the counter value to 
the power of the alphabet set: 
7
| |
bpA
A
 (7) 
 The emergence of new keywords (A8) 
From the training sample of legitimate 
queries, we have to select some non-repeating 
terms (words) - 𝑡 in order to compose a set of 
terms of the dictionary. Thus, when the word 
T appears in the query, the counter value p 
for this attribute is increased by one. The value of 
the attribute itself is calculated as the ratio of the 
value of the counter to the power of the set of 
terms of the dictionary: 
8
| |
p
A
T
 (8) 
IV. CONCLUSION 
For testing the operation of machine learning 
methods, a data set from several data sources of 
system protection tools will be used, such as log 
files of the intrusion detection and prevention 
system, HTTP requests (GET, POST method) of 
the web application firewall, etc. 
Fig. 4. An example of the complete dangerous HTTP 
request with the POST method 
When analyzing a full HTTP request, the 
author focuses on the data in a red frame (Fig. 3). 
After the extraction process, the data will be 
saved in the appropriate files (good_request.txt 
and bad_request.txt). The structure of these files 
is shown in Fig. 4. 
Fig.5. File of dangerous HTTP request 
A preliminary study allowed us to obtain 
an estimate of the accuracy of detecting attacks 
on Web applications of 96% for the data set [15] 
using the entered query attributes, query vector 
representation models and classifier based on 
decision trees. This fact allows us to conclude 
that it is possible to build an algorithm for 
detecting computer attacks on Web applications 
based on the proposed model for presenting 
requests to Web resources based on the vector 
space model and differing in the attribute 
attributes of requests via HTTP. 
REFERENCES 
[1] ]. Kaspersky Lab. Security report. - 2019. - (дата 
обращения: 15.04.2019). http:/ / www. securelist. com 
/ en / analysis / 204792244 / The - geography - of - 
cybercrime - Western - Europe- and-North-America. 
[2]. A survey of intrusion detection techniques in cloud / C. 
Modi [et al.] // Journal of Network and Computer 
Applications. - Vol. 36, no. 1. - P. 42-57, 2013. 
Journal of Science and Technology on Information Security 
50 No 2.CS (10) 2019 
[3]. Khamphakdee N., Benjamas N., Saiyod S. Improving 
intrusion detection system based on snort rules for 
network probe attack detection // Information and 
Communication Technology (IColCT), 2014 2nd 
International Conference On. - IEEE. - P. 69-74. 2014. 
[4]. A stateful intrusion detection system for world-wide 
web servers / G. Vigna [et al.] // Computer Security 
Applications Conference, 2003. Proceedings. 19th 
Annual. - IEEE.. - P. 34-43., 2003 
[5]. Sekar R. An Efficient Black-box Technique for 
Defeating Web Application Attacks. // NDSS. - 2009. 
[6]. Mutz D., Vigna G., Kemmerer R. An experience 
developing an IDS stimulator for the blackbox testing 
of network intrusion detection systems // Computer 
Security Applications Conference, 2003. Proceedings. 
19th Annual. - IEEE- P. 374-383, . 2003.. 
[7]. Li X., Xue Y. BLOCK: a black-box approach for 
detection of state violation attacks towards web 
applications // Proceedings of the 27th Annual 
Computer Security Applications Conference. - ACM - 
P. 247-256, 2011. 
[8]. Saxena P., Sekar R., Puranik V. Efficient fine-grained 
binary instrumentationwith applications to taint-
tracking // Proceedings of the 6th annual IEEE/ACM 
international symposium on Code generation and 
optimization. - ACM..- P. 74-83, 2008. 
[9]. Браницкий А. А., Котенко И. В. Анализ и 
классификация методов обнаружения сетевых 
атак // Труды СПИИРАН. - Т. 2, № 45. - С. 
207—244, 2016. 
[10]. Heckerman D. A tutorial on learning with Bayesian 
networks // Innovations in Bayesian networks. - 
Springer. - P. 33-82, 2008. 
[11]. Friedman N., Geiger D., Goldszmidt M. Bayesian 
network classifiers // Machine learning. - - Vol. 29, no. 
2-3. - P. 131-163, 1997. 
[12]. Goldszmidt M. Bayesian network classifiers // Wiley 
Encyclopedia of Operations Research and 
Management Science. - 2010. 
[13]. Barbara D., Wu N., Jajodia S. Detecting novel 
network intrusions using bayes estimators // 
Proceedings of the 2001 SIAM International 
Conference on Data Mining. - SIAM. - P. 1-17, . 2001 . 
[14]. Нейросетевая технология обнаружения сетевых 
атак на информационные ресурсы / Ю. Г. 
Емельянова [и др.] // Программные системы: 
теория и приложения. - Т. 2, № 3. - С. 3-15., 2011. 
[15]. A Detailed Analysis of the KDD CUP 99 Data Set / 
M. Tavallaee [и др.] // Proceedings of the Second 
IEEE International Conference on Computational 
Intelligence for Security and Defense Applications. - 
Ottawa, Ontario, Canada: IEEE Press. - С. 53—58. - 
(CISDA’09). - URL:  
1736481.17 36489, 2009. 
[16]. Васильев В.И., Шарабыров И.В. 
Интеллектуальная система обнаружения атак в 
ло¬кальных беспроводных сетях // Вестник 
Уфимского государственного авиационного 
тех¬нического университета. - 2015. - Т. 19, 4 (70). 
[17]. Su M.-Y. Real-time anomaly detection systems for 
Denial-of-Service attacks by weighted k- nearest-
neighbor classifiers // Expert Systems with 
Applications. - Vol. 38, no. 4. - P. 3492-3498. - 2011. 
[18]. Lee C. H., Chung J. W., Shin S. W. Network 
intrusion detection through genetic feature selection // 
Software Engineering, Artificial Intelligence, 
Networking, and Parallel/Distributed Computing, 
2006. SNPD 2006. Seventh ACIS International 
Conference on. - IEEE - P. 109-114, 2006. 
[19]. Intrusion detection with genetic algorithms and fuzzy 
logic / E. Ireland [et al.] // UMM CSci senior seminar 
conference..- Pp. 1-6, 2013. 
[20]. Kruegel C., Toth T. Using decision trees to improve 
signature-based intrusion detection // Recent Advances 
in Intrusion Detection. - Springer - P. 173-191, 2003. 
[21]. Bouzida Y., Cuppens F. Neural networks vs. 
ABOUT THE AUTHORS 
 Manh Thang Nguyen 
Workplace: Information Technology 
Faculty – Academy of cryptography 
techniques. 
Email: chieumatxcova@gmail.com 
Training process: 
2005-2007: Student at the Military 
Technical Academy. 
2007-2013: Student at the Applied Mathematics and 
Informatics Faculty - Lipetsk State Pedagogical 
University – Russia Federation. 
2017-present: Post-graduate student at the Military 
Academy of the Federal Guard Service Russian 
Federation. 
Research today: Computer network, network security, 
machine learning and data mining. 
D.S. Alexander Kozachok 
Workplace: The Academy of 
Federal Guard Service of the 
Russian Federation. 
Email: alex.totrin@gmail.com 
The education process: has 
received PhD. degree in 
Engineering Sciences in Academy 
of Federal Guard Service of the 
Russian Federation in Dec. 2012. 
Research today: Information security; Unauthorized access 
protection; Mathematical cryptography; theoretical 
problems of computer. 

File đính kèm:

  • pdfrepresentation_model_of_requests_to_web_resources_based_on_a.pdf