Detecting web attacks based on clustering algorithm and multi - Branch cnn

Bài báo đề xuất và phát triển mô hình
phát hiện tấn công Web dựa trên kết hợp thuật
toán phân cụm và mạng nơ-ron tích chập (CNN)
đa nhánh. Tập đặc trưng ban đầu được phân cụm
thành các nhóm đặc trưng tương ứng. Mỗi nhóm
đặc trưng được khái quát hóa trong một nhánh
của mạng CNN đa nhánh để tạo thành một vector
đặc trưng thành phần. Các vector đặc trưng thành
phần được ghép lại thành một vector đặc trưng
tổng hợp và đưa vào lớp liên kết đầy đủ để phân
lớp. Sử dụng phương pháp kiểm thử chéo trên mô
hình đề xuất, độ chính xác đạt 98,8%, F1-score đạt
98,8% và tỉ lệ cải tiến độ chính xác là 1,479%
Download
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
7 trang minhkhanh 4360
Download
Bạn đang xem tài liệu "Detecting web attacks based on clustering algorithm and multi - Branch cnn", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Detecting web attacks based on clustering algorithm and multi - Branch cnn

Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
Số 2.CS (12) 2020 31 
Pham Van Huong, Le Thi Hong Van, Pham Sy Nguyen 
Abstract—This paper proposes and develops a 
web attack detection model that combines a 
clustering algorithm and a multi-branch 
convolutional neural network (CNN). The original 
feature set was clustered into clusters of similar 
features. Each cluster of similar features was 
generalized in a convolutional structure of a 
branch of the CNN. The component feature 
vectors are assembled into a synthetic feature 
vector and included in a fully connected layer for 
classification. Using K-fold cross-validation, the 
accuracy of the proposed method 98.8%, 
F1-score is 98.9% and the improvement rate of 
accuracy is 1.479%. 
Tóm tắt—Bài báo đề xuất và phát triển mô hình 
phát hiện tấn công Web dựa trên kết hợp thuật 
toán phân cụm và mạng nơ-ron tích chập (CNN) 
đa nhánh. Tập đặc trưng ban đầu được phân cụm 
thành các nhóm đặc trưng tương ứng. Mỗi nhóm 
đặc trưng được khái quát hóa trong một nhánh 
của mạng CNN đa nhánh để tạo thành một vector 
đặc trưng thành phần. Các vector đặc trưng thành 
phần được ghép lại thành một vector đặc trưng 
tổng hợp và đưa vào lớp liên kết đầy đủ để phân 
lớp. Sử dụng phương pháp kiểm thử chéo trên mô 
hình đề xuất, độ chính xác đạt 98,8%, F1-score đạt 
98,8% và tỉ lệ cải tiến độ chính xác là 1,479%. 
Keywords—web attack detection; convolutional neural 
network (CNN); deep learning; K-means; multi-branch CNN. 
Từ khóa—phát hiện tấn công Web; mạng nơ-ron tích 
chập (CNN); học sâu; K-means; CNN đa nhánh. 
I. INTRODUCTION 
Along with the exponential growth in the 
number of websites worldwide, the forms of 
attacks on this type of network service are also 
increasingly diverse. According to the Internet 
Live Stats, in November 2020, there are more 
than 1.8 billion websites worldwide. The attack 
methods on the web are increasingly diverse, 
This manuscript is received on December 4, 2020. It is 
commented on December 22, 2020 and is accepted on 
December 22, 2020 by the first reviewer. It is commented on 
December 22, 2020 and is accepted on December 22, 2020 by 
the second reviewer. 
typically: XSS, HTTP Request Smuggling, DoS, 
SQL Injection, etc. At the same time, the world 
has also recorded a positive trend of website 
security globally. Specifically, the CyStack 
Attack Map system recorded 392,300 attacks on 
the website, decreased more than 20% compared 
to the same period last year. This is partly due to 
the fact that prevention and detection methods 
have been actively developed. These measures 
are aimed at minimizing the damage from attacks 
on websites, increasing the proactivity of coping 
as well as preventing specific prevention 
measures of each business or unit. 
There are many typical web attack detection 
methods such as static analysis, anomaly 
detection, using IDS/IPS, using Honey 
Pot/Honey Net, machine learning, deep learning, 
etc. Machine learning and deep learning are 
focused on development and application in most 
fields, such as image recognition, video 
recognition, medicine, entertainment, malware 
classification, etc. Web attack detection methods 
based on machine learning and deep learning 
have been applied vigorously and effectively 
since 2006 with a variety of attacks. 
In deep learning algorithms, CNN shows the 
highest efficiency in classifying problems. 
Therefore, the CNN architectural models have 
been studied continuously for about 10 years. 
Since 2017, multi-branch CNN architecture was 
launched and applied effectively to a number of 
classification problems such as JPEG image 
classification, lesion identification in medicine, 
etc. Therefore, this paper proposes a method of 
detecting a web attack that uses a combination 
of DBSCAN clustering algorithm and multi-
branch CNN. 
The rest of the paper is organized as follows: 
Section II – Survey, analysis, synthesis of related 
research; Section III – Presentation on the basic 
idea, process and content of method’s 
development; Section IV – Using K-means 
algorithm to cluster a feature set; Section V – 
Detecting Web Attacks Based on Clustering 
Algorithm and Multi-branch CNN 
Journal of Science and Technology on Information security 
32 No 2.CS (12) 2020 
Evaluation method; Section VI – Presenting our 
experiment; Section VII – Conclusion and trends 
of development. 
II. RELATED WORKS 
There have been many research results using 
machine learning models in web attack detection 
problems with accuracy from 92% to over 99%. 
Most of the machine learning algorithms are used 
and compared to each other. In phishing attack 
detection problem, Babagoli, Aghababa, and 
Solouk (2018) used SVM algorithm to achieve 
94.13% accuracy. Random Forest algorithm with 
only NLP-based features gives the best 
performance with the 97.98% accuracy rate for 
detection of phishing URLs [1]. In [2], the 
authors use most of machine learning algorithms 
to experiment with phishing detection using 
hyperlink information and the results show that 
Logistic Regression algorithm has the highest 
accuracy (98.42%). In SQL Injection attack 
detection, the authors used Naïve Bayes 
algorithm reached 93.3% [3]. In DoS, DDoS 
attack detection, the authors [4] uses an SVM 
algorithm based on web log traces. 
Deep learning is known as a subset of 
machine learning, with outstanding performance 
in classification problems. Common deep 
learning models have also been used to detect 
several types of web attacks with great 
efficiency. Feng et al. (2018) proposed a novel 
neural network based on a classification method 
for detection of phishing web pages using a 
Monte Carlo algorithm and risk minimization 
principle. The CNN model [5] is used to detect 
website anomalies based on HTTP requests. The 
Stacked Auto Encoder (SAE) model [6] is 
applie ... attack 
classification (Smadi, Aslam, and Zhang - 2018). 
However, most of the above research results 
focus on detecting and warning about one or a 
few specific types of attacks on the websites, yet 
to detect diverse types of attacks. 
Associative rule mining and clustering 
techniques using Apriori, FP-Growth or K-
means algorithms are not too new in the field of 
big data mining [9]-[11]. K-means was widely 
applied and integrated in many clustering tools 
such as ELKI, WEKA, etc. Recently, this 
clustering algorithm is still receiving growing 
attention in terms of parameter selection for 
meaningful research results and good 
performance [12], [13]. 
In 2017, multi-branch CNN was proposed by 
Amerini et al to detect double JPEG image 
compression. It is then further developed in the 
direction of proposing another feature set for 
relatively high accuracy (average between 95% - 
99%) [14]. In 2019, the research groups 
continued to propose branching CNN 
architecture for multiple sclerosis lesion 
segmentation [15], or for myocardial infarction 
screening from ECG images [16]. Therefore, it is 
used effectively in medicine. There are very few 
research results that use this architecture for the 
web attack detection problem [5]. 
Based on the above survey results, this paper 
proposes new methods to Web attack detection 
based on the combination of K-means clustering 
algorithm and Multi-branch CNN. Our method 
will be developed, experimented and evaluated 
in the following sections. 
III. IDEA AND THE MATHEMATICAL MODEL 
A. Basic idea 
The key idea of our paper is to use clustering 
algorithms to split an original feature set into the 
subsets corresponding to clusters; and put them 
to branches of a CNN to classify. Each cluster is 
put into a branch to generalize features to create 
a component feature vector. The component 
feature vectors are joined to generate a synthetic 
feature vector. This vector is put into a fully 
connected layer of CNN to classify. Because the 
features in a cluster have the closest metrics, it is 
more efficient to build the component feature 
vector for each cluster. 
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
Số 2.CS (12) 2020 33 
B. Building the mathematical model of the problem 
Definition 1 – Component feature vector 
A component feature vector is the feature 
vector generated by a branch of a CNN, is 
described by Equation (3). 
Definition 2 – Synthetic feature vector 
A synthetic feature vector is the feature vector 
created by joining component feature vectors 
described by Equation (4). 
As shown in Fig. 1, the original feature set 𝐷 
is clustered by K-means algorithm to K clusters 
shown in Equation (2). And, the overall 
mathematical model of the problem is described 
by Equations (1) to (5). 
𝑓: 𝐷 → 𝑂 (1) 
𝐷 = ⋃ 𝐷𝑖
𝐾
𝑖=1
(2) 
𝑣𝑖 = 𝑓𝐶𝑁𝑁
𝑖 (𝐷𝑖) (3) 
𝑣 = ⋃ 𝑣𝑖
𝐾
𝑖=1
 (4) 
𝑓’: 𝑉 → 𝑂 and 𝑉 = {𝑣} (5) 
The features in each cluster have similarities, 
so when using convolution and filtering part of a 
CNN branch, we obtain better generalization 
features. At the same time, each component 
feature vector is generated on a CNN branch so 
it also carries the characteristics of each cluster. 
Each component feature vector is called vi. The 
synthetic vector v is formed by combining 
component features vi. 
Based on the overall model of the problem, 
the steps of building, analyzing, testing and 
evaluating methods will be presented in detail in 
the following sections. 
IV. FEATURE SET CLUSTERING 
BASED ON K-MEANS ALGORITHM 
K-means is one of the most popular clustering 
algorithms. K-means clustering algorithm 
computes the centroids and iterates until it finds 
Fig. 1. Overall research model. 
Journal of Science and Technology on Information security 
34 No 2.CS (12) 2020 
optimal centroid. It assumes that the number of 
clusters is already known. In this paper, we use 
K-means algorithm to cluster the original feature 
set to K subsets of features. K-means algorithm 
is described as follows. 
K-means algorithm: 
Input: 
 A set of features. 
 Number of clusters 𝐾. 
Output: 𝐾 subsets of features 
Algorithm: 
1 Initialize 𝑘 cluster centroids randomly 
(6) 
2 Put each point into the cluster which has 
the nearest centroid 
 (7) 
 Stop if clusters do not change from the 
previous step 
3 Update centroids 
(8) 
V. EVALUATING THE METHOD 
In order to evaluate the proposed method, we 
used a K-fold cross-validation method and 
measures such as Accuracy, Precision, Recall 
and F1-score. These measurements are 
calculated using Equation (9), (10) and (11). 
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
 (9) 
𝑅𝑒𝑐𝑎𝑙𝑙 = 
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
 (10) 
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = 
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
 (11) 
where, 
 TP is the true number of classified 
patterns of attack state. 
 FP is the false number of classified 
patterns of attack state. 
 TN is the true number of classified 
patterns of normal state. 
 FN is the false number of classified 
patterns of normal state. 
VI. EXPERIMENT 
A. Experimental model 
To evaluate the proposed method, we 
conducted experiments as shown in Fig. 2. In 
Fig. 2. Experimental model. 
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
Số 2.CS (12) 2020 35 
experiments, an original feature set is clustered 
into three clusters; the original feature set is 
passed through a one-branch CNN and each 
cluster is passed through a branch of a multi-
branch CNN. 
B. Experimental program and data 
In this experiment, we installed the web attack 
detection program according to CNN in Python 
language, using the TensorFlow library. The two 
CNN network structures installed in the program 
consist of a one-branch CNN and a multi-branch 
CNN, described in Fig. 4 and Fig. 3. The multi-
branches have three branches corresponding to 
the three clusters, with 585, 835 and 223 
elements. To do our experiment, we use the 
dataset in [17]. 
Fig. 3. Experimental Structure of 
CNN-multi-branches. 
Fig. 4. Experimental structure of a CNN-1branch. 
C. Feature conversion 
In order to create binary matrices inputted to 
a CNN, we convert the original feature set to a 
binary feature set as shown in Fig. 5. and Fig. 6. 
Fig. 5 shows a part of the query string, used as a 
raw feature, having Xpath and XSS labels. Fig. 6 
shows some binary features converted by 
raw features. 
Fig. 5. A part of query string in the original feature set. 
Fig. 6. A part of CNN feature set. 
D. Experimental results and evaluation 
The accuracy and relevant measurements 
when experimenting on the three data sets with 
CNN model by the cross-testing method are 
summarized in Table 1. The average 
improvement rate is 1.479%. Comparing the 
improvement level of the proposed method when 
experimenting on 3 clusters, it is summarized in 
chart form as Fig. 7. 
As shown in Table 2, compared with some 
machine learning models in the study [18], 
including SVM, PCA, etc., the proposed model 
has higher accuracy. At the same time, the use of 
the K-means algorithm to group the features also 
improves the accuracy. This is because after 
clustering, we obtain groups of similar features, 
so the generalization of features in the 
convolution layers is more efficient. 
Journal of Science and Technology on Information security 
36 No 2.CS (12) 2020 
TABLE 2. COMPARING TO OTHER METHODS 
Method 
Naive 
bayes 
AGGRE 
GATE_ANY 
Auto 
encoder 
PCA CNN 
Acc. 0.941 0.933 0.906 0.737 0.988 
Fig. 7. Comparison of CNN-1 branch and 
CNN-multi-branches 
VII. CONCLUSION 
The main contribution of this paper is to 
propose and develop the new method of web 
attack detections, associated clustering by K-
means algorithm and classifying by a multi-
branch CNN. The proposed method is evaluated 
using K-fold cross-validation with good results. 
Our method is better than the original method on 
both F1-score and accuracy. 
Despite the positive results, this paper still has 
some limitations such as: the number of classes 
is small, the number of samples is limited, and 
the cluster number is fixed. Therefore, we will 
continue to research and improve the 
methodology in the paper including: 
experimenting with other machine learning/deep 
learning models; studying on dynamic cluster 
numbers; experimenting with other actual data 
sets with a higher number of classes and more 
diverse forms of attacks. 
REFERENCES 
[1] Ozgur Koray Sahingoz, Ebubekir Buber, Onder 
Demir, Banu Diri, Machine learning based 
phishing detection from URLs, Expert Systems 
With Applications 117, 2019, pp. 345–357. 
[2] Ankit Kumar Jain1 · B. B. Gupta, A Machine 
Learning based Approach for phishing detection 
using hyperlinks information, © Springer-
Verlag GmbH Germany, part of Springer 
Nature 2018. 
[3] Anamika Joshi, Geetha V, SQL Injection 
Detection using Machine Learning, 2014 
International Conference on Control, 
Instrumentation, Communication and 
Computational Technologies (ICCICCT), 2014. 
[4] Yuchun Tang, Zhenyu Zhong, Yuanchen He, 
System and Method for Detection of DoS 
Attacks, Apr. 25, 2013. 
[5] Ming Zhang, Boyi Xu, Shuai Bai, Shuaibing Lu, 
and Zhechao Lin, A Deep Learning Method to 
Detect Web Attacks Using a Specially Designed 
CNN, ICONIP 2017, Part V, LNCS 10638, 
2017, pp. 828–836. 
[6] Ali Moradi Vartouni, Saeed Sedighian Kashi, 
Mohammad Teshnehlab, An Anomaly 
Detection Method to Detect Web Attacks Using 
Stacked Auto-Encoder, 6th Iranian Joint 
Congress on Fuzzy and Intelligent Systems 
(CFIS), 2018. 
[7] Ruibo Yan, Xi Xiao, Guangwu Hu, Sancheng 
Peng, Yong Jiang, New deep learning method to 
TABLE 1. EXPERIMENTAL RESULTS 
Models 
Times 
Average 
1 2 3 4 5 
F1-
Score 
Acc 
F1-
Score 
Acc 
F1-
Score 
Acc 
F1-
Score 
Acc 
F1-
Score 
Acc 
F1-
Score 
Acc 
CNN-
1branch 
0.962 0.967 0.974 0.965 0.968 0.983 0.975 0.981 0.969 0.973 0.970 0.974 
CNN-multi-
branches 
0.985 0.986 0.989 0.991 0.983 0.984 0.991 0.995 0.995 0.985 0.989 0.988 
Improvement 
rate (%) 
2.391 1.965 1.540 2.694 1.550 0.102 1.641 1.427 2.683 1.233 1.960 1.479 
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin 
Số 2.CS (12) 2020 37 
detect code injection attacks on hybrid 
applications, The Journal of Systems and 
Software 137, 2018, pp. 67–77. 
[8] Yadigar Imamverdiyev, Fargana Abdullayeva, 
Deep Learning Method for Denial of Service 
Attack Detection Based on Restricted 
Boltzmann Machine, Mary Ann Liebert, Inc., 
Big Data, Volume 6 Number 2, 2018. 
[9] Coenen, F., Goulbourne, G. and Leng, P., Tree 
Structures for Mining association Rules, Journal 
of Data Mining and Knowledge Discovery, Vol 
8, No 1, 2003, pp. 25-51. 
[10] Asantha Thilina, Shakthi Attanayake, Sacith 
Samarakoon, Dahami Nawodya, Lakmal 
Rupasinghe, Nadith Pathirage, Tharindu 
Edirisinghe, Kesavan Krishnadeva, Intruder 
Detection Using Deep Learning and Association 
Rule Mining, IEEE International Conference on 
Computer and Information Technology, 2016. 
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, 
and Xiaowei Xu, A density-based algorithm for 
discovering clusters in large spatial databases 
with noise, In Proceedings of the 2nd ACM 
International Conference on Knowledge 
Discovery and Data Mining (KDD), 1996, pp. 
226–231. 
[12] Junhao Gan, Yufei Tao, DBSCAN revisited: 
Mis-Claim, Un-fixability and Approximation, 
SIGMODE 2015. 
[13] Erich Schubert, Jorg Sander, Martin Ester, Hans-
Peter Kriegel, Xiaowei Xu, DBSCAN Revisited, 
Revisited: Why and How You Should (Still) Use 
DBSCAN, ACM Trans. Database Syst. 42, 3, 
Article 19, 2017. 
[14] 14. Bin Li, Hu Luo, Haoxin Zhang, Shunquan 
Tan, Zhongzhou Ji, A multi-branch 
convolutional neural network for detecting 
double JPEG compression, Arxiv, 2017. 
[15] Shahab Aslani, Michael Dayan, Loredana 
Storelli, Massimo Filippi, Vittorio Murino, 
Maria A Rocca, Diego Sona, Multi-branch 
Convolutional Neural Network for Multiple 
Sclerosis Lesion Segmentation, Arxiv, 
April 2019. 
[16] Pengyi Hao, Xiang Gao, Zhihe Li, Jinglin 
Zhang, Fuli Wu, Cong Bai, Multi-branch fusion 
network for Myocardial infarction screening 
from 12-lead ECG images, Computer Methods 
and Programs in Biomedicine 184, 2020. 
[17] Web attack detection dataset: 
https://github.com/DuckDuckBug/cnn_waf 
[18] Pan Yao, Sun Fangzhou, Teng Zhongwei, White 
Jules, Schmidt Douglas, Staples Jacob and 
Krause Lee, Detecting web attacks with end-to-
end deep learning. Journal of Internet Services 
and Applications, 2019. 
ABOUT THE AUTHOR 
Pham Van Huong 
Workplace: Academy of 
Cryptography Techniques 
Email: huongpv@actvn.edu.vn 
Education: Received Bachelor's 
degree in 2005, Master's degree in 
2008 and PhD in 2015 in Information 
Technology from University of Engineering and 
Technology, VNU. 
Recent research direction: IoT, AIoT, embedded 
software optimization and big data, deep learning for 
information security. 
 Le Thi Hong Van 
Workplace: Academy of 
Cryptography Techniques 
Email: lthvan@actvn.edu.vn 
Education: Received Engineer's 
degree in 2009 and Master's degree in 
2013 in Information Security from 
Academy of Cryptography Techniques. 
Recent research direction: information security, 
cryptography, IoT and application of AI, machine 
learning for information security. 
Pham Sy Nguyen 
Workplace: Informatics center, The 
Government Office 
Email: phamsynguyen@chinhphu.vn 
Education: Received Engineer’s 
degree in Information Security in 
2013; received Master’s degree in 
Information Security in 2016 from Academy of 
Cryptography Techniques. 
Recent research direction: web hacking, malware 
detection, information security.
File đính kèm:
detecting_web_attacks_based_on_clustering_algorithm_and_mult.pdf