Detecting web attacks based on clustering algorithm and multi - Branch cnn
Bài báo đề xuất và phát triển mô hình
phát hiện tấn công Web dựa trên kết hợp thuật
toán phân cụm và mạng nơ-ron tích chập (CNN)
đa nhánh. Tập đặc trưng ban đầu được phân cụm
thành các nhóm đặc trưng tương ứng. Mỗi nhóm
đặc trưng được khái quát hóa trong một nhánh
của mạng CNN đa nhánh để tạo thành một vector
đặc trưng thành phần. Các vector đặc trưng thành
phần được ghép lại thành một vector đặc trưng
tổng hợp và đưa vào lớp liên kết đầy đủ để phân
lớp. Sử dụng phương pháp kiểm thử chéo trên mô
hình đề xuất, độ chính xác đạt 98,8%, F1-score đạt
98,8% và tỉ lệ cải tiến độ chính xác là 1,479%

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7
Bạn đang xem tài liệu "Detecting web attacks based on clustering algorithm and multi - Branch cnn", để tải tài liệu gốc về máy hãy click vào nút Download ở trên
Tóm tắt nội dung tài liệu: Detecting web attacks based on clustering algorithm and multi - Branch cnn
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
Số 2.CS (12) 2020 31
Pham Van Huong, Le Thi Hong Van, Pham Sy Nguyen
Abstract—This paper proposes and develops a
web attack detection model that combines a
clustering algorithm and a multi-branch
convolutional neural network (CNN). The original
feature set was clustered into clusters of similar
features. Each cluster of similar features was
generalized in a convolutional structure of a
branch of the CNN. The component feature
vectors are assembled into a synthetic feature
vector and included in a fully connected layer for
classification. Using K-fold cross-validation, the
accuracy of the proposed method 98.8%,
F1-score is 98.9% and the improvement rate of
accuracy is 1.479%.
Tóm tắt—Bài báo đề xuất và phát triển mô hình
phát hiện tấn công Web dựa trên kết hợp thuật
toán phân cụm và mạng nơ-ron tích chập (CNN)
đa nhánh. Tập đặc trưng ban đầu được phân cụm
thành các nhóm đặc trưng tương ứng. Mỗi nhóm
đặc trưng được khái quát hóa trong một nhánh
của mạng CNN đa nhánh để tạo thành một vector
đặc trưng thành phần. Các vector đặc trưng thành
phần được ghép lại thành một vector đặc trưng
tổng hợp và đưa vào lớp liên kết đầy đủ để phân
lớp. Sử dụng phương pháp kiểm thử chéo trên mô
hình đề xuất, độ chính xác đạt 98,8%, F1-score đạt
98,8% và tỉ lệ cải tiến độ chính xác là 1,479%.
Keywords—web attack detection; convolutional neural
network (CNN); deep learning; K-means; multi-branch CNN.
Từ khóa—phát hiện tấn công Web; mạng nơ-ron tích
chập (CNN); học sâu; K-means; CNN đa nhánh.
I. INTRODUCTION
Along with the exponential growth in the
number of websites worldwide, the forms of
attacks on this type of network service are also
increasingly diverse. According to the Internet
Live Stats, in November 2020, there are more
than 1.8 billion websites worldwide. The attack
methods on the web are increasingly diverse,
This manuscript is received on December 4, 2020. It is
commented on December 22, 2020 and is accepted on
December 22, 2020 by the first reviewer. It is commented on
December 22, 2020 and is accepted on December 22, 2020 by
the second reviewer.
typically: XSS, HTTP Request Smuggling, DoS,
SQL Injection, etc. At the same time, the world
has also recorded a positive trend of website
security globally. Specifically, the CyStack
Attack Map system recorded 392,300 attacks on
the website, decreased more than 20% compared
to the same period last year. This is partly due to
the fact that prevention and detection methods
have been actively developed. These measures
are aimed at minimizing the damage from attacks
on websites, increasing the proactivity of coping
as well as preventing specific prevention
measures of each business or unit.
There are many typical web attack detection
methods such as static analysis, anomaly
detection, using IDS/IPS, using Honey
Pot/Honey Net, machine learning, deep learning,
etc. Machine learning and deep learning are
focused on development and application in most
fields, such as image recognition, video
recognition, medicine, entertainment, malware
classification, etc. Web attack detection methods
based on machine learning and deep learning
have been applied vigorously and effectively
since 2006 with a variety of attacks.
In deep learning algorithms, CNN shows the
highest efficiency in classifying problems.
Therefore, the CNN architectural models have
been studied continuously for about 10 years.
Since 2017, multi-branch CNN architecture was
launched and applied effectively to a number of
classification problems such as JPEG image
classification, lesion identification in medicine,
etc. Therefore, this paper proposes a method of
detecting a web attack that uses a combination
of DBSCAN clustering algorithm and multi-
branch CNN.
The rest of the paper is organized as follows:
Section II – Survey, analysis, synthesis of related
research; Section III – Presentation on the basic
idea, process and content of method’s
development; Section IV – Using K-means
algorithm to cluster a feature set; Section V –
Detecting Web Attacks Based on Clustering
Algorithm and Multi-branch CNN
Journal of Science and Technology on Information security
32 No 2.CS (12) 2020
Evaluation method; Section VI – Presenting our
experiment; Section VII – Conclusion and trends
of development.
II. RELATED WORKS
There have been many research results using
machine learning models in web attack detection
problems with accuracy from 92% to over 99%.
Most of the machine learning algorithms are used
and compared to each other. In phishing attack
detection problem, Babagoli, Aghababa, and
Solouk (2018) used SVM algorithm to achieve
94.13% accuracy. Random Forest algorithm with
only NLP-based features gives the best
performance with the 97.98% accuracy rate for
detection of phishing URLs [1]. In [2], the
authors use most of machine learning algorithms
to experiment with phishing detection using
hyperlink information and the results show that
Logistic Regression algorithm has the highest
accuracy (98.42%). In SQL Injection attack
detection, the authors used Naïve Bayes
algorithm reached 93.3% [3]. In DoS, DDoS
attack detection, the authors [4] uses an SVM
algorithm based on web log traces.
Deep learning is known as a subset of
machine learning, with outstanding performance
in classification problems. Common deep
learning models have also been used to detect
several types of web attacks with great
efficiency. Feng et al. (2018) proposed a novel
neural network based on a classification method
for detection of phishing web pages using a
Monte Carlo algorithm and risk minimization
principle. The CNN model [5] is used to detect
website anomalies based on HTTP requests. The
Stacked Auto Encoder (SAE) model [6] is
applie ... attack
classification (Smadi, Aslam, and Zhang - 2018).
However, most of the above research results
focus on detecting and warning about one or a
few specific types of attacks on the websites, yet
to detect diverse types of attacks.
Associative rule mining and clustering
techniques using Apriori, FP-Growth or K-
means algorithms are not too new in the field of
big data mining [9]-[11]. K-means was widely
applied and integrated in many clustering tools
such as ELKI, WEKA, etc. Recently, this
clustering algorithm is still receiving growing
attention in terms of parameter selection for
meaningful research results and good
performance [12], [13].
In 2017, multi-branch CNN was proposed by
Amerini et al to detect double JPEG image
compression. It is then further developed in the
direction of proposing another feature set for
relatively high accuracy (average between 95% -
99%) [14]. In 2019, the research groups
continued to propose branching CNN
architecture for multiple sclerosis lesion
segmentation [15], or for myocardial infarction
screening from ECG images [16]. Therefore, it is
used effectively in medicine. There are very few
research results that use this architecture for the
web attack detection problem [5].
Based on the above survey results, this paper
proposes new methods to Web attack detection
based on the combination of K-means clustering
algorithm and Multi-branch CNN. Our method
will be developed, experimented and evaluated
in the following sections.
III. IDEA AND THE MATHEMATICAL MODEL
A. Basic idea
The key idea of our paper is to use clustering
algorithms to split an original feature set into the
subsets corresponding to clusters; and put them
to branches of a CNN to classify. Each cluster is
put into a branch to generalize features to create
a component feature vector. The component
feature vectors are joined to generate a synthetic
feature vector. This vector is put into a fully
connected layer of CNN to classify. Because the
features in a cluster have the closest metrics, it is
more efficient to build the component feature
vector for each cluster.
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
Số 2.CS (12) 2020 33
B. Building the mathematical model of the problem
Definition 1 – Component feature vector
A component feature vector is the feature
vector generated by a branch of a CNN, is
described by Equation (3).
Definition 2 – Synthetic feature vector
A synthetic feature vector is the feature vector
created by joining component feature vectors
described by Equation (4).
As shown in Fig. 1, the original feature set 𝐷
is clustered by K-means algorithm to K clusters
shown in Equation (2). And, the overall
mathematical model of the problem is described
by Equations (1) to (5).
𝑓: 𝐷 → 𝑂 (1)
𝐷 = ⋃ 𝐷𝑖
𝐾
𝑖=1
(2)
𝑣𝑖 = 𝑓𝐶𝑁𝑁
𝑖 (𝐷𝑖) (3)
𝑣 = ⋃ 𝑣𝑖
𝐾
𝑖=1
(4)
𝑓’: 𝑉 → 𝑂 and 𝑉 = {𝑣} (5)
The features in each cluster have similarities,
so when using convolution and filtering part of a
CNN branch, we obtain better generalization
features. At the same time, each component
feature vector is generated on a CNN branch so
it also carries the characteristics of each cluster.
Each component feature vector is called vi. The
synthetic vector v is formed by combining
component features vi.
Based on the overall model of the problem,
the steps of building, analyzing, testing and
evaluating methods will be presented in detail in
the following sections.
IV. FEATURE SET CLUSTERING
BASED ON K-MEANS ALGORITHM
K-means is one of the most popular clustering
algorithms. K-means clustering algorithm
computes the centroids and iterates until it finds
Fig. 1. Overall research model.
Journal of Science and Technology on Information security
34 No 2.CS (12) 2020
optimal centroid. It assumes that the number of
clusters is already known. In this paper, we use
K-means algorithm to cluster the original feature
set to K subsets of features. K-means algorithm
is described as follows.
K-means algorithm:
Input:
A set of features.
Number of clusters 𝐾.
Output: 𝐾 subsets of features
Algorithm:
1 Initialize 𝑘 cluster centroids randomly
(6)
2 Put each point into the cluster which has
the nearest centroid
(7)
Stop if clusters do not change from the
previous step
3 Update centroids
(8)
V. EVALUATING THE METHOD
In order to evaluate the proposed method, we
used a K-fold cross-validation method and
measures such as Accuracy, Precision, Recall
and F1-score. These measurements are
calculated using Equation (9), (10) and (11).
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
(9)
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
(10)
𝐹1 − 𝑆𝑐𝑜𝑟𝑒 =
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
(11)
where,
TP is the true number of classified
patterns of attack state.
FP is the false number of classified
patterns of attack state.
TN is the true number of classified
patterns of normal state.
FN is the false number of classified
patterns of normal state.
VI. EXPERIMENT
A. Experimental model
To evaluate the proposed method, we
conducted experiments as shown in Fig. 2. In
Fig. 2. Experimental model.
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
Số 2.CS (12) 2020 35
experiments, an original feature set is clustered
into three clusters; the original feature set is
passed through a one-branch CNN and each
cluster is passed through a branch of a multi-
branch CNN.
B. Experimental program and data
In this experiment, we installed the web attack
detection program according to CNN in Python
language, using the TensorFlow library. The two
CNN network structures installed in the program
consist of a one-branch CNN and a multi-branch
CNN, described in Fig. 4 and Fig. 3. The multi-
branches have three branches corresponding to
the three clusters, with 585, 835 and 223
elements. To do our experiment, we use the
dataset in [17].
Fig. 3. Experimental Structure of
CNN-multi-branches.
Fig. 4. Experimental structure of a CNN-1branch.
C. Feature conversion
In order to create binary matrices inputted to
a CNN, we convert the original feature set to a
binary feature set as shown in Fig. 5. and Fig. 6.
Fig. 5 shows a part of the query string, used as a
raw feature, having Xpath and XSS labels. Fig. 6
shows some binary features converted by
raw features.
Fig. 5. A part of query string in the original feature set.
Fig. 6. A part of CNN feature set.
D. Experimental results and evaluation
The accuracy and relevant measurements
when experimenting on the three data sets with
CNN model by the cross-testing method are
summarized in Table 1. The average
improvement rate is 1.479%. Comparing the
improvement level of the proposed method when
experimenting on 3 clusters, it is summarized in
chart form as Fig. 7.
As shown in Table 2, compared with some
machine learning models in the study [18],
including SVM, PCA, etc., the proposed model
has higher accuracy. At the same time, the use of
the K-means algorithm to group the features also
improves the accuracy. This is because after
clustering, we obtain groups of similar features,
so the generalization of features in the
convolution layers is more efficient.
Journal of Science and Technology on Information security
36 No 2.CS (12) 2020
TABLE 2. COMPARING TO OTHER METHODS
Method
Naive
bayes
AGGRE
GATE_ANY
Auto
encoder
PCA CNN
Acc. 0.941 0.933 0.906 0.737 0.988
Fig. 7. Comparison of CNN-1 branch and
CNN-multi-branches
VII. CONCLUSION
The main contribution of this paper is to
propose and develop the new method of web
attack detections, associated clustering by K-
means algorithm and classifying by a multi-
branch CNN. The proposed method is evaluated
using K-fold cross-validation with good results.
Our method is better than the original method on
both F1-score and accuracy.
Despite the positive results, this paper still has
some limitations such as: the number of classes
is small, the number of samples is limited, and
the cluster number is fixed. Therefore, we will
continue to research and improve the
methodology in the paper including:
experimenting with other machine learning/deep
learning models; studying on dynamic cluster
numbers; experimenting with other actual data
sets with a higher number of classes and more
diverse forms of attacks.
REFERENCES
[1] Ozgur Koray Sahingoz, Ebubekir Buber, Onder
Demir, Banu Diri, Machine learning based
phishing detection from URLs, Expert Systems
With Applications 117, 2019, pp. 345–357.
[2] Ankit Kumar Jain1 · B. B. Gupta, A Machine
Learning based Approach for phishing detection
using hyperlinks information, © Springer-
Verlag GmbH Germany, part of Springer
Nature 2018.
[3] Anamika Joshi, Geetha V, SQL Injection
Detection using Machine Learning, 2014
International Conference on Control,
Instrumentation, Communication and
Computational Technologies (ICCICCT), 2014.
[4] Yuchun Tang, Zhenyu Zhong, Yuanchen He,
System and Method for Detection of DoS
Attacks, Apr. 25, 2013.
[5] Ming Zhang, Boyi Xu, Shuai Bai, Shuaibing Lu,
and Zhechao Lin, A Deep Learning Method to
Detect Web Attacks Using a Specially Designed
CNN, ICONIP 2017, Part V, LNCS 10638,
2017, pp. 828–836.
[6] Ali Moradi Vartouni, Saeed Sedighian Kashi,
Mohammad Teshnehlab, An Anomaly
Detection Method to Detect Web Attacks Using
Stacked Auto-Encoder, 6th Iranian Joint
Congress on Fuzzy and Intelligent Systems
(CFIS), 2018.
[7] Ruibo Yan, Xi Xiao, Guangwu Hu, Sancheng
Peng, Yong Jiang, New deep learning method to
TABLE 1. EXPERIMENTAL RESULTS
Models
Times
Average
1 2 3 4 5
F1-
Score
Acc
F1-
Score
Acc
F1-
Score
Acc
F1-
Score
Acc
F1-
Score
Acc
F1-
Score
Acc
CNN-
1branch
0.962 0.967 0.974 0.965 0.968 0.983 0.975 0.981 0.969 0.973 0.970 0.974
CNN-multi-
branches
0.985 0.986 0.989 0.991 0.983 0.984 0.991 0.995 0.995 0.985 0.989 0.988
Improvement
rate (%)
2.391 1.965 1.540 2.694 1.550 0.102 1.641 1.427 2.683 1.233 1.960 1.479
Khoa học và Công nghệ trong lĩnh vực An toàn thông tin
Số 2.CS (12) 2020 37
detect code injection attacks on hybrid
applications, The Journal of Systems and
Software 137, 2018, pp. 67–77.
[8] Yadigar Imamverdiyev, Fargana Abdullayeva,
Deep Learning Method for Denial of Service
Attack Detection Based on Restricted
Boltzmann Machine, Mary Ann Liebert, Inc.,
Big Data, Volume 6 Number 2, 2018.
[9] Coenen, F., Goulbourne, G. and Leng, P., Tree
Structures for Mining association Rules, Journal
of Data Mining and Knowledge Discovery, Vol
8, No 1, 2003, pp. 25-51.
[10] Asantha Thilina, Shakthi Attanayake, Sacith
Samarakoon, Dahami Nawodya, Lakmal
Rupasinghe, Nadith Pathirage, Tharindu
Edirisinghe, Kesavan Krishnadeva, Intruder
Detection Using Deep Learning and Association
Rule Mining, IEEE International Conference on
Computer and Information Technology, 2016.
[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander,
and Xiaowei Xu, A density-based algorithm for
discovering clusters in large spatial databases
with noise, In Proceedings of the 2nd ACM
International Conference on Knowledge
Discovery and Data Mining (KDD), 1996, pp.
226–231.
[12] Junhao Gan, Yufei Tao, DBSCAN revisited:
Mis-Claim, Un-fixability and Approximation,
SIGMODE 2015.
[13] Erich Schubert, Jorg Sander, Martin Ester, Hans-
Peter Kriegel, Xiaowei Xu, DBSCAN Revisited,
Revisited: Why and How You Should (Still) Use
DBSCAN, ACM Trans. Database Syst. 42, 3,
Article 19, 2017.
[14] 14. Bin Li, Hu Luo, Haoxin Zhang, Shunquan
Tan, Zhongzhou Ji, A multi-branch
convolutional neural network for detecting
double JPEG compression, Arxiv, 2017.
[15] Shahab Aslani, Michael Dayan, Loredana
Storelli, Massimo Filippi, Vittorio Murino,
Maria A Rocca, Diego Sona, Multi-branch
Convolutional Neural Network for Multiple
Sclerosis Lesion Segmentation, Arxiv,
April 2019.
[16] Pengyi Hao, Xiang Gao, Zhihe Li, Jinglin
Zhang, Fuli Wu, Cong Bai, Multi-branch fusion
network for Myocardial infarction screening
from 12-lead ECG images, Computer Methods
and Programs in Biomedicine 184, 2020.
[17] Web attack detection dataset:
https://github.com/DuckDuckBug/cnn_waf
[18] Pan Yao, Sun Fangzhou, Teng Zhongwei, White
Jules, Schmidt Douglas, Staples Jacob and
Krause Lee, Detecting web attacks with end-to-
end deep learning. Journal of Internet Services
and Applications, 2019.
ABOUT THE AUTHOR
Pham Van Huong
Workplace: Academy of
Cryptography Techniques
Email: huongpv@actvn.edu.vn
Education: Received Bachelor's
degree in 2005, Master's degree in
2008 and PhD in 2015 in Information
Technology from University of Engineering and
Technology, VNU.
Recent research direction: IoT, AIoT, embedded
software optimization and big data, deep learning for
information security.
Le Thi Hong Van
Workplace: Academy of
Cryptography Techniques
Email: lthvan@actvn.edu.vn
Education: Received Engineer's
degree in 2009 and Master's degree in
2013 in Information Security from
Academy of Cryptography Techniques.
Recent research direction: information security,
cryptography, IoT and application of AI, machine
learning for information security.
Pham Sy Nguyen
Workplace: Informatics center, The
Government Office
Email: phamsynguyen@chinhphu.vn
Education: Received Engineer’s
degree in Information Security in
2013; received Master’s degree in
Information Security in 2016 from Academy of
Cryptography Techniques.
Recent research direction: web hacking, malware
detection, information security.
File đính kèm:
detecting_web_attacks_based_on_clustering_algorithm_and_mult.pdf

