Anomaly detection system of web access using user behavior features

The growth, accessibility of the Internet and the explosion of personal computing devices have made applications on the web growing robustly, especially for e-Commerce and public services. Unfortunately, the vulnerabilities of these web services also increased rapidly. This leads to the need of monitoring the users accesses to these services to distinguish abnormal and malicious behaviors from the log data in order to ensure the quality of these web services as well as their safety. This work presents methods to build and develop a rule-based systems allowing services’ administrators to detect abnormal and malicious accesses to their web services from web logs. The proposed method investigates characteristics of user behaviors in the form of HTTP requests and extracts efficient features to precisely detect abnormal accesses. Furthermore, this report proposes a way to collect and build datasets for applying machine learning techniques to generate detection rules automatically. The anomaly detection system of was tested and evaluated its performance on 4 different web sites with approximately one million log lines per day

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

18 trang minhkhanh 9880

Download

Bạn đang xem 10 trang mẫu của tài liệu "Anomaly detection system of web access using user behavior features", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Anomaly detection system of web access using user behavior features

Southeast Asian J. of Sciences: Vol 7, No 2, (2019) pp. 115-132
ANOMALY DETECTION SYSTEM OF WEB
ACCESS USING USER BEHAVIOR
FEATURES
Pham Hoang Duy, Nguyen Thi Thanh Thuy
and
Nguyen Ngoc Diep
Department of Information Technology
Posts and Telecommunications Institute of Technology (PTIT)
Hanoi, Vietnam
duyph@ptit.edu.vn; thuyntt@ptit.edu.vn; diepnguyenngoc@ptit.edu.vn
Abstract
The growth, accessibility of the Internet and the explosion of personal
computing devices have made applications on the web growing robustly,
especially for e-commerce and public services. Unfortunately, the vul-
nerabilities of these web services also increased rapidly. This leads to the
need of monitoring the users accesses to these services to distinguish ab-
normal and malicious behaviors from the log data in order to ensure the
quality of these web services as well as their safety. This work presents
methods to build and develop a rule-based systems allowing services’
administrators to detect abnormal and malicious accesses to their web
services from web logs. The proposed method investigates characteris-
tics of user behaviors in the form of HTTP requests and extracts eﬃcient
features to precisely detect abnormal accesses. Furthermore, this report
proposes a way to collect and build datasets for applying machine learn-
ing techniques to generate detection rules automatically. The anomaly
detection system of was tested and evaluated its performance on 4 dif-
ferent web sites with approximately one million log lines per day.
Key words: Anomaly detection system, web log, rule generation, user behavior, TF-IDF.
2010 AMS Mathematics classiﬁcation: 91D10, 68U115, 68U35, 68M14, 68M115, 68T99.
115
116 Anomaly Detection System of Web Access ...
1 Introduction
Anomaly detection can refer to the problem of ﬁnding patterns in the data
that do not match the expected behavior. These nonconforming patterns are
often referred to as anomalies, exceptions, contradictory observations, and ir-
regularities depending on the characteristics of diﬀerent application domains.
With the development of the Internet and web applications, anomaly detection
in web services can range from detecting misuse to malicious intent which de-
grade the quality of website service or commit fraudulent behaviors. With web
services, analytic techniques need to transform the original raw data into an
appropriate form that describes the session information or the amount of time
a user interacts with the services provided by the website.
In monitoring users’ access to web services, a rule-based anomaly detection
technique is commonly used due to its accessibility and readability to services
administrators. There are two basic approaches to generating rules. The former
is based on a rule manually and statically created by service administrators
when analyzing users’ behaviors. Another approach is to dynamically and
automatically create rules using data mining techniques or machine learning.
For static rule generation, it is ﬁrst necessary to construct a scenario of
the situation that the administrator wants to simulate. For example, if there
is one process running on one device and another process running on another
device at the same time and the combination of both processes causes a security
issue, the administrator needs to model this scenario. Besides creating these
rules, administrators must enforce the correlation among these rules to verify
whether the case is an anomaly or not. A rule can contain many parameters
such as time frame, repeating pattern, service type, port, etc. The algorithm
then checks the data from the log ﬁles and ﬁnds out attack scenarios or unusual
behaviors.
The advantage of this approach is the ability to detect anomaly behaviors
of access by correlating analysis and thus detecting intruders diﬃcult to detect.
There are speciﬁc languages that allow the creation of rules as well as tools to
create rules eﬀectively and easily. For companies with appropriate resources
and budget, it is easier and more convenient to buy a set of rules than to use
several systems to create speciﬁc rules.
The downside of this approach is high cost, especially for maintaining the
set of rules. Modeling each attack scenario is not an easy and trivial task. It is
most likely to re-perform the same type of attack and sometimes the anomaly
cannot be identiﬁed. In addition, attack patterns can change and new attack
forms are invented every day. As such, it is necessary to evolve the set of rules
over time despite the fact that there is the possibility that some unspeciﬁed
attacks, that could easily occur, are unrecognized.
The dynamic rule-generation approach has been used for anomaly detection
for quite some time. The generated rules are usually in the form of if-then.
P. H. Duy, N. T. T. Thuy, N. N. Diep 117
First, the algorithm generates patterns that can be further processed into a set
of rules allowing to determine which action should be taken. Methods based on
dynamic rule generation can solve the problem of continually updating attack
patterns (or rules) by looking for potential unknown attacks. The disadvantage
of this approach is the complexity of multi-dimensional spatial data analysis.
Algorithms capable of processing such data often have high computational
complexity. Therefore, it is necessary to reduce the dimension of the data
as much as possible.
Our work proposes the development of anomaly detection system for web
services based on dynamically generating rules using machine learning tech-
niques. The data used in the analysis process is log entries from web services.
In particular, Section 2 presents research on the use of machine learning tech-
niques for the generation of anomaly detection rules. Section 3 describes the
proposed anomaly detection system of web access as well as how to collect and
build the dataset for building model that automaticall ... ZAP is set to maximum oper-
ating mode to obtain the most types of access (hijacking, XSS, SQL injection,
etc.) as well as the maximum number of generated samples. These samples
are saved as semi-structured ﬁles for later processing (e.g.: anomaly.csv). At
the same time, ZAP generates reports that show speciﬁc security issues of the
interested web services. These will be collected and saved in a separate ﬁle
(e.g.: blacklist.csv).
In addition to security analysis tools, ZAP provides a mechanism for scan-
ning web service structures through web spider and AJAX spider services.
These tools allow to collect information about the structure of service pages
and save them in the text ﬁle (e.g.: normal.csv). This work conducted using
ZAP tool to collect data of web service access to 4 test websites with the follow-
ing volume: about 300,000 abnormal queries, 200,000 normal accesses, about
6,000 web pages with security issues (blacklist).
In addition to the data generated through the ZAP toolkit, log data to four
test web sites is also used to generate the dataset. Access by users with an
HTTP code outside of the normal range (return code > 200) will be consid-
ered an abnormal access due to an error. In addition, accesses in the log ﬁles
coincide with those of anomaly.csv and blacklist.csv ﬁles, which are considered
abnormal. The way to identify unusual access from the log ﬁle is not entirely
straightforward, but it is still useful because from an administrative point of
view, access to the faulty web service should be alerted.
Data from the web log ﬁles from 4 test sites was sampled over 4 to 5 days
with each site. The data collected included nearly 340,000 normal and nearly
56,000 abnormal accesses. Among the test sites, 1 site has an average data
P. H. Duy, N. T. T. Thuy, N. N. Diep 127
volume signiﬁcantly lower (about 25%) compared to the remaining sites.
The data collected from the log ﬁle combined with the data generated from
the ZAP toolkit, after eliminating duplicates, constructs the dataset for build-
ing classiﬁcation model. In fact, this dataset is quite balanced, including nearly
470,000 normal and 380,000 abnormal accesses.
4.4 Experiment and evaluation
4.4.1 Performance experiment
As mentioned above, an anomaly detection system for web access was developed
based on Python 3.6 and scikit-learn library. The dataset from the above is
divided in the ration of 7 : 3 for training and testing corresponding to the
number of accesses 554, 000 : 238, 000. These accesses are represented by TF-
IDF to perform machine learning techniques using random tree learning. The
parameters used in the classiﬁcation model building process are set to the
default level. Test results show that the accuracy of classiﬁcation model is
about 95%. Table 2 details the classiﬁcation of normal and anomaly accesses
from a learned model.
Table 2: Confusion matrix
Normal Abnormal
Normal 139,167 3,039
Abnormal 8,311 105,578
The results obtained from building classiﬁcation models are relatively good.
The classiﬁcation model obtained from the model building step was used on
the access data obtained from the log ﬁle to produce results as shown in Table
3.
The data in the table shows that at some point the number of user access
doubles the average at other times. The number of abnormal ﬂuctuations is not
proportional to the access volume of the user. Figure 5 shows the percentage of
hits checked (which do not coincide with other accesses) and anomalies in the
number of queries tested. During the observed period, the number of anomalies
ﬂuctuated around 2% of the total number of queries examined. Particularly, in
the 3rd and 4th observation range, the rate of abnormalities increased sharply.
Several abnormal user accesses are shown in Table 4. Accesses from 1 to 4
are relatively clear acts of attack on web services. Access 5 and 6 are not really
obvious behaviors or attempts to browse the directory structure of a website.
However, these behaviors still need to be monitored by administrators.
128 Anomaly Detection System of Web Access ...
Table 3: Anomaly detection results
Total
access
Total check Abnormal
3,759,770 39,311 238
917,004 7,989 102,445
1,486,577 30,884 2,718
682,988 25,936 1,291
1,640,803 32,530 262
1,356,999 30,716 147
1,007,064 27,002 60
1,864,161 30,051 74
2,434,863 35,028 204
1,788,273 30,232 458
1,895,390 44,230 116
1,553,298 38,047 90
Figure 5: Percentage of prediction results
4.4.2 Run-time experiment
Despite the better detection quality of TF-IDF model when applying against
CSIC data-set, this model perform is equally good compared to common-feature
model with accuracy of 95.57% and 96.01% respectively with the data-set sup-
ported by ZAP tool. For training phase using the same machine and data-set,
it takes almost 7 minutes to construct TF-IDF model compared about a half
of minute to build common-feature model. Therefore, it is more important and
interesting to investigate the run-time of these two models during testing phase
(detecting anomaly).
These models, namely TF-IDF and common-feature models, are used to
detect the anomaly against the datasets collected in 8 diﬀerent days and each
P. H. Duy, N. T. T. Thuy, N. N. Diep 129
Table 4: Abnormal access
No. Abnormal access
1 /Default.aspx?sname=..%2f..%2f..%2fetc%2fpasswd&sid=1293&pageid=32306
2 /Default.aspx?sname=http%3a%2f%2fwww.google.com+&sid=1293&pageid=32306
3
/wps/wcm/connect/309b0a0042eaedc58881ccd8919db02e/HINHLO
N.bmp?MOD=AJPERES
4
/Default.aspx?sname=c%3A%2FWindows%2Fsystem.ini&sid=4&p
ageid=468
5 /public/upload nhieuanh/server/php/ index.php
6 /vanban.aspx?type=%2527%253e%253c%2500rhLvZ%253e
dataset in these days is run by 5 times. The average run-time of these models
is recorded in the unit of seconds and illustrated in the table. As showed
in the table, the common-feature model performs better providing that the
number of log records below 600,000 but the model is quite slower when the
log records above 1 million. Despite its simplicity in computing feature, the
common-feature model cannot keep pace with TF-IDF model when data size
increases.
4.5 Discussions
Applying rule generation techniques to anomaly detection helps administrators
easily visualize how the detection system works. Machine learning techniques
using the decision tree algorithm allow the development of anomaly detection
rules quickly and eﬃciently. This technique is also one of the common and
typical for detecting anomalies based on the generation of anomalous behavioral
classiﬁcation tree. The experiment result showed that using TD-IDF features
to represent user behavior from access log data achieves good results compared
to other representation.
The advantage of a decision-based technique is that the training speed is
fast but the eﬀectiveness of the detection system depends on the quality of the
dataset used to build the analytical model. The report proposes a way to build
a dataset that meets the need for detecting anomaly and monitoring user access
based on the ZAP security tool. This dataset is stored in semi-structured ﬁles
including anomaly (anomaly.csv) and normal (normal.csv) samples, and black-
lists (blacklist.csv). This greatly supports administration in analysis and mon-
itoring of web services with limited resources. The simple structure through
semi-structured ﬁles allows the administrator to append malicious or normal
accesses. In other words, administrators or operators of web services can main-
tain a library of user access behaviors that appropriately accommodates to
130 Anomaly Detection System of Web Access ...
Figure 6: Comparison of the run-time of the two models (TF-IDF and common-
feature) during testing phase (detecting anomaly)
their own needs.
The anomaly detection system proposed to use a group of decision tree
algorithms, namely random forests, based on an assessment of the training rate,
performance and abnormal performance. On the other hand, the use of other
algorithms such as SVM machine learning vector or deep learning techniques,
can also improve the detection performance of the abnormal classiﬁcation model
obtained. However, these techniques are quite limited at the complexity of
deployment and training time, and may require special hardware and software
(especially for deep learning techniques).
The proposed system has been performing an experimental analysis of quite
a large amount of data up to millions of user accesses. The Python environ-
ment and NoSQL MongoDB database combine quite well when handling such
volumes. However, the proposed system is not really geared towards handling
big data like the Apache Spark platform. Even though, large data processing
platforms often provide toolkits connected to the Python environment due to
the popularity of this environment. It is possible to integrate and extend the
proposed system with large data processing platforms like Apache Spark.
5 Conclusion
With the increasing popularity of web services, the issue of administration and
monitoring user behaviors becomes even more urgent to ensure the quality of
service as well as the security of the web services. Anomaly detection in web
services can range from detecting misuse of users to malicious purposes which
P. H. Duy, N. T. T. Thuy, N. N. Diep 131
degrade the quality of website service to commit fraudulent behaviors.
The paper explores how to detect unusual accesses from log data on a web
server based on automatic rules generation by applying random forest algo-
rithms. User access to the services is represented by TF-IDF feature thanks
to its detecting performance. In addition, the report presented the method
to build and maintain a dataset for the development of an extraordinary clas-
siﬁcation model based on the ZAP security tool. Administrators can easily
maintain and update datasets according to their own management and super-
vision needs. Testing on the proposed anomaly detection system shows that the
system works relatively well, reaches 95% accurate detection and is capable of
processing and monitoring the volume of query data up to millions of Records
of user queries.
In the future, anomaly detection systems could be further investigated to in-
corporate more advanced machine learning algorithms to enhance the anomaly
detection performance. On the other hand, the analysis of anomalous access
behavior can be more detailed such as XSS, SQL or hijacking instead of normal
and abnormal as currently. Integration with large data processing platforms is
also a practical task to meet the needs of administration and monitoring with
large-scale web services.
References
[1] De Stefano, Claudio, Carlo Sansone, and Mario Vento. ”To reject or not to reject: that
is the question-an answer in case of neural classiﬁers.” IEEE Transactions on Systems,
Man, and Cybernetics, Part C (Applications and Reviews) 30.1 (2000): 84-94.
[2] Barbara, Daniel, NingningWu, and Sushil Jajodia. ”Detecting novel network intrusions
using bayes estimators.” Proceedings of the 2001 SIAM International Conference on
Data Mining. Society for Industrial and Applied Mathematics, 2001.
[3] Fan, Wei, et al. ”Using artiﬁcial anomalies to detect unknown and known network
intrusions.” Knowledge and Information Systems 6.5 (2004): 507-527.
[4] Helmer, Guy G., et al. ”Intelligent agents for intrusion detection.” 1998 IEEE Infor-
mation Technology Conference, Information Environment for the Future (Cat. No.
98EX228). IEEE, 1998.
[5] Lee, Wenke, Salvatore J. Stolfo, and Philip K. Chan. ”Learning patterns from unix
process execution traces for intrusion detection.” AAAI Workshop on AI Approaches
to Fraud Detection and Risk Management. 1997.
[6] Salvador, Stan, Philip Chan, and John Brodie. ”Learning States and Rules for Time
Series Anomaly Detection.” FLAIRS conference. 2004.
[7] Teng, Henry S., Kaihu Chen, and Stephen C. Lu. ”Security audit trail analysis using
inductively generated predictive rules.” Sixth Conference on Artificial Intelligence for
Applications. IEEE, 1990.
[8] Agrawal, Rakesh, and Ramakrishnan Srikant. ”Mining sequential patterns.” icde. Vol.
95. 1995.
[9] Mahoney, Matthew V., and Philip K. Chan. Learning rules for anomaly detection of
hostile network traﬃc. 2003.
[10] Chan, Philip K., Matthew V. Mahoney, and Muhammad H. Arshad. A machine learning
approach to anomaly detection. 2003.
132 Anomaly Detection System of Web Access ...
[11] Tandon, Gaurav, and Philip K. Chan. ”Weighting versus pruning in rule validation
for detecting network and host anomalies.” Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM, 2007.
[12] Chan, Gaik-Yee, Chien-Sing Lee, and Swee-Huay Heng. ”Discovering fuzzy association
rule patterns and increasing sensitivity analysis of XML-related attacks.” Journal of
Network and Computer Applications 36.2 (2013): 829-842.
[13] Ezeife, Christie I., Jingyu Dong, and Akshai K. Aggarwal. ”SensorWebIDS: a web
mining intrusion detection system.” International Journal of web Information Systems
4.1 (2008): 97-120.
[14] Breiman, Leo. ”Random Forests.” Machine learning 45.1 (2001): 5-32.
[15] Nguyen, Hai Thanh, et al. ”Application of the generic feature selection measure in
detection of web attacks.” Computational Intelligence in Security for Information Sys-
tems. Springer, Berlin, Heidelberg, 2011. 25-32.
[16] Christopher, D. Manning, Raghavan Prabhakar, and Schtze Hinrich. ”Introduction to
information retrieval.” An Introduction To Information Retrieval 151.177 (2008): 5.
[17] Gimnez, Carmen Torrano, Alejandro Prez Villegas, and Gonzalo lvarez Maran. ”HTTP
dataset CSIC 2010.” Information Security Institute of CSIC (Spanish Research Na-
tional Council) (2010).
[18] Bennetts, Simon. ”Owasp zed attack proxy.” AppSec USA (2013).

File đính kèm:

anomaly_detection_system_of_web_access_using_user_behavior_f.pdf