Anomaly detection system of web access using user behavior features
The growth, accessibility of the Internet and the explosion of personal computing devices have made applications on the web growing robustly, especially for e-Commerce and public services. Unfortunately, the vulnerabilities of these web services also increased rapidly. This leads to the need of monitoring the users accesses to these services to distinguish abnormal and malicious behaviors from the log data in order to ensure the quality of these web services as well as their safety. This work presents methods to build and develop a rule-based systems allowing services’ administrators to detect abnormal and malicious accesses to their web services from web logs. The proposed method investigates characteristics of user behaviors in the form of HTTP requests and extracts efficient features to precisely detect abnormal accesses. Furthermore, this report proposes a way to collect and build datasets for applying machine learning techniques to generate detection rules automatically. The anomaly detection system of was tested and evaluated its performance on 4 different web sites with approximately one million log lines per day
Trang 1
Trang 2
Trang 3
Trang 4
Trang 5
Trang 6
Trang 7
Trang 8
Trang 9
Trang 10
Tải về để xem bản đầy đủ
Tóm tắt nội dung tài liệu: Anomaly detection system of web access using user behavior features
Southeast Asian J. of Sciences: Vol 7, No 2, (2019) pp. 115-132 ANOMALY DETECTION SYSTEM OF WEB ACCESS USING USER BEHAVIOR FEATURES Pham Hoang Duy, Nguyen Thi Thanh Thuy and Nguyen Ngoc Diep Department of Information Technology Posts and Telecommunications Institute of Technology (PTIT) Hanoi, Vietnam duyph@ptit.edu.vn; thuyntt@ptit.edu.vn; diepnguyenngoc@ptit.edu.vn Abstract The growth, accessibility of the Internet and the explosion of personal computing devices have made applications on the web growing robustly, especially for e-commerce and public services. Unfortunately, the vul- nerabilities of these web services also increased rapidly. This leads to the need of monitoring the users accesses to these services to distinguish ab- normal and malicious behaviors from the log data in order to ensure the quality of these web services as well as their safety. This work presents methods to build and develop a rule-based systems allowing services’ administrators to detect abnormal and malicious accesses to their web services from web logs. The proposed method investigates characteris- tics of user behaviors in the form of HTTP requests and extracts efficient features to precisely detect abnormal accesses. Furthermore, this report proposes a way to collect and build datasets for applying machine learn- ing techniques to generate detection rules automatically. The anomaly detection system of was tested and evaluated its performance on 4 dif- ferent web sites with approximately one million log lines per day. Key words: Anomaly detection system, web log, rule generation, user behavior, TF-IDF. 2010 AMS Mathematics classification: 91D10, 68U115, 68U35, 68M14, 68M115, 68T99. 115 116 Anomaly Detection System of Web Access ... 1 Introduction Anomaly detection can refer to the problem of finding patterns in the data that do not match the expected behavior. These nonconforming patterns are often referred to as anomalies, exceptions, contradictory observations, and ir- regularities depending on the characteristics of different application domains. With the development of the Internet and web applications, anomaly detection in web services can range from detecting misuse to malicious intent which de- grade the quality of website service or commit fraudulent behaviors. With web services, analytic techniques need to transform the original raw data into an appropriate form that describes the session information or the amount of time a user interacts with the services provided by the website. In monitoring users’ access to web services, a rule-based anomaly detection technique is commonly used due to its accessibility and readability to services administrators. There are two basic approaches to generating rules. The former is based on a rule manually and statically created by service administrators when analyzing users’ behaviors. Another approach is to dynamically and automatically create rules using data mining techniques or machine learning. For static rule generation, it is first necessary to construct a scenario of the situation that the administrator wants to simulate. For example, if there is one process running on one device and another process running on another device at the same time and the combination of both processes causes a security issue, the administrator needs to model this scenario. Besides creating these rules, administrators must enforce the correlation among these rules to verify whether the case is an anomaly or not. A rule can contain many parameters such as time frame, repeating pattern, service type, port, etc. The algorithm then checks the data from the log files and finds out attack scenarios or unusual behaviors. The advantage of this approach is the ability to detect anomaly behaviors of access by correlating analysis and thus detecting intruders difficult to detect. There are specific languages that allow the creation of rules as well as tools to create rules effectively and easily. For companies with appropriate resources and budget, it is easier and more convenient to buy a set of rules than to use several systems to create specific rules. The downside of this approach is high cost, especially for maintaining the set of rules. Modeling each attack scenario is not an easy and trivial task. It is most likely to re-perform the same type of attack and sometimes the anomaly cannot be identified. In addition, attack patterns can change and new attack forms are invented every day. As such, it is necessary to evolve the set of rules over time despite the fact that there is the possibility that some unspecified attacks, that could easily occur, are unrecognized. The dynamic rule-generation approach has been used for anomaly detection for quite some time. The generated rules are usually in the form of if-then. P. H. Duy, N. T. T. Thuy, N. N. Diep 117 First, the algorithm generates patterns that can be further processed into a set of rules allowing to determine which action should be taken. Methods based on dynamic rule generation can solve the problem of continually updating attack patterns (or rules) by looking for potential unknown attacks. The disadvantage of this approach is the complexity of multi-dimensional spatial data analysis. Algorithms capable of processing such data often have high computational complexity. Therefore, it is necessary to reduce the dimension of the data as much as possible. Our work proposes the development of anomaly detection system for web services based on dynamically generating rules using machine learning tech- niques. The data used in the analysis process is log entries from web services. In particular, Section 2 presents research on the use of machine learning tech- niques for the generation of anomaly detection rules. Section 3 describes the proposed anomaly detection system of web access as well as how to collect and build the dataset for building model that automaticall ... ZAP is set to maximum oper- ating mode to obtain the most types of access (hijacking, XSS, SQL injection, etc.) as well as the maximum number of generated samples. These samples are saved as semi-structured files for later processing (e.g.: anomaly.csv). At the same time, ZAP generates reports that show specific security issues of the interested web services. These will be collected and saved in a separate file (e.g.: blacklist.csv). In addition to security analysis tools, ZAP provides a mechanism for scan- ning web service structures through web spider and AJAX spider services. These tools allow to collect information about the structure of service pages and save them in the text file (e.g.: normal.csv). This work conducted using ZAP tool to collect data of web service access to 4 test websites with the follow- ing volume: about 300,000 abnormal queries, 200,000 normal accesses, about 6,000 web pages with security issues (blacklist). In addition to the data generated through the ZAP toolkit, log data to four test web sites is also used to generate the dataset. Access by users with an HTTP code outside of the normal range (return code > 200) will be consid- ered an abnormal access due to an error. In addition, accesses in the log files coincide with those of anomaly.csv and blacklist.csv files, which are considered abnormal. The way to identify unusual access from the log file is not entirely straightforward, but it is still useful because from an administrative point of view, access to the faulty web service should be alerted. Data from the web log files from 4 test sites was sampled over 4 to 5 days with each site. The data collected included nearly 340,000 normal and nearly 56,000 abnormal accesses. Among the test sites, 1 site has an average data P. H. Duy, N. T. T. Thuy, N. N. Diep 127 volume significantly lower (about 25%) compared to the remaining sites. The data collected from the log file combined with the data generated from the ZAP toolkit, after eliminating duplicates, constructs the dataset for build- ing classification model. In fact, this dataset is quite balanced, including nearly 470,000 normal and 380,000 abnormal accesses. 4.4 Experiment and evaluation 4.4.1 Performance experiment As mentioned above, an anomaly detection system for web access was developed based on Python 3.6 and scikit-learn library. The dataset from the above is divided in the ration of 7 : 3 for training and testing corresponding to the number of accesses 554, 000 : 238, 000. These accesses are represented by TF- IDF to perform machine learning techniques using random tree learning. The parameters used in the classification model building process are set to the default level. Test results show that the accuracy of classification model is about 95%. Table 2 details the classification of normal and anomaly accesses from a learned model. Table 2: Confusion matrix Normal Abnormal Normal 139,167 3,039 Abnormal 8,311 105,578 The results obtained from building classification models are relatively good. The classification model obtained from the model building step was used on the access data obtained from the log file to produce results as shown in Table 3. The data in the table shows that at some point the number of user access doubles the average at other times. The number of abnormal fluctuations is not proportional to the access volume of the user. Figure 5 shows the percentage of hits checked (which do not coincide with other accesses) and anomalies in the number of queries tested. During the observed period, the number of anomalies fluctuated around 2% of the total number of queries examined. Particularly, in the 3rd and 4th observation range, the rate of abnormalities increased sharply. Several abnormal user accesses are shown in Table 4. Accesses from 1 to 4 are relatively clear acts of attack on web services. Access 5 and 6 are not really obvious behaviors or attempts to browse the directory structure of a website. However, these behaviors still need to be monitored by administrators. 128 Anomaly Detection System of Web Access ... Table 3: Anomaly detection results Total access Total check Abnormal 3,759,770 39,311 238 917,004 7,989 102,445 1,486,577 30,884 2,718 682,988 25,936 1,291 1,640,803 32,530 262 1,356,999 30,716 147 1,007,064 27,002 60 1,864,161 30,051 74 2,434,863 35,028 204 1,788,273 30,232 458 1,895,390 44,230 116 1,553,298 38,047 90 Figure 5: Percentage of prediction results 4.4.2 Run-time experiment Despite the better detection quality of TF-IDF model when applying against CSIC data-set, this model perform is equally good compared to common-feature model with accuracy of 95.57% and 96.01% respectively with the data-set sup- ported by ZAP tool. For training phase using the same machine and data-set, it takes almost 7 minutes to construct TF-IDF model compared about a half of minute to build common-feature model. Therefore, it is more important and interesting to investigate the run-time of these two models during testing phase (detecting anomaly). These models, namely TF-IDF and common-feature models, are used to detect the anomaly against the datasets collected in 8 different days and each P. H. Duy, N. T. T. Thuy, N. N. Diep 129 Table 4: Abnormal access No. Abnormal access 1 /Default.aspx?sname=..%2f..%2f..%2fetc%2fpasswd&sid=1293&pageid=32306 2 /Default.aspx?sname=http%3a%2f%2fwww.google.com+&sid=1293&pageid=32306 3 /wps/wcm/connect/309b0a0042eaedc58881ccd8919db02e/HINHLO N.bmp?MOD=AJPERES 4 /Default.aspx?sname=c%3A%2FWindows%2Fsystem.ini&sid=4&p ageid=468 5 /public/upload nhieuanh/server/php/ index.php 6 /vanban.aspx?type=%2527%253e%253c%2500rhLvZ%253e dataset in these days is run by 5 times. The average run-time of these models is recorded in the unit of seconds and illustrated in the table. As showed in the table, the common-feature model performs better providing that the number of log records below 600,000 but the model is quite slower when the log records above 1 million. Despite its simplicity in computing feature, the common-feature model cannot keep pace with TF-IDF model when data size increases. 4.5 Discussions Applying rule generation techniques to anomaly detection helps administrators easily visualize how the detection system works. Machine learning techniques using the decision tree algorithm allow the development of anomaly detection rules quickly and efficiently. This technique is also one of the common and typical for detecting anomalies based on the generation of anomalous behavioral classification tree. The experiment result showed that using TD-IDF features to represent user behavior from access log data achieves good results compared to other representation. The advantage of a decision-based technique is that the training speed is fast but the effectiveness of the detection system depends on the quality of the dataset used to build the analytical model. The report proposes a way to build a dataset that meets the need for detecting anomaly and monitoring user access based on the ZAP security tool. This dataset is stored in semi-structured files including anomaly (anomaly.csv) and normal (normal.csv) samples, and black- lists (blacklist.csv). This greatly supports administration in analysis and mon- itoring of web services with limited resources. The simple structure through semi-structured files allows the administrator to append malicious or normal accesses. In other words, administrators or operators of web services can main- tain a library of user access behaviors that appropriately accommodates to 130 Anomaly Detection System of Web Access ... Figure 6: Comparison of the run-time of the two models (TF-IDF and common- feature) during testing phase (detecting anomaly) their own needs. The anomaly detection system proposed to use a group of decision tree algorithms, namely random forests, based on an assessment of the training rate, performance and abnormal performance. On the other hand, the use of other algorithms such as SVM machine learning vector or deep learning techniques, can also improve the detection performance of the abnormal classification model obtained. However, these techniques are quite limited at the complexity of deployment and training time, and may require special hardware and software (especially for deep learning techniques). The proposed system has been performing an experimental analysis of quite a large amount of data up to millions of user accesses. The Python environ- ment and NoSQL MongoDB database combine quite well when handling such volumes. However, the proposed system is not really geared towards handling big data like the Apache Spark platform. Even though, large data processing platforms often provide toolkits connected to the Python environment due to the popularity of this environment. It is possible to integrate and extend the proposed system with large data processing platforms like Apache Spark. 5 Conclusion With the increasing popularity of web services, the issue of administration and monitoring user behaviors becomes even more urgent to ensure the quality of service as well as the security of the web services. Anomaly detection in web services can range from detecting misuse of users to malicious purposes which P. H. Duy, N. T. T. Thuy, N. N. Diep 131 degrade the quality of website service to commit fraudulent behaviors. The paper explores how to detect unusual accesses from log data on a web server based on automatic rules generation by applying random forest algo- rithms. User access to the services is represented by TF-IDF feature thanks to its detecting performance. In addition, the report presented the method to build and maintain a dataset for the development of an extraordinary clas- sification model based on the ZAP security tool. Administrators can easily maintain and update datasets according to their own management and super- vision needs. Testing on the proposed anomaly detection system shows that the system works relatively well, reaches 95% accurate detection and is capable of processing and monitoring the volume of query data up to millions of Records of user queries. In the future, anomaly detection systems could be further investigated to in- corporate more advanced machine learning algorithms to enhance the anomaly detection performance. On the other hand, the analysis of anomalous access behavior can be more detailed such as XSS, SQL or hijacking instead of normal and abnormal as currently. Integration with large data processing platforms is also a practical task to meet the needs of administration and monitoring with large-scale web services. References [1] De Stefano, Claudio, Carlo Sansone, and Mario Vento. ”To reject or not to reject: that is the question-an answer in case of neural classifiers.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30.1 (2000): 84-94. [2] Barbara, Daniel, NingningWu, and Sushil Jajodia. ”Detecting novel network intrusions using bayes estimators.” Proceedings of the 2001 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2001. [3] Fan, Wei, et al. ”Using artificial anomalies to detect unknown and known network intrusions.” Knowledge and Information Systems 6.5 (2004): 507-527. [4] Helmer, Guy G., et al. ”Intelligent agents for intrusion detection.” 1998 IEEE Infor- mation Technology Conference, Information Environment for the Future (Cat. No. 98EX228). IEEE, 1998. [5] Lee, Wenke, Salvatore J. Stolfo, and Philip K. Chan. ”Learning patterns from unix process execution traces for intrusion detection.” AAAI Workshop on AI Approaches to Fraud Detection and Risk Management. 1997. [6] Salvador, Stan, Philip Chan, and John Brodie. ”Learning States and Rules for Time Series Anomaly Detection.” FLAIRS conference. 2004. [7] Teng, Henry S., Kaihu Chen, and Stephen C. Lu. ”Security audit trail analysis using inductively generated predictive rules.” Sixth Conference on Artificial Intelligence for Applications. IEEE, 1990. [8] Agrawal, Rakesh, and Ramakrishnan Srikant. ”Mining sequential patterns.” icde. Vol. 95. 1995. [9] Mahoney, Matthew V., and Philip K. Chan. Learning rules for anomaly detection of hostile network traffic. 2003. [10] Chan, Philip K., Matthew V. Mahoney, and Muhammad H. Arshad. A machine learning approach to anomaly detection. 2003. 132 Anomaly Detection System of Web Access ... [11] Tandon, Gaurav, and Philip K. Chan. ”Weighting versus pruning in rule validation for detecting network and host anomalies.” Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007. [12] Chan, Gaik-Yee, Chien-Sing Lee, and Swee-Huay Heng. ”Discovering fuzzy association rule patterns and increasing sensitivity analysis of XML-related attacks.” Journal of Network and Computer Applications 36.2 (2013): 829-842. [13] Ezeife, Christie I., Jingyu Dong, and Akshai K. Aggarwal. ”SensorWebIDS: a web mining intrusion detection system.” International Journal of web Information Systems 4.1 (2008): 97-120. [14] Breiman, Leo. ”Random Forests.” Machine learning 45.1 (2001): 5-32. [15] Nguyen, Hai Thanh, et al. ”Application of the generic feature selection measure in detection of web attacks.” Computational Intelligence in Security for Information Sys- tems. Springer, Berlin, Heidelberg, 2011. 25-32. [16] Christopher, D. Manning, Raghavan Prabhakar, and Schtze Hinrich. ”Introduction to information retrieval.” An Introduction To Information Retrieval 151.177 (2008): 5. [17] Gimnez, Carmen Torrano, Alejandro Prez Villegas, and Gonzalo lvarez Maran. ”HTTP dataset CSIC 2010.” Information Security Institute of CSIC (Spanish Research Na- tional Council) (2010). [18] Bennetts, Simon. ”Owasp zed attack proxy.” AppSec USA (2013).
File đính kèm:
- anomaly_detection_system_of_web_access_using_user_behavior_f.pdf