You are on page 1of 3

LETTERS Int. J. of Recent Trends in Engineering and Technology, Vol. 2, No.

2, Nov 2009

Network Intrusion Detection Using Association Rules


Flora S. Tsai
Nanyang Technological University School of Electrical & Electronic Engineering, Singapore Email: fst1@columbia.edu

Abstract Network intrusion detection includes identifying a set of malicious actions that compromise the integrity, confidentiality, and availability of information resources. The tremendous increase of novel cyber attacks has made data mining based intrusion detection techniques extremely useful in their detection. This paper describes a system that is able to detect network intrusion using association rules. The technique is used to generate attack rules that will detect the attacks in network audit data using anomaly detection. This shows that the modified association rules algorithm is capable of detecting network intrusions. Index Terms intrusion detection, association rules, cyber security, data mining

I. INTRODUCTION With the proliferation of cyber security threats, such as malicious viruses and worms, denial of service (DoS) attacks, and online Internet fraud, achieving efficient network intrusion security is critical in protecting our information infrastructure. Intrusion detection systems (IDS) refer to a category of defense tools that is used to provide warnings indicating that a system is under attack or intrusion. The IDS monitors activities within a network and alerts security administrators of suspicious activity [7]. Suspicious activities include intrusions in integrity, confidentiality, and availability. Integrity can be compromised when intruders are able to modify the data, thus making the information unreliable. Confidentiality can be breached when a non-privileged user is able to access the information. Availability, such as DoS attacks on the Internet, can disrupt the web service and force information to be unavailable. Current methods of intrusion detection rely on analyzing network traffic data (or audit logs) to detect intrusion. The analysis consists of anomaly usage detection, where a user deviates from normal usage, or misuse detection, where the computer searches the logs for pre-defined attacks. In both areas, some form of user intervention is required, such as coding the attacks into the system or checking if the deviation from normal usage is a true attack. In our work, we focus on anomaly detection. Since audit logs may contain useful and rich information that can be used to build a better detection model, data mining techniques, which can discover meaningful knowledge in large volumes of data, can be used to analyze the network audit logs. By analyzing the audit logs, meaningful data can be extracted to generate better detection models. When data mining is to be applied to large volumes of network traffic data to search for patterns, it can provide valuable insights to attack patterns, thus allowing us to build a more effective detection model. These insights can identify new signatures as and when they appear, reduce the need of administrators experience and intuition in

detecting a new intrusion, and protect against constantly changing future threats. Earlier studies on network anomaly detection include a variant of the single-linkage clustering algorithm used to discover outliers in the training dataset [6]. Once the normal patterns are separated from outlier patterns, the clusters of normal data are used to construct a supervised detection model. Other work include a geometric framework to perform anomaly detection where patterns are mapped to a new feature space and anomalies are detected by searching for patterns that lie in sparse regions of the new feature space [3]. Three different classification techniques are used: a clustering algorithm, a kNN algorithm and a SVM-based one-class classifier. Past studies on detection of cyber threats include analyzing blogs using Probabilistic Latent Semantic Analyis [8] and Latent Dirichlet Allocation [9]. Other studies on intrusion detection include a multi-stage classification system for detecting intrusions in computer networks [2] and a latent class modeling approach to detect network intrusion [10]. In contrast to earlier work, our work focuses on association rules, which can reveal associations in network security data. II. DESIGN AND METHODOLOGY A. Dataset The network audit data that is used is the 1999 KDD intrusion detection contest dataset [4], which is a modified version of the 1998 DARPA Intrusion Detection Evaluation Program. The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. This was processed into about five million connection records in a text file. Each entry in the text file corresponds to a connection entry, a collection of connection is a sequence of TCP packets starting and ending at some well defined times. Each connection entry is defined to and from a source IP address to a target IP address under some welldefined protocol. Each connection record consists of about 100 bytes and were labelled as either normal, or as an attack, with exactly one of 22 specific attack types. In order to know how to read the data from the audit data, we need to analyze how the audit data is being recorded. The audit data is processed for data mining purpose and is split into two files, the training set which contains around five million rows and the test set with 10% of the training set. B. Association Rules Data Mining Association rules mining started as a technique for finding interesting rules from transactional databases [1]. It was initially used to reveal associations in commercial data from a database of transactions each representing the set of items

202

LETTERS Int. J. of Recent Trends in Engineering and Technology, Vol. 2, No. 2, Nov 2009 purchased by a customer. The association analysis identifies items purchased together. Association rules mining finds correlation between the attributes. In our case, we are trying to determine the correlation between the attributes within the network data. As correlation differs between attributes of the network data, association rules gives the flexibility of determining different relations. The results of the mining are expressed as rules. The data mining algorithm that is used is modified from the apriori association rules algorithm. Rules can be viewed simply as an [IfThen Else] structure, such as: If protocol=TCP Then AttackType=smurf. The algorithm is an influential algorithm for mining frequent item sets. It uses level-wise search, where k-item sets (an item set that contains k items) is use to explore (k+1) item set. Simply, a first-item set generated, will be used to generate the second-item set, in turn generate the thirditem set until no more k-item set can be found. In this paper, we developed a network intrusion detection system that analyzes the various item set generated, specifically on attribute relation with the AttackType attribute. This allows users to zoom in on the rules generated specifically with AttackType and use this as a basis to test if each rule that is generated can detect the intrusion on the test set. Thus, we have modified the association rule mining for this specific purpose. This modification altered the apriori algorithm, to allow us to specifically look at the relation with the AttackType column only. III. INTRUSION DETECTION SYSTEM A system to detect network intrusions was successfully built by integration of relational database structure from the KDD network intrusion data and implemented using Java and MySQL. The system is able to use a modified version of apriori association rules mining to determine whether association rules can provide attack type rules that generate good intrusion detection. After the audit and training data are loaded, the Data Processing screen allows the user to generate the item sets. The item sets are required to be built first before the rules can be generated from the training set. If the training set is loaded into the database, then rules building can proceed. The items will be built according to the support level specified by the user. Once the item sets are built, the next step is to extract useful rules from the item sets. These are the attack rules that will be generated to detect the attack. The confidence level is specified in the range from 1 to 100%. The program then finds rules that can predict with the minimum percent of accuracy (according to the confidence level) in the training set. Once the attack rules are built, the rules are required to be tested on the test set. This will generate the accuracy of the rules that are being tested. The confidence is the accuracy in the training set and the accuracy column defines the accuracy in the test set. The rules are read in a simple form, using (If Clause) and the (Then Outcome) the rules are read in If countConnection= 511 Then AttackType=smurf. These rules allow the user to see which attributes or which combination generates the most effective rules in detecting attack. The overall results show the results of how the rules fair in general by categorizing the result into the attack categories with respect to the rules generated by each item set. Instead of categorizing to the attack type, the attack types are categorized into their respective category and is shown with respect to the item sets. The ground truths are the 203 number of attacks available per category. The percentage in each item set shows how the detection rate fairs with respect to the ground truths. Using these we can analysis the results to build a better intrusion detection model. For example, we can see which item sets generates the best rules that is capable of detecting the most attacks. The overall audit data is close to 5 million records (actual count = 4,898,430). Two different subsets of the 5 million records are taken for analysis. Subsets are sampled at each interval, tested and analyzed for a period of 3 days. The criteria for subset selection is such that generation of 1, 2 and 3 item sets should be completed. Using the Support Level set at 1948, 405 and 1 with Confidence Level set at 90% and using the results from the Rule test, setting the View Percentage at 100% to only use rules that scored 100% accuracy. The other 2 sets is set View Percentage at 50% and 1%. If the View Percentage is adjusted and lowered, the resulting category detection rate does not differ much. As the item set expands up to 42 item sets, the accuracy of detecting the category also increases. However, adjusting the view percentage does allow category to be detected as seen with Support Level=1 and View Percentage drop from 100% to 50%. However, it is also noted that even though, category detected, the rules accuracy involve is within 50% to 100%. This means even though the category is detected, the rule accuracy that detect is never 100%. This would mean that if the rule is implemented, it might generate false positives. When the View Percentage is set to a lower limit, it allows lower accuracy rules to pass. The number of attack category detected increases. To as much as detecting 100% of all the attacks within the test even at lower item sets such as the 1% Support, 90% Confidence and 50% View Percentage. Because the rules are trained from a subset, the rules are characterized based on that subset. Comparing the two 50K subsets, the first set generates rules that can predict probing but does not predict DoS attacks. However, the second set only predicts DoS attacks but does not predict probing. Furthermore, the test set also plays an important part: even if the sampled test set for the second 50K is different, the test set does not contain any attacks for probing, U2R and R2L. Thus no detection will be available for this attack category. Thus, even rules with 50% accuracy are able to detect 100% of the attack category. However, if the test set is changed, the rules may not detect any intrusions. For example if the second test set is used on the first, none will be detected because there are no attacks available within the test set. Therefore, to adequately generate good rules, a few subsets need to be taken and process to generate a more complete results or fewer subsets but larger subset count. The results generated from 10% and 1% subsets shows that as the item sets gets larger (towards the 42 item set limit), the rules accuracy is expected to increase dramatically. As the item set increases, the accuracy of the category detection increases. Furthermore, by varying the numbers of the View Percentage, it is possible to detect new attack categories. The training of rules depends very much on the data. Different subsets of 50K of training data can results in different rules. Therefore, it would be recommended that a larger subset is taken to generate a more comprehensive set of attack rules. Furthermore, several subsets are required so that the results can be averaged to provide a more complete picture. By providing a more comprehensive training subset, the rules generated can be more reliable.

LETTERS Int. J. of Recent Trends in Engineering and Technology, Vol. 2, No. 2, Nov 2009 IV. CONCLUSIONS A system has been successfully developed to detect network intrusions using association rules. Modified association rule mining was used to generate the attack rules from the network audit data. The algorithms that are applied have effectively built the item sets from the training set. The rules are then built from the item sets and further tested on the data set. The system can display the results of the processing in terms of rules accuracy that shows how the rules fare in the test set. In addition, the system can show the overall results that display the item set versus the attack category accuracy that allows the user or administrator to filter out those unnecessary item sets and concentrate on those item sets that produce more accurate results. The rules accuracy are best viewed at 100% as they provide a more stable results compared to 90% and below. However, by allowing rules with lower accuracy to pass, more attack categories can be detected. Furthermore, the factor that determines the rules accuracy and attack category accuracy depends much on how much training data is provided and how is the training data being sampled. Different sets of 50K subset training data developed different attack rules and ultimately detect differently. However, the system is able to detect intrusion depending on which rules are deployed into the intrusion detection attack list based on the accuracy of the rules in predicting the attack category. Therefore, we have demonstrated that the modified association rules is capable of rules generation based on the anomaly-based detection. The rules generated provide the means to detected attack and ultimately reduce the need of human intervention. REFERENCES
[1] R. Aggrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in very large databases, Proceedings of the ACM SIGMOD Conference, 1993. [2] L.P. Cordella and C. Sansone, A multi-stage classification system for detecting intrusions in computer networks, Pattern Anal Applic, 10: 83 100, 2007. [3] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data, In: D. Barbara, S. Jajodia (Eds.), Applications of Data Mining in Computer Security, Kluwer, 2002. [4] D. Newman, KDD Cup 1999 Data, The UCI KDD Archive, Information and Computer Science, University Of California, Irvine. Source: http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html [5] Lincoln Laboratory, APRA Intrusion Detection Evaluation, Massachusetts Institute of Technology. Source: http://www.ll.mit.edu/IST/ideval/index.html [6] L. Portnoy, E. Eskin, S. Stolfo., Intrusion detection with unlabeled data using clustering, Proceedings of ACM CSS Workshop on Data Mining Applied to Security, DMSA-2001. [7] F.S. Tsai and C.K. Chan (eds), Cyber Security, Pearson Education, Singapore, 2006. [8] F. S. Tsai, K. L. Chan, Detecting cyber security threats in weblogs using probabilistic models, in: Intelligence and Security Informatics, vol. 4430, 2007, pp. 4657. [9] F. S. Tsai, K. L. Chan, Blog data mining for cyber security threats, in: Data Mining for Business Applications, 2009, pp. 169182. [1] Y. Wang, Inyoung Kim, G. Mbateng, S.-Y. Ho, A latent class modeling approach to detect network intrusion, Computer Communications, 30, 93 100, 2006.

204

You might also like