An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology Pub Date : 2013-05-15 DOI:10.1109/ECTICON.2013.6559650

Jakrarin Therdphapiyanak, K. Piromsopa

{"title":"An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework","authors":"Jakrarin Therdphapiyanak, K. Piromsopa","doi":"10.1109/ECTICON.2013.6559650","DOIUrl":null,"url":null,"abstract":"In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.","PeriodicalId":273802,"journal":{"name":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTICON.2013.6559650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.

查看原文本刊更多论文

在Hadoop框架下对TCPdump数据集高效应用K-means聚类的合适参数进行了分析

在本文中，我们使用Apache Mahout/Hadoop框架确定了对TCPdump数据集应用K-means聚类的适当数量的集群和适当数量的条目。我们的目标是找到合适的配置，以便在有限的时间内有效地分析大型数据集。我们的实现使用Hadoop进行大规模日志分析，以KDD'99大赛的数据集作为测试数据。在分布式系统框架下，我们可以通过首先应用我们的预处理来分析KDD'99的整个数据集。此外，我们使用异常检测模型进行日志分析。一个关键的挑战是使异常检测工作更准确。对于Kmeans算法，一个关键的挑战是如何设置合适的初始簇数(K)。此外，我们还讨论了日志文件中的条目数是否会影响系统的准确率和检测率。因此，我们的实现和实验结果描述了日志文件中适当数量的集群和适当数量的条目。最后给出了实验结果，包括准确率和初始聚类数(K)图、ROC曲线以及检测率和虚警率表。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology

自引率

0.00%

发文量