Testing of algorithms for anomaly detection in Big data using apache spark

2017 9th International Conference on Computational Intelligence and Communication Networks (CICN) Pub Date : 2017-09-01 DOI:10.1109/CICN.2017.8319364

S. Lighari, D. Hussain

{"title":"Testing of algorithms for anomaly detection in Big data using apache spark","authors":"S. Lighari, D. Hussain","doi":"10.1109/CICN.2017.8319364","DOIUrl":null,"url":null,"abstract":"The constant upsurge in the size of networks and the data massively produced by them has made the data analysis very challenging principally the data attaining the boundaries of big data and it becomes even more difficult to detect intrusions in the case of big data. In this era, the experts find very limited tools and methods to analyze big data for security reasons. Either we need to device new tools or we can use existing tools in a novel manner to achieve the purpose of big data security analysis. In this paper, we are using apache spark a big data tool for analyzing the big dataset for anomaly detection. The anomaly detection is performed by using different machine learning algorithms like Logistic regression, Support vector machine, Naïve bayes, Decision trees, Random forest, and Kmeans. More or less all the aforementioned algorithms are capable to detect anomalies in big data but we need to know how efficiently each performs. The main objective of this investigation is to find the most efficient algorithm in the context of anomaly detection. In this regard, we set to compare their training time, prediction time, and the rate of accuracy. The analysis was implemented on Kddcup99 dataset. Although this dataset is of size in megabytes but it meets our purpose here for big data security analytics.","PeriodicalId":339750,"journal":{"name":"2017 9th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 9th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN.2017.8319364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

The constant upsurge in the size of networks and the data massively produced by them has made the data analysis very challenging principally the data attaining the boundaries of big data and it becomes even more difficult to detect intrusions in the case of big data. In this era, the experts find very limited tools and methods to analyze big data for security reasons. Either we need to device new tools or we can use existing tools in a novel manner to achieve the purpose of big data security analysis. In this paper, we are using apache spark a big data tool for analyzing the big dataset for anomaly detection. The anomaly detection is performed by using different machine learning algorithms like Logistic regression, Support vector machine, Naïve bayes, Decision trees, Random forest, and Kmeans. More or less all the aforementioned algorithms are capable to detect anomalies in big data but we need to know how efficiently each performs. The main objective of this investigation is to find the most efficient algorithm in the context of anomaly detection. In this regard, we set to compare their training time, prediction time, and the rate of accuracy. The analysis was implemented on Kddcup99 dataset. Although this dataset is of size in megabytes but it meets our purpose here for big data security analytics.

查看原文本刊更多论文

使用apache spark测试大数据异常检测算法

网络规模的不断增长以及网络产生的海量数据使得数据分析变得非常具有挑战性，主要是数据达到大数据的边界，在大数据的情况下，入侵检测变得更加困难。在这个时代，出于安全原因，专家们发现分析大数据的工具和方法非常有限。要么我们需要装备新的工具，要么我们可以以一种新的方式使用现有的工具来实现大数据安全分析的目的。在本文中，我们使用apache spark这个大数据工具对大数据集进行分析，进行异常检测。异常检测使用不同的机器学习算法，如逻辑回归、支持向量机、Naïve贝叶斯、决策树、随机森林和Kmeans。上述所有算法或多或少都能够检测大数据中的异常，但我们需要知道每种算法的执行效率如何。本研究的主要目的是在异常检测的背景下找到最有效的算法。在这方面，我们设置比较他们的训练时间，预测时间和正确率。分析在Kddcup99数据集上实现。虽然这个数据集的大小以兆为单位，但它满足了我们在这里进行大数据安全分析的目的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 9th International Conference on Computational Intelligence and Communication Networks (CICN)

自引率

0.00%

发文量