Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream

2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery Pub Date : 2008-10-18 DOI:10.1109/FSKD.2008.374

Manzoor Elahi, Kun Li, Wasif Nisar, Xinjie Lv, Hongan Wang

{"title":"Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream","authors":"Manzoor Elahi, Kun Li, Wasif Nisar, Xinjie Lv, Hongan Wang","doi":"10.1109/FSKD.2008.374","DOIUrl":null,"url":null,"abstract":"Anomaly detection is currently an important and active research problem in many fields and involved in numerous applications. Most of the existing methods are based on distance measure. But in case of data stream these methods are not very efficient as computational point of view. Most of the exiting work on outlier detection in data stream declare a point as an outlier/inlier as soon as it arrive due to limited memory resources as compared to the huge data stream, to declare an outlier as it arrive often can lead us to a wrong decision, because of dynamic nature of the incoming data. In this paper we introduced a clustering based approach, which divide the stream in chunks and cluster each chunk using k-mean in fixed number of clusters. Instead of keeping only the summary information, which often used in case of clustering data stream, we keep the candidate outliers and mean value of every cluster for the next fixed number of steam chunks, to make sure that the detected candidate outliers are the real outliers. By employing the mean value of the clusters of previous chunk with mean values of the current chunk of stream, we decide better outlierness for data stream objects. Several experiments on different dataset confirm that our technique can find better outliers with low computational cost than the other exiting distance based approaches of outlier detection in data stream.","PeriodicalId":208332,"journal":{"name":"2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2008.374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

Abstract

Anomaly detection is currently an important and active research problem in many fields and involved in numerous applications. Most of the existing methods are based on distance measure. But in case of data stream these methods are not very efficient as computational point of view. Most of the exiting work on outlier detection in data stream declare a point as an outlier/inlier as soon as it arrive due to limited memory resources as compared to the huge data stream, to declare an outlier as it arrive often can lead us to a wrong decision, because of dynamic nature of the incoming data. In this paper we introduced a clustering based approach, which divide the stream in chunks and cluster each chunk using k-mean in fixed number of clusters. Instead of keeping only the summary information, which often used in case of clustering data stream, we keep the candidate outliers and mean value of every cluster for the next fixed number of steam chunks, to make sure that the detected candidate outliers are the real outliers. By employing the mean value of the clusters of previous chunk with mean values of the current chunk of stream, we decide better outlierness for data stream objects. Several experiments on different dataset confirm that our technique can find better outliers with low computational cost than the other exiting distance based approaches of outlier detection in data stream.

查看原文本刊更多论文

基于聚类的动态数据流离群点检测算法

异常检测是当前许多领域的一个重要而活跃的研究问题，有着广泛的应用。现有的方法大多是基于距离测量的。但是在数据流的情况下，从计算的角度来看，这些方法的效率并不高。与庞大的数据流相比，由于内存资源有限，大多数现有的数据流异常点检测工作在数据流中声明一个点为异常点/内线点，在它到达时声明一个异常点往往会导致我们做出错误的决定，因为传入数据的动态性。本文介绍了一种基于聚类的方法，该方法将数据流分成若干块，并在固定数量的聚类中使用k-均值对每个块进行聚类。为了保证检测到的候选离群点是真正的离群点，我们不再像聚类数据流那样只保留汇总信息，而是保留下一个固定数量蒸汽块的候选离群点和每个聚类的平均值。通过将前一个数据块的簇均值与当前数据流的簇均值相结合，我们可以更好地确定数据流对象的离群值。在不同数据集上的实验证明，我们的方法比现有的基于距离的数据流异常点检测方法能以较低的计算成本找到更好的异常点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery

自引率

0.00%

发文量