大数据中的分布式局部离群点检测

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2017-08-13 DOI:10.1145/3097983.3098179

Yizhou Yan, Lei Cao, C. Kuhlman, Elke A. Rundensteiner

{"title":"大数据中的分布式局部离群点检测","authors":"Yizhou Yan, Lei Cao, C. Kuhlman, Elke A. Rundensteiner","doi":"10.1145/3097983.3098179","DOIUrl":null,"url":null,"abstract":"In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method -- a popular outlier detection technique shown to be very effective for datasets with skewed distributions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required. This poses significant challenges due to the complexity of the LOF definition, and a lack of access to the entire dataset at any individual compute machine. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging our invariant observation for intermediate value management. Furthermore, we propose a data assignment strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline, while minimizing the number of data replicas. Based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies. These strategies not only minimize the computation costs within each stage, but also eliminate unnecessary communication costs by aggressively pushing the LOF computation into the early stages of the DLOF pipeline. Our comprehensive experimental study using both real and synthetic datasets confirms the efficiency and scalability of our approach to terabyte level data.","PeriodicalId":314049,"journal":{"name":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Distributed Local Outlier Detection in Big Data\",\"authors\":\"Yizhou Yan, Lei Cao, C. Kuhlman, Elke A. Rundensteiner\",\"doi\":\"10.1145/3097983.3098179\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method -- a popular outlier detection technique shown to be very effective for datasets with skewed distributions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required. This poses significant challenges due to the complexity of the LOF definition, and a lack of access to the entire dataset at any individual compute machine. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging our invariant observation for intermediate value management. Furthermore, we propose a data assignment strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline, while minimizing the number of data replicas. Based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies. These strategies not only minimize the computation costs within each stage, but also eliminate unnecessary communication costs by aggressively pushing the LOF computation into the early stages of the DLOF pipeline. Our comprehensive experimental study using both real and synthetic datasets confirms the efficiency and scalability of our approach to terabyte level data.\",\"PeriodicalId\":314049,\"journal\":{\"name\":\"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3097983.3098179\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3097983.3098179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

摘要

在这项工作中，我们提出了局部离群因子(LOF)方法的第一个分布式解决方案，LOF是一种流行的离群检测技术，对偏斜分布的数据集非常有效。随着数据集规模的急剧增加，需要利用现代分布式基础设施的高度可扩展的LOF算法。由于LOF定义的复杂性，以及在任何单独的计算机器上缺乏对整个数据集的访问，这带来了重大的挑战。我们的解决方案采用分布式LOF管道框架，称为dof。LOF计算的每个阶段都以完全分布式的方式进行，利用中间值管理的不变观察值。此外，我们提出了一种数据分配策略，该策略确保每台机器在LOF管道的所有阶段都是自给自足的，同时最小化数据副本的数量。基于对该策略在实际数据集背景下的分析得出的收敛性，我们介绍了一些数据驱动的优化策略。这些策略不仅使每个阶段的计算成本最小化，而且通过积极地将LOF计算推进到dof管道的早期阶段，消除了不必要的通信成本。我们使用真实和合成数据集进行的综合实验研究证实了我们处理tb级数据的方法的效率和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed Local Outlier Detection in Big Data

In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method -- a popular outlier detection technique shown to be very effective for datasets with skewed distributions. As datasets increase radically in size, highly scalable LOF algorithms leveraging modern distributed infrastructures are required. This poses significant challenges due to the complexity of the LOF definition, and a lack of access to the entire dataset at any individual compute machine. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging our invariant observation for intermediate value management. Furthermore, we propose a data assignment strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline, while minimizing the number of data replicas. Based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies. These strategies not only minimize the computation costs within each stage, but also eliminate unnecessary communication costs by aggressively pushing the LOF computation into the early stages of the DLOF pipeline. Our comprehensive experimental study using both real and synthetic datasets confirms the efficiency and scalability of our approach to terabyte level data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量