流数据中的并行连续离群值挖掘

2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2018-10-01 DOI:10.1109/DSAA.2018.00033

Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio

{"title":"流数据中的并行连续离群值挖掘","authors":"Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio","doi":"10.1109/DSAA.2018.00033","DOIUrl":null,"url":null,"abstract":"In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Parallel Continuous Outlier Mining in Streaming Data\",\"authors\":\"Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio\",\"doi\":\"10.1109/DSAA.2018.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source\",\"PeriodicalId\":208455,\"journal\":{\"name\":\"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"103 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA.2018.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2018.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

在这项工作中，我们关注度量空间中基于距离的离群值，其中实体的状态是否为离群值是基于其邻居中其他实体的数量。近年来，有几种解决方案解决了数据流中基于距离的异常值问题，随着新元素的出现，必须不断挖掘异常值。一个有趣的研究问题是将流环境与大规模并行系统相结合，以提供可扩展的基于流的算法。然而，之前提出的技术都不涉及大规模并行设置。我们的建议填补了这一空白，并研究了在Apache Flink中转移最先进的技术，这是一个用于密集流分析的现代平台。我们将全面介绍所遇到的技术挑战和可能应用的替代方案。我们显示加速高达117(每分钟)。2076)倍于朴素并行(参见。通过使用普通的4核机器和真实世界的数据集，Flink中的非并行)解决方案。我们的结果表明，可以以高效和可扩展的方式实现异常挖掘。由此产生的技术已经在开源中公开可用

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel Continuous Outlier Mining in Streaming Data

In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量