流数据中的并行连续离群值挖掘

Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio
{"title":"流数据中的并行连续离群值挖掘","authors":"Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio","doi":"10.1109/DSAA.2018.00033","DOIUrl":null,"url":null,"abstract":"In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Parallel Continuous Outlier Mining in Streaming Data\",\"authors\":\"Theodoros Toliopoulos, A. Gounaris, K. Tsichlas, A. Papadopoulos, Sandra Sampaio\",\"doi\":\"10.1109/DSAA.2018.00033\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source\",\"PeriodicalId\":208455,\"journal\":{\"name\":\"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)\",\"volume\":\"103 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSAA.2018.00033\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2018.00033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

在这项工作中,我们关注度量空间中基于距离的离群值,其中实体的状态是否为离群值是基于其邻居中其他实体的数量。近年来,有几种解决方案解决了数据流中基于距离的异常值问题,随着新元素的出现,必须不断挖掘异常值。一个有趣的研究问题是将流环境与大规模并行系统相结合,以提供可扩展的基于流的算法。然而,之前提出的技术都不涉及大规模并行设置。我们的建议填补了这一空白,并研究了在Apache Flink中转移最先进的技术,这是一个用于密集流分析的现代平台。我们将全面介绍所遇到的技术挑战和可能应用的替代方案。我们显示加速高达117(每分钟)。2076)倍于朴素并行(参见。通过使用普通的4核机器和真实世界的数据集,Flink中的非并行)解决方案。我们的结果表明,可以以高效和可扩展的方式实现异常挖掘。由此产生的技术已经在开源中公开可用
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Parallel Continuous Outlier Mining in Streaming Data
In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In the recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and studies transferring state-of-the-art techniques in Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary 4-core machine and a real-world dataset. Our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available in open-source
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信