DPiSAX: Massively Distributed Partitioned iSAX

D. Yagoubi, Reza Akbarinia, F. Masseglia, Themis Palpanas
{"title":"DPiSAX: Massively Distributed Partitioned iSAX","authors":"D. Yagoubi, Reza Akbarinia, F. Masseglia, Themis Palpanas","doi":"10.1109/ICDM.2017.151","DOIUrl":null,"url":null,"abstract":"Indexing is crucial for many data mining tasks that rely on efficient and effective similarity query processing. Consequently, indexing large volumes of time series, along with high performance similarity query processing, have became topics of high interest. For many applications across diverse domains though, the amount of data to be processed might be intractable for a single machine, making existing centralized indexing solutions inefficient. We propose a parallel indexing solution that gracefully scales to billions of time series, and a parallel query processing strategy that, given a batch of queries, efficiently exploits the index. Our experiments, on both synthetic and real world data, illustrate that our index creation algorithm works on 1 billion time series in less than 2 hours, while the state of the art centralized algorithms need more than 5 days. Also, our distributed querying algorithm is able to efficiently process millions of queries over collections of billions of time series, thanks to an effective load balancing mechanism.","PeriodicalId":254086,"journal":{"name":"2017 IEEE International Conference on Data Mining (ICDM)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Data Mining (ICDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2017.151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 49

Abstract

Indexing is crucial for many data mining tasks that rely on efficient and effective similarity query processing. Consequently, indexing large volumes of time series, along with high performance similarity query processing, have became topics of high interest. For many applications across diverse domains though, the amount of data to be processed might be intractable for a single machine, making existing centralized indexing solutions inefficient. We propose a parallel indexing solution that gracefully scales to billions of time series, and a parallel query processing strategy that, given a batch of queries, efficiently exploits the index. Our experiments, on both synthetic and real world data, illustrate that our index creation algorithm works on 1 billion time series in less than 2 hours, while the state of the art centralized algorithms need more than 5 days. Also, our distributed querying algorithm is able to efficiently process millions of queries over collections of billions of time series, thanks to an effective load balancing mechanism.
DPiSAX:大规模分布式分区iSAX
索引对于许多依赖于高效相似度查询处理的数据挖掘任务至关重要。因此,索引大量时间序列以及高性能相似度查询处理已成为人们非常感兴趣的话题。但是,对于跨不同领域的许多应用程序,要处理的数据量对于一台机器来说可能是难以处理的,这使得现有的集中式索引解决方案效率低下。我们提出了一种可以优雅地扩展到数十亿时间序列的并行索引解决方案,以及一种并行查询处理策略,该策略可以在给定一批查询的情况下有效地利用索引。我们在合成数据和真实世界数据上的实验表明,我们的索引创建算法在不到2小时的时间内就可以处理10亿个时间序列,而最先进的集中式算法需要5天以上的时间。此外,由于有效的负载平衡机制,我们的分布式查询算法能够有效地处理数十亿个时间序列集合上的数百万个查询。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信