DPiSAX: Massively Distributed Partitioned iSAX

2017 IEEE International Conference on Data Mining (ICDM) Pub Date : 2017-11-18 DOI:10.1109/ICDM.2017.151

D. Yagoubi, Reza Akbarinia, F. Masseglia, Themis Palpanas

引用次数: 49

Abstract

Indexing is crucial for many data mining tasks that rely on efficient and effective similarity query processing. Consequently, indexing large volumes of time series, along with high performance similarity query processing, have became topics of high interest. For many applications across diverse domains though, the amount of data to be processed might be intractable for a single machine, making existing centralized indexing solutions inefficient. We propose a parallel indexing solution that gracefully scales to billions of time series, and a parallel query processing strategy that, given a batch of queries, efficiently exploits the index. Our experiments, on both synthetic and real world data, illustrate that our index creation algorithm works on 1 billion time series in less than 2 hours, while the state of the art centralized algorithms need more than 5 days. Also, our distributed querying algorithm is able to efficiently process millions of queries over collections of billions of time series, thanks to an effective load balancing mechanism.

查看原文本刊更多论文

DPiSAX:大规模分布式分区iSAX

索引对于许多依赖于高效相似度查询处理的数据挖掘任务至关重要。因此，索引大量时间序列以及高性能相似度查询处理已成为人们非常感兴趣的话题。但是，对于跨不同领域的许多应用程序，要处理的数据量对于一台机器来说可能是难以处理的，这使得现有的集中式索引解决方案效率低下。我们提出了一种可以优雅地扩展到数十亿时间序列的并行索引解决方案，以及一种并行查询处理策略，该策略可以在给定一批查询的情况下有效地利用索引。我们在合成数据和真实世界数据上的实验表明，我们的索引创建算法在不到2小时的时间内就可以处理10亿个时间序列，而最先进的集中式算法需要5天以上的时间。此外，由于有效的负载平衡机制，我们的分布式查询算法能够有效地处理数十亿个时间序列集合上的数百万个查询。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Conference on Data Mining (ICDM)

自引率

0.00%

发文量