Autonomously improving query evaluations over multidimensional data in distributed hash tables

ACM Cloud and Autonomic Computing Conference Pub Date : 2013-08-09 DOI:10.1145/2494621.2494638

Matthew Malensek, S. Pallickara, S. Pallickara

{"title":"Autonomously improving query evaluations over multidimensional data in distributed hash tables","authors":"Matthew Malensek, S. Pallickara, S. Pallickara","doi":"10.1145/2494621.2494638","DOIUrl":null,"url":null,"abstract":"The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.

查看原文本刊更多论文

自主改进对分布式哈希表中多维数据的查询计算

具有联网功能的观测设备和传感器的激增导致了数据速率和来源的增长，最终导致了极端规模的数据量。在这种情况下生成的数据集通常是多维的，每个维度代表一个感兴趣的特征。我们假设对这些数据集的查询的有效评估必须考虑数据值的分布和查询本身的模式。考虑到数据量、维数以及新数据和查询到达的速度，手动配置查询评估是不可行的。在本文中，我们描述了我们的算法来自主改进对大量分布式数据集的查询评估。我们的方法自动调整最主要的查询模式和值在一个维度上的分布。我们在伽利略系统的背景下评估我们的算法，伽利略系统是一个用于管理地理空间、时间序列数据的分层分布式哈希表。我们的系统在内存利用率、快速评估和搜索空间缩减之间取得了平衡。这里报告的经验评估是在多维数据集上执行的，该数据集包含10亿个文件。本工作中描述的方案广泛适用于利用分布式散列表作为存储机制的任何系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Cloud and Autonomic Computing Conference

自引率

0.00%

发文量