Autonomously improving query evaluations over multidimensional data in distributed hash tables

Matthew Malensek, S. Pallickara, S. Pallickara
{"title":"Autonomously improving query evaluations over multidimensional data in distributed hash tables","authors":"Matthew Malensek, S. Pallickara, S. Pallickara","doi":"10.1145/2494621.2494638","DOIUrl":null,"url":null,"abstract":"The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.","PeriodicalId":190559,"journal":{"name":"ACM Cloud and Autonomic Computing Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Cloud and Autonomic Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2494621.2494638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.
自主改进对分布式哈希表中多维数据的查询计算
具有联网功能的观测设备和传感器的激增导致了数据速率和来源的增长,最终导致了极端规模的数据量。在这种情况下生成的数据集通常是多维的,每个维度代表一个感兴趣的特征。我们假设对这些数据集的查询的有效评估必须考虑数据值的分布和查询本身的模式。考虑到数据量、维数以及新数据和查询到达的速度,手动配置查询评估是不可行的。在本文中,我们描述了我们的算法来自主改进对大量分布式数据集的查询评估。我们的方法自动调整最主要的查询模式和值在一个维度上的分布。我们在伽利略系统的背景下评估我们的算法,伽利略系统是一个用于管理地理空间、时间序列数据的分层分布式哈希表。我们的系统在内存利用率、快速评估和搜索空间缩减之间取得了平衡。这里报告的经验评估是在多维数据集上执行的,该数据集包含10亿个文件。本工作中描述的方案广泛适用于利用分布式散列表作为存储机制的任何系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信