Optimizing multiple queries on scientific datasets with partial replicas

L. Weng, Ümit V. Çatalyürek, T. Kurç, G. Agrawal, J. Saltz
{"title":"Optimizing multiple queries on scientific datasets with partial replicas","authors":"L. Weng, Ümit V. Çatalyürek, T. Kurç, G. Agrawal, J. Saltz","doi":"10.1109/GRID.2007.4354141","DOIUrl":null,"url":null,"abstract":"We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multidimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.","PeriodicalId":304508,"journal":{"name":"2007 8th IEEE/ACM International Conference on Grid Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 8th IEEE/ACM International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2007.4354141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multidimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.
在部分副本的科学数据集上优化多个查询
我们提出了在存在部分数据集副本的分布式内存系统上有效执行查询工作负载的策略,该工作负载由针对科学数据集提交的多个相关查询组成。部分复制在整个存储系统中重新组织和重新分发一个或多个数据集子集,以减少I/O开销并增加I/O并行性。我们的工作目标是一类查询,称为范围查询,其中查询谓词指定数据集的所有或子集属性值的下界和上界。从数据集中检索属性值落在指定边界内的数据元素。如果我们认为数据集的属性形成了多维空间,其中每个属性对应于一个维度,那么范围查询在这个多维空间中定义了一个边界框。我们在涉及范围查询的两种场景中评估我们的策略。第一个场景表示查询具有重叠的感兴趣区域的情况,例如由多个用户对数据集进行探索性分析而产生的查询。在第二个场景中,查询表示在多维空间中捕获不规则子区域的相邻直线部分。此场景对应于用户希望从数据集中查询和检索空间特征的情况。我们提出了成本模型和优化这种查询的算法。我们使用查询对医学图像数据集进行子集和分析的结果表明,有效使用部分副本可以减少查询执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信