Optimizing multiple queries on scientific datasets with partial replicas

2007 8th IEEE/ACM International Conference on Grid Computing Pub Date : 2007-09-19 DOI:10.1109/GRID.2007.4354141

L. Weng, Ümit V. Çatalyürek, T. Kurç, G. Agrawal, J. Saltz

{"title":"Optimizing multiple queries on scientific datasets with partial replicas","authors":"L. Weng, Ümit V. Çatalyürek, T. Kurç, G. Agrawal, J. Saltz","doi":"10.1109/GRID.2007.4354141","DOIUrl":null,"url":null,"abstract":"We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multidimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.","PeriodicalId":304508,"journal":{"name":"2007 8th IEEE/ACM International Conference on Grid Computing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 8th IEEE/ACM International Conference on Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/GRID.2007.4354141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

We propose strategies to efficiently execute a query workload, which consists of multiple related queries submitted against a scientific dataset, on a distributed-memory system in the presence of partial dataset replicas. Partial replication re-organizes and re-distributes one or more subsets of a dataset across the storage system to reduce I/O overheads and increase I/O parallelism. Our work targets a class of queries, called range queries, in which the query predicate specifies lower and upper bounds on the values of all or a subset of attributes of a dataset. Data elements whose attribute values fall into the specified bounds are retrieved from the dataset. If we think of the attributes of a dataset forming multi-dimensional space, where each attribute corresponds to one of the dimensions, a range query defines a bounding box in this multidimensional space. We evaluate our strategies in two scenarios involving range queries. The first scenario represents the case in which queries have overlapping regions of interest, such as those arising from an exploratory analysis of the dataset by multiple users. In the second scenario, queries represent adjacent rectilinear sections that capture an irregular subregion in the multi-dimensional space. This scenario corresponds to a case where the user wants to query and retrieve a spatial feature from the dataset. We propose cost models and an algorithm for optimizing such queries. Our results using queries for subsetting and analysis of medical image datasets show that effective use of partial replicas can result in reduction in query execution times.

查看原文本刊更多论文

在部分副本的科学数据集上优化多个查询

我们提出了在存在部分数据集副本的分布式内存系统上有效执行查询工作负载的策略，该工作负载由针对科学数据集提交的多个相关查询组成。部分复制在整个存储系统中重新组织和重新分发一个或多个数据集子集，以减少I/O开销并增加I/O并行性。我们的工作目标是一类查询，称为范围查询，其中查询谓词指定数据集的所有或子集属性值的下界和上界。从数据集中检索属性值落在指定边界内的数据元素。如果我们认为数据集的属性形成了多维空间，其中每个属性对应于一个维度，那么范围查询在这个多维空间中定义了一个边界框。我们在涉及范围查询的两种场景中评估我们的策略。第一个场景表示查询具有重叠的感兴趣区域的情况，例如由多个用户对数据集进行探索性分析而产生的查询。在第二个场景中，查询表示在多维空间中捕获不规则子区域的相邻直线部分。此场景对应于用户希望从数据集中查询和检索空间特征的情况。我们提出了成本模型和优化这种查询的算法。我们使用查询对医学图像数据集进行子集和分析的结果表明，有效使用部分副本可以减少查询执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2007 8th IEEE/ACM International Conference on Grid Computing

自引率

0.00%

发文量