科学数据索引与查询的实证评价

J. Inf. Data Manag. Pub Date : 2018-06-20 DOI:10.5753/jidm.2018.1638

Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira

{"title":"科学数据索引与查询的实证评价","authors":"Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira","doi":"10.5753/jidm.2018.1638","DOIUrl":null,"url":null,"abstract":"Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards an Empirical Evaluation of Scientific Data Indexing and Querying\",\"authors\":\"Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira\",\"doi\":\"10.5753/jidm.2018.1638\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.\",\"PeriodicalId\":301338,\"journal\":{\"name\":\"J. Inf. Data Manag.\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Inf. Data Manag.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/jidm.2018.1638\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2018.1638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

计算模拟通常在有规律的时间步长的基础上产生大量的数据。异构仿真输出以不同的文件格式存储在不同的存储设备上。因此，访问模拟数据的主要挑战与查询时间有关，即将所有数据设置到公共框架、发布高级查询语句和获取结果集所花费的时间。将模拟数据加载到数据库管理系统(DataBase Management Systems, DBMS)中是不切实际的，因为它们需要大量的时间来准备数据，或者是不可行的，因为数据文件仍然需要原始形式(科学应用程序仍然需要读取和写入这些文件的内容)。在本文中，我们讨论了自适应查询和原始数据文件索引的互补方法，用于在不加载数据的情况下访问存储在多个源(例如，原始数据文件)中的模拟结果。我们特别回顾了(i)用于自适应查询处理的NoDB PostgresRAW例程，以及(ii)用于原始数据文件索引和查询的FastBit方法。我们检查关于在泥沙沉积领域的计算流体动力学模拟的实际案例研究两种策略的行为。在这个实验评估中，我们在62个时间步骤中测量了6个不同查询类别的索引构建和查询处理所花费的时间，计算模拟产生的44,160个文件(12.2 GB)上总共有372个不同的查询。结果表明，在除低选择性查询场景外的所有查询执行中，FastBit都比PostgresRAW快。以一种互补的方式，结果还显示，当用户对减少查询时间而不是查询执行时间本身感兴趣时，PostgresRAW优于FastBit。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards an Empirical Evaluation of Scientific Data Indexing and Querying

Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量