Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira
{"title":"科学数据索引与查询的实证评价","authors":"Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira","doi":"10.5753/jidm.2018.1638","DOIUrl":null,"url":null,"abstract":"Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards an Empirical Evaluation of Scientific Data Indexing and Querying\",\"authors\":\"Thaylon Guedes, V. Silva, J. Camata, M. Bedo, M. Mattoso, Daniel de Oliveira\",\"doi\":\"10.5753/jidm.2018.1638\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.\",\"PeriodicalId\":301338,\"journal\":{\"name\":\"J. Inf. Data Manag.\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Inf. Data Manag.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/jidm.2018.1638\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2018.1638","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Towards an Empirical Evaluation of Scientific Data Indexing and Querying
Computational simulations usually produce large amounts of data on a regular time-step basis. Heterogeneous simulation outputs are stored in different file formats and on distinct storage devices. Therefore, the main challenges for accessing simulation data are related to time-to-query, which is the effort spent for setting all data into a common framework, the issuing of a high-level query statement, and obtaining the result set. The simulation data loading into DataBase Management Systems (DBMS) are either unpractical, as they demand a prohibitive time for data preparation, or unfeasible, as data files are still needed in their original form (scientific applications still need to read and write contents to those files). In this article, we discuss the complementary approaches of adaptive querying and raw data file indexing for accessing simulation results stored in multiple sources (e.g., raw data files) without data loading. In particular, we review (i) NoDB PostgresRAW routines for adaptive query processing, and (ii) FastBit methods for raw data file indexing and querying. We examine the behavior of both strategies regarding a real case study of computational fluid dynamics simulation in the domain of sediment deposition. In this experimental evaluation, we measured the elapsed time for index construction and query processing regarding six distinct query categories over 62 time steps, which sums up to different 372 queries on 44,160 files (12.2 GB) produced by the computational simulation. Results show that FastBit is faster than PostgresRAW for query execution in all but low-selectivity query scenarios. In a complementary manner, results also show PostgresRAW outperforms FastBit whenever users are interested in reducing time-to-query rather than the query execution time itself.