面向数据密集型科学的云存储缓存仿真与评价。

Q1 Computer Science

Computing and Software for Big Science Pub Date : 2022-01-01 DOI:10.1007/s41781-021-00076-w

Tobias Wegner, Mario Lassnig, Peer Ueberholz, Christian Zeitnitz

{"title":"面向数据密集型科学的云存储缓存仿真与评价。","authors":"Tobias Wegner, Mario Lassnig, Peer Ueberholz, Christian Zeitnitz","doi":"10.1007/s41781-021-00076-w","DOIUrl":null,"url":null,"abstract":"A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.","PeriodicalId":36026,"journal":{"name":"Computing and Software for Big Science","volume":"6 1","pages":"5"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805534/pdf/","citationCount":"4","resultStr":"{\"title\":\"Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science.\",\"authors\":\"Tobias Wegner, Mario Lassnig, Peer Ueberholz, Christian Zeitnitz\",\"doi\":\"10.1007/s41781-021-00076-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.\",\"PeriodicalId\":36026,\"journal\":{\"name\":\"Computing and Software for Big Science\",\"volume\":\"6 1\",\"pages\":\"5\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9805534/pdf/\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computing and Software for Big Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s41781-021-00076-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computing and Software for Big Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41781-021-00076-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 4

摘要

科学计算中的一个常见任务是数据约简。该工作流从大型输入数据中提取最重要的信息，并将其存储在较小的派生数据对象中。然后可以使用派生数据对象进行进一步分析。通常，这些工作流使用分布式存储和计算资源。存储介质的简单设置是低成本的磁带存储和高成本的磁盘存储。大的、不经常访问的输入数据存储在磁带存储器上。较小的、经常访问的派生数据存储在磁盘存储器中。在最好的情况下，大型输入数据只以非常不频繁的方式访问，并且以精心规划的模式访问。然而，实践表明，通常必须对数据进行连续且不可预测的处理。这会显著降低磁带存储的性能。解决这个问题的一种常用方法是将大型输入数据的副本存储在磁盘存储器上。该贡献评估了一种方法，该方法根据计算工作流使用云存储资源作为灵活的缓存或缓冲区。针对连续处理数据的情况，探讨了所提出的模型。为了进行评估，开发了一个仿真工具，该工具可用于分析与存储和网络资源相关的模型。我们展示了使用商业云存储可以减少本地磁盘存储需求，同时保持相同的作业吞吐量。此外，讨论了模型的关键指标，并描述了一种方法，该方法使用仿真来辅助使用商业云存储的决策过程。目标是研究方法并提出新的评估方法，以克服未来的数据挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science.

查看原文本刊更多论文

Simulation and Evaluation of Cloud Storage Caching for Data Intensive Science.

A common task in scientific computing is the data reduction. This workflow extracts the most important information from large input data and stores it in smaller derived data objects. The derived data objects can then be used for further analysis. Typically, these workflows use distributed storage and computing resources. A straightforward setup of storage media would be low-cost tape storage and higher-cost disk storage. The large, infrequently accessed input data are stored on tape storage. The smaller, frequently accessed derived data is stored on disk storage. In a best-case scenario, the large input data is only accessed very infrequently and in a well-planned pattern. However, practice shows that often the data has to be processed continuously and unpredictably. This can significantly reduce tape storage performance. A common approach to counter this is storing copies of the large input data on disk storage. This contribution evaluates an approach that uses cloud storage resources to serve as a flexible cache or buffer, depending on the computational workflow. The proposed model is explored for the case of continuously processed data. For the evaluation, a simulation tool was developed, which can be used to analyse models related to storage and network resources. We show that using commercial cloud storage can reduce on-premises disk storage requirements, while maintaining an equal throughput of jobs. Moreover, the key metrics of the model are discussed, and an approach is described, which uses the simulation to assist with the decision process of using commercial cloud storage. The goal is to investigate approaches and propose new evaluation methods to overcome future data challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computing and Software for Big Science Computer Science-Computer Science (miscellaneous)

CiteScore

6.20

自引率

0.00%

发文量