Six degrees of scientific data: reading patterns for extreme scale science IO

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2011-06-08 DOI:10.1145/1996130.1996139

J. Lofstead, Milo Polte, Garth A. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, Qing Liu

{"title":"Six degrees of scientific data: reading patterns for extreme scale science IO","authors":"J. Lofstead, Milo Polte, Garth A. Gibson, S. Klasky, K. Schwan, R. Oldfield, M. Wolf, Qing Liu","doi":"10.1145/1996130.1996139","DOIUrl":null,"url":null,"abstract":"Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"97","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on High-Performance Parallel Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1996130.1996139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 97

Abstract

Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.

查看原文本刊更多论文

六度科学数据:极端尺度科学的阅读模式

千兆级科学模拟每天产生10tb的应用程序数据，其中大部分用于检查点/重启容错机制。以前的工作证明了仔细管理这种输出的重要性，以防止由于IO阻塞、资源争用对模拟性能产生负面影响而导致的应用程序减速，并充分利用千兆级机器可用的IO带宽。本文在理解和管理极端规模IO方面迈出了进一步的一步。具体来说，它的评估旨在了解如何有效地读取数据，以便进行后续的数据分析、可视化、故障后重新启动检查点以及其他读取密集型操作。总的来说，这些行动支持科学家的“端到端”需求，使正在进行的科学过程成为可能。贡献包括以下内容。首先，我们与应用程序科学家合作，定义“读取”基准，捕获分析代码使用的常见读取模式。其次，这些读取模式用于大规模地评估不同的IO技术，以了解不同的数据大小和组织对最终用户所看到的性能的影响。第三，定义“数据区”的新概念来描述数据如何组织读取，我们通过实验比较ADIOS中间件基于日志的BP格式与分析工具常用的逻辑连续NetCDF或HDF5格式的读取性能。度量评估跨模式、不同数据大小、组织和读取过程计数的性能。结果表明，高端到端IO性能要求数据组织在并行存储目标上提供数据布局和放置的灵活性，包括在数据写入与读取性能之间进行权衡的方式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Symposium on High-Performance Parallel Distributed Computing

自引率

0.00%

发文量