An Algorithmic and Software Pipeline for Very Large Scale Scientific Data Compression with Error Guarantees

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI:10.1109/HiPC56025.2022.00039

Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka

{"title":"An Algorithmic and Software Pipeline for Very Large Scale Scientific Data Compression with Error Guarantees","authors":"Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka","doi":"10.1109/HiPC56025.2022.00039","DOIUrl":null,"url":null,"abstract":"Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.

查看原文本刊更多论文

一种具有误差保证的超大规模科学数据压缩算法和软件管道

高效的数据压缩对于存储科学数据变得越来越重要，因为许多科学应用程序产生大量的数据。本文提出了一种端到端的数据压缩算法和软件管道，它保证了原始数据(PD)和派生数据的错误边界，称为兴趣量(qi)。我们通过压缩由大型融合代码XGC生成的融合数据来证明该管道的有效性，该代码在一天内产生数十pb的数据。我们证明了压缩是通过留出称为分段节点的计算资源来进行的，并且不会影响模拟性能。为了高效的并行I/O，管道使用ADIOS2，许多代码(如XGC)已经将其用于并行I/O。我们表明，我们的方法可以将数据压缩两个数量级，同时保证PD和qoi的高精度。此外，压缩所需的资源量是模拟所需资源的几个百分点，同时确保每个阶段的压缩时间小于相应的模拟时间。该管道由三个主要步骤组成。第一步使用域分解将数据分解成小的子域。然后，每个子域被独立压缩，以实现高水平的并行性。第二步使用现有的技术来保证每个子域的主数据的错误边界。第三步采用基于拉格朗日乘法器的后处理优化技术，降低各子域对应数据的qi误差。生成的拉格朗日乘法器可以进一步量化或截断以提高压缩水平。我们方法的所有上述特征使得在保证对科学家至关重要的qi误差的同时，应用动态压缩变得非常实用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量