Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka
{"title":"An Algorithmic and Software Pipeline for Very Large Scale Scientific Data Compression with Error Guarantees","authors":"Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka","doi":"10.1109/HiPC56025.2022.00039","DOIUrl":null,"url":null,"abstract":"Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.