S-preconditioner for Multi-fold Data Reduction with Guaranteed User-Controlled Accuracy

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI:10.1109/ICDM.2011.138

Ye Jin, Sriram Lakshminarasimhan, Neil Shah, Zhenhuan Gong, Choong-Seock Chang, Jackie H. Chen, S. Ethier, H. Kolla, S. Ku, S. Klasky, R. Latham, R. Ross, K. Schuchardt, N. Samatova

{"title":"S-preconditioner for Multi-fold Data Reduction with Guaranteed User-Controlled Accuracy","authors":"Ye Jin, Sriram Lakshminarasimhan, Neil Shah, Zhenhuan Gong, Choong-Seock Chang, Jackie H. Chen, S. Ethier, H. Kolla, S. Ku, S. Klasky, R. Latham, R. Ross, K. Schuchardt, N. Samatova","doi":"10.1109/ICDM.2011.138","DOIUrl":null,"url":null,"abstract":"The growing gap between the massive amounts of data generated by petascale scientific simulation codes and the capability of system hardware and software to effectively analyze this data necessitates data reduction. Yet, the increasing data complexity challenges most, if not all, of the existing data compression methods. In fact, loss less compression techniques offer no more than 10% reduction on scientific data that we have experience with, which is widely regarded as effectively incompressible. To bridge this gap, in this paper, we advocate a transformative strategy that enables fast, accurate, and multi-fold reduction of double-precision floating-point scientific data. The intuition behind our method is inspired by an effective use of preconditioners for linear algebra solvers optimized for a particular class of computational \"dwarfs\" (e.g., dense or sparse matrices). Focusing on a commonly used multi-resolution wavelet compression technique as the underlying \"solver\" for data reduction we propose the S-preconditioner, which transforms scientific data into a form with high global regularity to ensure a significant decrease in the number of wavelet coefficients stored for a segment of data. Combined with the subsequent EQ-$calibrator, our resultant method (called S-Preconditioned EQ-Calibrated Wavelets (SW)), robustly achieved a 4-to 5-fold data reduction-while guaranteeing user-defined accuracy of reconstructed data to be within 1% point-by-point relative error, lower than 0.01 Normalized RMSE, and higher than 0.99 Pearson Correlation. In this paper, we show the results we obtained by testing our method on six petascale simulation codes including fusion, combustion, climate, astrophysics, and subsurface groundwater in addition to 13 publicly available scientific datasets. We also demonstrate that application-driven data mining tasks performed on decompressed variables or their derived quantities produce results of comparable quality with the ones for the original data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 11th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2011.138","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

The growing gap between the massive amounts of data generated by petascale scientific simulation codes and the capability of system hardware and software to effectively analyze this data necessitates data reduction. Yet, the increasing data complexity challenges most, if not all, of the existing data compression methods. In fact, loss less compression techniques offer no more than 10% reduction on scientific data that we have experience with, which is widely regarded as effectively incompressible. To bridge this gap, in this paper, we advocate a transformative strategy that enables fast, accurate, and multi-fold reduction of double-precision floating-point scientific data. The intuition behind our method is inspired by an effective use of preconditioners for linear algebra solvers optimized for a particular class of computational "dwarfs" (e.g., dense or sparse matrices). Focusing on a commonly used multi-resolution wavelet compression technique as the underlying "solver" for data reduction we propose the S-preconditioner, which transforms scientific data into a form with high global regularity to ensure a significant decrease in the number of wavelet coefficients stored for a segment of data. Combined with the subsequent EQ-$calibrator, our resultant method (called S-Preconditioned EQ-Calibrated Wavelets (SW)), robustly achieved a 4-to 5-fold data reduction-while guaranteeing user-defined accuracy of reconstructed data to be within 1% point-by-point relative error, lower than 0.01 Normalized RMSE, and higher than 0.99 Pearson Correlation. In this paper, we show the results we obtained by testing our method on six petascale simulation codes including fusion, combustion, climate, astrophysics, and subsurface groundwater in addition to 13 publicly available scientific datasets. We also demonstrate that application-driven data mining tasks performed on decompressed variables or their derived quantities produce results of comparable quality with the ones for the original data.

查看原文本刊更多论文

保证用户控制精度的多重数据约简s预调节器

千万亿次科学模拟代码产生的海量数据与系统硬件和软件有效分析这些数据的能力之间的差距越来越大，这就需要数据精简。然而，不断增加的数据复杂性挑战了大多数(如果不是全部的话)现有的数据压缩方法。事实上，根据我们的经验，低损耗压缩技术可以减少不超过10%的科学数据，而这些数据被广泛认为是不可压缩的。为了弥补这一差距，在本文中，我们提倡一种变革性策略，使双精度浮点科学数据能够快速，准确和多次减少。我们的方法背后的直觉灵感来自于对线性代数解算器的预条件的有效使用，该解算器针对特定的计算“小矮人”(例如，密集或稀疏矩阵)进行了优化。针对一种常用的多分辨率小波压缩技术作为数据约简的基础“求解器”，我们提出了s预条件，它将科学数据转换为具有高度全局规律性的形式，以确保一段数据存储的小波系数数量显著减少。结合随后的EQ-$校准器，我们的结果方法(称为S-Preconditioned EQ-校准小波(SW))稳健地实现了4到5倍的数据缩减，同时保证重构数据的用户定义精度在1%的逐点相对误差内，低于0.01的归一化RMSE，高于0.99的Pearson相关性。在本文中，我们展示了我们通过在包括聚变、燃烧、气候、天体物理和地下地下水在内的6个千兆级模拟代码以及13个公开可用的科学数据集上测试我们的方法获得的结果。我们还证明，应用程序驱动的数据挖掘任务在解压缩变量或其派生量上执行，产生的结果与原始数据的结果质量相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 11th International Conference on Data Mining

自引率

0.00%

发文量