NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-01 DOI:10.1109/SC.2014.65

Zhengzhang Chen, S. Son, W. Hendrix, Ankit Agrawal, W. Liao, A. Choudhary

{"title":"NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing","authors":"Zhengzhang Chen, S. Son, W. Hendrix, Ankit Agrawal, W. Liao, A. Choudhary","doi":"10.1109/SC.2014.65","DOIUrl":null,"url":null,"abstract":"Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"126 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"50","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2014.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 50

Abstract

Data check pointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of check pointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, North western University Machine learning Algorithm for Resiliency and Check pointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.

查看原文本刊更多论文

NUMARCK:弹性和检查点的机器学习算法

数据校验指向是高性能计算系统中一项重要的容错技术。随着高性能计算系统向百亿亿级发展，检查点的存储空间和时间成本不仅威胁到模拟，而且威胁到模拟后的数据分析。解决这个问题的一个常见做法是应用压缩算法来减小数据大小。然而，寻找重复模式的传统无损压缩技术对于使用高精度数据的科学数据是无效的，因此很难找到常见的模式。本文利用了这样一个事实，即在许多科学应用中，从一次模拟迭代到下一次迭代的数据值的相对变化彼此之间并没有太大的差异。因此，捕获数据中相对变化的分布，而不是存储数据本身，使我们能够结合数据的时间维度，并了解变化的演变分布。我们表明，在保证用户定义的每个数据点的误差范围内，可以实现数量级的数据缩减。我们提出了西北大学弹性和检查指向机器学习算法NUMARCK，它利用连续模拟迭代之间数据变化的新分布，并将其编码为可以简洁表示的索引空间。我们使用FLASH和CMIP5两种生产科学模拟对NUMARCK进行了评估，并在压缩比和压缩精度方面展示了优越的性能。更重要的是，我们的算法允许用户在每个点的基础上指定最大可容忍误差，同时将数据压缩一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量