Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion

M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender
{"title":"Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion","authors":"M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender","doi":"10.1145/3368474.3368482","DOIUrl":null,"url":null,"abstract":"In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2--3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2--3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.
利用Spark实现HPC模拟数据:驯服短暂数据爆炸
在本文中,我们使用Apache Spark这个大数据框架来解决HPC系统仿真数据分析的挑战。我们在HPC系统上使用Spark遇到的主要问题之一是短暂的数据爆炸,这是由Spark框架中的持久性诅咒带来的。数据持久化对于减少I/O至关重要,但它是以存储空间为代价的。我们展示了在某些情况下,Spark临时数据消耗的空间可能比正在分析的输入数据多一个数量级,从而导致致命的磁盘外错误。我们研究了缩放机器学习算法在76TB数据(超过一万亿训练示例)上的多物理场模拟中的实际应用,以预测和分析故障。这个问题比之前的工作要大2- 3个数量级。基于大规模的大量实验,我们提供了一些具体的建议作为实践状态,并演示了磁盘利用率降低了7倍,而运行时的增加甚至减少可以忽略不计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信