Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2020-01-15 DOI:10.1145/3368474.3368482

M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender

{"title":"Exploiting Spark for HPC Simulation Data: Taming the Ephemeral Data Explosion","authors":"M. Jiang, Brian Gallagher, Albert Chu, G. Abdulla, Timothy Bender","doi":"10.1145/3368474.3368482","DOIUrl":null,"url":null,"abstract":"In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2--3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368474.3368482","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In this paper, we address the challenge of analyzing simulation data on HPC systems by using Apache Spark, which is a Big Data framework. One of the main problems we encountered with using Spark on HPC systems is the ephemeral data explosion, which is brought about by the curse of persistence in the Spark framework. Data persistence is essential in reducing I/O, but it comes at the cost of storage space. We show that in some cases, Spark scratch data can consume an order of magnitude more space than the input data being analyzed, leading to fatal out-of-disk errors. We investigate the real-world application of scaling machine learning algorithms to predict and analyze failures in multi-physics simulations on 76TB of data (over one trillion training examples). This problem is 2--3 orders of magnitude larger than prior work. Based on extensive experiments at scale, we provide several concrete recommendations as state-of-the-practice, and demonstrate a 7x reduction in disk utilization with negligible increases or even decreases in runtime.

查看原文本刊更多论文

利用Spark实现HPC模拟数据:驯服短暂数据爆炸

在本文中，我们使用Apache Spark这个大数据框架来解决HPC系统仿真数据分析的挑战。我们在HPC系统上使用Spark遇到的主要问题之一是短暂的数据爆炸，这是由Spark框架中的持久性诅咒带来的。数据持久化对于减少I/O至关重要，但它是以存储空间为代价的。我们展示了在某些情况下，Spark临时数据消耗的空间可能比正在分析的输入数据多一个数量级，从而导致致命的磁盘外错误。我们研究了缩放机器学习算法在76TB数据(超过一万亿训练示例)上的多物理场模拟中的实际应用，以预测和分析故障。这个问题比之前的工作要大2- 3个数量级。基于大规模的大量实验，我们提供了一些具体的建议作为实践状态，并演示了磁盘利用率降低了7倍，而运行时的增加甚至减少可以忽略不计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量