Landrush: Rethinking In-Situ Analysis for GPGPU Workflows

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) Pub Date : 2016-05-16 DOI:10.1109/CCGrid.2016.58

Anshuman Goswami, Yuan Tian, K. Schwan, F. Zheng, Jeffrey S. Young, M. Wolf, G. Eisenhauer, S. Klasky

{"title":"Landrush: Rethinking In-Situ Analysis for GPGPU Workflows","authors":"Anshuman Goswami, Yuan Tian, K. Schwan, F. Zheng, Jeffrey S. Young, M. Wolf, G. Eisenhauer, S. Klasky","doi":"10.1109/CCGrid.2016.58","DOIUrl":null,"url":null,"abstract":"In-situ analysis on the output data of scientific simulations has been made necessary by ever-growing output data volumes and increasing costs of data movement as supercomputing is moving towards exascale. With hardware accelerators like GPUs becoming increasingly common in high end machines, new opportunities arise to co-locate scientific simulations and online analysis performed on the scientific data generated by the simulations. However, the asynchronous nature of GPGPU programming models and the limited context-switching capabilities on the GPU pose challenges to co-locating the scientific simulation and analysis on the same GPU. This paper dives deeper into these challenges to understand how best to co-locate analysis with scientific simulations on the GPUs in HPC clusters. Specifically, our 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data. Landrush is demonstrated with experimental results obtained from leadership high-end applications on ORNL's Titan supercomputer, which show that (i) GPU-based scientific simulations have varying degrees of idle cycles to afford useful analysis task co-location, and (ii) the inability to context switch on the GPU at instruction granularity can be overcome by careful control of the analysis kernel launches and software-controlled early completion of analysis kernel executions. Results show that Landrush is superior in terms of time-to-answer compared to serially running simulations followed by analysis or by relying on the GPU driver and hardwired thread dispatcher to run analysis concurrently on a single GPU.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

In-situ analysis on the output data of scientific simulations has been made necessary by ever-growing output data volumes and increasing costs of data movement as supercomputing is moving towards exascale. With hardware accelerators like GPUs becoming increasingly common in high end machines, new opportunities arise to co-locate scientific simulations and online analysis performed on the scientific data generated by the simulations. However, the asynchronous nature of GPGPU programming models and the limited context-switching capabilities on the GPU pose challenges to co-locating the scientific simulation and analysis on the same GPU. This paper dives deeper into these challenges to understand how best to co-locate analysis with scientific simulations on the GPUs in HPC clusters. Specifically, our 'Landrush' approach to GPU sharing proposes a solution that utilizes idle cycles on the GPU to provide an improved time-to-answer, that is, the total time to run the scientific simulation and analysis of the generated data. Landrush is demonstrated with experimental results obtained from leadership high-end applications on ORNL's Titan supercomputer, which show that (i) GPU-based scientific simulations have varying degrees of idle cycles to afford useful analysis task co-location, and (ii) the inability to context switch on the GPU at instruction granularity can be overcome by careful control of the analysis kernel launches and software-controlled early completion of analysis kernel executions. Results show that Landrush is superior in terms of time-to-answer compared to serially running simulations followed by analysis or by relying on the GPU driver and hardwired thread dispatcher to run analysis concurrently on a single GPU.

查看原文本刊更多论文

Landrush:重新思考GPGPU工作流的原位分析

随着超级计算向百亿亿级发展，输出数据量不断增长，数据移动成本不断增加，因此对科学模拟输出数据进行现场分析是必要的。随着gpu等硬件加速器在高端机器中变得越来越普遍，出现了对模拟生成的科学数据进行共同定位和在线分析的新机会。然而，GPGPU编程模型的异步特性和GPU上有限的上下文切换能力给在同一GPU上进行科学仿真和分析带来了挑战。本文将深入探讨这些挑战，以了解如何最好地将分析与HPC集群中gpu上的科学模拟一起定位。具体来说，我们的“Landrush”GPU共享方法提出了一种解决方案，该解决方案利用GPU上的空闲周期来提供改进的应答时间，即运行生成数据的科学模拟和分析的总时间。通过在ORNL的Titan超级计算机上的领先高端应用程序获得的实验结果证明了Landrush，这表明(i)基于GPU的科学模拟具有不同程度的空闲周期，以提供有用的分析任务协同定位，以及(ii)通过仔细控制分析内核启动和软件控制分析内核执行的早期完成，可以克服GPU在指令粒度上无法切换上下文的问题。结果表明，与连续运行仿真然后进行分析或依赖GPU驱动程序和硬连线线程调度程序在单个GPU上并发运行分析相比，Landrush在应答时间方面更优越。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

自引率

0.00%

发文量