Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.47

Cheng Chen, Yao Wu, Stéphane Zuckerman, G. Gao

{"title":"Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models","authors":"Cheng Chen, Yao Wu, Stéphane Zuckerman, G. Gao","doi":"10.1109/IPDPSW.2013.47","DOIUrl":null,"url":null,"abstract":"The code let model is a fine-grain dataflow-inspired program execution model that balances the parallelism and overhead of the runtime system. It plays an important role in terms of performance, scalability, and energy efficiency in exascale studies such as the DARPA UHPC project and the DOE X-Stack project. As an important application, the Fast Fourier Transform (FFT) has been deeply studied in fine-grain models, including the code let model. However, the existing work focuses on how fine-grain models achieve more balanced workload comparing to traditional coarse-grain models. In this paper, we make an important observation that the flexibility of execution order of tasks in fine-grain models improves utilization of memory bandwidth as well. We use the code let model and the FFT application as a case study to show that a proper execution order of tasks (or code lets) can significantly reduce memory contention and thus improve performance. We propose an algorithm that provides a heuristic guidance of the execution order of the code lets to reduce memory contention. We implemented our algorithm on the IBM Cyclops-64 architecture. Experimental results show that our algorithm improves up to 46% performance compared to a state-of-the-art coarse-grain implementation of the FFT application on Cyclops-64.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The code let model is a fine-grain dataflow-inspired program execution model that balances the parallelism and overhead of the runtime system. It plays an important role in terms of performance, scalability, and energy efficiency in exascale studies such as the DARPA UHPC project and the DOE X-Stack project. As an important application, the Fast Fourier Transform (FFT) has been deeply studied in fine-grain models, including the code let model. However, the existing work focuses on how fine-grain models achieve more balanced workload comparing to traditional coarse-grain models. In this paper, we make an important observation that the flexibility of execution order of tasks in fine-grain models improves utilization of memory bandwidth as well. We use the code let model and the FFT application as a case study to show that a proper execution order of tasks (or code lets) can significantly reduce memory contention and thus improve performance. We propose an algorithm that provides a heuristic guidance of the execution order of the code lets to reduce memory contention. We implemented our algorithm on the IBM Cyclops-64 architecture. Experimental results show that our algorithm improves up to 46% performance compared to a state-of-the-art coarse-grain implementation of the FFT application on Cyclops-64.

查看原文本刊更多论文

在细粒度执行模型中实现内存负载平衡的快速傅立叶变换

代码let模型是一种受细粒度数据流启发的程序执行模型，它平衡了运行时系统的并行性和开销。在诸如DARPA UHPC项目和DOE X-Stack项目等百亿亿次研究中，它在性能、可扩展性和能源效率方面发挥着重要作用。快速傅里叶变换(FFT)作为一种重要的应用，在细粒度模型(包括码let模型)中得到了深入的研究。然而，现有的工作主要集中在细粒度模型如何比传统的粗粒度模型实现更平衡的工作负载。在本文中，我们做了一个重要的观察，即在细粒度模型中任务执行顺序的灵活性也提高了内存带宽的利用率。我们使用代码let模型和FFT应用程序作为案例研究，以显示任务(或代码let)的正确执行顺序可以显著减少内存争用，从而提高性能。我们提出了一种算法，该算法提供了启发式的代码执行顺序指导，以减少内存争用。我们在IBM Cyclops-64架构上实现了我们的算法。实验结果表明，与Cyclops-64上最先进的粗粒度FFT应用程序实现相比，我们的算法性能提高了46%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量