MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI:10.1145/1454115.1454132

H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki

{"title":"MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory","authors":"H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki","doi":"10.1145/1454115.1454132","DOIUrl":null,"url":null,"abstract":"Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1454115.1454132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.

查看原文本刊更多论文

基于分级刮擦板存储器的大规模并行机器上的通信优化

将大量简单处理器和小型刮刮板存储器(spm)集成到单个芯片中的大规模并行机器可以实现每瓦功率的峰值性能。在这些机器中，通信优化非常重要，因为通信带宽往往是瓶颈。先前提出的使用副本候选的通信优化已被证明是有效的，它通过编译时分析检测频繁重用的数组区域，并将这些区域复制到靠近处理器的临时存储器中。然而，它们已经被提出用于单处理器系统或具有一层或多层刮擦板存储器的小型并行机器，并且当它们应用于大规模并行机器时，分析时间增加。本文提出了一种大规模并行机通信优化方法MCAMP (Multilayer Copy-candidate Analysis for Massively Parallel machine)。MCAMP重新形式化了早期工作中使用的框架，并通过假设目标系统的同质性来提高分析的可伸缩性。我们实现了一个MCAMP优化器，它接受一个由包含数组引用和计算代码的完美嵌套循环组成的输入程序，并生成优化的通信。我们通过在真正的大规模并行机GRAPE-DR上执行MCAMP优化器输出程序来测量它们的性能，并使用了我们实现的软件工具链。我们证明MCAMP可以在较短的分析时间内实现最佳的数据传输模式和与手动优化代码相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量