Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI:10.1109/PACT.1997.644019

M. Kandemir, J. Ramanujam, A. Choudhary

{"title":"Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines","authors":"M. Kandemir, J. Ramanujam, A. Choudhary","doi":"10.1109/PACT.1997.644019","DOIUrl":null,"url":null,"abstract":"Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with a large number of processors is difficult. Previously, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (it) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.1997.644019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with a large number of processors is difficult. Previously, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (it) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.

查看原文本刊更多论文

在共享和分布式内存机器上优化局部性和并行性的编译算法

分布式内存消息传递机器可以提供可伸缩的性能，但很难编程。另一方面，共享内存机器更容易编程，但是在拥有大量处理器的情况下获得可伸缩的性能是困难的。以前，已经设计和实现了一些基于逻辑共享物理分布式内存的可扩展架构。虽然一些性能问题(如并行性和局部性)在不同的并行体系结构中是共同的，但数据分解等问题对于特定类型的体系结构是独特的。编译器编写者面临的最重要的挑战之一是设计可以在各种体系结构上工作的编译技术。在本文中，我们提出了一种算法，可用于优化不同类型的并行架构的编译器。我们的优化算法做了以下工作:(1)变换循环巢，以便在可能的情况下，最外层的循环可以跨处理器并行运行;(2)跨处理器分解每个数组;(3)尽可能通过向量化优化处理器间通信;并且(它)通过为每个数组分配适当的存储布局来优化局部性(缓存性能)。根据底层硬件系统的不同，可以在统一的框架中应用其中的一些或全部步骤。我们给出了缓存缺失率的模拟结果，以及在SUN SPARCstation 5、IBM SP-2、SGI Challenge和Convex Exemplar上的实证结果，以验证我们的方法在不同架构上的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量