Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques最新文献

Interprocedural array remapping 过程间数组重新映射

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644011

Michal Cierniak, Wei Li

引用次数: 13

MDL: a language and compiler for dynamic program instrumentation MDL:用于动态程序插装的语言和编译器

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644016

J. Hollingsworth, B. Miller, M. J. R. Gonçalves, Oscar Naim, Zhichen Xu, Ling Zheng

引用次数: 127

Buffer-safe communication optimization based on data flow analysis and performance prediction 基于数据流分析和性能预测的缓冲区安全通信优化

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644015

T. Fahringer, E. Mehofer

{"title":"Buffer-safe communication optimization based on data flow analysis and performance prediction","authors":"T. Fahringer, E. Mehofer","doi":"10.1109/PACT.1997.644015","DOIUrl":"https://doi.org/10.1109/PACT.1997.644015","url":null,"abstract":"The paper presents a novel approach to reduce communication costs of programs for distributed memory machines. The techniques are based on uni-directional bit-vector data flow analysis that enable vectorizing and coalescing communication, overlapping communication with computation, eliminating redundant messages and amount of data being transferred both within and across loop nests. The data flow analysis differs from previous techniques that it does not require to explicitly model balanced communication placement and loops and does not employ interval analysis. The techniques are based on simple yet highly effective data flow equations which are solved iteratively for arbitrary control flow graphs. Moving communication earlier to hide latency has been shown to dramatically increase communication buffer sizes and can even cause run-time errors. The authors use P/sup 3/T, a state-of-the-art performance estimator to create a buffer-safe program. By accurately estimating both the communication buffer sizes required and the implied communication times of every single communication of a program one can selectively choose communication that must be delayed in order to ensure a correct communication placement while maximizing communication latency hiding. Experimental results are presented to prove the efficacy of the communication optimization strategy.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134142633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Improving the memory bandwidth of highly-integrated, wide-issue, microprocessor-based systems 提高高集成度、宽问题、基于微处理器的系统的内存带宽

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644009

D. Albonesi, I. Koren

引用次数: 2

Parallel execution of radix sort program using fine-grain communication 采用细粒度通信的基数排序程序并行执行

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644010

Yuetsu Kodama, H. Sakane, H. Koike, M. Sato, S. Sakai, Y. Yamaguchi

引用次数: 3

Empirical evaluation of deterministic and adaptive routing with constant-area routers 固定区域路由器的确定性和自适应路由的经验评价

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644004

Dianne Miller, W. Najjar

引用次数: 9

Effective usage of vector registers in advanced vector architectures 在先进的矢量架构中有效地使用矢量寄存器

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644021

L. Villa, R. Espasa, M. Valero

{"title":"Effective usage of vector registers in advanced vector architectures","authors":"L. Villa, R. Espasa, M. Valero","doi":"10.1109/PACT.1997.644021","DOIUrl":"https://doi.org/10.1109/PACT.1997.644021","url":null,"abstract":"This paper presents data confirming the fact that traditional vector architectures can not reduce their vector register length without suffering a severe performance penalty. However, we will show that by combining the vector register length reduction with two different ILP techniques, decoupling and multithreading, the performance penalty can be made very small. We will show that each resulting architecture tolerates very well long memory latencies and also makes a better usage of the available storage space in each vector register. Using decoupling and short vectors, Each register can be halved while still providing speedups in the range 1.04-1.49 over a traditional architecture with long registers. Using multithreading. We split a vector register file in two halfs and show that two independent threads running on such machine can yield speedups in the range 1.23-1.29. The paper also explores configurations with 1/4 and 1/8 the original vector register size aimed at cost-conscious designs, and shows that even at 1/4 the original size, the resulting architectures can outperform a traditional machine. We also present results across a wide range of memory latencies, and show that the combination of short vectors and ILP techniques results in a very good tolerance of slow memory systems.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115141365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Direct generation of data-driven program for stream-oriented processing 直接生成面向流处理的数据驱动程序

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644025

K. Karasawa, M. Iwata, H. Terada

引用次数: 3

Optimally synchronizing DOACROSS loops on shared memory multiprocessors 在共享内存多处理器上优化同步DOACROSS循环

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644017

R. Rajamony, A. Cox

{"title":"Optimally synchronizing DOACROSS loops on shared memory multiprocessors","authors":"R. Rajamony, A. Cox","doi":"10.1109/PACT.1997.644017","DOIUrl":"https://doi.org/10.1109/PACT.1997.644017","url":null,"abstract":"We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loop-carried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm uses an interval graph representation of the dependence \"overlap\" to find a synchronization placement in time almost linear in the number of dependences. Although this solution may be suboptimal, it is still better than that obtained using existing methods, which first eliminate redundant dependences and then synchronize the remaining ones. Determining the optimal synchronization is an NP-complete problem. Our second algorithm therefore uses integer programming to determine the optimal solution. We first use a polynomial-time algorithm to find a minimal search space that must contain the optimal solution. Then, we formulate the problem of choosing the minimal synchronization from the search space as a set-cover problem, and solve it exactly using 0-1 integer programming. We show the performance impact of our algorithms by synchronizing a set of synthetic loops on an 8-processor Convex Exemplar. The greedily synchronized loops ran between 7% and 22% faster than those synchronized by the best existing algorithm. Relative to the same base, the optimally synchronized loops ran between 10% and and 22% faster.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131021350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines 在共享和分布式内存机器上优化局部性和并行性的编译算法

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644019

M. Kandemir, J. Ramanujam, A. Choudhary

{"title":"Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines","authors":"M. Kandemir, J. Ramanujam, A. Choudhary","doi":"10.1109/PACT.1997.644019","DOIUrl":"https://doi.org/10.1109/PACT.1997.644019","url":null,"abstract":"Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with a large number of processors is difficult. Previously, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (it) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114889126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24