Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques最新文献_第2页

Path prediction for high issue-rate processors 高发布率处理器的路径预测

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644014

Kishore N. Menezes, Sumedh W. Sathaye, T. Conte

{"title":"Path prediction for high issue-rate processors","authors":"Kishore N. Menezes, Sumedh W. Sathaye, T. Conte","doi":"10.1109/PACT.1997.644014","DOIUrl":"https://doi.org/10.1109/PACT.1997.644014","url":null,"abstract":"Rapid developments in the exploitation of instruction-level parallelism are prompting deeper-pipelined, wider machines with high issue rates. Speculative execution has been used to provide the required issue bandwidth. Current methods predict a single branch at a time. Performance improvement is possible by predicting multiple branches in a single cycle. The paper presents a technique to predict paths in a single access. The correlation of a path with the branches executed before it, is exploited to provide high prediction accuracy. A novel path prediction automaton is presented The automaton is easily scalable to predict long paths through arbitrary subgraphs. It also predicts a path through a subgraph in a single access. The automaton requires only n+1 bits for predicting the 2/sup n/ paths in a subgraph of depth n. The performance of the proposed path predictor is measured. The full path accuracy (accuracy in predicting all the branches in a path) is higher than or equal to other predictors found in the literature. This performance is achieved at a low hardware cost. The scalability single access prediction and low hardware cost of the path prediction technique presented in the paper make it suitable for machines requiring high issue bandwidth.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126086790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Design of heterogenous multi-processor embedded systems: applying functional pipelining 异构多处理器嵌入式系统的设计:应用功能流水线

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644012

I. Karkowski, H. Corporaal

引用次数: 25

Locality analysis for parallel C programs 并行C程序的局部性分析

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.643999

Yingchun Zhu, L. Hendren

{"title":"Locality analysis for parallel C programs","authors":"Yingchun Zhu, L. Hendren","doi":"10.1109/PACT.1997.643999","DOIUrl":"https://doi.org/10.1109/PACT.1997.643999","url":null,"abstract":"Many parallel architectures support a memory model where some memory accesses are local, and thus inexpensive, while other memory accesses are remote, and potentially quite expensive. In order to achieve good parallel performance, it is often necessary to reduce the number of remote memory accesses. This can be done by the programmer, the compiler, or a combination of both. The overall goal is to minimize the work required by the programmer, and have the compiler automate the process as much as possible. The paper reports on compiler techniques for decreasing the number of remote memory accesses using locality analysis for a parallel dialect of C called EARTH-C. The locality analysis uses an algorithm inspired by type inference algorithms for fast points-to analysis. The algorithm estimates when an indirect reference via a pointer can be safely assumed to be a local access. The locality inference algorithm is also used to guide the automatic specialization of functions in order to take advantage of locality scientific to particular calling contexts. The locality analysis and automatic specialization has been implemented in the EARTH-C compiler which produces low level threaded code for the EARTH-C multithreaded architecture. Experimental results are presented for a set of benchmarks that operate on irregular, dynamically allocated data structures. The techniques give moderate to significant speedups and they do lessen the burden on the programmer.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128309951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions 通过编译器启动的一致性动作克服多处理器中预取的限制

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644023

J. Skeppstedt

{"title":"Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions","authors":"J. Skeppstedt","doi":"10.1109/PACT.1997.644023","DOIUrl":"https://doi.org/10.1109/PACT.1997.644023","url":null,"abstract":"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133834526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The effect of limited network bandwidth and its utilization by latency hiding techniques in large-scale shared memory systems 延迟隐藏技术对大规模共享内存系统中有限网络带宽及其利用率的影响

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644002

Sunil Kim, A. Veidenbaum

引用次数: 4

VLIW across multiple superscalar processors on a single chip 在单个芯片上跨多个超标量处理器的VLIW

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644013

Soohong P. Kim, R. Hoare, H. Dietz

引用次数: 2

Efficient personalized communication on wormhole networks 虫洞网络上有效的个性化通信

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644003

F. Petrini, M. Vanneschi

引用次数: 4

Heap analysis and optimizations for threaded programs 线程程序的堆分析和优化

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644000

Xinan Tang, R. Ghiya, L. Hendren, G. Gao

引用次数: 26

Towards a time and space efficient functional implementation of a Monte Carlo photon transport code 迈向一个时间和空间高效功能实现的蒙特卡罗光子传输代码

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644024

J. Hammes, A. Böhm

引用次数: 0

A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors 多处理器上并行程序编译时调度的并行算法

Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 1997-11-11 DOI: 10.1109/PACT.1997.644006

Yu-Kwong Kwok, I. Ahmad

{"title":"A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors","authors":"Yu-Kwong Kwok, I. Ahmad","doi":"10.1109/PACT.1997.644006","DOIUrl":"https://doi.org/10.1109/PACT.1997.644006","url":null,"abstract":"Proposes a parallel randomized algorithm, called PFAST (Parallel Fast Assignment using Search Technique), for scheduling parallel programs represented by directed acyclic graphs (DAGs) during compile-time. The PFAST algorithm has O(e) time complexity, where e is the number of edges in the DAG. This linear-time algorithm works by first generating an initial solution and then refining it using a parallel random search. Using a prototype computer-aided parallelization and scheduling tool called CASCH (Computer-Aided SCHeduling), the algorithm is found to outperform numerous previous algorithms while taking dramatically smaller execution times. The distinctive feature of this research is that, instead of simulations, our proposed algorithm is evaluated and compared with other algorithms using the CASCH tool with real applications running on an Intel Paragon. The PFAST algorithm is also evaluated with randomly generated DAGs for which optimal schedules are known. The algorithm generated optimal solutions for a majority of the test cases and close-to-optimal solutions for the others. The proposed algorithm is the fastest scheduling algorithm known to us and is an attractive choice for scheduling under running time constraints.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132371884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5