Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques最新文献

筛选
英文 中文
Path prediction for high issue-rate processors 高发布率处理器的路径预测
Kishore N. Menezes, Sumedh W. Sathaye, T. Conte
{"title":"Path prediction for high issue-rate processors","authors":"Kishore N. Menezes, Sumedh W. Sathaye, T. Conte","doi":"10.1109/PACT.1997.644014","DOIUrl":"https://doi.org/10.1109/PACT.1997.644014","url":null,"abstract":"Rapid developments in the exploitation of instruction-level parallelism are prompting deeper-pipelined, wider machines with high issue rates. Speculative execution has been used to provide the required issue bandwidth. Current methods predict a single branch at a time. Performance improvement is possible by predicting multiple branches in a single cycle. The paper presents a technique to predict paths in a single access. The correlation of a path with the branches executed before it, is exploited to provide high prediction accuracy. A novel path prediction automaton is presented The automaton is easily scalable to predict long paths through arbitrary subgraphs. It also predicts a path through a subgraph in a single access. The automaton requires only n+1 bits for predicting the 2/sup n/ paths in a subgraph of depth n. The performance of the proposed path predictor is measured. The full path accuracy (accuracy in predicting all the branches in a path) is higher than or equal to other predictors found in the literature. This performance is achieved at a low hardware cost. The scalability single access prediction and low hardware cost of the path prediction technique presented in the paper make it suitable for machines requiring high issue bandwidth.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126086790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Design of heterogenous multi-processor embedded systems: applying functional pipelining 异构多处理器嵌入式系统的设计:应用功能流水线
I. Karkowski, H. Corporaal
{"title":"Design of heterogenous multi-processor embedded systems: applying functional pipelining","authors":"I. Karkowski, H. Corporaal","doi":"10.1109/PACT.1997.644012","DOIUrl":"https://doi.org/10.1109/PACT.1997.644012","url":null,"abstract":"Practice shows that increasing the amount of instruction level parallelism (ILP) offered by an architecture (like adding instruction slots to VLIW instructions) does not necessary lead to significant performance gains. Instead, high hardware costs and inefficient use of this hardware may occur. Mapping embedded applications onto multiprocessor systems forms a very interesting extension to ILP. The authors describe their approach to the mapping of embedded programs written in ANSI C onto a pipeline of application specific processors. An efficient algorithm for functional pipelining of loops is presented. To validate its applicability the frequency tracking system is used as a case study. This typical embedded application is mapped onto a two-processor system delivering speedup of 1.88 in comparison with a highly optimized single core solution.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128059243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Locality analysis for parallel C programs 并行C程序的局部性分析
Yingchun Zhu, L. Hendren
{"title":"Locality analysis for parallel C programs","authors":"Yingchun Zhu, L. Hendren","doi":"10.1109/PACT.1997.643999","DOIUrl":"https://doi.org/10.1109/PACT.1997.643999","url":null,"abstract":"Many parallel architectures support a memory model where some memory accesses are local, and thus inexpensive, while other memory accesses are remote, and potentially quite expensive. In order to achieve good parallel performance, it is often necessary to reduce the number of remote memory accesses. This can be done by the programmer, the compiler, or a combination of both. The overall goal is to minimize the work required by the programmer, and have the compiler automate the process as much as possible. The paper reports on compiler techniques for decreasing the number of remote memory accesses using locality analysis for a parallel dialect of C called EARTH-C. The locality analysis uses an algorithm inspired by type inference algorithms for fast points-to analysis. The algorithm estimates when an indirect reference via a pointer can be safely assumed to be a local access. The locality inference algorithm is also used to guide the automatic specialization of functions in order to take advantage of locality scientific to particular calling contexts. The locality analysis and automatic specialization has been implemented in the EARTH-C compiler which produces low level threaded code for the EARTH-C multithreaded architecture. Experimental results are presented for a set of benchmarks that operate on irregular, dynamically allocated data structures. The techniques give moderate to significant speedups and they do lessen the burden on the programmer.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128309951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions 通过编译器启动的一致性动作克服多处理器中预取的限制
J. Skeppstedt
{"title":"Overcoming limitations of prefetching in multiprocessors by compiler-initiated coherence actions","authors":"J. Skeppstedt","doi":"10.1109/PACT.1997.644023","DOIUrl":"https://doi.org/10.1109/PACT.1997.644023","url":null,"abstract":"In this paper we first identify limitations of compiler-controlled prefetching in a CC-NUMA multiprocessor with a write-invalidate cache coherence protocol. Compiler-controlled prefetch techniques for CC-NUMAs often are focused only, on stride-accesses, and this introduces a major limitation. We consider combining prefetch with two other compiler-controlled techniques to partly remedy the situation: (1) load-exclusive to reduce write-latency and (2) store-update to reduce read-latency. The purpose of each of these techniques in a machine with prefetch is to let them reduce latency for accesses which the prefetch technique could not handle. We evaluate two different scenarios, firstly with a hybrid compiler/hardware prefetch technique and secondly with an optimal stride-prefetcher. We find that the combined gains under the hybrid prefetch technique are significant for six applications we have studied: in average, 71% of the original write-stall time remains after using the hybrid prefetcher, and of these ownership-requests, 60% would be eliminated using load-exclusive; in average, 68% of the read-stall time remains after using the hybrid prefetcher and of these read-misses, 34% were serviced by remote caches and would be converted by store-update into misses serviced by a clean copy in memory which reduces the read-latency. With an optimal stride-prefetcher our results show that it beneficient to complement prefetch, with the two techniques here as well.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133834526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The effect of limited network bandwidth and its utilization by latency hiding techniques in large-scale shared memory systems 延迟隐藏技术对大规模共享内存系统中有限网络带宽及其利用率的影响
Sunil Kim, A. Veidenbaum
{"title":"The effect of limited network bandwidth and its utilization by latency hiding techniques in large-scale shared memory systems","authors":"Sunil Kim, A. Veidenbaum","doi":"10.1109/PACT.1997.644002","DOIUrl":"https://doi.org/10.1109/PACT.1997.644002","url":null,"abstract":"Addresses the use of two latency hiding techniques, prefetching and weak consistency, for large-scale shared memory multiprocessors with compiler-controlled cache coherence management and the interaction of latency hiding techniques and network bandwidth. The performance effect of latency hiding is evaluated and compared varying the network channel bandwidth. The interaction of reads, writes and prefetches, given a limited bandwidth, is studied, and an approach to better network bandwidth utilization by limiting the number of outstanding requests in each node is investigated. Increasing network (channel) bandwidth helps both prefetching and non-prefetching systems, with the initial 2/spl times/ increase in bandwidth giving the most improvement. The use of prefetching can deliver a much larger improvement than increasing network bandwidth for a 128-processor system for some benchmarks, even with the minimal bandwidth. Controlling bandwidth utilization is shown to be important when prefetch and write request rates are high.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114250861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
VLIW across multiple superscalar processors on a single chip 在单个芯片上跨多个超标量处理器的VLIW
Soohong P. Kim, R. Hoare, H. Dietz
{"title":"VLIW across multiple superscalar processors on a single chip","authors":"Soohong P. Kim, R. Hoare, H. Dietz","doi":"10.1109/PACT.1997.644013","DOIUrl":"https://doi.org/10.1109/PACT.1997.644013","url":null,"abstract":"Advances in IC technology increase the integration density for higher clock rates and provide more opportunities for microprocessor design. The authors propose a new paradigm to exploit instruction-level parallelism (ILP) across multiple superscalar processors on a single chip by taking advantages of both VLIW-style static scheduling techniques and dynamic scheduling of superscalar architecture. In the proposed paradigm, ILP is exploited by a compiler from a sequential program and this VLIW-like-parallelized code is further parallelized by 2-way superscalar engines at run-time. Superscalar processors are connected by an aggregate function network, which can enforce the necessary static timing constraints and provide appropriate inter-processor data communication mechanisms that are needed for ILP. The aggregate function operations are statically scheduled and implement not only fine-grain communication and control, but also simple global computations resembling systolic array operations within the network.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130527790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Efficient personalized communication on wormhole networks 虫洞网络上有效的个性化通信
F. Petrini, M. Vanneschi
{"title":"Efficient personalized communication on wormhole networks","authors":"F. Petrini, M. Vanneschi","doi":"10.1109/PACT.1997.644003","DOIUrl":"https://doi.org/10.1109/PACT.1997.644003","url":null,"abstract":"Bridging models, such as the BSP (bulk synchronous parallel) model, tend to abstract the characteristics of interconnection networks using a small set of parameters, by dividing the computation into supersteps and organizing the communication into global patterns called h-relations. In this paper, we evaluate (through experimental results conducted on a wormhole-routed 2D torus and a quaternary fat-tree with 256 processing nodes) the execution time of three families of h-relations with variable degree of imbalance. We also prove a strong result that links the communication performance of the fat-tree with the BSP abstraction of the interconnection network. Given a generic h-relation, we can provide a value of g (the gap) that, in the worst case, slightly overestimates the completion time and is very close to optimality.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"417 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122480354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Heap analysis and optimizations for threaded programs 线程程序的堆分析和优化
Xinan Tang, R. Ghiya, L. Hendren, G. Gao
{"title":"Heap analysis and optimizations for threaded programs","authors":"Xinan Tang, R. Ghiya, L. Hendren, G. Gao","doi":"10.1109/PACT.1997.644000","DOIUrl":"https://doi.org/10.1109/PACT.1997.644000","url":null,"abstract":"Traditional compiler optimizations such as loop invariant removal and common sub-expression elimination are standard in all optimizing compilers. The purpose of the paper is to present new versions of these optimizations that apply to programs using dynamically allocated data structures, and to show the effect of these optimizations on the performance of multithreaded programs. We show how heap pointer analyses can be used to support better dependence testing, new applications of the above traditional optimizations, and high quality code generation for multithreaded architectures. We have implemented these analyses and optimizations in the EARTH-C compiler to study their impact on the performance of generated multithreaded code. We provide both static and dynamic measurements showing the effect of the optimizations applied individually, and together. We note several general trends, and discuss the performance tradeoffs and suggest when specific optimizations are generally beneficial.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"418 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121447165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Towards a time and space efficient functional implementation of a Monte Carlo photon transport code 迈向一个时间和空间高效功能实现的蒙特卡罗光子传输代码
J. Hammes, A. Böhm
{"title":"Towards a time and space efficient functional implementation of a Monte Carlo photon transport code","authors":"J. Hammes, A. Böhm","doi":"10.1109/PACT.1997.644024","DOIUrl":"https://doi.org/10.1109/PACT.1997.644024","url":null,"abstract":"In this paper we present three Sisal versions of a large Monte Carlo radiation transport code: a straightforward version, a stream version, and a stripmined loop version. We compare these versions with respect to their time and space efficiency and their parallelism. We discuss the compiler used in this project, which generates multithreaded shared memory code. We discuss the effect of strictness on program behavior. Sisal provides the fastest, purely functional, sequential code we have seen for this benchmark:, using a constant amount of space. The stream version suffers from the fact that streams have a strict implementation in the Sisal compiler, so programs using long streams are both space inefficient and can show limited parallel speedup. The stripmined version of our code uses relatively small amounts of space, and shows a speedup of only around two for four processors, as it exhibits significant reference count lock contention.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123392938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors 多处理器上并行程序编译时调度的并行算法
Yu-Kwong Kwok, I. Ahmad
{"title":"A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors","authors":"Yu-Kwong Kwok, I. Ahmad","doi":"10.1109/PACT.1997.644006","DOIUrl":"https://doi.org/10.1109/PACT.1997.644006","url":null,"abstract":"Proposes a parallel randomized algorithm, called PFAST (Parallel Fast Assignment using Search Technique), for scheduling parallel programs represented by directed acyclic graphs (DAGs) during compile-time. The PFAST algorithm has O(e) time complexity, where e is the number of edges in the DAG. This linear-time algorithm works by first generating an initial solution and then refining it using a parallel random search. Using a prototype computer-aided parallelization and scheduling tool called CASCH (Computer-Aided SCHeduling), the algorithm is found to outperform numerous previous algorithms while taking dramatically smaller execution times. The distinctive feature of this research is that, instead of simulations, our proposed algorithm is evaluated and compared with other algorithms using the CASCH tool with real applications running on an Intel Paragon. The PFAST algorithm is also evaluated with randomly generated DAGs for which optimal schedules are known. The algorithm generated optimal solutions for a majority of the test cases and close-to-optimal solutions for the others. The proposed algorithm is the fastest scheduling algorithm known to us and is an attractive choice for scheduling under running time constraints.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132371884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信