Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques最新文献

筛选
英文 中文
Interprocedural array remapping 过程间数组重新映射
Michal Cierniak, Wei Li
{"title":"Interprocedural array remapping","authors":"Michal Cierniak, Wei Li","doi":"10.1109/PACT.1997.644011","DOIUrl":"https://doi.org/10.1109/PACT.1997.644011","url":null,"abstract":"Programming languages like Fortran or C define exactly the layout of array elements in memory. Programmers often use that definition to access the same memory via variables of different types. For many real programs this practice makes changing the layout of an array impossible without violating the semantics of the program since the same memory block may be accessed via variables of different types-such accesses may now receive wrong array elements. On the other hand, changing array layout is often necessary to obtain good parallel performance or even to improve sequential performance by providing better cache locality. The paper demonstrates that the problem of changing array layouts in the presence of multiple variables of different types accessing the same memory can be solved with the algorithms for 1) detecting overlapping arrays, 2) using procedure cloning to reduce overlapping, 3) array-type coercion, and 4) code structure recovery.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123673133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
MDL: a language and compiler for dynamic program instrumentation MDL:用于动态程序插装的语言和编译器
J. Hollingsworth, B. Miller, M. J. R. Gonçalves, Oscar Naim, Zhichen Xu, Ling Zheng
{"title":"MDL: a language and compiler for dynamic program instrumentation","authors":"J. Hollingsworth, B. Miller, M. J. R. Gonçalves, Oscar Naim, Zhichen Xu, Ling Zheng","doi":"10.1109/PACT.1997.644016","DOIUrl":"https://doi.org/10.1109/PACT.1997.644016","url":null,"abstract":"We use a form of dynamic code generation, called dynamic instrumentation, to collect data about the execution of an application program. Dynamic instrumentation allows us to instrument running programs to collect performance and other types of information. The instrumentation code is generated incrementally and can be inserted and removed at any time. Our instrumentation currently runs on the SPARC, PA-RISC, Power 2, Alpha, and x86 architectures. Specification of what data to collect are written in a specialized language called the Metric Description Language, that is part of the Paradyn Parallel Performance Tools. This language allows platform independent descriptions of how to collect performance data. It also provides a concise way to specify, how to constrain performance data to particular resources such as modules, procedures, nodes, files, or message channels (or combinations of these resources). We also describe the details of how we weave instrumentation into a running program.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 127
Buffer-safe communication optimization based on data flow analysis and performance prediction 基于数据流分析和性能预测的缓冲区安全通信优化
T. Fahringer, E. Mehofer
{"title":"Buffer-safe communication optimization based on data flow analysis and performance prediction","authors":"T. Fahringer, E. Mehofer","doi":"10.1109/PACT.1997.644015","DOIUrl":"https://doi.org/10.1109/PACT.1997.644015","url":null,"abstract":"The paper presents a novel approach to reduce communication costs of programs for distributed memory machines. The techniques are based on uni-directional bit-vector data flow analysis that enable vectorizing and coalescing communication, overlapping communication with computation, eliminating redundant messages and amount of data being transferred both within and across loop nests. The data flow analysis differs from previous techniques that it does not require to explicitly model balanced communication placement and loops and does not employ interval analysis. The techniques are based on simple yet highly effective data flow equations which are solved iteratively for arbitrary control flow graphs. Moving communication earlier to hide latency has been shown to dramatically increase communication buffer sizes and can even cause run-time errors. The authors use P/sup 3/T, a state-of-the-art performance estimator to create a buffer-safe program. By accurately estimating both the communication buffer sizes required and the implied communication times of every single communication of a program one can selectively choose communication that must be delayed in order to ensure a correct communication placement while maximizing communication latency hiding. Experimental results are presented to prove the efficacy of the communication optimization strategy.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134142633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improving the memory bandwidth of highly-integrated, wide-issue, microprocessor-based systems 提高高集成度、宽问题、基于微处理器的系统的内存带宽
D. Albonesi, I. Koren
{"title":"Improving the memory bandwidth of highly-integrated, wide-issue, microprocessor-based systems","authors":"D. Albonesi, I. Koren","doi":"10.1109/PACT.1997.644009","DOIUrl":"https://doi.org/10.1109/PACT.1997.644009","url":null,"abstract":"Next-generation wide-issue processors will require greater memory bandwidth than provided by present memory hierarchy designs. We propose techniques for increasing the memory bandwidth of multi-ported L1 D-caches, large on-chip L2 caches and dedicated memory ports while minimizing the cycle time impact. These approaches are evaluated within the context of an 8-way superscalar processor design and next-generation VLSI, packaging and RAM technologies. We show that the combined L1 and L2 cache enhancements can outperform conventional techniques by over 80%, and that even with an on-chip 512-kByte L2 cache, board-level caches provide significant enough performance gains to justify their higher cost.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133224145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallel execution of radix sort program using fine-grain communication 采用细粒度通信的基数排序程序并行执行
Yuetsu Kodama, H. Sakane, H. Koike, M. Sato, S. Sakai, Y. Yamaguchi
{"title":"Parallel execution of radix sort program using fine-grain communication","authors":"Yuetsu Kodama, H. Sakane, H. Koike, M. Sato, S. Sakai, Y. Yamaguchi","doi":"10.1109/PACT.1997.644010","DOIUrl":"https://doi.org/10.1109/PACT.1997.644010","url":null,"abstract":"The report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication-this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. The authors study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. The experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as one increases the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123968385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Empirical evaluation of deterministic and adaptive routing with constant-area routers 固定区域路由器的确定性和自适应路由的经验评价
Dianne Miller, W. Najjar
{"title":"Empirical evaluation of deterministic and adaptive routing with constant-area routers","authors":"Dianne Miller, W. Najjar","doi":"10.1109/PACT.1997.644004","DOIUrl":"https://doi.org/10.1109/PACT.1997.644004","url":null,"abstract":"This paper addresses the issue of how router complexity affects the overall performance in deterministic and adaptive routing under virtual cut-through switching in k-ary n-cube networks. First, the performance of various adaptive routers with constant area are compared. Second, the performance of adaptive and deterministic routers are compared under the same conditions. Finally, it is shown that, under certain conditions, deterministic routers can reach saturation points comparable to adaptive routers.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"314 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116257099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Effective usage of vector registers in advanced vector architectures 在先进的矢量架构中有效地使用矢量寄存器
L. Villa, R. Espasa, M. Valero
{"title":"Effective usage of vector registers in advanced vector architectures","authors":"L. Villa, R. Espasa, M. Valero","doi":"10.1109/PACT.1997.644021","DOIUrl":"https://doi.org/10.1109/PACT.1997.644021","url":null,"abstract":"This paper presents data confirming the fact that traditional vector architectures can not reduce their vector register length without suffering a severe performance penalty. However, we will show that by combining the vector register length reduction with two different ILP techniques, decoupling and multithreading, the performance penalty can be made very small. We will show that each resulting architecture tolerates very well long memory latencies and also makes a better usage of the available storage space in each vector register. Using decoupling and short vectors, Each register can be halved while still providing speedups in the range 1.04-1.49 over a traditional architecture with long registers. Using multithreading. We split a vector register file in two halfs and show that two independent threads running on such machine can yield speedups in the range 1.23-1.29. The paper also explores configurations with 1/4 and 1/8 the original vector register size aimed at cost-conscious designs, and shows that even at 1/4 the original size, the resulting architectures can outperform a traditional machine. We also present results across a wide range of memory latencies, and show that the combination of short vectors and ILP techniques results in a very good tolerance of slow memory systems.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115141365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Direct generation of data-driven program for stream-oriented processing 直接生成面向流处理的数据驱动程序
K. Karasawa, M. Iwata, H. Terada
{"title":"Direct generation of data-driven program for stream-oriented processing","authors":"K. Karasawa, M. Iwata, H. Terada","doi":"10.1109/PACT.1997.644025","DOIUrl":"https://doi.org/10.1109/PACT.1997.644025","url":null,"abstract":"This paper proposes a scheme that directly transforms unified system specifications into highly parallel dynamic data-driven programs incrementally in an interactive fashion. In this paper the scheme proposed is described with special emphasis on its application to stream-oriented processing such as the multimedia signal processing. An abstract data type for generalized multiple data streams is introduced in order to facilitate interpretations of hierarchical and diagrammatic specifications. Also an optimization technique applicable in fitting the specifications to a specific hardware configuration is shown. Finally, practicability of the methodology is illustrated through a design process of an HDTV signal decoder.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"277 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132824089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optimally synchronizing DOACROSS loops on shared memory multiprocessors 在共享内存多处理器上优化同步DOACROSS循环
R. Rajamony, A. Cox
{"title":"Optimally synchronizing DOACROSS loops on shared memory multiprocessors","authors":"R. Rajamony, A. Cox","doi":"10.1109/PACT.1997.644017","DOIUrl":"https://doi.org/10.1109/PACT.1997.644017","url":null,"abstract":"We present two algorithms to minimize the amount of synchronization added when parallelizing a loop with loop-carried dependences. In contrast to existing schemes, our algorithms add lesser synchronization, while preserving the parallelism that can be extracted from the loop. Our first algorithm uses an interval graph representation of the dependence \"overlap\" to find a synchronization placement in time almost linear in the number of dependences. Although this solution may be suboptimal, it is still better than that obtained using existing methods, which first eliminate redundant dependences and then synchronize the remaining ones. Determining the optimal synchronization is an NP-complete problem. Our second algorithm therefore uses integer programming to determine the optimal solution. We first use a polynomial-time algorithm to find a minimal search space that must contain the optimal solution. Then, we formulate the problem of choosing the minimal synchronization from the search space as a set-cover problem, and solve it exactly using 0-1 integer programming. We show the performance impact of our algorithms by synchronizing a set of synthetic loops on an 8-processor Convex Exemplar. The greedily synchronized loops ran between 7% and 22% faster than those synchronized by the best existing algorithm. Relative to the same base, the optimally synchronized loops ran between 10% and and 22% faster.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131021350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines 在共享和分布式内存机器上优化局部性和并行性的编译算法
M. Kandemir, J. Ramanujam, A. Choudhary
{"title":"Compiler algorithms for optimizing locality and parallelism on shared and distributed memory machines","authors":"M. Kandemir, J. Ramanujam, A. Choudhary","doi":"10.1109/PACT.1997.644019","DOIUrl":"https://doi.org/10.1109/PACT.1997.644019","url":null,"abstract":"Distributed memory message passing machines can deliver scalable performance but are difficult to program. Shared memory machines, on the other hand, are easier to program but obtaining scalable performance with a large number of processors is difficult. Previously, some scalable architectures based on logically-shared physically-distributed memory have been designed and implemented. While some of the performance issues like parallelism and locality are common to the different parallel architectures, issues such as data decomposition are unique to specific types of architectures. One of the most important challenges compiler writers face is to design compilation techniques that can work on a variety of architectures. In this paper, we propose an algorithm that can be employed by optimizing compilers for different types of parallel architectures. Our optimization algorithm does the following: (1) transforms loop nests such that, where possible, the outermost loops can be run in parallel across processors; (2) decomposes each array across processors; (3) optimizes interprocessor communication by vectorizing it whenever possible; and (it) optimizes locality (cache performance) by assigning appropriate storage layout for each array. Depending on the underlying hardware system, some or all of these steps can be applied in a unified framework. We present simulation results for cache miss rates, and empirical results on SUN SPARCstation 5, IBM SP-2, SGI Challenge and Convex Exemplar to validate the effectiveness of our approach on different architectures.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114889126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信