{"title":"Two techniques for static array partitioning on message-passing parallel machines","authors":"Eric Hung-Yu Tseng, J. Gaudiot","doi":"10.1109/PACT.1997.644018","DOIUrl":"https://doi.org/10.1109/PACT.1997.644018","url":null,"abstract":"We present two techniques for partitioning arrays in parallel DoAll loops for message-passing parallel machines. (1) Communication-free array partitioning: a general solution of communication-free partitioning is derived for arrays in a DoAll loop. The derivation is based on the Smith normal form decomposition of the matrix which characterizes the array references in a DoAll loop. (2) One block-communication partitioning: when communication-free partitioning is not possible, we derive the partitioning equations which allocate all remote data to a unique processor. Thus, at most one block-communication is required for each processor to obtain the remote data it needs during computation.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123003930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A register pressure sensitive instruction scheduler for dynamic issue processors","authors":"Rad Silvera, Jian Wang, R. Govindarajan, G. Gao","doi":"10.1109/PACT.1997.644005","DOIUrl":"https://doi.org/10.1109/PACT.1997.644005","url":null,"abstract":"Several modern superscalar processors contain an out-of-order (OOO) instruction issue mechanism, which resolves dependencies between instructions to expose greater instruction-level parallelism (ILP). How to extend a traditional instruction scheduler to take advantage of these hardware resources has presented both a challenge and an opportunity for compiler design. In this paper, we present a new approach for instruction scheduling, which reorders the instructions in a traditional instruction schedule to reduce its register pressure while maintaining the amount of ILP exploitable by the target OOO processor. This may prevent the introduction of spill code, thus producing a performance improvement. We have implemented our instruction scheduler under the MOST scheduling testbed. Our experiments show that the proposed approach reduces the register pressure by 12.81% in SPEC92 benchmark loops which do not require any spill code. For loops with a high register pressure, our approach reduced the amount of spill code required by an average of 32.08% and produced an average performance improvement of 8.79%.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121760896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Path profile guided partial dead code elimination using predication","authors":"Rajiv Gupta, David A. Berson, J. Fang","doi":"10.1109/PACT.1997.644007","DOIUrl":"https://doi.org/10.1109/PACT.1997.644007","url":null,"abstract":"Presents a path-profile-guided partial dead code elimination algorithm that uses predication to enable sinking for the removal of deadness along frequently executed paths at the expense of adding additional instructions along infrequently executed paths. Our approach to optimization is particularly suitable for VLIW architectures since it directs the efforts of the optimizer towards aggressively enabling generation of fast schedules along frequently executed paths by reducing their critical path lengths. The paper presents a cost-benefit data flow analysis that uses path profiling information to determine the profitability of using predication-enabled sinking. The cost of predication-enabled sinking of a statement past a merge point is determined by identifying paths along which an additional statement is introduced. The benefit of predication-enabled sinking is determined by identifying paths along which additional dead code elimination is achieved due to predication. The results of this analysis are incorporated in a code sinking framework in which predication-enabled sinking is allowed past merge points only if its benefit is determined to be greater than the cost. It is also demonstrated that trade-off can be performed between the compile-time cost and the precision of cost-benefit analysis.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129938609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Determining the idle time of a tiling: new results","authors":"F. Desprez, J. Dongarra, F. Rastello, Y. Robert","doi":"10.1109/PACT.1997.644026","DOIUrl":"https://doi.org/10.1109/PACT.1997.644026","url":null,"abstract":"In the framework of fully permutable loops, tiling has been studied extensively as a source-to-source program transformation. We build upon recent results by Hogsted, Carter, and Ferrante (1997), who aim at determining the cumulated idle time spent by all processors while executing the partitioned (tiled) computation domain. We propose new, much shorter proofs of all their results and extend these in several important directions. More precisely, we provide an accurate solution for all values of the rise parameter that relates the shape of the iteration space to that of the tiles, and for all possible distributions of the tiles to processors. In contrast, the authors in Hogsted, Carter, and Ferrante (1997) deal only with a limited number of cases and provide upper bounds rather than exact formulas.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130397600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interprocedural distribution assignment placement: more than just enhancing intraprocedural placing techniques","authors":"J. Knoop, E. Mehofer","doi":"10.1109/PACT.1997.644001","DOIUrl":"https://doi.org/10.1109/PACT.1997.644001","url":null,"abstract":"Avoiding unnecessary remappings at run-time by means of a strategic distribution assignment placement (DAP) is a major means for improving the run-time efficiency of data-parallel programs on distributed-memory architectures. In Proc. Euro-Par '97, pp. 364-73 (1997), we presented a novel and aggressive intraprocedural algorithm achieving this by eliminating partially redundant and partially dead distribution assignments. In this paper, we show how to enhance this approach interprocedurally. Surprisingly at first sight, it turns out that a straightforward adaption of the intraprocedural approach fails because central properties being valid for the intraprocedural case do not carry over to the interprocedural one, revealing severe anomalies. After discussing the essential differences and analogies of DAP in the interprocedural and interprocedural cases, we show how to overcome these anomalies in order to arrive at a powerful and flexible approach for interprocedural DAP (IDAP). As in the interprocedural case, we get a hierarchy of IDAP algorithms of varying power and efficiency supporting user-customized solutions. First practical experiences underline its importance and effectivity.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123877594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carrie J. Brownhill, A. Nicolau, S. Novack, C. Polychronopoulos
{"title":"The PROMIS compiler prototype","authors":"Carrie J. Brownhill, A. Nicolau, S. Novack, C. Polychronopoulos","doi":"10.1109/PACT.1997.644008","DOIUrl":"https://doi.org/10.1109/PACT.1997.644008","url":null,"abstract":"Source-code parallelizers and instruction-level parallelizers each have specific advantages. Usually, a compiler is designed to be one or the other, based on the target architecture and/or algorithms. A compiler that is designed to generate near-optimal code for modern, multi-level machines must have the capabilities of both. This paper describes the prototype of the PROMIS compiler. The prototype was designed to show that loop-level and instruction-level parallelization can be combined to produce results better than either one alone. In addition, it shows how communication between the levels can produce additional speedup.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129010503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static locality analysis for cache management","authors":"F. Sánchez, Antonio González, M. Valero","doi":"10.1109/PACT.1997.644022","DOIUrl":"https://doi.org/10.1109/PACT.1997.644022","url":null,"abstract":"Most memory references in numerical codes correspond to array references whose indices are affine functions of surrounding loop indices. These array references follow a regular predictable memory pattern that can be analysed at compile time. This analysis can provide valuable information like the locality exhibited by the program, which can be used to implement more intelligent caching strategy. In this paper we propose a static locality analysis oriented to the management of data caches. We show that previous proposals on locality analysis are not appropriate when the proposals have a high conflict miss ratio. This paper examines those proposals by introducing a compile-time interference analysis that significantly improve the performance of them. We first show how this analysis can be used to characterize the dynamic locality properties of numerical codes. This evaluation show for instance that a large percentage of references exhibit any type of locality. This motivates the use of a dual data cache, which has a module specialized to exploit temporal locality, and a selective cache respectively. Then, the performance provided by these two cache organizations is evaluated. In both organizations, the static locality analysis is responsible for tagging each memory instruction accordingly to the particular type(s) of locality that it exhibits.","PeriodicalId":177411,"journal":{"name":"Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127333874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}