MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123509
B. Su, Jian Wang
{"title":"GURPR*: a new global software pipelining algorithm","authors":"B. Su, Jian Wang","doi":"10.1145/123465.123509","DOIUrl":"https://doi.org/10.1145/123465.123509","url":null,"abstract":"Software pipclining as an effective loop optimization technique has been widefy used in various optimizing compilers. Afthough some software pipefining algorithms can optimize complicated loops globally, they are still not satisfied both in time efficiency and space efficiency simultaneously. In this paper, we present a new global software pipelining algorithm GURPR* which is applied in the URPR-1 optimtilng compiler. Preliminary experiments show that GURPR* has good time efficiency as well as good space efficiency which is quite important for a single-chip VLIW machine with limited capacity of onchip sxmtrol memory.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"32 20","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120835974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123475
Tse-Yu Yeh, Y. Patt
{"title":"Two-level adaptive training branch prediction","authors":"Tse-Yu Yeh, Y. Patt","doi":"10.1145/123465.123475","DOIUrl":"https://doi.org/10.1145/123465.123475","url":null,"abstract":"High-performance microarchitectures use, among other structures, deep pipelines to help speed up exe- cution. The importance of a good branch predictor to the effectiveness of a deep pipeline in the presence of condi- tional branches is well-known. In fact, the literature contains proposals for a number of branch prediction schemes. Some are static in that they use opcode information and profiling statistics to make predictions. Others are dynamic in that they use run-time execution history to make predictions. This paper proposes a new dynamic branch predictor, the Two-Level Adaptive Paining scheme, which alters the branch prediction algorithm on the basis of information collected at run-time. Several configurations of the Two-Level Adaptive Training Branch Predictor are introduced, simulated, and compared to simulations of other known static and dynamic branch prediction schemes. Two-Level Adaptive Training Branch Prediction achieves 97 percent accuracy on nine of the ten SPEC benchmarks, compared to less than 93 percent for other schemes. Since a prediction miss requires flushing of the speculative execution already in progress, the relevant metric is the miss rate. The miss rate is 3 percent for the Two-Level Adaptive Training scheme vs. 7 percent (best case) for the other schemes. This represents more than a 100 percent improvement in reducing the number of pipeline hushes required.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"2 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120842031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123486
D. Bernstein, D. Cohen, H. Krawczyk
{"title":"Code duplication: an assist for global instruction scheduling","authors":"D. Bernstein, D. Cohen, H. Krawczyk","doi":"10.1145/123465.123486","DOIUrl":"https://doi.org/10.1145/123465.123486","url":null,"abstract":"The recent appearance of supersca/ar machines (like IBM RISC System/6000, Intel i860, etc.) dictates that instruction scheduling must be done by the compiler well beyond the basic block boundaries. Moreover, when performing global instruction scheduling of the program, to further enhance the performance of the generated code, techniques which include speculative execution, duplication of code, sof~are pipe[ining, etc. must be employed. Recently, a scheme for such global instruction scheduling was proposed in [BR91]. Here we describe an efficient technique for supporting duplication of code in the presence of a (general) acyclic control flow, as required by the global instruction scheduling framework. The algorithms have been implemented in the context of the IBM XL family of compilers, and we are in process of evaluating them on the IBM RISC System/6000 machines.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116810761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123495
G. Singh
{"title":"GRIP: graphics reduced instruction processor","authors":"G. Singh","doi":"10.1145/123465.123495","DOIUrl":"https://doi.org/10.1145/123465.123495","url":null,"abstract":"A B S TRA C E This paper presents an original approach for designing a new 2D graphics processor. The paradigm of the Reduced Instruction Set Computers (RISC) is applied in the design of thk graphics processor. First, a set of 2D graphics operations that are commonly encountered in graphics processing are delineated. Next, these operations are evaluated from the perspective of implementing them in a RISC style, single cycle, pipelined execution. Subsequently, the GRIP datapath is designed with the objective of implementing both the commonly encountered general purpose operations as well as the fundamental graphics operations. The motivation behind the integrated design of a general purpose processor with graphics capability is the author’s belief in an increasing role of graphics and window based interfaces for the ergonomicsgeared software applications of the future. Such an integration also results in a reduction of the system development and integration cost. The paper demonstrates that the RISC based approach to design of a processor with graphics capability also results in considerable performance improvements, in addition to the elimination of the communication delays. The net result of adopting the proposed approach is thus a polynomial increase in the Performance/Cost ratio, compared to a system incorporating a separate 2D graphics co-processor.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128709410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123507
S. Beaty
{"title":"Genetic algorithms and instruction scheduling","authors":"S. Beaty","doi":"10.1145/123465.123507","DOIUrl":"https://doi.org/10.1145/123465.123507","url":null,"abstract":"Many difficulties are encountered when developing an instruction scheduler to produce efficacious code for multiple architectures. Heuristic-based methods were found to produce disappointing results; indeed the goals of validity and length compete. This lead to the introduction of another method to search the solution space of valid schedules: genetic algorithms. Their application to this domain proved fruitful.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133529664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123471
P. Chang, William Y. Chen, S. Mahlke, Wen-mei W. Hwu
{"title":"Comparing static and dynamic code scheduling for multiple-instruction-issue processors","authors":"P. Chang, William Y. Chen, S. Mahlke, Wen-mei W. Hwu","doi":"10.1145/123465.123471","DOIUrl":"https://doi.org/10.1145/123465.123471","url":null,"abstract":"This paper examines two alternative approaches to supporting code scheduling for multiple-instruction-issue processors. One is to provide a set of non-trapping instructions so that the compiler can perform aggressive static code scheduling. The application of this approach to existing commercial architectures typically requires extending the instruction set. The other approach is to support out-of-order execution in the microarchitecture so that the hardware can perform aggressive dynamic code scheduling. This approach usually does not require modifying the instruction set but requires complex hardware support. In this paper, we analyze the performance of the two alternative approaches using a set of important nonnumerical C benchmark programs. A distinguishing feature of the experiment is that the code for the dynamic approach has been optimized and scheduled as much as allowed by the architecture. The hardware is only responsible for the additional reordering that cannot be performed by the compiler. The overall result is that the dynamic and static approaches are comparable in performance. When applied to a four-instruction-issue processor, both methods achieve more than two times speedup over a high performance single-instruction-issue processor. However, the performance of each scheme varies among the benchmark programs. To explain this variation, we have identi ed the conditions in these programs that make one approach perform better than the other.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115685804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123481
Reese B. Jones, V. Allan
{"title":"Software pipelining: an evaluation of enhanced pipelining","authors":"Reese B. Jones, V. Allan","doi":"10.1145/123465.123481","DOIUrl":"https://doi.org/10.1145/123465.123481","url":null,"abstract":"Software Pipelining is a fine-grain loop optimization technique for architectures that support synchronous parallel execution. We compare Lam’s software pipelining algorithm with Ebcio~lu and Nakatani’s technique. This research seems to indicate the Enhanced Pipeline Scheduling algorithm is a good general purpose software pipelining algorithm, due to the fact that it performs only slightly worse than Lam’s algorithm on single basic block loops and should perform better than Lam’s algorithm on multiple basic block loops. However, if pipelining single basic block loops is the goal, it appears that it would be better to use Lam’s algorithm. We s3so propose a technique for changing the resource constrained scheduling priority of operations to prevent operations from future iterations from being significantly delayed due to resource confllcts.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127048561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123473
B. Bray, M. Flynn
{"title":"Strategies for branch target buffers","authors":"B. Bray, M. Flynn","doi":"10.1145/123465.123473","DOIUrl":"https://doi.org/10.1145/123465.123473","url":null,"abstract":"Achieving high instruction issue rates depends on the ability to dynamically predict branches. We compare two schemes for dynamic branch prediction: a separate branch target buffer and an instruction cache based branch target buffer. For instruction caches of 4KB and greater, instruction cache based branch prediction performance is a strong function of line size, and a weak function of instruction cache size. An instruction cache based branch target buffer with a line size of 8 (or 4) instructions performs about as well as a separate branch target buffer structure which has 64 (or 256, respectively) entries. Software can rearrange basic blocks in a procedure to reduce the number of taken branches, thus reducing the amount of branch prediction hardware needed. With software assistance, predicting all branches as not branching performs as well as a 4 entry branch target buffer without assistance, and a 4 entry branch target buffer with assistance performs as well as a 32 entry branch target buffer without assistance. The instruction cache based branch target buffer also benefits from the software, but only for line sizes of more than 4 instructions.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127795801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123482
M. Smotherman, Sanjay M. Krishnamurthy, P. Aravind, David Hunnicutt
{"title":"Efficient DAG construction and heuristic calculation for instruction scheduling","authors":"M. Smotherman, Sanjay M. Krishnamurthy, P. Aravind, David Hunnicutt","doi":"10.1145/123465.123482","DOIUrl":"https://doi.org/10.1145/123465.123482","url":null,"abstract":"A number of heuristic algorithms for DAG-based instruction scheduling have been proposed over the past few years. In this paper, we explore the efficiency of three DAG construction algorithms and survey 26 proposed heuristics and their methods of calculation. Six scheduling algorithms are analyzed in terms of DAG construction and heuristic use. DAG suuctural statistics and scheduling times for the three construction algorithms are given for several popular benchmarks. The tablebuilding algorithms are shown to extremely efficient for programs with large basic blocks and yet appropriately handle the problem of retaining important transitive arcs. The node revisitation overhead of intermediate heuristic calculation steps is also investigated and shown to be negligible.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116159644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MICRO 24Pub Date : 1991-09-01DOI: 10.1145/123465.123488
M. Breternitz, John Paul Shen
{"title":"Implementation optimization techniques for architecture synthesis of application-specific processors","authors":"M. Breternitz, John Paul Shen","doi":"10.1145/123465.123488","DOIUrl":"https://doi.org/10.1145/123465.123488","url":null,"abstract":"An architectu?’e synthesis method for the automated design of high-performance application-specific processors has been p?’oposed. This method divides the design task into the Specification Optimization (behavioml) and Implementation Optimization (structural) phases. In an eaI’iieT pUpeT[~], poweTfui algoTiihms foI’ peI’forming specification optimization aTe pTesented. High peTfow mance is achieved via exploitation of fine-groin parallelism. The architecture design style uses a iemplate resembling a scalable VeTy Long Instruction Word (VLIW) pTocessoT. This papeT pTesents new a~goTithms foT performing implementation optimization, which map the optimized specification in the form of highly paTallelized code to eficient haTdwaTe imp~ementations. A scalable implementation template is used to constTain the implementation style. Graph coloTing algorithms aTe employed to pToduce the optimized implementations. The entire architecture synthesis pToceduTe has been implemented and applied to numeTous examples. Results on these examples are presented. Speedups in the range of ,2.6 to 7.7 oveT contemporary RISC processors have been obtained. The computation times needed foT the synthesis of these examples are on the oTder of a few seconds.","PeriodicalId":118572,"journal":{"name":"MICRO 24","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114522718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}