{"title":"Hardware Support For Large Atomic Units in Dynamically Scheduled Machines","authors":"S. Melvin, M. Shebanow, Y. Patt","doi":"10.1145/62504.62535","DOIUrl":"https://doi.org/10.1145/62504.62535","url":null,"abstract":"Microarchitectures that implement conventional instruction set architectures are usually limited in that they are only able to execute a small number of microoperations concurrently. This limitation is due in part to the fact that the units of work that the hardware treats as indivisible are small. While this limitation is not important for microarchitectures with a low level of functionality, it can be significant if the goal is to build hardware that can support a large number of microoperations executing concurrently. In this paper we address the tradeoffs associated with the sizes of the various units of work that a processor considers indivisible, or atomic. We argue that by allowing larger units of work to be atomic, restrictions on concurrent operation are reduced and performance is increased. We outline the implementation of a front end for a dynamically scheduled processor with hardware support for large atomic units. We discuss tradeoffs in the design and show that with a modest investment in hardware, the run-time advantages of large atomic units can be realized without the need to alter the instruction set architecture.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116959562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling The Effects Of Instruction Queue Loading On A Static Instruction Stream Micro-architecture","authors":"J. Jacobs, A. Uht, R. C. Ord","doi":"10.1145/62504.62509","DOIUrl":"https://doi.org/10.1145/62504.62509","url":null,"abstract":"Increased processor performance requires the exploitation of the parallelism that exists within the instruction stream and within the processor itself: A static instruction stream micro-architecture, CONDEL, extracts and uses the machine instruction level concurrency implicit in the instruction stream. A major source of intraprocessor parallelism is the overlap of instruction execution with instruction loading. The effect of several methods of utilizing the execution/loading parallelism within the static instruction stream machine are studied: pipelining, buffered pipelining, branch buffering and instruction load limiting. The results of incorporating the different methods into the micro-architecture are shown. In addition, the results provide a more realistic performance comparison with conventional machine designs than the upper limits presented in previous work.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128270696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Organization Of Array Data For Concurrent Memory Access","authors":"M. Breternitz, John Paul Shen","doi":"10.1145/62504.62672","DOIUrl":"https://doi.org/10.1145/62504.62672","url":null,"abstract":"","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128595329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Microprogramming In A Multiprocessor Data Acquisition System","authors":"S. D'Angelo, L. Lisca, A. Proserpio, G. Sechi","doi":"10.1109/MICRO.1988.639272","DOIUrl":"https://doi.org/10.1109/MICRO.1988.639272","url":null,"abstract":"","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124507795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lazy Data Routing And Greedy Scheduling For Application-specific Signal Processors","authors":"K. Rimey, P. Hilfinger","doi":"10.1145/62504.62676","DOIUrl":"https://doi.org/10.1145/62504.62676","url":null,"abstract":"This paper concerns code generation for a troublesome class of horizontal-instruction-word architectures (whose machine language resembles horizontal microcode). These are application-specifrcprocessors, minimalistic programmable processors to be incorporated into application-specific signal processing chips. The processors of interest afford some opportunity for pipelined and for parallel operation of functional units, but do not provide enough bandwidth to store intermediate results in memory or in a register file. Instead, a typical datapath provides direct connections between functional units (often through pipeline registers), forming an irregular network. The usual way to generate horizontal code is to fist generate a loose sequence of microoperations (vertical code) and then pack these tightly into instructions in a compaction post-pass. Local compaction, which packs one straight-line code segment at a time, is now well-understood; theresearch community has largely shifted its attention to global compaction. For our application-specific processors, however, packing microoperations in a separate pass works poorly and generating good horizontal code for even straight-line code segments presents a challenge. Not only must the code generator choose which functional units to use; it must also choose how to route each intermediate result from the output of one functional unit to the input of another. This task is called data routing. How best to route a particular value depends on the time interval between its definition and use or uses, as well as on the datapath resources that are free during that interval. For this reason we abandon the compaction post-pass, and instead pack or schedule microoperations as they are generated. We consider only local scheduling in this paper. Our local scheduler is similar to the “operation scheduler” developed by Fisher et al. [l] for use in a trace-scheduling compiler for a VLIW supercomputer. However, we consider machines in which intermediate results must often reside in hot spots such as busses and latches as well as registers that would obstruct computation if tied up. Like Fisher et al.,","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115008580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Trap As A Control Flow Mechanism","authors":"J. A. Chandross, H. Jagadish, A. Asthana","doi":"10.1145/62504.62527","DOIUrl":"https://doi.org/10.1145/62504.62527","url":null,"abstract":"In this paper we show how traditional hardware trap handlers can be generalized into an efficient vehicle for conditional branches. These ideas are being used in a VLSI processor under design.\u0000Conditional branches are often a major bottleneck in scheduling microinstructions on a horizontally microcoded machine. Several tests and conditional branches are frequently ready for scheduling simultaneously, but only one test and branch is possible in a given cycle.\u0000The trap facility is traditionally treated as an interrupt scheme for the notification of exceptional conditions. In this paper we study how the role of the trap mechanism may be expanded to include the parallel evaluation of arbitrary user-specified tests, and the concomitant performance benefits.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123219769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-speed Hardware Unit For A Subset of Logic Resolution","authors":"D. Wong","doi":"10.1145/62504.62542","DOIUrl":"https://doi.org/10.1145/62504.62542","url":null,"abstract":"High-speed engines for logic programming have been the target of much recent research. Here, we present a high-level hardware design and its custom data formats for directly performing a subset of logic resolution. This design uses parallelism in unifying arguments and substituting variable bindings which is distinct from the widely discussed OR and AND parallelism.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122950460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing a Prolog Machine with Multiple Functional Units","authors":"A. Singhal, Y. Patt","doi":"10.1145/62504.62517","DOIUrl":"https://doi.org/10.1145/62504.62517","url":null,"abstract":"This paper describes the microarchitecture of the PLUM, a high performance Prolog machine. Multiple specialized functional units, each with a port to memory, operate in parallel using data driven control. Instructions are dynamically scheduled by a Prefetch Unit to execute on several specialized functional units. Out of order execution is allowed, and instructions execute when their operands are available. Special synchronization techniques that ensure correct parallel unification and pipelined operation are discussed. The performance of the PLUM is limited by unification, since almost all other operations execute in parallel with unification. Unification time is reduced by parallel unification, resulting in an overall speedup of approximately a factor of 4 over the Berkeley PLM.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121812930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Dependency Graph Bracing","authors":"V. Allan","doi":"10.1145/62504.62670","DOIUrl":"https://doi.org/10.1145/62504.62670","url":null,"abstract":"The Sunburst compiler refined at Utah State University employs a powerful mechanism for management of data anti-dependencies in data dependency graphs, <italic>DDG's</italic>: the <italic>DDG Bracer</italic>. The term <italic>bracing</italic><supscrpt>1</supscrpt> is used to mean the fastening of two or more parts together. There are two major goals in bracing: 1) semantic correctness, and 2) creation of an optimal DDG. Bracing provides necessary joining of code fragments, produced by a divide and conquer code generation algorithm, while yielding multiple code sequences.\u0000Since no anti-dependency arcs are present, the input <italic>DDG's</italic> are said to be in <italic>normal form</italic>. Because anti-dependency arcs occur only when a resource must be reused, a <italic>DDG</italic> in normal form represents infinite resources. The output <italic>DDG</italic> is a merging of the two input <italic>DDG's</italic> such that data dependency arcs between the two <italic>DDG's</italic> are inserted and data anti-dependency arcs are added to sequentialize the use of common resources.\u0000Vegdahl [Veg82] was one of the first to recognize the importance of live track manipulation. A <italic>live track</italic> is an ordered pair: the first component is the microoperation node ( MO) in which a resource is born, and the second component is the set of nodes in which the resource dies.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114820590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trace Selection For Compiling Large C Application Programs To Microcode","authors":"P. Chang, W. Hwu","doi":"10.1145/62504.62511","DOIUrl":"https://doi.org/10.1145/62504.62511","url":null,"abstract":"Microcode optimization techniques such as code scheduling and resource allocation can benefit significantly by reducing uncertainties in program control flow. A trace selection algorithm with profiling information reduces the uncertainties in program control flow by identifying sequences of frequently invoked basic blocks as traces. These traces are treated as sequential codes for optimization purposes. Optimization based on traces is especially useful when the code size is large and the control structure is complicated enough to defeat hand optimizations. However, most of the experimental results reported to date are based on small benchmarks with simple control structures.\u0000For different trace selection algorithms, we report the distribution of control transfers categorized according to their potential impact on the microcode optimizations. The experimental results are based on ten C application programs which exhibit large code size and complicated control structure. The measured data for each program is accumulated across a large number of input files to ensure the reliability of the result. All experiments are performed automatically using our IMPACT C compiler which contains integrated profiling and analysis tools.","PeriodicalId":378625,"journal":{"name":"[1988] Proceedings of the 21st Annual Workshop on Microprogramming and Microarchitecture - MICRO '21","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1988-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128245904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}