Andreas Moshovos, S. E. Breach, T. N. Vijaykumar, G. Sohi
{"title":"Dynamic Speculation And Synchronization Of Data Dependence","authors":"Andreas Moshovos, S. E. Breach, T. N. Vijaykumar, G. Sohi","doi":"10.1145/264107.264189","DOIUrl":"https://doi.org/10.1145/264107.264189","url":null,"abstract":"Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-specalation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123075581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz
{"title":"Hardware Fault Containment In Scalable Shared-memory Multiprocessors","authors":"D. Teodosiu, J. Baxter, Kinshuk Govil, J. Chapin, M. Rosenblum, M. Horowitz","doi":"10.1145/264107.264141","DOIUrl":"https://doi.org/10.1145/264107.264141","url":null,"abstract":"Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121476876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-system Design Considerations For Dynamically-scheduled Processors","authors":"K. Farkas, P. Chow, N. Jouppi, Z. Vranesic","doi":"10.1145/264107.264156","DOIUrl":"https://doi.org/10.1145/264107.264156","url":null,"abstract":"In this paper, we identify performance trends and design relationships between the following components of the data memory hierarchy in a dynamically-scheduled processor: the register file, the lockup-free data cache, the stream buffers, and the interface between these components and the lower levels of the memory hierarchy. Similar performance was obtained from all systems having support for fewer than four in-flight misses, irrespective of the register-file size, the issue width of the processor, and the memory bandwidth. While providing support for more than four in-flight misses did increase system performance, the improvement was less than that obtained by increasing the number of registers. The addition of stream buffers to the investigated systems led to a significant performance increase, with the larger increases for systems having less in-flight-miss support, greater memory bandwidth, or more instruction issue capability. The performance of these systems was not significantly affected by the inclusion of traffic filters, dynamic-stride calculators, or the inclusion of the per-load non-unity stride-predictor and the incremental-prefetching techniques, which we introduce. However, the incremental prefetching technique reduces the bandwidth consumed by stream buffers by 50% without a significant impact on performance.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116948490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Daisy: Dynamic Compilation For 10o?40 Architectural Compatibility","authors":"K. Ebcioglu, E. Altman","doi":"10.1145/264107.264126","DOIUrl":"https://doi.org/10.1145/264107.264126","url":null,"abstract":"Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129648481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Synchronization: Let Them Eat QOLB /sup1/","authors":"A. Kägi, Doug Burger, J. Goodman","doi":"10.1145/264107.264166","DOIUrl":"https://doi.org/10.1145/264107.264166","url":null,"abstract":"Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127380543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing High Bandwidth On-chip Caches","authors":"Kenneth M. Wilson, K. Olukotun","doi":"10.1145/264107.264153","DOIUrl":"https://doi.org/10.1145/264107.264153","url":null,"abstract":"In this paper we evaluate the performance of high bandwidth caches that employ multiple ports, multiple cycle hit times, on-chip DRAM, and a line buffer to find the organization that provides the best processor performance. Processor performance is measured in execution time using a dynamic superscalar processor running realistic benchmarks that include operating system references. The results show that a large dual-ported multi-cycle pipelined SRAM cache with a line buffer maximizes processor performance. A large pipelined cache provides both a low miss rate and a high CPU clock frequency. Dual-porting the cache and the use of a line buffer provide the bandwidth needed by a dynamic superscalar processor. In addition, the line buffer makes the pipelined dual-ported cache the best option by increasing cache port bandwidth and hiding cache latency.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, S. Adve
{"title":"The Interaction Of Software Prefetching With Ilp Processors In Shared-memory Systems","authors":"Parthasarathy Ranganathan, Vijay S. Pai, Hazim Abdel-Shafi, S. Adve","doi":"10.1145/264107.264158","DOIUrl":"https://doi.org/10.1145/264107.264158","url":null,"abstract":"Current microprocessors aggressively exploit instruction-level parallelism (ILP) through techniques such as multiple issue, dynamic scheduling, and non-blocking reads. Recent work has shown that memory latency remains a significant performance bottleneck for shared-memory multiprocessor systems built of such processors.This paper provides the first study of the effectiveness of software-controlled non-binding prefetching in shared memory multiprocessors built of state-of-the-art ILP-based processors. We find that software prefetching results in significant reductions in execution time (12% to 31%) for three out of five applications on an ILP system. However, compared to previous-generation system, software prefetching is significantly less effective in reducing the memory stall component of execution time on an ILP system. Consequently, even after adding software prefetching, memory stall time accounts for over 30% of the total execution time in four out of five applications on our ILP system.This paper also investigates the interaction of software prefetching with memory consistency models on ILP-based multiprocessors. In particular, we seek to determine whether software prefetching can equalize the performance of sequential consistency (SC) and release consistency (RC). We find that even with software prefetching, for three out of five applications, RC provides a significant reduction in execution time (15% to 40%) compared to SC.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121818501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maged M. Michael, Ashwini K. Nanda, B. Lim, M. Scott
{"title":"Coherence Controller Architectures For Smp-based Cc-numa Multiprocessors","authors":"Maged M. Michael, Ashwini K. Nanda, B. Lim, M. Scott","doi":"10.1145/264107.264203","DOIUrl":"https://doi.org/10.1145/264107.264203","url":null,"abstract":"Scalable distributed shared-memory architectures rely on coherence controllers on each processing node to synthesize cache-coherent shared memory across the entire machine. The coherence controllers execute coherence protocol handlers that may be hardwired in custom hardware or programmed in a protocol processor within each coherence controller. Although custom hardware runs faster, a protocol processor allows the coherence protocol to be tailored to specific application needs and may shorten hardware development time. Previous research show that the increase in application execution time due to protocol processors over custom hardware is minimal.With the advent of SMP nodes and faster processors and networks, the tradeoff between custom hardware and protocol processors needs to be reexamined. This paper studies the performance of custom-hardware and protocol-processor-based coherence controllers in SMP-node-based CC-NUMA systems on applications from the SPLASH-2 suite. Using realistic parameters and detailed models of existing state-of-the-art system components, it shows that the occupancy of coherence controllers can limit the performance of applications with high communication requirements, where the execution time using protocol processors can be twice as long as using custom hardware.To gain a deeper understanding of the tradeoff, we investigate the effect of varying several architectural parameters that influence the communication characteristics of the applications and the underlying system on coherence controller performance. We identify measures of applications' communication requirements and their impact on the performance penalty of protocol processors, which can help system designers predict performance penalties for other applications. We also study the potential of improving the performance of hardware-based and protocol-processor-based coherence controllers by separating or duplicating critical components.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132196708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Design And Analysis Of A Cache Architecture For Texture Mapping","authors":"Z. S. Hakura, Anoop Gupta","doi":"10.1145/264107.264152","DOIUrl":"https://doi.org/10.1145/264107.264152","url":null,"abstract":"The effectiveness of texture mapping in enhancing the realism of computer generated imagery has made support for real-time texture mapping a critical part of graphics pipelines. Despite a recent surge in interest in three-dimensional graphics from computer architects, high-quality high-speed texture mapping has so far been confined to costly hardware systems that use brute-force techniques to achieve high performance. One obstacle faced by designers of texture mapping systems is the requirement of extremely high bandwidth to texture memory. High bandwidth is necessary since there are typically tens to hundreds of millions of accesses to texture memory per second. In addition, to achieve the high clock rates required in graphics pipelines, low-latency access to texture memory is needed. In this paper, we propose the use of texture image caches to alleviate the above bottlenecks, and evaluate various tradeoffs that arise in such designs.We find that the factors important to cache behavior are (i) the representation of texture images in memory, (ii) the rasterization order on screen and (iii) the cache organization. Through a detailed investigation of these issues, we explore the best way to exploit locality of reference and determine whether this technique is robust with respect to different scenes and different amounts of texture. Overall, we observe that there is a significant amount of temporal and spatial locality and that the working set sizes are relatively small (at most 16KB) across all cases that we studied. Consequently, the memory bandwidth requirements of a texture cache system are substantially lower (at least three times and as much as fifteen times) than the memory bandwidth requirements of a system which achieves equivalent performance but does not utilize a cache. These results are very encouraging and indicate that caching is a promising approach to designing memory systems for texture mapping.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124953496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Deadlocks In Interconnection Networks","authors":"T. Pinkston, Sugath Warnakulasuriya","doi":"10.1145/264107.264127","DOIUrl":"https://doi.org/10.1145/264107.264127","url":null,"abstract":"Deadlock avoidance-based and deadlock recovery-based routing algorithms have been proposed in recent years without full understanding of the likelihood and characteristics of actual deadlocks in interconnection networks. This work models the interrelationships between routing freedom, message blocking, correlated resource dependencies and deadlock formation. We empirically show that increasing routing freedom, as achieved by allowing unrestricted routing over multiple virtual channels, makes deadlocks highly improbable and reduces the likelihood of other types of correlated message blocking behavior that can degrade performance. Our results further substantiate that recovery-based routing algorithms have a higher potential performance advantage over deadlock avoidance-based routing algorithms which, inherently, allow less routing freedom.","PeriodicalId":405506,"journal":{"name":"Conference Proceedings. The 24th Annual International Symposium on Computer Architecture","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127138642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}