{"title":"WildFire: a scalable path for SMPs","authors":"Erik Hagersten, M. Koster","doi":"10.1109/HPCA.1999.744361","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744361","url":null,"abstract":"Researchers have searched for scalable alternatives to the symmetric multiprocessor (SMP) architecture since it was first introduced in 1982. The paper introduces an alternative view of the relationship between scalable technologies and SMPs. Instead of replacing large SMPs with scalable technology, we propose new scalable techniques that allow large SMPs to be tied together efficiently, while maintaining the compatibility with, and performance characteristics of, an SMP. The trade-offs of such an architecture differ from those of traditional, scalable, Non-Uniform Memory Architecture (cc-NUMA) approaches. WildFire is a distributed shared memory (DSM) prototype implementation based on large SMPs. It relies on two techniques for creating application-transparent locality: Coherent Memory Replication (CMR), which is a variation of Simple COMA/Reactive NUMA, and Hierarchical Affinity Scheduling (HAS). These two optimizations create extra node locality, which blurs the node boundaries to an application such that SMP-like performance can be achieved with no NUMA-specific optimizations. We present a performance study of a large OLTP benchmark running on DSMs built from various sized nodes and with varying amounts of application-transparent locality. WildFire's measured performance is shown to be more than two times that of an unoptimized NUMA implementation built from small nodes and within 13% of the performance of the ideal implementation: a large SMP with the same access time to its entire shared memory as the local memory access time of WildFire.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127382672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs","authors":"Koji Inoue, K. Kai, K. Murakami","doi":"10.1109/HPCA.1999.744366","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744366","url":null,"abstract":"This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called \"dynamically variable line-size cache (D-VLS cache)\". The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth. In our evaluation, it is observed that the performance improvement achieved by a direct-mapped D-VLS cache is about 27%, compared to a conventional direct-mapped cache with fixed 32-byte lines.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133513609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instruction pre-processing in trace processors","authors":"Quinn Jacobson, James E. Smith","doi":"10.1109/HPCA.1999.744347","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744347","url":null,"abstract":"In trace processors, a sequential program is partitioned at run time into \"traces\". A trace is an encapsulation of a dynamic sequence of instructions. A processor that uses traces as the unit of sequencing and execution achieves high instruction fetch rates and can support very wide-issue execution engines. We propose a new class of hardware optimizations that transform the instructions within traces to increase the performance of trace processors. Traces are \"pre-processed\" to optimize the instructions for execution together. We propose three specific optimizations: instruction scheduling, constant propagation, and instruction collapsing. Together, these optimizations offer substantial performance benefit, increasing performance by up to 24%.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126329516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RAPID-Cache-a reliable and inexpensive write cache for disk I/O systems","authors":"Yimin Hu, Qing Yang, Tycho Nightingale","doi":"10.1109/HPCA.1999.744364","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744364","url":null,"abstract":"This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache. A typical RAPID-Cache consists of two redundant write buffers on top of a disk system. One of the buffers is a primary cache made of RAM or NVRAM and the other is a backup cache containing a two level hierarchy: a small NVRAM buffer on top of a log disk. The backup cache has nearly equivalent write performance as the primary RAM cache, while the read performance of the backup cache is not as critical because normal read operations are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache and a fast-write-slow-read NVRAM-disk hierarchy being a backup cache. The asymmetric cache architecture allows cost-effective designs for very large write caches for high-end disk I/O systems that would otherwise have to use dual-copy, costly NVRAM caches. It also makes it possible to implement reliable write caching for low-end disk I/O systems since the RAPID-Cache makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-Cache has significant reliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over dual-copy NVRAM caches. The RAPID-Cache architecture opens a new dimension for disk system designers to exercise trade-offs among performance, reliability and cost.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129895127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sung I. Hong, S. Mckee, M. H. Salinas, R. Klenke, J. Aylor, W. Wulf
{"title":"Access order and effective bandwidth for streams on a Direct Rambus memory","authors":"Sung I. Hong, S. Mckee, M. H. Salinas, R. Klenke, J. Aylor, W. Wulf","doi":"10.1109/HPCA.1999.744337","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744337","url":null,"abstract":"Processor speeds are increasing rapidly and memory speeds are not keeping up. Streaming computations (such as multimedia or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6 Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76% of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128973773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Plaat, H. Bal, Rutger F. H. Hofman, T. Kielmann
{"title":"Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects","authors":"A. Plaat, H. Bal, Rutger F. H. Hofman, T. Kielmann","doi":"10.1109/HPCA.1999.744376","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744376","url":null,"abstract":"This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect-the \"NUMA gap\"-is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical-like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severely impacting speedup. We analyze why the improvements are needed, why they work so well, and how much non-uniformity they can mask.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131708222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamically exploiting narrow width operands to improve processor power and performance","authors":"D. Brooks, M. Martonosi","doi":"10.1109/HPCA.1999.744314","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744314","url":null,"abstract":"In general-purpose microprocessors, recent trends have pushed towards 64 bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64 bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced \"multimedia\" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these \"narrow-bitwidth\" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Sohn, Y. Paek, Jui-Yuan Ku, Yuetsu Kodama, Y. Yamaguchi
{"title":"Communication studies of single-threaded and multithreaded distributed-memory machines","authors":"A. Sohn, Y. Paek, Jui-Yuan Ku, Yuetsu Kodama, Y. Yamaguchi","doi":"10.1109/HPCA.1999.744384","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744384","url":null,"abstract":"This report explicates the communication overlapping capabilities of three distributed-memory machines, SGI/Cray T3E, IBM SP-2 with wide nodes, and the ETL EM-X. Bitonic sorting and Fast Fourier Transform are selected for experiments. Various message sizes are used to determine when, where, how much and why the overlapping takes place. Experimental results with up to 64 processors indicated that the communication performance of EM-X is insensitive to various message sizes while SP-2 is the most sensitive. T3E stayed in between. The EM-X gave the highest communication overlapping capability while T3E did the lowest. The experimental results are compared with the analytical results based on LogP and LogGP communication models.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126902253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MP-LOCKs: replacing H/W synchronization primitives with message passing","authors":"Chen-Chi Kuo, J. Carter, R. Kuramkote","doi":"10.1109/HPCA.1999.744381","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744381","url":null,"abstract":"Shared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide implementation guidelines. MP-LOCKs reduce the design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nodes, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), three applications with MP-LOCKs slow down by no more than 18%, while the other two slowed by no more than 180% due to higher software overhead. We conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that support efficient message passing mechanisms.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128364442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols","authors":"B. Falsafi, D. Wood","doi":"10.1109/HPCA.1999.744362","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744362","url":null,"abstract":"This paper proposes a novel queue-based programming abstraction, Parallel Dispatch Queue (PDQ), that enables efficient parallel execution of fine-grain software communication protocols. Parallel systems often use fine-grain software handlers to integrate a network message into computation. Executing such handlers in parallel requires access synchronization around resources. Much as a monitor construct in a concurrent language protects accesses to a set of data structures, PDQ allows messages to include a synchronization key protecting handler accesses to a group of protocol resources. By simply synchronizing messages in a queue prior to dispatch, PDQ not only eliminates the overhead of acquiring/releasing synchronization primitives but also prevents busy-waiting within handlers. In this paper, we study PDQ's impact on software protocol performance in the context of fine-grain distributed shared memory (DSM) on an SMP cluster. Simulation results running shared-memory applications indicate that: (i) parallel software protocol execution using PDQ significantly improves performance in fine-grain DSM, (ii) tight integration of PDQ and embedded processors into a single custom device can offer performance competitive or better than an all-hardware DSM, and (iii) PDQ best benefits cost-effective systems that use idle SMP processors (rather than custom embedded processors) to execute protocols. On a cluster of 4 16-way SMPs, a PDQ-based parallel protocol running on idle SMP processors improves application performance by a factor of 2.6 over a system running a serial protocol on a single dedicated processor.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}