Proceedings Fifth International Symposium on High-Performance Computer Architecture最新文献_第3页

WildFire: a scalable path for SMPs WildFire: smp的可扩展路径

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744361

Erik Hagersten, M. Koster

{"title":"WildFire: a scalable path for SMPs","authors":"Erik Hagersten, M. Koster","doi":"10.1109/HPCA.1999.744361","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744361","url":null,"abstract":"Researchers have searched for scalable alternatives to the symmetric multiprocessor (SMP) architecture since it was first introduced in 1982. The paper introduces an alternative view of the relationship between scalable technologies and SMPs. Instead of replacing large SMPs with scalable technology, we propose new scalable techniques that allow large SMPs to be tied together efficiently, while maintaining the compatibility with, and performance characteristics of, an SMP. The trade-offs of such an architecture differ from those of traditional, scalable, Non-Uniform Memory Architecture (cc-NUMA) approaches. WildFire is a distributed shared memory (DSM) prototype implementation based on large SMPs. It relies on two techniques for creating application-transparent locality: Coherent Memory Replication (CMR), which is a variation of Simple COMA/Reactive NUMA, and Hierarchical Affinity Scheduling (HAS). These two optimizations create extra node locality, which blurs the node boundaries to an application such that SMP-like performance can be achieved with no NUMA-specific optimizations. We present a performance study of a large OLTP benchmark running on DSMs built from various sized nodes and with varying amounts of application-transparent locality. WildFire's measured performance is shown to be more than two times that of an unoptimized NUMA implementation built from small nodes and within 13% of the performance of the ideal implementation: a large SMP with the same access time to its entire shared memory as the local memory access time of WildFire.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127382672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 152

Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs 动态可变行大小缓存，利用合并DRAM/逻辑lsi的高片上存储器带宽

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744366

Koji Inoue, K. Kai, K. Murakami

引用次数: 38

Instruction pre-processing in trace processors 跟踪处理器中的指令预处理

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744347

Quinn Jacobson, James E. Smith

引用次数: 62

RAPID-Cache-a reliable and inexpensive write cache for disk I/O systems 快速缓存—用于磁盘I/O系统的可靠且廉价的写缓存

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744364

Yimin Hu, Qing Yang, Tycho Nightingale

{"title":"RAPID-Cache-a reliable and inexpensive write cache for disk I/O systems","authors":"Yimin Hu, Qing Yang, Tycho Nightingale","doi":"10.1109/HPCA.1999.744364","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744364","url":null,"abstract":"This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk Cache. A typical RAPID-Cache consists of two redundant write buffers on top of a disk system. One of the buffers is a primary cache made of RAM or NVRAM and the other is a backup cache containing a two level hierarchy: a small NVRAM buffer on top of a log disk. The backup cache has nearly equivalent write performance as the primary RAM cache, while the read performance of the backup cache is not as critical because normal read operations are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache and a fast-write-slow-read NVRAM-disk hierarchy being a backup cache. The asymmetric cache architecture allows cost-effective designs for very large write caches for high-end disk I/O systems that would otherwise have to use dual-copy, costly NVRAM caches. It also makes it possible to implement reliable write caching for low-end disk I/O systems since the RAPID-Cache makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-Cache has significant reliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over dual-copy NVRAM caches. The RAPID-Cache architecture opens a new dimension for disk system designers to exercise trade-offs among performance, reliability and cost.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129895127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Access order and effective bandwidth for streams on a Direct Rambus memory 直接Rambus存储器上流的访问顺序和有效带宽

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744337

Sung I. Hong, S. Mckee, M. H. Salinas, R. Klenke, J. Aylor, W. Wulf

引用次数: 66

Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects 并行应用对两层互连中带宽和延迟差异的敏感性

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744376

A. Plaat, H. Bal, Rutger F. H. Hofman, T. Kielmann

{"title":"Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects","authors":"A. Plaat, H. Bal, Rutger F. H. Hofman, T. Kielmann","doi":"10.1109/HPCA.1999.744376","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744376","url":null,"abstract":"This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect-the \"NUMA gap\"-is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical-like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severely impacting speedup. We analyze why the improvements are needed, why they work so well, and how much non-uniformity they can mask.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131708222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Dynamically exploiting narrow width operands to improve processor power and performance 动态地利用窄宽度操作数来提高处理器的功率和性能

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744314

D. Brooks, M. Martonosi

{"title":"Dynamically exploiting narrow width operands to improve processor power and performance","authors":"D. Brooks, M. Martonosi","doi":"10.1109/HPCA.1999.744314","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744314","url":null,"abstract":"In general-purpose microprocessors, recent trends have pushed towards 64 bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64 bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced \"multimedia\" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these \"narrow-bitwidth\" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132242110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 306

Communication studies of single-threaded and multithreaded distributed-memory machines 单线程和多线程分布式内存机器的通信研究

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744384

A. Sohn, Y. Paek, Jui-Yuan Ku, Yuetsu Kodama, Y. Yamaguchi

引用次数: 1

MP-LOCKs: replacing H/W synchronization primitives with message passing MP-LOCKs:用消息传递取代H/W同步原语

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744381

Chen-Chi Kuo, J. Carter, R. Kuramkote

{"title":"MP-LOCKs: replacing H/W synchronization primitives with message passing","authors":"Chen-Chi Kuo, J. Carter, R. Kuramkote","doi":"10.1109/HPCA.1999.744381","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744381","url":null,"abstract":"Shared memory programs guarantee the correctness of concurrent accesses to shared data using interprocessor synchronization operations. The most common synchronization operators are locks, which are traditionally implemented via a mix of shared memory accesses and hardware synchronization primitives like test-and-set. In this paper, we argue that synchronization operations implemented using fast message passing and kernel-embedded lock managers are an attractive alternative to dedicated synchronization hardware. We propose three message passing lock (MP-LOCK) algorithms (centralized, distributed, and reactive) and provide implementation guidelines. MP-LOCKs reduce the design complexity and runtime occupancy of DSM controllers and can exploit software's inherent flexibility to adapt to differing applications lock access patterns. We compared the performance of MP-LOCKs with two common shared memory lock algorithms: test-and-test-and-set and MCS locks and found that MP-LOCKs scale better. For machines with 16 to 32 nodes, applications using MP-LOCKs ran up to 186% faster than the same applications with shared memory locks. For small systems (up to 8 nodes), three applications with MP-LOCKs slow down by no more than 18%, while the other two slowed by no more than 180% due to higher software overhead. We conclude that locks based on message passing should be considered as a replacement for hardware locks in future scalable multiprocessors that support efficient message passing mechanisms.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128364442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols 并行调度队列:基于队列的编程抽象，用于并行化细粒度通信协议

Proceedings Fifth International Symposium on High-Performance Computer Architecture Pub Date : 1999-01-09 DOI: 10.1109/HPCA.1999.744362

B. Falsafi, D. Wood

{"title":"Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols","authors":"B. Falsafi, D. Wood","doi":"10.1109/HPCA.1999.744362","DOIUrl":"https://doi.org/10.1109/HPCA.1999.744362","url":null,"abstract":"This paper proposes a novel queue-based programming abstraction, Parallel Dispatch Queue (PDQ), that enables efficient parallel execution of fine-grain software communication protocols. Parallel systems often use fine-grain software handlers to integrate a network message into computation. Executing such handlers in parallel requires access synchronization around resources. Much as a monitor construct in a concurrent language protects accesses to a set of data structures, PDQ allows messages to include a synchronization key protecting handler accesses to a group of protocol resources. By simply synchronizing messages in a queue prior to dispatch, PDQ not only eliminates the overhead of acquiring/releasing synchronization primitives but also prevents busy-waiting within handlers. In this paper, we study PDQ's impact on software protocol performance in the context of fine-grain distributed shared memory (DSM) on an SMP cluster. Simulation results running shared-memory applications indicate that: (i) parallel software protocol execution using PDQ significantly improves performance in fine-grain DSM, (ii) tight integration of PDQ and embedded processors into a single custom device can offer performance competitive or better than an all-hardware DSM, and (iii) PDQ best benefits cost-effective systems that use idle SMP processors (rather than custom embedded processors) to execute protocols. On a cluster of 4 16-way SMPs, a PDQ-based parallel protocol running on idle SMP processors improves application performance by a factor of 2.6 over a system running a serial protocol on a single dedicated processor.","PeriodicalId":287867,"journal":{"name":"Proceedings Fifth International Symposium on High-Performance Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11