23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献_第3页

Early Experience with Message-Passing on the SHRIMP Multicomputer 在SHRIMP多计算机上进行消息传递的早期经验

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.233004

E. Felten, R. Alpert, A. Bilas, M. Blumrich, D. Clark, Stefanos N. Damianakis, C. Dubnicki, L. Iftode, Kai Li

{"title":"Early Experience with Message-Passing on the SHRIMP Multicomputer","authors":"E. Felten, R. Alpert, A. Bilas, M. Blumrich, D. Clark, Stefanos N. Damianakis, C. Dubnicki, L. Iftode, Kai Li","doi":"10.1145/232973.233004","DOIUrl":"https://doi.org/10.1145/232973.233004","url":null,"abstract":"The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that a data transfer can be done without the intervention of the receiving node CPU. An important question is whether such a mechanism can indeed deliver all of the available hardware performance to applications which use conventional message-passing libraries.This paper reports our early experience with message-passing on a small, working SHRIMP multicomputer. We have implemented several user-level communication libraries on top of the VMMC mechanism, including the NX message-passing interface, Sun RPC, stream sockets, and specialized RPC. The first three are fully compatible with existing systems. Our experience shows that the VMMC mechanism supports these message-passing interfaces well. When zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130511611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

High-Bandwidth Address Translation for Multiple-Issue Processors 多问题处理器的高带宽地址转换

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232990

T. Austin, G. Sohi

{"title":"High-Bandwidth Address Translation for Multiple-Issue Processors","authors":"T. Austin, G. Sohi","doi":"10.1145/232973.232990","DOIUrl":"https://doi.org/10.1145/232973.232990","url":null,"abstract":"In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116587165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Memory Bandwidth Limitations of Future Microprocessors 未来微处理器的内存带宽限制

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232983

D. Burger, J. Goodman, A. Kägi

{"title":"Memory Bandwidth Limitations of Future Microprocessors","authors":"D. Burger, J. Goodman, A. Kägi","doi":"10.1145/232973.232983","DOIUrl":"https://doi.org/10.1145/232973.232983","url":null,"abstract":"This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"170 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 463

Missing the Memory Wall: The Case for Processor/Memory Integration 错过内存墙:处理器/内存集成的案例

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232984

Ashley Saulsbury, Fong Pong, A. Nowatzyk

{"title":"Missing the Memory Wall: The Case for Processor/Memory Integration","authors":"Ashley Saulsbury, Fong Pong, A. Nowatzyk","doi":"10.1145/232973.232984","DOIUrl":"https://doi.org/10.1145/232973.232984","url":null,"abstract":"Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114209234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 254

Correlation and Aliasing in Dynamic Branch Predictors 动态分支预测器中的关联和混叠

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232978

S. Sechrest, Chih-Chieh Lee, T. Mudge

{"title":"Correlation and Aliasing in Dynamic Branch Predictors","authors":"S. Sechrest, Chih-Chieh Lee, T. Mudge","doi":"10.1145/232973.232978","DOIUrl":"https://doi.org/10.1145/232973.232978","url":null,"abstract":"Previous branch prediction studies have relied primarily upon the SPECint89 and SPECint92 benchmarks for evaluation. Most of these benchmarks exercise a very small amount of code. As a consequence, the resources required by these schemes for accurate predictions of larger programs have not been clear. Moreover, many of these studies have simulated a very limited number of configurations. Here we report on simulations of a variety of branch prediction schemes using a set of relatively large benchmark programs that we believe to be more representative of likely system workloads. We have examined the sensitivity of these prediction schemes to variation in workload, in resources, and in design and configuration. We show that for predictors with small available resources, aliasing between distinct branches can have the dominant influence on prediction accuracy. For global history based schemes, such as GAs and gshare, aliasing in the predictor table can eliminate any advantage gained through inter branch correlation. For self-history based prediction scheme, such as PAs, it is aliasing in the buffer recording branch history, rather than the predictor table, that poses problems. Past studies have sometimes confused these effects and allocated resources incorrectly.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128953315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Decoupled Hardware Support for Distributed Shared Memory 分布式共享内存的解耦硬件支持

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-01 DOI: 10.1145/232973.232979

S. Reinhardt, Robert W. Pfile, D. Wood

{"title":"Decoupled Hardware Support for Distributed Shared Memory","authors":"S. Reinhardt, Robert W. Pfile, D. Wood","doi":"10.1145/232973.232979","DOIUrl":"https://doi.org/10.1145/232973.232979","url":null,"abstract":"This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

Coherent Network Interfaces for Fine-Grain Communication 用于细粒度通信的相干网络接口

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-01 DOI: 10.1145/232973.232999

Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood

{"title":"Coherent Network Interfaces for Fine-Grain Communication","authors":"Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood","doi":"10.1145/232973.232999","DOIUrl":"https://doi.org/10.1145/232973.232999","url":null,"abstract":"Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114930966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104