23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献_第2页

Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors 提高动态超标量微处理器的缓存端口效率

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232989

Kenneth M. Wilson, K. Olukotun, M. Rosenblum

引用次数: 68

MGS: A Multigrain Shared Memory System MGS:一个多粒共享内存系统

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232980

D. Yeung, J. Kubiatowicz, A. Agarwal

{"title":"MGS: A Multigrain Shared Memory System","authors":"D. Yeung, J. Kubiatowicz, A. Agarwal","doi":"10.1145/232973.232980","DOIUrl":"https://doi.org/10.1145/232973.232980","url":null,"abstract":"Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121531031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

An Analysis of Dynamic Branch Prediction Schemes on System Workloads 基于系统负载的动态分支预测方案分析

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232977

Nicolas Gloy, C. Young, Bradley Chen, Michael D. Smith

引用次数: 76

Performance Comparison of ILP Machines with Cycle Time Evaluation 周期时间评价下的ILP机器性能比较

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232995

Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya

{"title":"Performance Comparison of ILP Machines with Cycle Time Evaluation","authors":"Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya","doi":"10.1145/232973.232995","DOIUrl":"https://doi.org/10.1145/232973.232995","url":null,"abstract":"Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to execute a program, but do not quantitatively estimate the penalty imposed on the cycle time from the architecture. Since the performance of a microprocessor must be measured by its execution time, a cycle time evaluation is required as well as a cycle count speedup evaluation. Currently, superscalar machines are widely accepted as the machines which achieve the highest performance. On the other hand, because of hardware simplicity and instruction scheduling sophistication, there is a perception that the next generation of microprocessors will be implemented with a VLIW architecture. A simple VLIW machine, however, has a serious weakness regarding speculative execution. Thus, it is a question whether a simple VLIW machine really outperforms a superscalar machine. We recently proposed a mechanism called predicating that supports speculative execution for the VLIW machine, and showed a significant cycle count speedup over a scalar machine. Although the mechanism is simple, it is unknown how much it imposes a penalty on the cycle time, and how much the performance is improved as a result. This paper evaluates both the cycle count speedup and the cycle time for three ILP machines: a superscalar machine, a simple VLIW machine, and the VLIW machine with predicating. The evaluation results show that the simple VLIW machine slightly outperforms the superscalar machine, while the VLIW machine with predicating achieves a significant speedup of 1.41x over the superscalar machine.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123022855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors COMA:构建容错可扩展共享内存多处理器的机会

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232981

C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec

{"title":"COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors","authors":"C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec","doi":"10.1145/232973.232981","DOIUrl":"https://doi.org/10.1145/232973.232981","url":null,"abstract":"Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129911375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

The Difference-Bit Cache 差位缓存

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232986

Toni Juan, T. Lang, J. Navarro

引用次数: 43

DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance DCD—磁盘缓存磁盘:提高I/O性能的新方法

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232991

Yimin Hu, Qing Yang

{"title":"DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance","authors":"Yimin Hu, Qing Yang","doi":"10.1145/232973.232991","DOIUrl":"https://doi.org/10.1145/232973.232991","url":null,"abstract":"This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize write performance. While the cache-disk and the normal data disk have the same physical properties, the access speed of the former differs dramatically from the latter because of different data units and different ways in which data are accessed. Our objective is to exploit this speed difference by using the log disk as a cache to build a reliable and smooth disk hierarchy. A small RAM buffer is used to collect small write requests to form a log which is transferred onto the cache-disk whenever the cache-disk is idle. Because of the temporal locality that exists in office/engineering work-load environments, the DCD system shows write performance close to the same size RAM (i.e. solid-state disk) for the cost of a disk. Moreover, the cache-disk can also be implemented as a logical disk in which case a small portion of the normal data disk is used as the log disk. Trace-driven simulation experiments are carried out to evaluate the performance of the proposed disk architecture. Under the office/engineering work-load environment, the DCD shows superb disk performance for writes as compared to existing disk systems. Performance improvements of up to two orders of magnitude are observed in terms of average response time for write operations. Furthermore, DCD is very reliable and works at the device or device driver level. As a result, it can be applied directly to current file systems without the need of changing the operating system.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121775191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 167

STiNG: A CC-NUMA Computer System for the Commercial Marketplace 用于商业市场的CC-NUMA计算机系统

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.233006

Thomas D. Lovett, R. Clapp

引用次数: 275

Don't Use the Page Number, but a Pointer to It 不要使用页码，而是使用指向页码的指针

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232985

André Seznec

{"title":"Don't Use the Page Number, but a Pointer to It","authors":"André Seznec","doi":"10.1145/232973.232985","DOIUrl":"https://doi.org/10.1145/232973.232985","url":null,"abstract":"Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is particularly dramatic when small block sizes are used. At the same time, the performance of high performance microprocessors depends more and more on the accuracy of branch prediction and for reasons similar to those in the case of caches the size of the Branch Target Buffer is also increasing linearly with the address width.In this paper, we apply the simple principle stated in the title for limiting the tag size of on-chip caches. In the resulting indirect-tagged cache, the duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Applying the same principle to Branch Target Buffers, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126708360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Understanding Application Performance on Shared Virtual Memory Systems 了解共享虚拟内存系统上的应用程序性能

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232987

L. Iftode, J. Singh, Kai Li

{"title":"Understanding Application Performance on Shared Virtual Memory Systems","authors":"L. Iftode, J. Singh, Kai Li","doi":"10.1145/232973.232987","DOIUrl":"https://doi.org/10.1145/232973.232987","url":null,"abstract":"Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics.We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133713996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80