23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献

筛选
英文 中文
Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors 提高动态超标量微处理器的缓存端口效率
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232989
Kenneth M. Wilson, K. Olukotun, M. Rosenblum
{"title":"Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors","authors":"Kenneth M. Wilson, K. Olukotun, M. Rosenblum","doi":"10.1145/232973.232989","DOIUrl":"https://doi.org/10.1145/232973.232989","url":null,"abstract":"The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129912611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
MGS: A Multigrain Shared Memory System MGS:一个多粒共享内存系统
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232980
D. Yeung, J. Kubiatowicz, A. Agarwal
{"title":"MGS: A Multigrain Shared Memory System","authors":"D. Yeung, J. Kubiatowicz, A. Agarwal","doi":"10.1145/232973.232980","DOIUrl":"https://doi.org/10.1145/232973.232980","url":null,"abstract":"Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121531031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
An Analysis of Dynamic Branch Prediction Schemes on System Workloads 基于系统负载的动态分支预测方案分析
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232977
Nicolas Gloy, C. Young, Bradley Chen, Michael D. Smith
{"title":"An Analysis of Dynamic Branch Prediction Schemes on System Workloads","authors":"Nicolas Gloy, C. Young, Bradley Chen, Michael D. Smith","doi":"10.1145/232973.232977","DOIUrl":"https://doi.org/10.1145/232973.232977","url":null,"abstract":"Recent studies of dynamic branch prediction schemes rely almost exclusively on user-only simulations to evaluate performance. We find that an evaluation of these schemes with user and kernel references often leads to different conclusions. By analyzing our own Atom-generated system traces and the system traces from the Instruction Benchmark Suite, we quantify the effects of kernel and user interactions on branch prediction accuracy. We find that user-only traces yield accurate prediction results only when the kernel accounts for less than 5% of the total executed instructions. Schemes that appear to predict well under user-only traces are not always the most effective on full-system traces: the recently-proposed two-level adaptive schemes can suffer from higher aliasing than the original per-branch 2-bit counter scheme. We also find that flushing the branch history state at fixed intervals does not accurately model the true effects of user/kernel interaction.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126547721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 76
Performance Comparison of ILP Machines with Cycle Time Evaluation 周期时间评价下的ILP机器性能比较
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232995
Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya
{"title":"Performance Comparison of ILP Machines with Cycle Time Evaluation","authors":"Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya","doi":"10.1145/232973.232995","DOIUrl":"https://doi.org/10.1145/232973.232995","url":null,"abstract":"Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to execute a program, but do not quantitatively estimate the penalty imposed on the cycle time from the architecture. Since the performance of a microprocessor must be measured by its execution time, a cycle time evaluation is required as well as a cycle count speedup evaluation. Currently, superscalar machines are widely accepted as the machines which achieve the highest performance. On the other hand, because of hardware simplicity and instruction scheduling sophistication, there is a perception that the next generation of microprocessors will be implemented with a VLIW architecture. A simple VLIW machine, however, has a serious weakness regarding speculative execution. Thus, it is a question whether a simple VLIW machine really outperforms a superscalar machine. We recently proposed a mechanism called predicating that supports speculative execution for the VLIW machine, and showed a significant cycle count speedup over a scalar machine. Although the mechanism is simple, it is unknown how much it imposes a penalty on the cycle time, and how much the performance is improved as a result. This paper evaluates both the cycle count speedup and the cycle time for three ILP machines: a superscalar machine, a simple VLIW machine, and the VLIW machine with predicating. The evaluation results show that the simple VLIW machine slightly outperforms the superscalar machine, while the VLIW machine with predicating achieves a significant speedup of 1.41x over the superscalar machine.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123022855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors COMA:构建容错可扩展共享内存多处理器的机会
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232981
C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec
{"title":"COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors","authors":"C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec","doi":"10.1145/232973.232981","DOIUrl":"https://doi.org/10.1145/232973.232981","url":null,"abstract":"Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129911375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
The Difference-Bit Cache 差位缓存
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232986
Toni Juan, T. Lang, J. Navarro
{"title":"The Difference-Bit Cache","authors":"Toni Juan, T. Lang, J. Navarro","doi":"10.1145/232973.232986","DOIUrl":"https://doi.org/10.1145/232973.232986","url":null,"abstract":"The difference-bit cache is a two-way set-associative cache with an access time that is smaller than that of a conventional one and close or equal to that of a direct-mapped cache. This is achieved by noticing that the two tags for a set have to differ at least by one bit and by using this bit to select the way. In contrast with previous approaches that predict the way and have two types of hits (primary of one cycle and secondary of two to four cycles), all hits of the difference-bit cache are of one cycle. The evaluation of the access time of our cache organization has been performed using a recently proposed on-chip cache access model.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127884289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance DCD—磁盘缓存磁盘:提高I/O性能的新方法
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232991
Yimin Hu, Qing Yang
{"title":"DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance","authors":"Yimin Hu, Qing Yang","doi":"10.1145/232973.232991","DOIUrl":"https://doi.org/10.1145/232973.232991","url":null,"abstract":"This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize write performance. While the cache-disk and the normal data disk have the same physical properties, the access speed of the former differs dramatically from the latter because of different data units and different ways in which data are accessed. Our objective is to exploit this speed difference by using the log disk as a cache to build a reliable and smooth disk hierarchy. A small RAM buffer is used to collect small write requests to form a log which is transferred onto the cache-disk whenever the cache-disk is idle. Because of the temporal locality that exists in office/engineering work-load environments, the DCD system shows write performance close to the same size RAM (i.e. solid-state disk) for the cost of a disk. Moreover, the cache-disk can also be implemented as a logical disk in which case a small portion of the normal data disk is used as the log disk. Trace-driven simulation experiments are carried out to evaluate the performance of the proposed disk architecture. Under the office/engineering work-load environment, the DCD shows superb disk performance for writes as compared to existing disk systems. Performance improvements of up to two orders of magnitude are observed in terms of average response time for write operations. Furthermore, DCD is very reliable and works at the device or device driver level. As a result, it can be applied directly to current file systems without the need of changing the operating system.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121775191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 167
STiNG: A CC-NUMA Computer System for the Commercial Marketplace 用于商业市场的CC-NUMA计算机系统
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.233006
Thomas D. Lovett, R. Clapp
{"title":"STiNG: A CC-NUMA Computer System for the Commercial Marketplace","authors":"Thomas D. Lovett, R. Clapp","doi":"10.1145/232973.233006","DOIUrl":"https://doi.org/10.1145/232973.233006","url":null,"abstract":"\"STiNG\" is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multi-processor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI) based coherent interconnect. The Quads are based on the Intel P6 processor and the external bus it defines. In addition to 4 P6 processors, each Quad may contain up to 4 GBytes of system memory, 2 Peripheral Component Interface (PCI) busses for I/O, and a Lynx board. The Lynx board provides the datapath to the SCI-based interconnect and ensures system-wide cache coherency. STiNG is one of the first commercial CC-NUMA systems to be built. This paper describes the motivation for building STiNG as well as its architecture and implementation. In addition, performance analysis is provided for On-Line Transaction Processing (OLTP) and Decision Support System (DSS) workloads. Finally, the status of the current implementation is reviewed.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"252 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115331741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 275
Don't Use the Page Number, but a Pointer to It 不要使用页码,而是使用指向页码的指针
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232985
André Seznec
{"title":"Don't Use the Page Number, but a Pointer to It","authors":"André Seznec","doi":"10.1145/232973.232985","DOIUrl":"https://doi.org/10.1145/232973.232985","url":null,"abstract":"Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is particularly dramatic when small block sizes are used. At the same time, the performance of high performance microprocessors depends more and more on the accuracy of branch prediction and for reasons similar to those in the case of caches the size of the Branch Target Buffer is also increasing linearly with the address width.In this paper, we apply the simple principle stated in the title for limiting the tag size of on-chip caches. In the resulting indirect-tagged cache, the duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Applying the same principle to Branch Target Buffers, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126708360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Understanding Application Performance on Shared Virtual Memory Systems 了解共享虚拟内存系统上的应用程序性能
23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI: 10.1145/232973.232987
L. Iftode, J. Singh, Kai Li
{"title":"Understanding Application Performance on Shared Virtual Memory Systems","authors":"L. Iftode, J. Singh, Kai Li","doi":"10.1145/232973.232987","DOIUrl":"https://doi.org/10.1145/232973.232987","url":null,"abstract":"Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics.We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133713996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信