Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)最新文献

Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor 在MIT多alu处理器上开发细粒度线程级并行性

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1145/279358.279399

S. Keckler, W. Dally, D. Maskit, N. Carter, Andrew Chang, W. S. Lee

{"title":"Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor","authors":"S. Keckler, W. Dally, D. Maskit, N. Carter, Andrew Chang, W. S. Lee","doi":"10.1145/279358.279399","DOIUrl":"https://doi.org/10.1145/279358.279399","url":null,"abstract":"Much of the improvement in computer performance over the last twenty years has come from faster transistors and architectural advances that increase parallelism. Historically, parallelism has been exploited either at the instruction level with a grain-size of a single instruction or by partitioning applications into coarse threads with grain-sizes of thousands of instructions. Fine-grain threads fill the parallelism gap between these extremes by enabling tasks with run lengths as small as 20 cycles. As this fine-grain parallelism is orthogonal to ILP and coarse threads, it complements both methods and provides an opportunity for greater speedup. This paper describes the efficient communication and synchronization mechanisms implemented in the Multi-ALU Processor (MAP) chip, including a thread creation instruction, register communication, and a hardware barrier. These register-based mechanisms provide 10 times faster communication and 60 times faster synchronization than mechanisms that operate via a shared on-chip cache. With a three-processor implementation of the MAP: fine-grain speedups of 1.2-2.1 are demonstrated on a suite of applications.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115626544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 88

The effect of instruction fetch bandwidth on value prediction 指令获取带宽对值预测的影响

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1145/279358.278058

F. Gabbay, A. Mendelson

引用次数: 64

Execution characteristics of desktop applications on Windows NT Windows NT上桌面应用程序的执行特性

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1145/279358.279366

Dennis Lee, P. Crowley, J. Baer, T. Anderson, B. Bershad

引用次数: 142

Confidence estimation for speculation control 投机控制的置信度估计

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1145/279358.279376

D. Grunwald, A. Klauser, Srilatha Manne, A. Pleszkun

{"title":"Confidence estimation for speculation control","authors":"D. Grunwald, A. Klauser, Srilatha Manne, A. Pleszkun","doi":"10.1145/279358.279376","DOIUrl":"https://doi.org/10.1145/279358.279376","url":null,"abstract":"Modern processors improve instruction level parallelism by speculation. The outcome of data and control decisions is predicted, and the operations are speculatively executed and only committed if the original predictions were correct. There are a number of other ways that processor resources could be used, such as threading or eager execution. As the use of speculation increases, we believe more processors will need some form of speculation control to balance the benefits of speculation against other possible activities. Confidence estimation is one technique that can be exploited by architects for speculation control. In this paper, we introduce performance metrics to compare confidence estimation mechanisms, and argue that these metrics are appropriate for speculation control. We compare a number of confidence estimation mechanisms, focusing on mechanisms that have a small implementation cost and gain benefit by exploiting characteristics of branch predictors, such as clustering of mispredicted branches. We compare the performance of the different confidence estimation methods using detailed pipeline simulations. Using these simulations, we show how to improve some confidence estimators, providing better insight for future investigations comparing and applying confidence estimators.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130838835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 169

Improving trace cache effectiveness with branch promotion and trace packing 通过分支提升和跟踪打包提高跟踪缓存的有效性

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1109/ISCA.1998.694786

Sanjay J. Patel, M. Evers, Y. Patt

{"title":"Improving trace cache effectiveness with branch promotion and trace packing","authors":"Sanjay J. Patel, M. Evers, Y. Patt","doi":"10.1109/ISCA.1998.694786","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694786","url":null,"abstract":"The increasing widths of superscalar processors are placing greater demands upon the fetch mechanism. The trace cache meets these demands by placing logically contiguous instructions in physically contiguous storage. As a result, the trace cache delivers instructions at a high rate by supplying multiple fetch blocks each cycle. In this paper we examine two techniques to improve the number of instructions delivered each cycle by the trace cache. The first technique, branch promotion, dynamically converts strongly biased branches into branches with static predictions. Because these promoted branches require no dynamic prediction, the branch predictor suffers less from the negative effects of interference. Branch promotion unlocks the potential of the second technique: trace packing. With trace packing, trace segments are packed with as many instructions as will fit, without regard to naturally occurring fetch block boundaries. With both techniques, the effective fetch rate of the trace cache jumps up 17% over a trace cache which implements neither on a machine where the execution engine has a very aggressive memory disambiguator; the performance of a machine using branch promotion and trace packing is on average 11% higher than a machine using neither technique.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Multi-level texture caching for 3D graphics hardware 用于3D图形硬件的多级纹理缓存

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1109/ISCA.1998.694765

M. Cox, Narendra Bhandari, M. Shantz

{"title":"Multi-level texture caching for 3D graphics hardware","authors":"M. Cox, Narendra Bhandari, M. Shantz","doi":"10.1109/ISCA.1998.694765","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694765","url":null,"abstract":"Traditional graphics hardware architectures implement what we call the push architecture for texture mapping. Local memory is dedicated to the accelerator for fast local retrieval of texture during rasterization, and the application is responsible for managing this memory. The push architecture has a bandwidth advantage, but disadvantages of limited texture capacity, escalation of accelerator memory requirements (and therefore cost), and poor memory utilization. The push architecture also requires the programmer to solve the bin-packing problem of managing accelerator memory each frame. More recently graphics hardware on PC-class machines has moved to an implementation of what we call the pull architecture. Texture is stored in system memory and downloaded by the accelerator as needed. The pull architecture has advantages of texture capacity, stems the escalation of accelerator memory requirements, and has good memory utilization. It also frees the programmer from accelerator texture memory management. However, the pull architecture suffers escalating requirements for bandwidth from main memory to the accelerator. In this paper we propose multi-level texture caching to provide the accelerator with the bandwidth advantages of the push architecture combined with the capacity advantages of the pull architecture. We have studied the feasibility of 2-level caching and found the following: (1) significant re-use of texture between frames; (2) L2 caching requires significantly less memory than the push architecture; (3) L2 caching requires significantly less bandwidth from host memory than the pull architecture; (4) L2 caching enables implementation of smaller L1 caches that would otherwise bandwidth-limit accelerators on the workloads in this paper. Results suggest that an L2 cache achieves the original advantage of the pull architecture stemming the growth of local texture memory - while at the same time stemming the current explosion in demand for texture bandwidth between host memory and the accelerator.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130442464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Selective eager execution on the PolyPath architecture 在PolyPath架构上的选择性急切执行

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1109/ISCA.1998.694785

A. Klauser, A. Paithankar, D. Grunwald

引用次数: 92

Performance characterization of a quad Pentium Pro SMP using OLTP workloads 使用OLTP工作负载的四处理器Pentium Pro SMP的性能表征

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1109/ISCA.1998.694759

K. Keeton, D. Patterson, Yongchao He, R. Raphael, W. Baker

{"title":"Performance characterization of a quad Pentium Pro SMP using OLTP workloads","authors":"K. Keeton, D. Patterson, Yongchao He, R. Raphael, W. Baker","doi":"10.1109/ISCA.1998.694759","DOIUrl":"https://doi.org/10.1109/ISCA.1998.694759","url":null,"abstract":"Commercial applications are an important, yet often overlooked, workload with significantly different characteristics from technical workloads. The potential impact of these differences is that computers optimized for technical workloads may not provide good performance for commercial applications, and these applications may not fully exploit advances in processor design. To evaluate these issues, we use hardware counters to measure architectural features of a four-processor Pentium Pro-based server running a TPC-C-like workload on an Informix database. We examine the effectiveness of out-of-order execution, branch prediction, speculative execution, superscalar issue and retire, caching and multiprocessor scaling. We find that out-of-order execution, superscalar issue and retire, and branch prediction are not as effective for database workloads as they are for technical workloads, such as SPEC. We find that caches are effective at reducing processor traffic to memory; even larger caches would be helpful to satisfy more data requests. Multiprocessor scaling of this workload is good, but even modest bus utilization degrades application memory latency, limiting database throughput.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"316 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123468163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 278

Threaded multiple path execution 多线程多路径执行

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1109/ISCA.1998.694778

S. Wallace, B. Calder, D. Tullsen

引用次数: 155

Options for dynamic address translation in COMAs coma中动态地址转换的选项

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1998-04-16 DOI: 10.1145/279358.279390

Xiaogang Qiu, M. Dubois

{"title":"Options for dynamic address translation in COMAs","authors":"Xiaogang Qiu, M. Dubois","doi":"10.1145/279358.279390","DOIUrl":"https://doi.org/10.1145/279358.279390","url":null,"abstract":"In modern processors, the dynamic translation of virtual addresses to support virtual memory is done before or in parallel with the first-level cache access. As processor technology improves at a rapid pace and the working sets of new applications grow insatiably the latency and bandwidth demands on the TLB (Translation Lookaside Buffer) are getting more and more difficult to meet. The situation is worse in multiprocessor systems, which run larger applications and are plagued by the TLB consistency problem. We evaluate and compare five options for virtual address translation in the context of COMAs (Cache Only Memory Architectures). The dynamic address translation mechanism can be located after the cache access provided the cache is virtual. In a particular design, which we call V-COMA for Virtual COMA, the physical address concept and the traditional TLB are eliminated. While still supporting virtual memory, V-COMA reduces the address translation overhead to a minimum. V-COMA scales well and works better in systems with large number of processors. As a machine running on virtual addresses, V-COMA provides a simple and consistent hardware model to the operating system and the compiler, in which further optimization opportunities are possible.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129030921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24