The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.最新文献_第3页

Incorporating predicate information into branch predictors 将谓词信息合并到分支预测器中

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183524

B. Simon, B. Calder, J. Ferrante

引用次数: 16

Scalar operand networks: on-chip interconnect for ILP in partitioned architectures 标量操作数网络:分区体系结构中ILP的片上互连

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183551

M. Taylor, Walter Lee, Saman P. Amarasinghe, A. Agarwal

{"title":"Scalar operand networks: on-chip interconnect for ILP in partitioned architectures","authors":"M. Taylor, Walter Lee, Saman P. Amarasinghe, A. Agarwal","doi":"10.1109/HPCA.2003.1183551","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183551","url":null,"abstract":"The bypass paths and multiported register files in microprocessors serve as an implicit interconnect to communicate operand values among pipeline stages and multiple ALU. Previous superscalar designs implemented this interconnect using centralized structures that do not scale with increasing ILP demands. In search of scalability, recent microprocessor designs in industry and academia exhibit a trend towards distributed resources such as partitioned register files, banked caches, multiple independent compute pipelines, and even multiple program counters. Some of these partitioned microprocessor designs have begun to implement bypassing and operand transport using point-to-point interconnects rather than centralized networks. We call interconnects optimized for scalar data transport, whether centralized or distributed, scalar operand networks. Although these networks share many of the challenges of multiprocessor networks such as scalability and deadlock avoidance, they have many unique requirements, including ultra-low latencies (a few cycles versus tens of cycles) and ultra-fast operation-operand matching. This paper discusses the unique properties of scalar operand networks, examines alternative ways of implementing them, and describes in detail the implementation of one such network in the Raw microprocessor. The paper analyzes the performance of these networks for ILP workloads and the sensitivity of overall ILP performance to network properties.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128000870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 164

Billion transistor chips in mainstream enterprise platforms of the future 未来主流企业平台的十亿晶体管芯片

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183519

D. Bhandarkar

{"title":"Billion transistor chips in mainstream enterprise platforms of the future","authors":"D. Bhandarkar","doi":"10.1109/HPCA.2003.1183519","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183519","url":null,"abstract":"Today’s leading edge microprocessors like the Intel’s Itanium ® 2 Processor feature over 220 million transistors in 0.18µm semiconductor process technology. Nanotechnology that continues to drive Moore’s Law provides a doubling of the transistor density every two years. This indicates that a Billion transistor chip is possible in the 65 nm technology within the next 3 to 4 years. Such chips can be used in mainstream enterprise server platforms. This talk will review the progress in semiconductor technology over the last 3 decades since the introduction of the first microprocessor in 1971. A short video tape will provide a historical perspective on Moore’s Law in the form of an interview with co-founder Gordon Moore, and his thoughts for the future of semiconductor technology. Key trends in high end microprocessor design including multi-threading and multi-core will be covered. We have started to see “SMP-on-a-chip” designs for high-end enterprise servers where two processors with Level 2 (L2) cache are incorporated on a single chip. Future microprocessors will offer higher levels of multiprocessor capability on chip as the transistor density increases.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128274379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Inter-cluster communication models for clustered VLIW processors 集群VLIW处理器的集群间通信模型

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183552

A. Terechko, Erwan Le Thenaff, Manish Garg, J. V. Eijndhoven, H. Corporaal

引用次数: 45

A methodology for designing efficient on-chip interconnects on well-behaved communication patterns 一种在良好的通信模式下设计高效片上互连的方法

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183554

W. Ho, T. Pinkston

{"title":"A methodology for designing efficient on-chip interconnects on well-behaved communication patterns","authors":"W. Ho, T. Pinkston","doi":"10.1109/HPCA.2003.1183554","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183554","url":null,"abstract":"As the level of chip integration continues to advance at a fast pace, the desire for efficient interconnects - whether on-chip or off-chip - is rapidly increasing. Traditional interconnects like buses, point-to-point wires and regular topologies may suffer from poor resource sharing in the time and space domains, leading to high contention or low resource utilization. In this paper, we propose a design methodology for constructing networks for special-purpose computer systems with well-behaved (known) communication characteristics. A temporal and spatial model is proposed to define the sufficient condition for contention-free communication. Based upon this model, a design methodology using a recursive bisection technique is applied to systematically partition a parallel system such that the required number of links and switches is minimized while achieving low contention. Results show that the design methodology can generate more optimized on-chip networks with up to 60% fewer resources than meshes or tori while providing blocking performance closer to that of a fully connected crossbar.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123729617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

TCP: tag correlating prefetchers TCP:标签关联预取器

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183549

Zhigang Hu, M. Martonosi, S. Kaxiras

{"title":"TCP: tag correlating prefetchers","authors":"Zhigang Hu, M. Martonosi, S. Kaxiras","doi":"10.1109/HPCA.2003.1183549","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183549","url":null,"abstract":"Although caches for decades have been the backbone of the memory system, the speed gap between CPU and main memory suggests their augmentation with prefetching mechanisms. Recently, sophisticated hardware correlating prefetching mechanisms have been proposed, in some cases coupled with some form of dead-block prediction. In many proposals, however correlating prefetchers demand a significant investment in hardware. In this paper we show that correlating prefetchers that work with tags instead of cache-line addresses are significantly more resource-efficient, providing equal or better performance than previous proposals. We support this claim by showing that per-set tag sequences exhibit highly repetitive patterns both within a set and across different sets. Because a single tag sequence can capture multiple address sequences spread over different cache sets, significant space savings can be achieved. We propose a tag-based prefetcher called a tag correlating prefetcher (TCP). Even with very small history tables, TCP outperforms address-based correlating prefetchers many times larger. In addition, we show that such a prefetcher can yield most of its performance benefits if placed at the L2 level of an aggressive out-of-order processor. Only if one wants prefetching all the way up to L1, is dead-block prediction required. Finally, we draw parallels between the two-level structure of TCP and similar structures for branch prediction mechanisms; these parallels raise interesting opportunities for improving correlating memory prefetchers by harnessing lessons already learned for correlating branch predictors.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128198928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Runahead execution: an alternative to very large instruction windows for out-of-order processors 超前执行:无序处理器非常大的指令窗口的替代方案

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183532

O. Mutlu, J. Stark, C. Wilkerson, Y. Patt

{"title":"Runahead execution: an alternative to very large instruction windows for out-of-order processors","authors":"O. Mutlu, J. Stark, C. Wilkerson, Y. Patt","doi":"10.1109/HPCA.2003.1183532","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183532","url":null,"abstract":"Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130823120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 452

Slipstream execution mode for CMP-based multiprocessors 基于cmp的多处理器的滑流执行模式

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183536

K. Ibrahim, G. Byrd, E. Rotenberg

{"title":"Slipstream execution mode for CMP-based multiprocessors","authors":"K. Ibrahim, G. Byrd, E. Rotenberg","doi":"10.1109/HPCA.2003.1183536","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183536","url":null,"abstract":"Scalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead enlists extra processors to assist parallel tasks by reducing perceived overheads. We consider DSM multiprocessors built from dual-processor chip multiprocessor (CMP) nodes with shared L2 cache. A task is allocated on one processor of each CMP node. The other processor of each node executes a reduced version of the same task. The reduced version skips shared-memory stores and synchronization, running ahead of the true task. Even with the skipped operations, the reduced task makes accurate forward progress and generates an accurate reference stream, because branches and addresses depend primarily on private data. Slipstream execution mode yields two benefits. First, the reduced task prefetches data on behalf of the true task. Second, reduced tasks provide a detailed picture of future reference behavior, enabling a number of optimizations aimed at accelerating coherence events, e.g., self-invalidation. For multiprocessor systems with up to 16 CMP nodes, slipstream mode outperforms running one or two conventional tasks per CMP in 7 out of 9 parallel scientific benchmarks. Slipstream mode is 12-19% faster with prefetching only and up to 29% faster with self-invalidation enabled.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131330806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Active I/O switches in system area networks 系统局域网络中的主动I/O交换机

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183553

M. Hao, Mark A. Heinrich

引用次数: 4

Reconsidering complex branch predictors 重新考虑复杂的分支预测器

The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. Pub Date : 2003-02-08 DOI: 10.1109/HPCA.2003.1183523

Daniel A. Jiménez

{"title":"Reconsidering complex branch predictors","authors":"Daniel A. Jiménez","doi":"10.1109/HPCA.2003.1183523","DOIUrl":"https://doi.org/10.1109/HPCA.2003.1183523","url":null,"abstract":"To sustain instruction throughput rates in more aggressively clocked microarchitectures, microarchitects have incorporated larger and more complex branch predictors into their designs, taking advantage of the increasing numbers of transistors available on a chip. Unfortunately, because of penalties associated with their implementations, the extra accuracy provided by many branch predictors does not produce a proportionate increase in performance. Specifically, we show that the techniques used to hide the latency of a large and complex branch predictor do not scale well and will be unable to sustain IPC for deeper pipelines. We investigate a different way to build large branch predictors. We propose an alternative predictor design that completely hides predictor latency so that accuracy and hardware budget are the only factors that affect the efficiency of the predictor. Our simple design allows the predictor to be pipelined efficiently by avoiding difficulties introduced by complex predictors. Because this predictor eliminates the penalties associated with complex predictors, overall performance exceeds that of even the most accurate known branch predictors in the literature at large hardware budgets. We conclude that as chip densities increase in the next several years, the accuracy of complex branch predictors must be weighed against the performance benefits of simple branch predictors.","PeriodicalId":150992,"journal":{"name":"The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75