2007 IEEE 13th International Symposium on High Performance Computer Architecture最新文献

筛选
英文 中文
Researching Novel Systems: To Instantiate, Emulate, Simulate, or Analyticate? 研究新系统:实例化,仿真,模拟,还是分析?
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-08-10 DOI: 10.1109/HPCA.2007.346203
D. Burger, J. Emer, Phil Emma, S. Keckler, Y. Patt, D. Patterson
{"title":"Researching Novel Systems: To Instantiate, Emulate, Simulate, or Analyticate?","authors":"D. Burger, J. Emer, Phil Emma, S. Keckler, Y. Patt, D. Patterson","doi":"10.1109/HPCA.2007.346203","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346203","url":null,"abstract":"The computer architecture research community has a rich menu of methodological options, which includes building full system prototypes, measuring in simulation, emulating on FPGAs, or constructing sophisticated analytic models. However, building custom systems has become enormously expensive, especially given the current funding climate. Simulations have become enormously complex as well, often including full operating systems. Analytic models have become less popular as system complexity has grown. Finally, some argue that FPGA emulation of hardware is the right approach for the future, while others opine that it is the worst of all worlds. This panel will debate these various points of view, which are of great interest to the funding sponsors of our community.","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134414167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors 热羊群:用于控制高性能3d集成处理器热点的微架构技术
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346197
Kiran Puttaswamy, G. Loh
{"title":"Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors","authors":"Kiran Puttaswamy, G. Loh","doi":"10.1109/HPCA.2007.346197","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346197","url":null,"abstract":"3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of thermal herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memorization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement (min 7%, max 77% on individual benchmarks), while simultaneously reducing total power by 20% (min 15%, max 30%). Without our thermal herding techniques, the worst-case 3D temperature increases by 17 degrees. With our thermal herding techniques, the temperature increase is only 12 degrees (29% reduction in the 3D worst-case temperature increase)","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114424857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 166
A Memory-Level Parallelism Aware Fetch Policy for SMT Processors SMT处理器的内存级并行获取策略
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346201
Stijn Eyerman, L. Eeckhout
{"title":"A Memory-Level Parallelism Aware Fetch Policy for SMT Processors","authors":"Stijn Eyerman, L. Eeckhout","doi":"10.1109/HPCA.2007.346201","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346201","url":null,"abstract":"A thread executing on a simultaneous multithreading (SMT) processor that experiences a long-latency load will eventually stall while holding execution resources. Existing long-latency load aware SMT fetch policies limit the amount of resources allocated by a staffed thread by identifying long-latency loads and preventing the given thread from fetching more instructions - and in some implementations, instructions beyond the long-latency load may even be flushed which frees allocated resources. This paper proposes an SMT fetch policy that hikes into account the available memory-level parallelism (MLP) in a thread. The key idea proposed in this paper is that in case of an isolated long-latency had. i.e. there is no MLP the thread should be prevented from allocating additional resources. However, in case multiple independent long-latency loads overlap, i.e., there is MLP the thread should allocate as many resources as needed in order to fully expose the available MLP. The proposed MLP-aware fetch policy achieves better performance for MLP-intensive threads on an SMT processor and achieves a better overall balance between performance and fairness than previously proposed fetch policies","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129128995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Interactions Between Compression and Prefetching in Chip Multiprocessors 芯片多处理器中压缩与预取的相互作用
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346200
Alaa R. Alameldeen, D. Wood
{"title":"Interactions Between Compression and Prefetching in Chip Multiprocessors","authors":"Alaa R. Alameldeen, D. Wood","doi":"10.1109/HPCA.2007.346200","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346200","url":null,"abstract":"In chip multiprocessors (CMPs), multiple cores compete for shared resources such as on-chip caches and off-chip pin bandwidth. Stride-based hardware prefetching increases demand for these resources, causing contention that can degrade performance (up to 35% for one of our benchmarks). In this paper, we first show that cache and link (off-chip interconnect) compression can increase the effective cache capacity (thereby reducing off-chip misses) and increase the effective off-chip bandwidth (reducing contention). On an 8-processor CMP with no prefetching, compression improves performance by up to 18% for commercial workloads. Second, we propose a simple adaptive prefetching mechanism that uses cache compressions extra tags to detect useless and harmful prefetches. Furthermore, in the central result of this paper, we show that compression and prefetching interact in a strong positive way, resulting in combined performance improvement of 10-51% for seven of our eight workloads","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126534429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling 全缓冲内存架构:理解机制,开销和扩展
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346190
B. Ganesh, A. Jaleel, David T. Wang, B. Jacob
{"title":"Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling","authors":"B. Ganesh, A. Jaleel, David T. Wang, B. Jacob","doi":"10.1109/HPCA.2007.346190","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346190","url":null,"abstract":"Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the fully-buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a fully-buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7% in latency and 10% in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25% and 10% respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125247744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing 为生产者-消费者共享优化的自适应缓存一致性协议
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346210
Liqun Cheng, J. Carter, Donglai Dai
{"title":"An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing","authors":"Liqun Cheng, J. Carter, Donglai Dai","doi":"10.1109/HPCA.2007.346210","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346210","url":null,"abstract":"Shared memory multiprocessors play an increasingly important role in enterprise and scientific computing facilities. Remote misses limit the performance of shared memory applications, and their significance is growing as network latency increases relative to processor speeds. This paper proposes two mechanisms that improve shared memory performance by eliminating remote misses and/or reducing the amount of communication required to maintain coherence. We focus on improving the performance of applications that exhibit producer-consumer sharing. We first present a simple hardware mechanism for detecting producer-consumer sharing. We then describe a directory delegation mechanism whereby the \"home node\" of a cache line can be delegated to a producer node, thereby converting 3-hop coherence operations into 2-hop operations. We then extend the delegation mechanism to support speculative updates for data accessed in a producer-consumer pattern, which can convert 2-hop misses into local misses, thereby eliminating the remote memory latency. Both mechanisms can be implemented without changes to the processor. We evaluate our directory delegation and speculative update mechanisms on seven benchmark programs that exhibit producer-consumer sharing using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor. We find that the mechanisms proposed in this paper reduce the average remote miss rate by 40%, reduce network traffic by 15%, and improve performance by 21%. Finally, we use Murphi to verify that each mechanism is error-free and does not violate sequential consistency","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134562058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications 扩展多核架构,利用单线程应用程序中的混合并行性
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346182
Hongtao Zhong, Steven A. Lieberman, S. Mahlke
{"title":"Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-thread Applications","authors":"Hongtao Zhong, Steven A. Lieberman, S. Mahlke","doi":"10.1109/HPCA.2007.346182","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346182","url":null,"abstract":"Chip multiprocessors with multiple simpler cores are gaining popularity because they have the potential to drive future performance gains without exacerbating the problems of power dissipation and complexity. Current chip multiprocessors increase throughput by utilizing multiple cores to perform computation in parallel. These designs provide real benefits for server-class applications that are explicitly multi-threaded. However, for desktop and other systems where single-thread applications dominate, multicore systems have yet to offer much benefit. Chip multiprocessors are most efficient at executing coarse-grain threads that have little communication. However, general-purpose applications do not provide many opportunities for identifying such threads, due to frequent use of pointers, recursive data structures, if-then-else branches, small function bodies, and loops with small trip counts. To attack this mismatch, this paper proposes a multicore architecture, referred to as Voltron that extends traditional multicore systems in two ways. First, it provides a dual-mode scalar operand network to enable efficient inter-core communication and lightweight synchronization. Second, Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, the cores execute multiple instruction streams in lock-step to collectively function as a wide-issue VLIW. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. This paper describes the Voltron architecture and associated compiler support for orchestrating bi-modal execution","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 117
Evaluating MapReduce for Multi-core and Multiprocessor Systems 评估MapReduce在多核和多处理器系统中的应用
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346181
Colby Ranger, R. Raghuraman, Arun Penmetsa, G. Bradski, C. Kozyrakis
{"title":"Evaluating MapReduce for Multi-core and Multiprocessor Systems","authors":"Colby Ranger, R. Raghuraman, Arun Penmetsa, G. Bradski, C. Kozyrakis","doi":"10.1109/HPCA.2007.346181","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346181","url":null,"abstract":"This paper evaluates the suitability of the MapReduce model for multi-core and multi-processor systems. MapReduce was created by Google for application development on data-centers with thousands of servers. It allows programmers to write functional-style code that is automatically parallelized and scheduled in a distributed system. We describe Phoenix, an implementation of MapReduce for shared-memory systems that includes a programming API and an efficient runtime system. The Phoenix runtime automatically manages thread creation, dynamic task scheduling, data partitioning, and fault tolerance across processor nodes. We study Phoenix with multi-core and symmetric multiprocessor systems and evaluate its performance potential and error recovery features. We also compare MapReduce code to code written in lower-level APIs such as P-threads. Overall, we establish that, given a careful implementation, MapReduce is a promising model for scalable performance on shared-memory systems with simple parallel code","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124874487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1092
Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat 用恒温器对机架式服务器的热概况进行建模和管理
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346198
Jeonghwan Choi, Youngjae Kim, A. Sivasubramaniam, J. Srebric, Qian Wang, Joonwon Lee
{"title":"Modeling and Managing Thermal Profiles of Rack-mounted Servers with ThermoStat","authors":"Jeonghwan Choi, Youngjae Kim, A. Sivasubramaniam, J. Srebric, Qian Wang, Joonwon Lee","doi":"10.1109/HPCA.2007.346198","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346198","url":null,"abstract":"High power densities and the implications of high operating temperatures on the failure rates of components are key driving factors of temperature-aware computing. Computer architects and system software designers need to understand the thermal consequences of their proposals, and develop techniques to lower operating temperatures to reduce both transient and permanent component failures. Tools for understanding temperature ramifications of designs have been mainly restricted to industry for studying packaging and cooling mechanisms, with little access to such toolsets for academic researchers. Developing such tools is an arduous task since it usually requires cross-cutting areas of expertise spanning architecture, systems software, thermodynamics, and cooling systems. Recognizing the need for such tools, there has been work on modeling temperatures of processors at the micro-architectural level which can be easily understood and employed by computer architects for processor designs. However, there is a dearth of such tools in the academic/research community for undertaking architectural/systems studies beyond a processor - a server box, rack or even a machine room. This paper presents a detailed 3-dimensional computational fluid dynamics based thermal modeling tool, called ThermoStat, for rack-mounted server systems. Using this tool, we model a 20 (each with dual Xeon processors) node rack-mounted server system, and validate it with over 30 temperature sensor measurements at different points in the servers/rack. We conduct several experiments with this tool to show how different load conditions affect the thermal profile, and also illustrate how this tool can help design dynamic thermal management techniques","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129521430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Optical Interconnect Opportunities for Future Server Memory Systems 未来服务器存储系统的光互连机会
2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI: 10.1109/HPCA.2007.346184
Y. Katayama, A. Okazaki
{"title":"Optical Interconnect Opportunities for Future Server Memory Systems","authors":"Y. Katayama, A. Okazaki","doi":"10.1109/HPCA.2007.346184","DOIUrl":"https://doi.org/10.1109/HPCA.2007.346184","url":null,"abstract":"This paper deals with alternative server memory architecture options in multicore CPU generations using optically-attached memory systems. Thanks to its large bandwidth-distance product, optical interconnect technology enables CPUs and local memory to be placed meters away from each other without sacrificing bandwidth. This topologically-local but physically-remote main memory attached via an ultra-high-bandwidth parallel optical interconnect can lead to flexible memory architecture options using low-cost commodity memory technologies","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114277563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信