2008 International Symposium on Computer Architecture最新文献

Achieving Out-of-Order Performance with Almost In-Order Complexity 用几乎有序的复杂度实现无序的性能

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382169

F. Tseng, Y. Patt

引用次数: 39

Microcoded Architectures for Ion-Tap Quantum Computers 离子抽头量子计算机的微编码体系结构

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382136

Lucas Kreger-Stickles, M. Oskin

引用次数: 20

iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures 理想:用于片上网络(NoC)架构的路由器间双功能节能和区域高效链路

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382142

Avinash Karanth Kodi, Ashwini Sarathy, A. Louri

{"title":"iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures","authors":"Avinash Karanth Kodi, Ashwini Sarathy, A. Louri","doi":"10.1145/1394608.1382142","DOIUrl":"https://doi.org/10.1145/1394608.1382142","url":null,"abstract":"Network-on-Chip (NoC) architectures have been adopted by a growing number of multi-core designs as a flexible and scalable solution to the increasing wire delay constraints in the deep sub-micron regime. However, the shrinking feature size limits the performance of NoCs due to power and area constraints. Research into the optimization of NoCs has shown that a reduction in the number of buffers in the NoC routers reduces the power and area overhead but degrades the network performance. In this paper, we propose iDEAL, a low-power area-efficient NoC architecture by reducing the number of buffers within the router. To overcome the performance degradation caused by the reduced buffer size, we propose to use adaptive dual-function links capable of data transmission as well as data storage when required. Simulation results for the proposed architecture show that reducing the router buffer size in half and using the adaptive dual-function links achieves nearly 40% savings in buffer power, 30% savings in overall network power and about 41% savings in the router area, with only a marginal 1-3% drop in performance. Moreover, the performance in iDEAL can be further improved by aggressive and speculative flow control techniques.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132949157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors 芯片多处理器的可变感知应用程序调度和电源管理

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382152

R. Teodorescu, J. Torrellas

引用次数: 341

VEAL: Virtualized Execution Accelerator for Loops VEAL:循环的虚拟执行加速器

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382155

Nathan Clark, Amir Hormati, S. Mahlke

{"title":"VEAL: Virtualized Execution Accelerator for Loops","authors":"Nathan Clark, Amir Hormati, S. Mahlke","doi":"10.1145/1394608.1382155","DOIUrl":"https://doi.org/10.1145/1394608.1382155","url":null,"abstract":"Performance improvement solely through transistor scaling is becoming more and more difficult, thus it is increasingly common to see domain specific accelerators used in conjunction with general purpose processors to achieve future performance goals. There is a serious drawback to accelerators, though: binary compatibility. An application compiled to utilize an accelerator cannot run on a processor without that accelerator, and applications that do not utilize an accelerator will never use it. To overcome this problem, we propose decoupling the instruction set architecture from the underlying accelerators. Computation to be accelerated is expressed using a processorpsilas baseline instruction set, and light-weight dynamic translation maps the representation to whatever accelerators are available in the system. In this paper, we describe the changes to a compilation framework and processor system needed to support this abstraction for an important set of accelerator designs that support innermost loops. In this analysis, we investigate the dynamic overheads associated with abstraction as well as the static/dynamic tradeoffs to improve the dynamic mapping of loop-nests. As part of the exploration, we also provide a quantitative analysis of the hardware characteristics of effective loop accelerators. We conclude that using a hybrid static-dynamic compilation approach to map computation on to loop-level accelerators is an practical way to increase computation efficiency, without the overheads associated with instruction set modification.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114535369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Online Estimation of Architectural Vulnerability Factor for Soft Errors 软错误体系结构脆弱性因子的在线估计

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382150

Xiaodong Li, S. Adve, P. Bose, J. Rivers

{"title":"Online Estimation of Architectural Vulnerability Factor for Soft Errors","authors":"Xiaodong Li, S. Adve, P. Bose, J. Rivers","doi":"10.1145/1394608.1382150","DOIUrl":"https://doi.org/10.1145/1394608.1382150","url":null,"abstract":"As CMOS technology scales and more transistors are packed on to the same chip, soft error reliability has become an increasingly important design issue for processors. Prior research has shown that there is significant architecture-level masking, and many soft error solutions take advantage of this effect. Prior work has also shown that the degree of such masking can vary significantly across workloads and between individual workload phases, motivating dynamic adaptation of reliability solutions for optimal cost and benefit. For such adaptation, it is important to be able to accurately estimate the amount of masking or the architecture vulnerability factor (AVF) online, while the program is running. Unfortunately, existing solutions for estimating AVF are often based on offline simulators and hard to implement in real processors. This paper proposes a novel way of estimating AVF online, using simple modifications to the processor. The estimation method applies to both logic and storage structures on the processor. Compared to previous methods for estimating AVF, our method does not require any offline simulation or calibration for different workloads. We tested our method with a widely used simulator from industry, for four processor structures and for 100 to 200 intervals of each of eleven SPEC benchmarks. The results show that our method provides acceptably accurate AVF estimates at runtime. The absolute error rarely exceeds 0.08 across all application intervals for all structures, and the mean absolute error for a given application and structure combination is always within 0.05.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 85

A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime 一种利用微架构冗余延长缓存SRAM寿命的主动磨损恢复方法

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382151

Jeonghee Shin, V. Zyuban, P. Bose, T. Pinkston

{"title":"A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime","authors":"Jeonghee Shin, V. Zyuban, P. Bose, T. Pinkston","doi":"10.1145/1394608.1382151","DOIUrl":"https://doi.org/10.1145/1394608.1382151","url":null,"abstract":"Microarchitectural redundancy has been proposed as a means of improving chip lifetime reliability. It is typically used in a reactive way, allowing chips to maintain operability in the presence of failures by detecting and isolating, correcting, and/or replacing components on a first-come, first-served basis only after they become faulty. In this paper, we explore an alternative, more preferred method of exploiting microarchitectural redundancy to enhance chip lifetime reliability. In our proposed approach, redundancy is used proactively to allow non-faulty microarchitecture components to be temporarily deactivated, on a rotating basis, to suspend and/or recover from certain wearout effects. This approach improves chip lifetime reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. Applied to on-chip cache SRAM for combating NBTI-induced wearout failure, our proactive wearout recovery approach increases lifetime reliability (measured in mean-time-to-failure) of the cache by about a factor of seven relative to no use of microarchitectural redundancy and a factor of five relative to conventional reactive use of redundancy having similar area overhead.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127369433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Rerun: Exploiting Episodes for Lightweight Memory Race Recording 重播:利用情节轻量级内存比赛记录

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382144

Derek Hower, M. Hill

引用次数: 164

Corona: System Implications of Emerging Nanophotonic Technology 电晕:新兴纳米光子技术的系统含义

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382135

D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, Marco Fiorentino, A. Davis, N. Binkert, R. Beausoleil, Jung Ho Ahn

{"title":"Corona: System Implications of Emerging Nanophotonic Technology","authors":"D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, Marco Fiorentino, A. Davis, N. Binkert, R. Beausoleil, Jung Ho Ahn","doi":"10.1145/1394608.1382135","DOIUrl":"https://doi.org/10.1145/1394608.1382135","url":null,"abstract":"We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on-stack bandwidth requirements at acceptable power levels. Corona is a 3 D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114729687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 690

DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently DeLorean:记录和确定性重放共享内存多处理器执行Ef?地

2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382146

Pablo Montesinos, L. Ceze, J. Torrellas

{"title":"DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently","authors":"Pablo Montesinos, L. Ceze, J. Torrellas","doi":"10.1145/1394608.1382146","DOIUrl":"https://doi.org/10.1145/1394608.1382146","url":null,"abstract":"Support for deterministic replay of multithreaded execution can greatly help in finding concurrency bugs. For highest effectiveness, replay schemes should (i) record at production-run speed, (ii) keep their logging requirements minute, and (iii) replay at a speed similar to that of the initial execution. In this paper, we propose a new substrate for deterministic replay that provides substantial advances along these axes. In our proposal, processors execute blocks of instructions atomically, as in transactional memory or speculative multithreading, and the system only needs to record the commit order of these blocks. We call our scheme DeLorean. Our results show that DeLorean records execution at a speed similar to that of release consistency (RC) execution and replays at about 82% of its speed. In contrast, most current schemes only record at the speed of Sequential Consistency (SC) execution. Moreover, DeLorean only needs 7.5% of the log size needed by a state-of-the-art scheme. Finally, DeLorean can be configured to need only 0.6% of the log size of the state-of-the-art scheme at the cost of recording at 86% of RCpsilas execution speed - still faster than SC. In this configuration, the log of an 8-processor 5-GHz machine is estimated to be only about 20GB per day.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"83 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116734809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 213