2008 International Symposium on Computer Architecture最新文献

筛选
英文 中文
Achieving Out-of-Order Performance with Almost In-Order Complexity 用几乎有序的复杂度实现无序的性能
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382169
F. Tseng, Y. Patt
{"title":"Achieving Out-of-Order Performance with Almost In-Order Complexity","authors":"F. Tseng, Y. Patt","doi":"10.1145/1394608.1382169","DOIUrl":"https://doi.org/10.1145/1394608.1382169","url":null,"abstract":"There is still much performance to be gained by out-of-order processors with wider issue widths. However, traditional methods of increasing issue width do not scale; that is, they drastically increase design complexity and power requirements. This paper introduces the braid, a compile-time identified entity that enables the execution core to scale to wider widths by exploiting the small fanout and short lifetime of values produced by the program. Braid processing requires identification by the compiler, minor extensions to the ISA, and support by the microarchitecture. The result from processing braids is performance within 9% of a very aggressive conventional out-of-order microarchitecture with almost the complexity of an in-order implementation.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"182 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127023866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Microcoded Architectures for Ion-Tap Quantum Computers 离子抽头量子计算机的微编码体系结构
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382136
Lucas Kreger-Stickles, M. Oskin
{"title":"Microcoded Architectures for Ion-Tap Quantum Computers","authors":"Lucas Kreger-Stickles, M. Oskin","doi":"10.1145/1394608.1382136","DOIUrl":"https://doi.org/10.1145/1394608.1382136","url":null,"abstract":"In this paper we present the first ever systematic design space exploration of microcoded software fault tolerant ion-trap quantum computers. This exploration reveals the critical importance of a well-tuned microcode for providing high performance and ensuring system reliability. In addition, we find that, despite recent advances in the reliability of quantum memory, the impact of errors due to stored quantum data is now, and will continue to be, a major source of systemic error. Finally, our exploration reveals a single design which out performs all others we considered in run time, fidelity and area. For completeness our design space exploration includes designs from prior work [13] and we find a novel design that is 1/2 the size, 3 times as fast, and an order of magnitude more reliable.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132960054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures 理想:用于片上网络(NoC)架构的路由器间双功能节能和区域高效链路
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382142
Avinash Karanth Kodi, Ashwini Sarathy, A. Louri
{"title":"iDEAL: Inter-router Dual-Function Energy and Area-Efficient Links for Network-on-Chip (NoC) Architectures","authors":"Avinash Karanth Kodi, Ashwini Sarathy, A. Louri","doi":"10.1145/1394608.1382142","DOIUrl":"https://doi.org/10.1145/1394608.1382142","url":null,"abstract":"Network-on-Chip (NoC) architectures have been adopted by a growing number of multi-core designs as a flexible and scalable solution to the increasing wire delay constraints in the deep sub-micron regime. However, the shrinking feature size limits the performance of NoCs due to power and area constraints. Research into the optimization of NoCs has shown that a reduction in the number of buffers in the NoC routers reduces the power and area overhead but degrades the network performance. In this paper, we propose iDEAL, a low-power area-efficient NoC architecture by reducing the number of buffers within the router. To overcome the performance degradation caused by the reduced buffer size, we propose to use adaptive dual-function links capable of data transmission as well as data storage when required. Simulation results for the proposed architecture show that reducing the router buffer size in half and using the adaptive dual-function links achieves nearly 40% savings in buffer power, 30% savings in overall network power and about 41% savings in the router area, with only a marginal 1-3% drop in performance. Moreover, the performance in iDEAL can be further improved by aggressive and speculative flow control techniques.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132949157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors 芯片多处理器的可变感知应用程序调度和电源管理
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382152
R. Teodorescu, J. Torrellas
{"title":"Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors","authors":"R. Teodorescu, J. Torrellas","doi":"10.1145/1394608.1382152","DOIUrl":"https://doi.org/10.1145/1394608.1382152","url":null,"abstract":"Within-die process variation causes individual cores in a chip multiprocessor (CMP) to differ substantially in both static power consumed and maximum frequency supported. In this environment, ignoring variation effects when scheduling applications or when managing power with dynamic voltage and frequency scaling (DVFS) is suboptimal. This paper proposes variation-aware algorithms for application scheduling and power management. One such power management algorithm, called LinOpt, uses linear programming to find the best voltage and frequency levels for each of the cores in the CMP - maximizing throughput at a given power budget. In a 20-core CMP, the combination of variation-aware application scheduling and LinOpt increases the average throughput by 12-17% and reduces the average ED2 by 30-38% - all relative to using variation-aware scheduling together with a simple extension to Intel's Foxton power management algorithm.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115420178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 341
VEAL: Virtualized Execution Accelerator for Loops VEAL:循环的虚拟执行加速器
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382155
Nathan Clark, Amir Hormati, S. Mahlke
{"title":"VEAL: Virtualized Execution Accelerator for Loops","authors":"Nathan Clark, Amir Hormati, S. Mahlke","doi":"10.1145/1394608.1382155","DOIUrl":"https://doi.org/10.1145/1394608.1382155","url":null,"abstract":"Performance improvement solely through transistor scaling is becoming more and more difficult, thus it is increasingly common to see domain specific accelerators used in conjunction with general purpose processors to achieve future performance goals. There is a serious drawback to accelerators, though: binary compatibility. An application compiled to utilize an accelerator cannot run on a processor without that accelerator, and applications that do not utilize an accelerator will never use it. To overcome this problem, we propose decoupling the instruction set architecture from the underlying accelerators. Computation to be accelerated is expressed using a processorpsilas baseline instruction set, and light-weight dynamic translation maps the representation to whatever accelerators are available in the system. In this paper, we describe the changes to a compilation framework and processor system needed to support this abstraction for an important set of accelerator designs that support innermost loops. In this analysis, we investigate the dynamic overheads associated with abstraction as well as the static/dynamic tradeoffs to improve the dynamic mapping of loop-nests. As part of the exploration, we also provide a quantitative analysis of the hardware characteristics of effective loop accelerators. We conclude that using a hybrid static-dynamic compilation approach to map computation on to loop-level accelerators is an practical way to increase computation efficiency, without the overheads associated with instruction set modification.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114535369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Online Estimation of Architectural Vulnerability Factor for Soft Errors 软错误体系结构脆弱性因子的在线估计
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382150
Xiaodong Li, S. Adve, P. Bose, J. Rivers
{"title":"Online Estimation of Architectural Vulnerability Factor for Soft Errors","authors":"Xiaodong Li, S. Adve, P. Bose, J. Rivers","doi":"10.1145/1394608.1382150","DOIUrl":"https://doi.org/10.1145/1394608.1382150","url":null,"abstract":"As CMOS technology scales and more transistors are packed on to the same chip, soft error reliability has become an increasingly important design issue for processors. Prior research has shown that there is significant architecture-level masking, and many soft error solutions take advantage of this effect. Prior work has also shown that the degree of such masking can vary significantly across workloads and between individual workload phases, motivating dynamic adaptation of reliability solutions for optimal cost and benefit. For such adaptation, it is important to be able to accurately estimate the amount of masking or the architecture vulnerability factor (AVF) online, while the program is running. Unfortunately, existing solutions for estimating AVF are often based on offline simulators and hard to implement in real processors. This paper proposes a novel way of estimating AVF online, using simple modifications to the processor. The estimation method applies to both logic and storage structures on the processor. Compared to previous methods for estimating AVF, our method does not require any offline simulation or calibration for different workloads. We tested our method with a widely used simulator from industry, for four processor structures and for 100 to 200 intervals of each of eleven SPEC benchmarks. The results show that our method provides acceptably accurate AVF estimates at runtime. The absolute error rarely exceeds 0.08 across all application intervals for all structures, and the mean absolute error for a given application and structure combination is always within 0.05.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115034893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 85
A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime 一种利用微架构冗余延长缓存SRAM寿命的主动磨损恢复方法
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382151
Jeonghee Shin, V. Zyuban, P. Bose, T. Pinkston
{"title":"A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime","authors":"Jeonghee Shin, V. Zyuban, P. Bose, T. Pinkston","doi":"10.1145/1394608.1382151","DOIUrl":"https://doi.org/10.1145/1394608.1382151","url":null,"abstract":"Microarchitectural redundancy has been proposed as a means of improving chip lifetime reliability. It is typically used in a reactive way, allowing chips to maintain operability in the presence of failures by detecting and isolating, correcting, and/or replacing components on a first-come, first-served basis only after they become faulty. In this paper, we explore an alternative, more preferred method of exploiting microarchitectural redundancy to enhance chip lifetime reliability. In our proposed approach, redundancy is used proactively to allow non-faulty microarchitecture components to be temporarily deactivated, on a rotating basis, to suspend and/or recover from certain wearout effects. This approach improves chip lifetime reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. Applied to on-chip cache SRAM for combating NBTI-induced wearout failure, our proactive wearout recovery approach increases lifetime reliability (measured in mean-time-to-failure) of the cache by about a factor of seven relative to no use of microarchitectural redundancy and a factor of five relative to conventional reactive use of redundancy having similar area overhead.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127369433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Rerun: Exploiting Episodes for Lightweight Memory Race Recording 重播:利用情节轻量级内存比赛记录
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382144
Derek Hower, M. Hill
{"title":"Rerun: Exploiting Episodes for Lightweight Memory Race Recording","authors":"Derek Hower, M. Hill","doi":"10.1145/1394608.1382144","DOIUrl":"https://doi.org/10.1145/1394608.1382144","url":null,"abstract":"Multiprocessor deterministic replay has many potential uses in the era of multicore computing, including enhanced debugging, fault tolerance, and intrusion detection. While sources of nondeterminism in a uniprocessor can be recorded efficiently in software, it seems likely that hardware support will be needed in a multiprocessor environment where the outcome of memory races must also be recorded. We develop a memory race recording mechanism, called Rerun, that uses small hardware state (~166 bytes/core), writes a small race log (~4 bytes/kilo- instruction), and operates well as the number of cores per system scales (e.g., to 16 cores). Rerun exploits the dual of conventional wisdom in race recording: Rather than record information about individual memory accesses that conflict, we record how long a thread executes without conflicting with other threads. In particular, Rerun passively creates atomic episodes. Each episode is a dynamic instruction sequence that a thread happens to execute without interacting with other threads. Rerun uses Lamport Clocks to order episodes and enable replay of an equivalent execution.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132293532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 164
Corona: System Implications of Emerging Nanophotonic Technology 电晕:新兴纳米光子技术的系统含义
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382135
D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, Marco Fiorentino, A. Davis, N. Binkert, R. Beausoleil, Jung Ho Ahn
{"title":"Corona: System Implications of Emerging Nanophotonic Technology","authors":"D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, Marco Fiorentino, A. Davis, N. Binkert, R. Beausoleil, Jung Ho Ahn","doi":"10.1145/1394608.1382135","DOIUrl":"https://doi.org/10.1145/1394608.1382135","url":null,"abstract":"We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on-stack bandwidth requirements at acceptable power levels. Corona is a 3 D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114729687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 690
DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently DeLorean:记录和确定性重放共享内存多处理器执行Ef?地
2008 International Symposium on Computer Architecture Pub Date : 2008-06-01 DOI: 10.1145/1394608.1382146
Pablo Montesinos, L. Ceze, J. Torrellas
{"title":"DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently","authors":"Pablo Montesinos, L. Ceze, J. Torrellas","doi":"10.1145/1394608.1382146","DOIUrl":"https://doi.org/10.1145/1394608.1382146","url":null,"abstract":"Support for deterministic replay of multithreaded execution can greatly help in finding concurrency bugs. For highest effectiveness, replay schemes should (i) record at production-run speed, (ii) keep their logging requirements minute, and (iii) replay at a speed similar to that of the initial execution. In this paper, we propose a new substrate for deterministic replay that provides substantial advances along these axes. In our proposal, processors execute blocks of instructions atomically, as in transactional memory or speculative multithreading, and the system only needs to record the commit order of these blocks. We call our scheme DeLorean. Our results show that DeLorean records execution at a speed similar to that of release consistency (RC) execution and replays at about 82% of its speed. In contrast, most current schemes only record at the speed of Sequential Consistency (SC) execution. Moreover, DeLorean only needs 7.5% of the log size needed by a state-of-the-art scheme. Finally, DeLorean can be configured to need only 0.6% of the log size of the state-of-the-art scheme at the cost of recording at 86% of RCpsilas execution speed - still faster than SC. In this configuration, the log of an 8-processor 5-GHz machine is estimated to be only about 20GB per day.","PeriodicalId":190082,"journal":{"name":"2008 International Symposium on Computer Architecture","volume":"83 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116734809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 213
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信