Proceedings of the 37th annual international symposium on Computer architecture最新文献_第2页

Relax: an architectural framework for software recovery of hardware faults 放松:用于硬件故障的软件恢复的体系结构框架

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816026

M. Kruijf, Shuou Nomura, K. Sankaralingam

{"title":"Relax: an architectural framework for software recovery of hardware faults","authors":"M. Kruijf, Shuou Nomura, K. Sankaralingam","doi":"10.1145/1815961.1816026","DOIUrl":"https://doi.org/10.1145/1815961.1816026","url":null,"abstract":"As technology scales ever further, device unreliability is creating excessive complexity for hardware to maintain the illusion of perfect operation. In this paper, we consider whether exposing hardware fault information to software and allowing software to control fault recovery simplifies hardware design and helps technology scaling. The combination of emerging applications and emerging many-core architectures makes software recovery a viable alternative to hardware-based fault recovery. Emerging applications tend to have few I/O and memory side-effects, which limits the amount of information that needs checkpointing, and they allow discarding individual sub-computations with small qualitative impact. Software recovery can harness these properties in ways that hardware recovery cannot. We describe Relax, an architectural framework for software recovery of hardware faults. Relax includes three core components: (1) an ISA extension that allows software to mark regions of code for software recovery, (2) a hardware organization that simplifies reliability considerations and provides energy efficiency with hardware recovery support removed, and (3) software support for compilers and programmers to utilize the Relax ISA. Applying Relax to counter the effects of process variation, our results show a 20% energy efficiency improvement for PARSEC applications with only minimal source code changes and simpler hardware.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"992 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133323022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 233

Using hardware vulnerability factors to enhance AVF analysis 利用硬件漏洞因子增强AVF分析

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816023

Vilas Sridharan, D. Kaeli

{"title":"Using hardware vulnerability factors to enhance AVF analysis","authors":"Vilas Sridharan, D. Kaeli","doi":"10.1145/1815961.1816023","DOIUrl":"https://doi.org/10.1145/1815961.1816023","url":null,"abstract":"Fault tolerance is now a primary design constraint for all major microprocessors. One step in determining a processor's compliance to its failure rate target is measuring the Architectural Vulnerability Factor (AVF) of each on-chip structure. The AVF of a hardware structure is the probability that a fault in the structure will affect the output of a program. While AVF generates meaningful insight into system behavior, it cannot quantify the vulnerability of an individual system component (hardware, user program, etc.), limiting the amount of insight that can be generated. To address this, prior work has introduced the Program Vulnerability Factor (PVF) to quantify the vulnerability of software. In this paper, we introduce and analyze the Hardware Vulnerability Factor (HVF) to quantify the vulnerability of hardware. HVF has three concrete benefits which we examine in this paper. First, HVF analysis can provide insight to hardware designers beyond that gained from AVF analysis alone. Second, separating AVF analysis into HVF and PVF steps can accelerate the AVF measurement process. Finally, HVF measurement enables runtime AVF estimation that combines compile-time PVF estimates with runtime HVF measurements. A key benefit of this technique is that it allows software developers to influence the runtime AVF estimates. We demonstrate that this technique can estimate AVF at runtime with an average absolute error of less than 3%.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134634931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

The virtual write queue: coordinating DRAM and last-level cache policies 虚拟写队列:协调DRAM和最后一级缓存策略

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1815972

Jeffrey Stuecheli, Dimitris Kaseridis, D. Daly, H. Hunter, L. John

{"title":"The virtual write queue: coordinating DRAM and last-level cache policies","authors":"Jeffrey Stuecheli, Dimitris Kaseridis, D. Daly, H. Hunter, L. John","doi":"10.1145/1815961.1815972","DOIUrl":"https://doi.org/10.1145/1815961.1815972","url":null,"abstract":"In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"279 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134108370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 139

Session details: Memory subsystems 会话详细信息:内存子系统

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/3258105

D. Tullsen

引用次数: 0

Evolution of thread-level parallelism in desktop applications 桌面应用程序中线程级并行性的演变

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816000

G. Blake, R. Dreslinski, T. Mudge, K. Flautner

{"title":"Evolution of thread-level parallelism in desktop applications","authors":"G. Blake, R. Dreslinski, T. Mudge, K. Flautner","doi":"10.1145/1815961.1816000","DOIUrl":"https://doi.org/10.1145/1815961.1816000","url":null,"abstract":"As the effective limits of frequency and instruction level parallelism have been reached, the strategy of microprocessor vendors has changed to increase the number of processing cores on a single chip each generation. The implicit expectation is that software developers will write their applications with concurrency in mind to take advantage of this sudden change in direction. In this study we analyze whether software developers for laptop/desktop machines have followed the recent hardware trends by creating software for chip multi-processing. We conduct a study of a wide range of applications on Microsoft Windows 7 and Apple's OS X Snow Leopard, measuring Thread Level Parallelism on a high performance workstation and a low power desktop. In addition, we explore graphics processing units (GPUs) and their impact on chip multi-processing. We compare our findings to a study done 10 years ago which concluded that a second core was sufficient to improve system responsiveness. Our results on today's machines show that, 10 years later, surprisingly 2-3 cores are more than adequate for most applications and that the GPU often remains under-utilized. However, in some application specific domains an 8 core SMT system with a 240 core GPU can be effectively utilized. Overall these studies suggest that many-core architectures are not a natural fit for current desktop/laptop applications.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125784926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

Understanding sources of inefficiency in general-purpose chips 了解通用芯片效率低下的根源

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1815968

R. Hameed, W. Qadeer, Megan Wachs, Omid Azizi, A. Solomatnikov, Benjamin C. Lee, S. Richardson, C. Kozyrakis, M. Horowitz

{"title":"Understanding sources of inefficiency in general-purpose chips","authors":"R. Hameed, W. Qadeer, Megan Wachs, Omid Azizi, A. Solomatnikov, Benjamin C. Lee, S. Richardson, C. Kozyrakis, M. Horowitz","doi":"10.1145/1815961.1815968","DOIUrl":"https://doi.org/10.1145/1815961.1815968","url":null,"abstract":"Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still \"overhead\". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125891160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 479

Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors 利用PVT变化的核心级互补效应来减少多核处理器中的时序紧急情况

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816025

Guihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li

{"title":"Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors","authors":"Guihai Yan, Xiaoyao Liang, Yinhe Han, Xiaowei Li","doi":"10.1145/1815961.1816025","DOIUrl":"https://doi.org/10.1145/1815961.1816025","url":null,"abstract":"Process, Voltage, and Temperature (PVT) variations can significantly degrade the performance benefits expected from next nanoscale technology. The primary circuit implication of the PVT variations is the resultant timing emergencies. In a multi-core processor running multiple programs, variations create spatial and temporal unbalance across the processing cores. Most prior schemes are dedicated to tolerating PVT variations individually for a single core, but ignore the opportunity of leveraging the complementary effects between variations and the intrinsic variation unbalance among individual cores. We find that the notorious delay impacts from different variations are not necessary aggregated. Cores with mild variations can share the violent workload from cores suffering large variations. If operated correctly, variations on different cores can help mitigating each other and result in a variation-mild environment. In this paper, we propose Timing Emergency Aware Thread Migration (TEA-TM), a delay sensor-based scheme to reduce system timing emergencies under PVT variations. Fourier transform and frequency domain analysis are conducted to provide the insights and the potential of the PVT co-optimization scheme. Experimental results show on average TEA-TM can help save up to 24% throughput loss, at the same time improve the system fairness by 85%.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124162695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU 揭穿100X GPU vs CPU神话:CPU和GPU吞吐量计算的评估

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816021

V. Lee, Changkyu Kim, J. Chhugani, M. Deisher, Daehyun Kim, A. Nguyen, N. Satish, M. Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, P. Dubey

{"title":"Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU","authors":"V. Lee, Changkyu Kim, J. Chhugani, M. Deisher, Daehyun Kim, A. Nguyen, N. Satish, M. Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, P. Dubey","doi":"10.1145/1815961.1816021","DOIUrl":"https://doi.org/10.1145/1815961.1816021","url":null,"abstract":"Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127437837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 830

Sentry: light-weight auxiliary memory access control 哨兵:轻量级辅助存储器访问控制

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/1815961.1816016

Arrvindh Shriraman, S. Dwarkadas

{"title":"Sentry: light-weight auxiliary memory access control","authors":"Arrvindh Shriraman, S. Dwarkadas","doi":"10.1145/1815961.1816016","DOIUrl":"https://doi.org/10.1145/1815961.1816016","url":null,"abstract":"Light-weight, flexible access control, which allows software to regulate reads and writes to any granularity of memory region, can help improve the reliability of today's multi-module multi-programmer applications, as well as the efficiency of software debugging tools. Unfortunately, access control in today's processors is tied to support for virtual memory, making its use both heavy weight and coarse grain. In this paper, we propose Sentry, an auxiliary level of virtual memory tagging that is entirely subordinate to existing virtual memory-based protection mechanisms and can be manipulated at the user level. We implement these tags in a complexity-effective manner using an M-cache (metadata cache) structure that only intervenes on L1 misses, thereby minimizing changes to the processor core. Existing cache coherence states are repurposed to implicitly validate permissions for L1 hits. Sentry achieves its goal of flexible and light-weight access control without disrupting existing inter-application protection, sidestepping the challenges associated with adding a new protection framework to an existing operating system. We illustrate the benefits of our design point using 1) an Apache-based web server that uses the M-cache to enforce protection boundaries among its modules and 2) a watchpoint-based tool to demonstrate low-overhead debugging. Protection is achieved with very few changes to the source code, no changes to the programming model, minimal modifications to the operating system, and with low overhead incurred only when accessing memory regions for which the additional level of access control is enabled.","PeriodicalId":132033,"journal":{"name":"Proceedings of the 37th annual international symposium on Computer architecture","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131432066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Session details: Productivity and debugging 会话细节:生产力和调试

Proceedings of the 37th annual international symposium on Computer architecture Pub Date : 2010-06-19 DOI: 10.1145/3258106

J. Torrellas

引用次数: 0