ASPLOS XII最新文献_第3页

A defect tolerant self-organizing nanoscale SIMD architecture 一种容错自组织纳米级SIMD体系结构

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168888

Jaidev P. Patwardhan, Vijeta Johri, C. Dwyer, A. Lebeck

{"title":"A defect tolerant self-organizing nanoscale SIMD architecture","authors":"Jaidev P. Patwardhan, Vijeta Johri, C. Dwyer, A. Lebeck","doi":"10.1145/1168857.1168888","DOIUrl":"https://doi.org/10.1145/1168857.1168888","url":null,"abstract":"The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing to the exponentially increasing cost of top-down lithography. Bottom-up manufacturing techniques, like self assembly, may provide a viable lower-cost alternative to top-down lithography, but may also be prone to higher defects. Therefore, regardless of fabrication methodology, defect tolerant architectures are necessary to exploit the full potential of future increased device densities.This paper explores a defect tolerant SIMD architecture. A key feature of our design is the ability of a large number of limited capability nodes with high defect rates (up to 30%) to self-organize into a set of SIMD processing elements. Despite node simplicity and high defect rates, we show that by supporting the familiar data parallel programming model the architecture can execute a variety of programs. The architecture efficiently exploits a large number of nodes and higher device densities to keep device switching speeds and power density low. On a medium sized system (~1cm2 area), the performance of the proposed architecture on our data parallel programs matches or exceeds the performance of an aggressively scaled out-of-order processor (128-wide, 8k reorder buffer, perfect memory system). For larger systems (>1cm2), the proposed architecture can match the performance of a chip multiprocessor with 16 aggressively scaled out-of-order cores.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128532379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Exploiting coarse-grained task, data, and pipeline parallelism in stream programs 利用流程序中的粗粒度任务、数据和管道并行性

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168877

Michael I. Gordon, W. Thies, Saman P. Amarasinghe

引用次数: 603

SlicK: slice-based locality exploitation for efficient redundant multithreading SlicK:基于片的局部性利用，实现高效的冗余多线程

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168870

A. Parashar, A. Sivasubramaniam, S. Gurumurthi

{"title":"SlicK: slice-based locality exploitation for efficient redundant multithreading","authors":"A. Parashar, A. Sivasubramaniam, S. Gurumurthi","doi":"10.1145/1168857.1168870","DOIUrl":"https://doi.org/10.1145/1168857.1168870","url":null,"abstract":"Transient faults are expected a be a major design consideration in future microprocessors. Recent proposals for transient fault detection in processor cores have revolved around the idea of redundant threading, which involves redundant execution of a program across multiple execution contexts. This paper presents a new approach to redundant threading by bringing together the concepts of slice-level execution and value and control-flow locality into a novel partial redundant threading mechanism called SlicK.The purpose of redundant execution is to check the integrity of the outputs propagating out of the core (typically through stores). SlicK implements redundancy at the granularity of backward-slices of these output instructions and exploits value and control-flow locality to avoid redundantly executing slices that lead to predictable outputs, thereby avoiding redundant execution of a significant fraction of instructions while maintaining extremely low vulnerabilities for critical processor structures.We propose the microarchitecture of a backward-slice extractor called SliceEM that is able to identify backward slices without interrupting the instruction flow, and show how this extractor and a set of predictors can be integrated into a redundant threading mechanism to form SlicK. Detailed simulations with SPEC CPU2000 benchmarks show that SlicK can provide around 10.2% performance improvement over a well known redundant threading mechanism, buying back over 50% of the loss suffered due to redundant execution. SlicK can keep the Architectural Vulnerability Factors of processor structures to typically 0%-2%. More importantly, SlicK's slice-based mechanisms provide future opportunities for exploring interesting points in the performance-reliability design space based on market segment needs.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

Impact of virtualization on computer architecture and operating systems 虚拟化对计算机体系结构和操作系统的影响

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168858

M. Rosenblum

引用次数: 3

Accelerator: using data parallelism to program GPUs for general-purpose uses 加速器:使用数据并行性为通用用途的gpu编程

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168898

D. Tarditi, Sidd Puri, Jose Oglesby

引用次数: 351

Combinatorial sketching for finite programs 有限程序的组合素描

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168907

Armando Solar-Lezama, Liviu Tancau, R. Bodík, S. Seshia, V. Saraswat

{"title":"Combinatorial sketching for finite programs","authors":"Armando Solar-Lezama, Liviu Tancau, R. Bodík, S. Seshia, V. Saraswat","doi":"10.1145/1168857.1168907","DOIUrl":"https://doi.org/10.1145/1168857.1168907","url":null,"abstract":"Sketching is a software synthesis approach where the programmer develops a partial implementation - a sketch - and a separate specification of the desired functionality. The synthesizer then completes the sketch to behave like the specification. The correctness of the synthesized implementation is guaranteed by the compiler, which allows, among other benefits, rapid development of highly tuned implementations without the fear of introducing bugs.We develop SKETCH, a language for finite programs with linguistic support for sketching. Finite programs include many highperformance kernels, including cryptocodes. In contrast to prior synthesizers, which had to be equipped with domain-specific rules, SKETCH completes sketches by means of a combinatorial search based on generalized boolean satisfiability. Consequently, our combinatorial synthesizer is complete for the class of finite programs: it is guaranteed to complete any sketch in theory, and in practice has scaled to realistic programming problems.Freed from domain rules, we can now write sketches as simpleto-understand partial programs, which are regular programs in which difficult code fragments are replaced with holes to be filled by the synthesizer. Holes may stand for index expressions, lookup tables, or bitmasks, but the programmer can easily define new kinds of holes using a single versatile synthesis operator.We have used SKETCH to synthesize an efficient implementation of the AES cipher standard. The synthesizer produces the most complex part of the implementation and runs in about an hour.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124757815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 790

Tradeoffs in transactional memory virtualization 事务性内存虚拟化中的权衡

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168903

Jaewoong Chung, C. Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, C. Kozyrakis, K. Olukotun

{"title":"Tradeoffs in transactional memory virtualization","authors":"Jaewoong Chung, C. Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, C. Kozyrakis, K. Olukotun","doi":"10.1145/1168857.1168903","DOIUrl":"https://doi.org/10.1145/1168857.1168903","url":null,"abstract":"For transactional memory (TM) to achieve widespread acceptance, transactions should not be limited to the physical resources of any specific hardware implementation. TM systems should guarantee correct execution even when transactions exceed scheduling quanta, overflow the capacity of hardware caches and physical memory, or include more independent nesting levels than what is supported in hardware. Existing proposals for TM virtualization are either incomplete or rely on complex hardware implementations, which are an overkill if virtualization is invoked infrequently in the common case.We present eXtended Transactional Memory (XTM), the first TM virtualization system that virtualizes all aspects of transactional execution (time, space, and nesting depth). XTM is implemented in software using virtual memory support. It operates at page granularity, using private copies of overflowed pages to buffer memory updates until the transaction commits and snapshots of pages to detect interference between transactions. We also describe two enhancements to XTM that use limited hardware support to address key performance bottlenecks.We compare XTM to hardwarebased virtualization using both real applications and synthetic microbenchmarks. We show that despite being software-based, XTM and its enhancements are competitive with hardware-based alternatives. Overall, we demonstrate that XTM provides a complete, flexible, and low-cost mechanism for practical TM virtualization.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134579272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

A regulated transitive reduction (RTR) for longer memory race recording 一种可调节的传递减少(RTR)，用于更长的内存竞争记录

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168865

Min Xu, M. Hill, R. Bodík

{"title":"A regulated transitive reduction (RTR) for longer memory race recording","authors":"Min Xu, M. Hill, R. Bodík","doi":"10.1145/1168857.1168865","DOIUrl":"https://doi.org/10.1145/1168857.1168865","url":null,"abstract":"Now at VMware. Multithreaded deterministic replay has important applications in cyclic debugging, fault tolerance and intrusion analysis. Memory race recording is a key technology for multithreaded deterministic replay. In this paper, we considerably improve our previous always-on Flight Data Recorder (FDR) in four ways: •Longer recording by reducing the log size growth rate to approximately one byte per thousand dynamic instructions. •Lower hardware cost by reducing the cost to 24 KB per processor core. •Simpler design by modifying only the cache coherence protocol, but not the cache. •Broader applicability by supporting both Sequential Consistency (SC) and Total Store Order (TSO) memory consistency models (existing recorders support only SC).These improvements stem from several ideas: (1) a Regulated Transitive Reduction (RTR) recording algorithm that creates stricter and vectorizable dependencies to reduce the log growth rate; (2) a Set/LRU timestamp approximation method that better approximates timestamps of uncached memory locations to reduce the hardware cost; (3) an order-value-hybrid recording methodthat explicitly logs the value of potential SC-violating load instructions to support multiprocessor systems with TSO.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 115

Recording shared memory dependencies using strata 使用分层记录共享内存依赖

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168886

S. Narayanasamy, C. Pereira, B. Calder

引用次数: 133

Instruction scheduling for a tiled dataflow architecture 平铺数据流架构的指令调度

ASPLOS XII Pub Date : 2006-10-23 DOI: 10.1145/1168857.1168876

M. Kim, S. Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, M. Oskin, S. Eggers

引用次数: 43