ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168888
Jaidev P. Patwardhan, Vijeta Johri, C. Dwyer, A. Lebeck
{"title":"A defect tolerant self-organizing nanoscale SIMD architecture","authors":"Jaidev P. Patwardhan, Vijeta Johri, C. Dwyer, A. Lebeck","doi":"10.1145/1168857.1168888","DOIUrl":"https://doi.org/10.1145/1168857.1168888","url":null,"abstract":"The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing to the exponentially increasing cost of top-down lithography. Bottom-up manufacturing techniques, like self assembly, may provide a viable lower-cost alternative to top-down lithography, but may also be prone to higher defects. Therefore, regardless of fabrication methodology, defect tolerant architectures are necessary to exploit the full potential of future increased device densities.This paper explores a defect tolerant SIMD architecture. A key feature of our design is the ability of a large number of limited capability nodes with high defect rates (up to 30%) to self-organize into a set of SIMD processing elements. Despite node simplicity and high defect rates, we show that by supporting the familiar data parallel programming model the architecture can execute a variety of programs. The architecture efficiently exploits a large number of nodes and higher device densities to keep device switching speeds and power density low. On a medium sized system (~1cm2 area), the performance of the proposed architecture on our data parallel programs matches or exceeds the performance of an aggressively scaled out-of-order processor (128-wide, 8k reorder buffer, perfect memory system). For larger systems (>1cm2), the proposed architecture can match the performance of a chip multiprocessor with 16 aggressively scaled out-of-order cores.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128532379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168877
Michael I. Gordon, W. Thies, Saman P. Amarasinghe
{"title":"Exploiting coarse-grained task, data, and pipeline parallelism in stream programs","authors":"Michael I. Gordon, W. Thies, Saman P. Amarasinghe","doi":"10.1145/1168857.1168877","DOIUrl":"https://doi.org/10.1145/1168857.1168877","url":null,"abstract":"As multicore architectures enter the mainstream, there is a pressing demand for high-level programming models that can effectively map to them. Stream programming offers an attractive way to expose coarse-grained parallelism, as streaming applications (image, video, DSP, etc.) are naturally represented by independent filters that communicate over explicit data channels.In this paper, we demonstrate an end-to-end stream compiler that attains robust multicore performance in the face of varying application characteristics. As benchmarks exhibit different amounts of task, data, and pipeline parallelism, we exploit all types of parallelism in a unified manner in order to achieve this generality. Our compiler, which maps from the StreamIt language to the 16-core Raw architecture, attains a 11.2x mean speedup over a single-core baseline, and a 1.84x speedup over our previous work.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130731774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168870
A. Parashar, A. Sivasubramaniam, S. Gurumurthi
{"title":"SlicK: slice-based locality exploitation for efficient redundant multithreading","authors":"A. Parashar, A. Sivasubramaniam, S. Gurumurthi","doi":"10.1145/1168857.1168870","DOIUrl":"https://doi.org/10.1145/1168857.1168870","url":null,"abstract":"Transient faults are expected a be a major design consideration in future microprocessors. Recent proposals for transient fault detection in processor cores have revolved around the idea of redundant threading, which involves redundant execution of a program across multiple execution contexts. This paper presents a new approach to redundant threading by bringing together the concepts of slice-level execution and value and control-flow locality into a novel partial redundant threading mechanism called SlicK.The purpose of redundant execution is to check the integrity of the outputs propagating out of the core (typically through stores). SlicK implements redundancy at the granularity of backward-slices of these output instructions and exploits value and control-flow locality to avoid redundantly executing slices that lead to predictable outputs, thereby avoiding redundant execution of a significant fraction of instructions while maintaining extremely low vulnerabilities for critical processor structures.We propose the microarchitecture of a backward-slice extractor called SliceEM that is able to identify backward slices without interrupting the instruction flow, and show how this extractor and a set of predictors can be integrated into a redundant threading mechanism to form SlicK. Detailed simulations with SPEC CPU2000 benchmarks show that SlicK can provide around 10.2% performance improvement over a well known redundant threading mechanism, buying back over 50% of the loss suffered due to redundant execution. SlicK can keep the Architectural Vulnerability Factors of processor structures to typically 0%-2%. More importantly, SlicK's slice-based mechanisms provide future opportunities for exploring interesting points in the performance-reliability design space based on market segment needs.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168858
M. Rosenblum
{"title":"Impact of virtualization on computer architecture and operating systems","authors":"M. Rosenblum","doi":"10.1145/1168857.1168858","DOIUrl":"https://doi.org/10.1145/1168857.1168858","url":null,"abstract":"Abstract This talk describes how virtualization is changing the way computing is done in the industry today and how it is causing users to rethink how they view hardware, operating systems, and application programs. The talk will describe this new view on computing and the benefits driving users to adopt it. The changing roles for hardware and operating systems will be discussed along with what changes will be needed to efficiently and simply support this new computing model. I will conclude with a discussion of areas where industry could use input from the ASPLOS research community.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123537882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168898
D. Tarditi, Sidd Puri, Jose Oglesby
{"title":"Accelerator: using data parallelism to program GPUs for general-purpose uses","authors":"D. Tarditi, Sidd Puri, Jose Oglesby","doi":"10.1145/1168857.1168898","DOIUrl":"https://doi.org/10.1145/1168857.1168898","url":null,"abstract":"GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls.We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124326556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168907
Armando Solar-Lezama, Liviu Tancau, R. Bodík, S. Seshia, V. Saraswat
{"title":"Combinatorial sketching for finite programs","authors":"Armando Solar-Lezama, Liviu Tancau, R. Bodík, S. Seshia, V. Saraswat","doi":"10.1145/1168857.1168907","DOIUrl":"https://doi.org/10.1145/1168857.1168907","url":null,"abstract":"Sketching is a software synthesis approach where the programmer develops a partial implementation - a sketch - and a separate specification of the desired functionality. The synthesizer then completes the sketch to behave like the specification. The correctness of the synthesized implementation is guaranteed by the compiler, which allows, among other benefits, rapid development of highly tuned implementations without the fear of introducing bugs.We develop SKETCH, a language for finite programs with linguistic support for sketching. Finite programs include many highperformance kernels, including cryptocodes. In contrast to prior synthesizers, which had to be equipped with domain-specific rules, SKETCH completes sketches by means of a combinatorial search based on generalized boolean satisfiability. Consequently, our combinatorial synthesizer is complete for the class of finite programs: it is guaranteed to complete any sketch in theory, and in practice has scaled to realistic programming problems.Freed from domain rules, we can now write sketches as simpleto-understand partial programs, which are regular programs in which difficult code fragments are replaced with holes to be filled by the synthesizer. Holes may stand for index expressions, lookup tables, or bitmasks, but the programmer can easily define new kinds of holes using a single versatile synthesis operator.We have used SKETCH to synthesize an efficient implementation of the AES cipher standard. The synthesizer produces the most complex part of the implementation and runs in about an hour.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124757815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168903
Jaewoong Chung, C. Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, C. Kozyrakis, K. Olukotun
{"title":"Tradeoffs in transactional memory virtualization","authors":"Jaewoong Chung, C. Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, C. Kozyrakis, K. Olukotun","doi":"10.1145/1168857.1168903","DOIUrl":"https://doi.org/10.1145/1168857.1168903","url":null,"abstract":"For transactional memory (TM) to achieve widespread acceptance, transactions should not be limited to the physical resources of any specific hardware implementation. TM systems should guarantee correct execution even when transactions exceed scheduling quanta, overflow the capacity of hardware caches and physical memory, or include more independent nesting levels than what is supported in hardware. Existing proposals for TM virtualization are either incomplete or rely on complex hardware implementations, which are an overkill if virtualization is invoked infrequently in the common case.We present eXtended Transactional Memory (XTM), the first TM virtualization system that virtualizes all aspects of transactional execution (time, space, and nesting depth). XTM is implemented in software using virtual memory support. It operates at page granularity, using private copies of overflowed pages to buffer memory updates until the transaction commits and snapshots of pages to detect interference between transactions. We also describe two enhancements to XTM that use limited hardware support to address key performance bottlenecks.We compare XTM to hardwarebased virtualization using both real applications and synthetic microbenchmarks. We show that despite being software-based, XTM and its enhancements are competitive with hardware-based alternatives. Overall, we demonstrate that XTM provides a complete, flexible, and low-cost mechanism for practical TM virtualization.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134579272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168865
Min Xu, M. Hill, R. Bodík
{"title":"A regulated transitive reduction (RTR) for longer memory race recording","authors":"Min Xu, M. Hill, R. Bodík","doi":"10.1145/1168857.1168865","DOIUrl":"https://doi.org/10.1145/1168857.1168865","url":null,"abstract":"Now at VMware. Multithreaded deterministic replay has important applications in cyclic debugging, fault tolerance and intrusion analysis. Memory race recording is a key technology for multithreaded deterministic replay. In this paper, we considerably improve our previous always-on Flight Data Recorder (FDR) in four ways: •Longer recording by reducing the log size growth rate to approximately one byte per thousand dynamic instructions. •Lower hardware cost by reducing the cost to 24 KB per processor core. •Simpler design by modifying only the cache coherence protocol, but not the cache. •Broader applicability by supporting both Sequential Consistency (SC) and Total Store Order (TSO) memory consistency models (existing recorders support only SC).These improvements stem from several ideas: (1) a Regulated Transitive Reduction (RTR) recording algorithm that creates stricter and vectorizable dependencies to reduce the log growth rate; (2) a Set/LRU timestamp approximation method that better approximates timestamps of uncached memory locations to reduce the hardware cost; (3) an order-value-hybrid recording methodthat explicitly logs the value of potential SC-violating load instructions to support multiprocessor systems with TSO.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168886
S. Narayanasamy, C. Pereira, B. Calder
{"title":"Recording shared memory dependencies using strata","authors":"S. Narayanasamy, C. Pereira, B. Calder","doi":"10.1145/1168857.1168886","DOIUrl":"https://doi.org/10.1145/1168857.1168886","url":null,"abstract":"Significant time is spent by companies trying to reproduce and fix bugs. BugNet and FDR are recent architecture proposals that provide architecture support for deterministic replay debugging. They focus on continuously recording information about the program's execution, which can be communicated back to the developer. Using that information, the developer can deterministically replay the program's execution to reproduce and fix the bugs.In this paper, we propose using Strata to efficiently capture the shared memory dependencies. A stratum creates a time layer across all the logs for the running threads, which separates all the memory operations executed before and after the stratum. A strata log allows us to determine all the shared memory dependencies during replay and thereby supports deterministic replay debugging for multi-threaded programs.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123973301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ASPLOS XIIPub Date : 2006-10-23DOI: 10.1145/1168857.1168876
M. Kim, S. Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, M. Oskin, S. Eggers
{"title":"Instruction scheduling for a tiled dataflow architecture","authors":"M. Kim, S. Swanson, Andrew Petersen, Andrew Putnam, Andrew Schwerin, M. Oskin, S. Eggers","doi":"10.1145/1168857.1168876","DOIUrl":"https://doi.org/10.1145/1168857.1168876","url":null,"abstract":"This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.","PeriodicalId":270694,"journal":{"name":"ASPLOS XII","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129760349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}