ASPLOS XI最新文献_第2页

Scalable selective re-execution for EDGE architectures EDGE架构的可伸缩选择性重新执行

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024408

R. Desikan, S. Sethumadhavan, D. Burger, S. Keckler

{"title":"Scalable selective re-execution for EDGE architectures","authors":"R. Desikan, S. Sethumadhavan, D. Burger, S. Keckler","doi":"10.1145/1024393.1024408","DOIUrl":"https://doi.org/10.1145/1024393.1024408","url":null,"abstract":"Pipeline flushes are becoming increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculations by re-executing only instructions affected by the mis-speculation, instead of all instructions. In this paper we introduce a new selective re-execution mechanism that exploits the properties of a dataflow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to window sizes of thousands of instructions with high performance. This distributed selective re-execution (DSRE) protocol permits multiple speculative waves of computation to be traversing a dataflow graph simultaneously, with a commit wave propagating behind them to ensure correct execution. We evaluate one application of this protocol to provide efficient recovery for load-store dependence speculation. Unlike traditional dataflow architectures which resorted to single-assignment memory semantics, the DSRE protocol combines dataflow execution with speculation to enable high performance and conventional sequential memory semantics. Our experiments show that the DSRE protocol results in an average 17% speedup over the best dependence predictor proposed to date, and obtains 82% of the performance possible with a perfect oracle directing the issue of loads.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123816017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Coherence decoupling: making use of incoherence 相干解耦:利用不相干

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024406

Jaehyuk Huh, Jichuan Chang, D. Burger, G. Sohi

{"title":"Coherence decoupling: making use of incoherence","authors":"Jaehyuk Huh, Jichuan Chang, D. Burger, G. Sohi","doi":"10.1145/1024393.1024406","DOIUrl":"https://doi.org/10.1145/1024393.1024406","url":null,"abstract":"This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128449534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

An ultra low-power processor for sensor networks 用于传感器网络的超低功耗处理器

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024397

Virantha N. Ekanayake, IV ClintonKelly, R. Manohar

{"title":"An ultra low-power processor for sensor networks","authors":"Virantha N. Ekanayake, IV ClintonKelly, R. Manohar","doi":"10.1145/1024393.1024397","DOIUrl":"https://doi.org/10.1145/1024393.1024397","url":null,"abstract":"We present a novel processor architecture designed specifically for use in low-power wireless sensor-network nodes. Our sensor network asynchronous processor (SNAP/LE) is based on an asynchronous data-driven 16-bit RISC core with an extremely low-power idle state, and a wakeup response latency on the order of tens of nanoseconds. The processor instruction set is optimized for sensor-network applications, with support for event scheduling, pseudo-random number generation, bitfield operations, and radio/sensor interfaces. SNAP/LE has a hardware event queue and event coprocessors, which allow the processor to avoid the overhead of operating system software (such as task schedulers and external interrupt servicing), while still providing a straightforward programming interface to the designer. The processor can meet performance levels required for data monitoring applications while executing instructions with tens of picojoules of energy.We evaluate the energy consumption of SNAP/LE with several applications representative of the workload found in data-gathering wireless sensor networks. We compare our architecture and software against existing platforms for sensor networks, quantifying both the software and hardware benefits of our approach.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131550466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 137

Compiler orchestrated prefetching via speculation and predication 编译器通过推测和预测编排预取

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024416

R. Rabbah, Hariharan Sandanagobalane, M. Ekpanyapong, W. Wong

{"title":"Compiler orchestrated prefetching via speculation and predication","authors":"R. Rabbah, Hariharan Sandanagobalane, M. Ekpanyapong, W. Wong","doi":"10.1145/1024393.1024416","DOIUrl":"https://doi.org/10.1145/1024393.1024416","url":null,"abstract":"This paper introduces a compiler orchestrated prefetching system as a unified framework geared toward ameliorating the gap between processing speeds and memory access latencies. We focus the scope of the optimization on specific subsets of the program dependence graph that succinctly characterize the memory access pattern of both regular array-based applications and irregular pointer-intensive programs. We illustrate how program embedded precomputation via speculative execution can accurately predict and effectively prefetch future memory references with negligible overhead. The proposed techniques reduce the total running time of seven SPEC benchmarks and two OLDEN benchmarks by 27% on an Itanium 2 processor. The improvements are in addition to several state-of-the-art optimizations including software pipelining and data prefetching. In addition, we use cycle-accurate simulations to identify important and lightweight architectural innovations that further mitigate the memory system bottleneck. In particular, we focus on the notoriously challenging class of pointer-chasing applications, and demonstrate how they may benefit from a novel scheme of it sentineled prefetching. Our results for twelve SPEC benchmarks demonstrate that 45% of the processor stalls that are caused by the memory system are avoidable. The techniques in this paper can effectively mask long memory latencies with little instruction overhead, and can readily contribute to the performance of processors today.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128734015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Programming with transactional coherence and consistency (TCC) 使用事务一致性和一致性(TCC)进行编程

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1037949.1024395

Lance Hammond, Brian D. Carlstrom, Vicky Wong, Ben Hertzberg, Michael K. Chen, C. Kozyrakis, K. Olukotun

引用次数: 145

HOIST: a system for automatically deriving static analyzers for embedded systems HOIST:用于自动生成嵌入式系统静态分析器的系统

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024410

J. Regehr, A. Reid

{"title":"HOIST: a system for automatically deriving static analyzers for embedded systems","authors":"J. Regehr, A. Reid","doi":"10.1145/1024393.1024410","DOIUrl":"https://doi.org/10.1145/1024393.1024410","url":null,"abstract":"Embedded software must meet conflicting requirements such as be-ing highly reliable, running on resource-constrained platforms, and being developed rapidly. Static program analysis can help meet all of these goals. People developing analyzers for embedded object code face a difficult problem: writing an abstract version of each instruction in the target architecture(s). This is currently done by hand, resulting in abstract operations that are both buggy and im-precise. We have developed Hoist: a novel system that solves these problems by automatically constructing abstract operations using a microprocessor (or simulator) as its own specification. With almost no input from a human, Hoist generates a collection of C func-tions that are ready to be linked into an abstract interpreter. We demonstrate that Hoist generates abstract operations that are cor-rect, having been extensively tested, sufficiently fast, and substan-tially more precise than manually written abstract operations. Hoist is currently limited to eight-bit machines due to costs exponential in the word size of the target architecture. It is essential to be able to analyze software running on these small processors: they are important and ubiquitous, with many embedded and safety-critical systems being based on them.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129629058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Devirtualizable virtual machines enabling general, single-node, online maintenance 可反虚拟化的虚拟机，支持一般、单节点、在线维护

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024419

David E. Lowell, Yasushi Saito, Eileen J. Samberg

{"title":"Devirtualizable virtual machines enabling general, single-node, online maintenance","authors":"David E. Lowell, Yasushi Saito, Eileen J. Samberg","doi":"10.1145/1024393.1024419","DOIUrl":"https://doi.org/10.1145/1024393.1024419","url":null,"abstract":"Maintenance is the dominant source of downtime at high availability sites. Unfortunately, the dominant mechanism for reducing this downtime, cluster rolling upgrade, has two shortcomings that have prevented its broad acceptance. First, cluster-style maintenance over many nodes is typically performed a few nodes at a time, mak-ing maintenance slow and often impractical. Second, cluster-style maintenance does not work on single-node systems, despite the fact that their unavailability during maintenance can be painful for organizations. In this paper, we propose a novel technique for online maintenance that uses virtual machines to provide maintenance on single nodes, allowing parallel maintenance over multiple nodes, and online maintenance for standalone servers. We present the Microvisor, our prototype virtual machine system that is custom tailored to the needs of online maintenance. Unlike general purpose virtual machine environments that induce continual 10-20% over-head, the Microvisor virtualizes the hardware only during periods of active maintenance, letting the guest OS run at full speed most of the time. Unlike past attempts at virtual machine optimization, we do not compromise OS transparency. We instead give up generality and tailor our virtual machine system to the minimum needs of online maintenance, eschewing features, such as I/O and memory virtualization, that it does not strictly require. The result is a very thin virtual machine system that induces only 5.6% CPU overhead when virtualizing the hardware, and zero CPU overhead when devirtualized. Using the Microvisor, we demonstrate an online OS upgrade on a live, single-node web server, reducing downtime from one hour to less than one minute.","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129758835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

D-SPTF: decentralized request distribution in brick-based storage systems D-SPTF:基于砖的存储系统中的分散请求分发

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024399

Christopher R. Lumb, Richard A. Golding

引用次数: 38

Spatial computation 空间计算

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024396

M. Budiu, Girish Venkataramani, Tiberiu Chelcea, S. Goldstein

{"title":"Spatial computation","authors":"M. Budiu, Girish Venkataramani, Tiberiu Chelcea, S. Goldstein","doi":"10.1145/1024393.1024396","DOIUrl":"https://doi.org/10.1145/1024393.1024396","url":null,"abstract":"This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units.In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration; the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient.In this work we demonstrate three features of ASH: (1) that such architectures can be built by automatic compilation of C programs; (2) that distributed computation is in some respects fundamentally different from monolithic superscalar processors; and (3) that ASIC implementations of ASH use three orders of magnitude less energy compared to high-end superscalar processors, while being on average only 33% slower in performance (3.5x worst-case).","PeriodicalId":344295,"journal":{"name":"ASPLOS XI","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134111392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

Secure program execution via dynamic information flow tracking 通过动态信息流跟踪安全程序执行

ASPLOS XI Pub Date : 2004-10-07 DOI: 10.1145/1024393.1024404

Edward Suh, Jaewook Lee, Srini Devadas, David Zhang

引用次数: 830