International Conference on Compilers, Architecture, and Synthesis for Embedded Systems最新文献_第9页

Towards scalable reliability frameworks for error prone CMPs 面向易出错cmp的可扩展可靠性框架

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2009-10-11 DOI: 10.1145/1629395.1629432

Joseph Sloan, Rakesh Kumar

{"title":"Towards scalable reliability frameworks for error prone CMPs","authors":"Joseph Sloan, Rakesh Kumar","doi":"10.1145/1629395.1629432","DOIUrl":"https://doi.org/10.1145/1629395.1629432","url":null,"abstract":"As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123013402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

CGRA express: accelerating execution using dynamic operation fusion CGRA express:利用动态运算融合加速执行

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2009-10-11 DOI: 10.1145/1629395.1629433

Yongjun Park, Hyunchul Park, S. Mahlke

引用次数: 65

Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware 具有非对齐和不规则数据访问硬件的SIMD程序的有效矢量化

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450121

Hoseok Chang, Wonyong Sung

{"title":"Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware","authors":"Hoseok Chang, Wonyong Sung","doi":"10.1145/1450095.1450121","DOIUrl":"https://doi.org/10.1145/1450095.1450121","url":null,"abstract":"Automatic vectorization of programs for partitioned-ALU SIMD (Single Instruction Multiple Data) processors has been difficult because of not only data dependency issues but also non-aligned and irregular data access problems. A non-aligned or irregular data access operation incurs many overhead cycles for data alignment. Moreover, this causes difficulty in efficient code generation and hinders automatic vectorization. In this paper, we employ special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer. The former solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access. The addition of these hardware units not only requires very small changes to the instruction set architecture but also contributes to the significant performance improvement by vectorizing more loops and reducing the overhead cycles. We have also developed an auto-vectorization compiler which utilizes these special hardware units. Experiments have been conducted to compare the proposed method with the conventional one, which show 50% increase in the number of vectorized loops and 77% increase in the total performance of an MPEG2 encoder program.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127862545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip SoC-C:芯片上异构多核系统的高效编程抽象

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450112

A. Reid, K. Flautner, Edmund Grimley-Evans, Yuan Lin

{"title":"SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip","authors":"A. Reid, K. Flautner, Edmund Grimley-Evans, Yuan Lin","doi":"10.1145/1450095.1450112","DOIUrl":"https://doi.org/10.1145/1450095.1450112","url":null,"abstract":"The architectures of system-on-chip (SoC) platforms found in high-end consumer devices are getting more and more complex as designers strive to deliver increasingly compute-intensive applications on near-constant energy budgets. Workloads running on these platforms require the exploitation of heterogeneous parallelism and increasingly irregular memory hierarchies. The conventional approach to programming such hardware is very lowlevel but this yields software which is intimately and inseparably tied to the details of the platform it was originally designed for, limiting the software's portability, and, ultimately, the architectural choices available to designers of future platform generations. The key insight of this paper is that many of the problems experienced in mapping applications onto SoC platforms come not from deciding how to map a program onto the hardware but from the need to restructure the program and the number of interdependencies introduced in the process of implementing those decisions. We tackle this complexity with a set of language extensions which allows the programmer to introduce pipeline parallelism into sequential programs, manage distributed memories, and express the desired mapping of tasks to resources. The compiler takes care of the complex, error-prone details required to implement that mapping. We demonstrate the effectiveness of SoC-C and its compiler with a \"software defined radio\" example (the PHY layer of a Digital Video Broadcast receiver) achieving a 3.4x speedup on 4 cores.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127734701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Exploring and predicting the architecture/optimising compiler co-design space 探索和预测架构/优化编译器协同设计空间

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450103

Christophe Dubach, Timothy M. Jones, M. O’Boyle

{"title":"Exploring and predicting the architecture/optimising compiler co-design space","authors":"Christophe Dubach, Timothy M. Jones, M. O’Boyle","doi":"10.1145/1450095.1450103","DOIUrl":"https://doi.org/10.1145/1450095.1450103","url":null,"abstract":"Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing compiler and architecture in isolation, leading to sub-optimal performance of the final product.\u0000 This paper develops a novel approach to this co-design space problem. For any microarchitectural configuration we automatically predict the performance that an optimising compiler would achieve without actually building it. Once trained, a single run of -O1 on the new architecture is enough to make a prediction with just a 1.6% error rate. This allows the designer to accurately choose an architectural configuration with knowledge of how an optimising compiler will perform on it. We use this to find the best optimising compiler/architectural configuration in our co-design space and demonstrate that it achieves an average 13% performance improvement and energy savings of 23% compared to the baseline, leading to an energy-delay (ED) value of 0.67.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133825578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Non-intrusive dynamic application profiler for detailed loop execution characterization 非侵入式动态应用分析器，用于详细的循环执行特性描述

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450102

Ajay Nair, Roman L. Lysecky

引用次数: 19

StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems StageNetSlice:用于弹性CMP系统的可重构微架构构建块

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450099

S. Gupta, Shuguang Feng, Amin Ansari, J. Blome, S. Mahlke

{"title":"StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems","authors":"S. Gupta, Shuguang Feng, Amin Ansari, J. Blome, S. Mahlke","doi":"10.1145/1450095.1450099","DOIUrl":"https://doi.org/10.1145/1450095.1450099","url":null,"abstract":"Although CMOS feature size scaling has been the source of dramatic performance gains, it has lead to mounting reliability concerns due to increasing power densities and on-chip temperatures. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Traditional techniques for dealing with device failures have relied on coarse-grained redundancy to maintain service in the face of failed components. In this work, we challenge this practice by identifying its inability to scale to high failure rate scenarios and investigate the advantages of finer-grained configurations. We use this study to motivate the design of StageNet, an embedded CMP architecture designed from its inception with reliability as a first class design constraint. StageNet relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of the chip, gracefully degrading performance toward end of life. This paper addresses the microarchitecture of the basic building block of StageNet, named StageNetSlice, which is a processor core comprised of networked pipeline stages. A naive slice design results in approximately 4X slowdown verses a traditional processor due to longer communication delays in the pipeline. However, several small design changes that eliminate inter-stage communication paths and minimize communication bandwidth reduce this overhead to 11% on average while providing high levels of fine-grain adaptability.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Cache-aware cross-profiling for java processors java处理器的缓存感知交叉分析

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450116

Walter Binder, A. Villazón, Martin Schoeberl, Philippe Moret

{"title":"Cache-aware cross-profiling for java processors","authors":"Walter Binder, A. Villazón, Martin Schoeberl, Philippe Moret","doi":"10.1145/1450095.1450116","DOIUrl":"https://doi.org/10.1145/1450095.1450116","url":null,"abstract":"Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. Prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this paper, we introduce a customizable cross-profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java Virtual Machine, but the generated profiles represent the execution time metric of the target system. Our cross-profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross-profiling framework also enables the rapid evaluation of the performance impact of possible optimizations, such as different caching strategies.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132690909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Compiling custom instructions onto expression-grained reconfigurable architectures 将自定义指令编译到表达式粒度的可重构体系结构中

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450106

Paolo Bonzini, G. Ansaloni, L. Pozzi

{"title":"Compiling custom instructions onto expression-grained reconfigurable architectures","authors":"Paolo Bonzini, G. Ansaloni, L. Pozzi","doi":"10.1145/1450095.1450106","DOIUrl":"https://doi.org/10.1145/1450095.1450106","url":null,"abstract":"While customizable processors aim at combining the flexibility of general purpose processors with the speed and power advantages of custom circuits, commercially available processors are often limited by the inability to reconfigure the application-specific features after manufacturing. Even though reconfigurable array-based accelerators are available, their performance is often unacceptable, and comes with other disadvantages such as the size of the configuration bitstream. Additionally, compilation support is limited for existing Coarse Grain Reconfigurable Arrays (CGRAs).\u0000 We propose to target a different reconfigurable fabric, the EGRA (Expression-Grained Reconfigurable Array), to realize custom instructions in a customizable processor. The EGRA is based on arithmetic processing elements that can compute entire subexpressions in a single cycle and can be connected in both combinational or sequential manners. We present here a compilation flow for this architecture, including novel algorithms for subgraph enumeration and scheduling. The compilation flow proposed is used here to efficiently explore the design space of the EGRA processing element; furthermore, its modularity and flexibility suggest suitability to generic CGRA retargetable compilation.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124171731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Power management of MEMS-based storage devices for mobile systems 基于mems的移动系统存储设备的电源管理

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems Pub Date : 2008-10-19 DOI: 10.1145/1450095.1450131

Mohammed G. Khatib, P. Hartel

引用次数: 3