International Conference on Compilers, Architecture, and Synthesis for Embedded Systems最新文献

筛选
英文 中文
Towards scalable reliability frameworks for error prone CMPs 面向易出错cmp的可扩展可靠性框架
Joseph Sloan, Rakesh Kumar
{"title":"Towards scalable reliability frameworks for error prone CMPs","authors":"Joseph Sloan, Rakesh Kumar","doi":"10.1145/1629395.1629432","DOIUrl":"https://doi.org/10.1145/1629395.1629432","url":null,"abstract":"As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123013402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
CGRA express: accelerating execution using dynamic operation fusion CGRA express:利用动态运算融合加速执行
Yongjun Park, Hyunchul Park, S. Mahlke
{"title":"CGRA express: accelerating execution using dynamic operation fusion","authors":"Yongjun Park, Hyunchul Park, S. Mahlke","doi":"10.1145/1629395.1629433","DOIUrl":"https://doi.org/10.1145/1629395.1629433","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing programmability with the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs have been effectively used for innermost loops that contain an abundant of instruction-level parallelism. Conversely, non-loop and outer-loop code are latency constrained and do not offer significant amounts of instruction-level parallelism. In these situations, CGRAs are ineffective as the majority of the resources remain idle. In this paper, dynamic operation fusion is introduced to enable CGRAs to effectively accelerate latency-constrained code regions. Dynamic operation fusion is enabled through the combination of a small bypass network added between function units in a conventional CGRA and a sub-cycle modulo scheduler to automatically identify opportunities for fusion. Results show that dynamic operation fusion reduced total application run-time by up to 17% on a 4x4 CGRA.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125752363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware 具有非对齐和不规则数据访问硬件的SIMD程序的有效矢量化
Hoseok Chang, Wonyong Sung
{"title":"Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware","authors":"Hoseok Chang, Wonyong Sung","doi":"10.1145/1450095.1450121","DOIUrl":"https://doi.org/10.1145/1450095.1450121","url":null,"abstract":"Automatic vectorization of programs for partitioned-ALU SIMD (Single Instruction Multiple Data) processors has been difficult because of not only data dependency issues but also non-aligned and irregular data access problems. A non-aligned or irregular data access operation incurs many overhead cycles for data alignment. Moreover, this causes difficulty in efficient code generation and hinders automatic vectorization. In this paper, we employ special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer. The former solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access. The addition of these hardware units not only requires very small changes to the instruction set architecture but also contributes to the significant performance improvement by vectorizing more loops and reducing the overhead cycles. We have also developed an auto-vectorization compiler which utilizes these special hardware units. Experiments have been conducted to compare the proposed method with the conventional one, which show 50% increase in the number of vectorized loops and 77% increase in the total performance of an MPEG2 encoder program.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127862545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip SoC-C:芯片上异构多核系统的高效编程抽象
A. Reid, K. Flautner, Edmund Grimley-Evans, Yuan Lin
{"title":"SoC-C: efficient programming abstractions for heterogeneous multicore systems on chip","authors":"A. Reid, K. Flautner, Edmund Grimley-Evans, Yuan Lin","doi":"10.1145/1450095.1450112","DOIUrl":"https://doi.org/10.1145/1450095.1450112","url":null,"abstract":"The architectures of system-on-chip (SoC) platforms found in high-end consumer devices are getting more and more complex as designers strive to deliver increasingly compute-intensive applications on near-constant energy budgets. Workloads running on these platforms require the exploitation of heterogeneous parallelism and increasingly irregular memory hierarchies. The conventional approach to programming such hardware is very lowlevel but this yields software which is intimately and inseparably tied to the details of the platform it was originally designed for, limiting the software's portability, and, ultimately, the architectural choices available to designers of future platform generations. The key insight of this paper is that many of the problems experienced in mapping applications onto SoC platforms come not from deciding how to map a program onto the hardware but from the need to restructure the program and the number of interdependencies introduced in the process of implementing those decisions. We tackle this complexity with a set of language extensions which allows the programmer to introduce pipeline parallelism into sequential programs, manage distributed memories, and express the desired mapping of tasks to resources. The compiler takes care of the complex, error-prone details required to implement that mapping. We demonstrate the effectiveness of SoC-C and its compiler with a \"software defined radio\" example (the PHY layer of a Digital Video Broadcast receiver) achieving a 3.4x speedup on 4 cores.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127734701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Exploring and predicting the architecture/optimising compiler co-design space 探索和预测架构/优化编译器协同设计空间
Christophe Dubach, Timothy M. Jones, M. O’Boyle
{"title":"Exploring and predicting the architecture/optimising compiler co-design space","authors":"Christophe Dubach, Timothy M. Jones, M. O’Boyle","doi":"10.1145/1450095.1450103","DOIUrl":"https://doi.org/10.1145/1450095.1450103","url":null,"abstract":"Embedded processor performance is dependent on both the underlying architecture and the compiler optimisations applied. However, designing both simultaneously is extremely difficult to achieve due to the time constraints designers must work under. Therefore, current methodology involves designing compiler and architecture in isolation, leading to sub-optimal performance of the final product.\u0000 This paper develops a novel approach to this co-design space problem. For any microarchitectural configuration we automatically predict the performance that an optimising compiler would achieve without actually building it. Once trained, a single run of -O1 on the new architecture is enough to make a prediction with just a 1.6% error rate. This allows the designer to accurately choose an architectural configuration with knowledge of how an optimising compiler will perform on it. We use this to find the best optimising compiler/architectural configuration in our co-design space and demonstrate that it achieves an average 13% performance improvement and energy savings of 23% compared to the baseline, leading to an energy-delay (ED) value of 0.67.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133825578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Non-intrusive dynamic application profiler for detailed loop execution characterization 非侵入式动态应用分析器,用于详细的循环执行特性描述
Ajay Nair, Roman L. Lysecky
{"title":"Non-intrusive dynamic application profiler for detailed loop execution characterization","authors":"Ajay Nair, Roman L. Lysecky","doi":"10.1145/1450095.1450102","DOIUrl":"https://doi.org/10.1145/1450095.1450102","url":null,"abstract":"Application profiling - the process of monitoring an application to determine the frequency of execution within specific regions - is an essential step within the design process for many software and hardware systems. In this paper, we present an efficient innovative, non-intrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application's short backwards branches and providing detailed profiling statistics for characterizing loop execution behavior. DAProf is ideally suited for hardware/software partitioning approaches in which detailed loop execution information is needed to provide accurate performance estimates. DAProf provides a profiling accuracy of greater than 90% with only an 11% area overhead compared to a small ARM9.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132486873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems StageNetSlice:用于弹性CMP系统的可重构微架构构建块
S. Gupta, Shuguang Feng, Amin Ansari, J. Blome, S. Mahlke
{"title":"StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems","authors":"S. Gupta, Shuguang Feng, Amin Ansari, J. Blome, S. Mahlke","doi":"10.1145/1450095.1450099","DOIUrl":"https://doi.org/10.1145/1450095.1450099","url":null,"abstract":"Although CMOS feature size scaling has been the source of dramatic performance gains, it has lead to mounting reliability concerns due to increasing power densities and on-chip temperatures. Given that most wearout mechanisms that plague semiconductor devices are highly dependent on these parameters, significantly higher failure rates are projected for future technology generations. Traditional techniques for dealing with device failures have relied on coarse-grained redundancy to maintain service in the face of failed components. In this work, we challenge this practice by identifying its inability to scale to high failure rate scenarios and investigate the advantages of finer-grained configurations. We use this study to motivate the design of StageNet, an embedded CMP architecture designed from its inception with reliability as a first class design constraint. StageNet relies on a reconfigurable network of replicated processor pipeline stages to maximize the useful lifetime of the chip, gracefully degrading performance toward end of life. This paper addresses the microarchitecture of the basic building block of StageNet, named StageNetSlice, which is a processor core comprised of networked pipeline stages. A naive slice design results in approximately 4X slowdown verses a traditional processor due to longer communication delays in the pipeline. However, several small design changes that eliminate inter-stage communication paths and minimize communication bandwidth reduce this overhead to 11% on average while providing high levels of fine-grain adaptability.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Cache-aware cross-profiling for java processors java处理器的缓存感知交叉分析
Walter Binder, A. Villazón, Martin Schoeberl, Philippe Moret
{"title":"Cache-aware cross-profiling for java processors","authors":"Walter Binder, A. Villazón, Martin Schoeberl, Philippe Moret","doi":"10.1145/1450095.1450116","DOIUrl":"https://doi.org/10.1145/1450095.1450116","url":null,"abstract":"Performance evaluation of embedded software is essential in an early development phase so as to ensure that the software will run on the embedded device's limited computing resources. Prevailing approaches either require the deployment of the software on the embedded target, which can be tedious and may be impossible in an early development phase, or rely on simulation, which can be very slow. In this paper, we introduce a customizable cross-profiling framework for embedded Java processors, including processors featuring a method cache. The developer profiles the embedded software in the host environment, completely decoupled from the target system, on any standard Java Virtual Machine, but the generated profiles represent the execution time metric of the target system. Our cross-profiling framework is based on bytecode instrumentation. We identify several pointcuts in the execution of bytecode that need to be instrumented in order to estimate the CPU cycle consumption on the target system. An evaluation using the JOP embedded Java processor as target confirms that our approach reconciles high profile accuracy with moderate overhead. Our cross-profiling framework also enables the rapid evaluation of the performance impact of possible optimizations, such as different caching strategies.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132690909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Compiling custom instructions onto expression-grained reconfigurable architectures 将自定义指令编译到表达式粒度的可重构体系结构中
Paolo Bonzini, G. Ansaloni, L. Pozzi
{"title":"Compiling custom instructions onto expression-grained reconfigurable architectures","authors":"Paolo Bonzini, G. Ansaloni, L. Pozzi","doi":"10.1145/1450095.1450106","DOIUrl":"https://doi.org/10.1145/1450095.1450106","url":null,"abstract":"While customizable processors aim at combining the flexibility of general purpose processors with the speed and power advantages of custom circuits, commercially available processors are often limited by the inability to reconfigure the application-specific features after manufacturing. Even though reconfigurable array-based accelerators are available, their performance is often unacceptable, and comes with other disadvantages such as the size of the configuration bitstream. Additionally, compilation support is limited for existing Coarse Grain Reconfigurable Arrays (CGRAs).\u0000 We propose to target a different reconfigurable fabric, the EGRA (Expression-Grained Reconfigurable Array), to realize custom instructions in a customizable processor. The EGRA is based on arithmetic processing elements that can compute entire subexpressions in a single cycle and can be connected in both combinational or sequential manners. We present here a compilation flow for this architecture, including novel algorithms for subgraph enumeration and scheduling. The compilation flow proposed is used here to efficiently explore the design space of the EGRA processing element; furthermore, its modularity and flexibility suggest suitability to generic CGRA retargetable compilation.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124171731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Power management of MEMS-based storage devices for mobile systems 基于mems的移动系统存储设备的电源管理
Mohammed G. Khatib, P. Hartel
{"title":"Power management of MEMS-based storage devices for mobile systems","authors":"Mohammed G. Khatib, P. Hartel","doi":"10.1145/1450095.1450131","DOIUrl":"https://doi.org/10.1145/1450095.1450131","url":null,"abstract":"Because of its small form factor, high capacity, and expected low cost, MEMS-based storage is a suitable storage technology for mobile systems. MEMS-based storage devices should also be energy efficient for deployment in mobile systems. The problem is that MEMS-based storage devices are mechanical, and thus consume a large amount of energy when idle. Therefore, a power management (PM) policy is needed that maximizes energy saving while minimizing performance degradation. In this work, we quantitatively demonstrate the optimality of the fixed-timeout PM policy for MEMS-based storage devices. Because the media sled is suspended by springs across the head array in MEMS-based storage devices, we show that these devices (1) lack mechanical startup overhead and (2) exhibit small shutdown overhead. As a result, we show that the combination of a PM policy, that fixes the timeout in the range of 1--10 ms, and a shutdown policy, that exploits the springs, results in a near-optimal energy saving yet at a negligible loss in performance.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116709702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信