{"title":"Application-aware adaptive cache architecture for power-sensitive mobile processors","authors":"Garo Bournoutian, A. Orailoglu","doi":"10.1145/2539036.2539037","DOIUrl":"https://doi.org/10.1145/2539036.2539037","url":null,"abstract":"Today, mobile smartphones are expected to be able to run the same complex, algorithm-heavy, memory-intensive applications that were originally designed and coded for general-purpose processors. All the while, it is also expected that these mobile processors be power-conscientious as well as of minimal area impact. These devices pose unique usage demands of ultra-portability but also demand an always-on, continuous data access paradigm. As a result, this dichotomy of continuous execution versus long battery life poses a difficult challenge. This article explores a novel approach to mitigating mobile processor power consumption while abating any significant degradation in execution speed. The concept relies on efficiently leveraging both compile-time and runtime application memory behavior to intelligently target adjustments in the cache to significantly reduce overall processor power, taking into account both the dynamic and leakage power footprint of the cache subsystem. The simulation results show a significant reduction in power consumption of approximately 13% to 29%, while only incurring a nominal increase in execution time and area.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130908401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-optimized volume ray tracing for massive numbers of rays in radiotherapy","authors":"Bo Zhou, K. Xiao, D. Chen, X. Hu","doi":"10.1145/2539036.2539038","DOIUrl":"https://doi.org/10.1145/2539036.2539038","url":null,"abstract":"Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many applications, especially radiation-dose calculation methods in radiotherapy. However, the conflicting features between the GPU memory architecture and the memory-accessing patterns of volume ray tracing lead to inefficient usage of GPU memory bandwidth and waste of capability of modern GPUs. To improve the ray tracing performance on GPU, we propose a lookup-table-based ray tracing method which is specially optimized towards the GPU memory system for processing a massive number of rays. The proposed method is based on a key observation that many of these applications normally involves a massive number of rays, but their ray tracing may not need to follow a specific execution order. Therefore, we divide the 3D space into many regions (called pyramids) and group together the rays falling into the same pyramid. For each ray group, the volume is rotated and resampled for their raytracing. This divide-and-rotate strategy allows the memory access of the ray tracing process to adopt a table-lookup approach and leads to better memory coalescing on GPU. Our proposed method was thoroughly evaluated in four volume setups with randomly-generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, our method provides a speedup in the range of 1.91--2.94X for the volume settings we used. Major performance factors, including ray origins, volume size, and pyramid size, are also analyzed. The proposed technique was also found to be able to give a speedup of 1.61--2.17X over the original GPU implementation of the CCCS algorithm. Our experiment results indicate that the proposed approach is capable of offering better coalesced memory access which eventually boosts the raytracing performance on GPU. Moreover, our approach is conceptually simple and can be readily included into various applications.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"816 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133251191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software-based register file vulnerability reduction for embedded processors","authors":"Jongeun Lee, Aviral Shrivastava","doi":"10.1145/2536747.2536760","DOIUrl":"https://doi.org/10.1145/2536747.2536760","url":null,"abstract":"Register File (RF) is extremely vulnerable to soft errors, and traditional redundancy based schemes to protect the RF are prohibitive not only because RF is often in the timing critical path of the processor, but also since it is one of the hottest blocks on the chip. Software approaches would be ideal in this case, but previous approaches based on instruction scheduling are only moderately effective due to local scope. In this article we present a compiler approach, based on interprocedural program analysis, to reduce the vulnerability of registers by temporarily writing live variables to protected memory. We formulate the problem as an integer linear programming problem and also present a very efficient heuristic algorithm. Further we present an iterative optimization method based on Kernighan-Lin's graph partitioning algorithm. Our experiments demonstrate that our proposed techniques can reduce the vulnerability of a RF by 33 ∼ 37% on average and up to 66%, with a small 2% increase in runtime. In addition, our overhead reduction optimization can effectively reduce the code size overhead, by more than 40% on average, to a mere 5 ∼ 6%, compared to highly optimized binaries.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127506107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the special section on ESTIMedia'10","authors":"N. Chang, Jian-Jia Chen","doi":"10.1145/2536747.2536748","DOIUrl":"https://doi.org/10.1145/2536747.2536748","url":null,"abstract":"Embedded multimedia systems are one of the most representative segments of today’s electronics markets. However, their design complexity is increasing every year due to the convergence of a range of technologies as well as rapidly exploding user demands. Furthermore, their time-to-market is becoming more challenging. For holistic systemwide optimization, hardware and software are tightly coupled and separation between hardware and software design is no longer feasible. This special section presents the state-of-the-art results for designing embedded multimedia systems. The eighth edition of the IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia) featured 14 high-quality technical papers, which were carefully selected from 27 submissions from all over the world. This special section features five articles, three of which were extended from ESTIMedia 2010. The range of the featured articles provides a glimpse of the current state-of-the-art in embedded systems for real-time multimedia. They cover a variety of topics including design-space exploration for MPSoC, automated generation of process networks, on-chip interconnects, scheduling of data flow, and fault resilience design for embedded multimedia applications. The article entitled A System-Level Infrastructure for Multidimensional MPSoC Design Space Co-Exploration by Z. J. Jia et al. presents a flexible and extensible systemlevel MP-SoC design space exploration (DSE) infrastructure. The article Automated Generation of Polyhedral Process Networks from Affine Nested-Loop Programs with Dynamic Loop Bounds by D. Nadezhkin et al. presents a first approach for automated translation of affine nested-loop programs with dynamic loop bounds into input-output equivalent Polyhedral Process Networks. Moreover, the method for calculating the first-in-first-out (FIFO) buffer sizes in such networks is also provided. The article entitled An Analytical Model for On-Chip Interconnects in Multimedia Embedded Systems by Y. Wu et al. presents a new analytical model to investigate the performance of the fat-tree based on-chip interconnection networks under bursty multimedia traffic and nonuniform message destinations. The article Scheduling of Synchronous Data Flow Models onto Scratchpad Memory-Based Embedded Processors by W. Che and K. S. Chatha presents a heuristic algorithm for scheduling synchronous data flow models on scratch pad memory-enhanced processors to minimize the steady-state execution time. The article Improving the Fault Resilience of an H.264 Decoder using Static Analysis Methods by F. Schmoll et al. presents a fault-tolerance approach for a H.264 decoder to select correction methods appropriate to error impact and current timing conditions. For this special section, at least three reviewers who are experts in their area provided numerous comments to the authors. We would like to thank all the authors who submitted manuscripts for this special section. Special thanks go to all the ","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125537069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iraklis Anagnostopoulos, Jean-Michel Chabloz, Ioannis Koutras, A. Bartzas, A. Hemani, D. Soudris
{"title":"Power-aware dynamic memory management on many-core platforms utilizing DVFS","authors":"Iraklis Anagnostopoulos, Jean-Michel Chabloz, Ioannis Koutras, A. Bartzas, A. Hemani, D. Soudris","doi":"10.1145/2536747.2536762","DOIUrl":"https://doi.org/10.1145/2536747.2536762","url":null,"abstract":"Today multicore platforms are already prevalent solutions for modern embedded systems. In the future, embedded platforms will have an even more increased processor core count, composing many-core platforms. In addition, applications are becoming more complex and dynamic and try to efficiently utilize the amount of available resources on the embedded platforms. Efficient memory utilization is a key challenge for application developers, especially since memory is a scarce resource and often becomes the system's bottleneck. To cope with this dynamism and achieve better memory footprint utilization (low memory fragmentation) application developers resort to the usage of dynamic memory (heap) management techniques, by allocating and deallocating data at runtime. Moreover, overall power consumption is another key challenge that needs to be taken into consideration. Towards this, designers employ the usage of Dynamic Voltage and Frequency Scaling (DVFS) mechanisms, adapting to the application's computational demands at runtime. In this article, we propose the combination of dynamic memory management techniques with DVFS ones. This is performed by integrating, within the memory manager, runtime monitoring mechanisms that steer the DVFS mechanisms to adjust clock frequency and voltage supply based on heap performance. The proposed approach has been evaluated on a distributed shared-memory many-core platform composed of multiple LEON3 processors interconnected by a Network-on-Chip infrastructure, supporting DVFS. Experimental results show that by using the proposed method for monitoring and applying DVFS mechanisms the power consumption concerning dynamic memory management was reduced by approximately 37%. In addition we present the trade-offs the proposed approach. Last, by combining the developed method with heap fragmentation-aware dynamic memory managers, we achieve low heap fragmentation values combined with low power consumption.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130976990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analytical model for on-chip interconnects in multimedia embedded systems","authors":"Yulei Wu, G. Min, Dakai Zhu, L. Yang","doi":"10.1145/2536747.2536751","DOIUrl":"https://doi.org/10.1145/2536747.2536751","url":null,"abstract":"The traffic pattern has significant impact on the performance of network-on-chip. Many recent studies have shown that multimedia applications can be supported in on-chip interconnects. Driven by the motivation of evaluating on-chip interconnects in multimedia embedded systems, a new analytical model is proposed to investigate the performance of the fat-tree based on-chip interconnection network under bursty multimedia traffic and nonuniform message destinations. Extensive simulation experiments are conducted to validate the accuracy of the model, which is then adopted as a cost-efficient tool to investigate the effects of bursty multimedia traffic with nonuniform destinations on the network performance.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129288010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Models for characterizing noise based PCMOS circuits","authors":"Anshul Singh, A. Basu, K. Ling, V. Mooney","doi":"10.1145/2536747.2536761","DOIUrl":"https://doi.org/10.1145/2536747.2536761","url":null,"abstract":"Quick and accurate error-rate prediction of Probabilistic CMOS (PCMOS) circuits is crucial for their systematic design and performance evaluation. While still in the early stage of research, PCMOS has shown potential to drastically reduce energy consumption at a cost of increased errors. Recently, a methodology has been proposed which could predict the error rates of cascade structures of blocks in PCMOS. This methodology requires error rates of unique blocks to predict the error rates of multiblock cascade structures composed of these unique blocks. In this article we present a new model for characterization of probabilistic circuits/blocks and present a procedure to find and characterize unique circuits/blocks. Unlike prior approaches, our new model distinguishes distinct filtering effects per output, thereby improving prediction accuracy by an average of 95% over the prior art by Palem and coauthors. Furthermore, we show two models where our new model with three stages is 18% more accurate, on average, than our simpler two-stage model. We apply our proposed models to Ripple Carry Adders and Wallace Tree Multipliers and show that using our models, the methodology of cascade structures can predict error rates of PCMOS circuits with reasonable accuracy (within 9%) in PCMOS for uniform voltages as well as multiple voltages. Finally, our approach takes seconds of simulation time whereas using HSPICE would take days of simulation time.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127823842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the fault resilience of an H.264 decoder using static analysis methods","authors":"F. Schmoll, A. Heinig, P. Marwedel, M. Engel","doi":"10.1145/2536747.2536753","DOIUrl":"https://doi.org/10.1145/2536747.2536753","url":null,"abstract":"Fault tolerance rapidly evolves into one of the most significant design objectives for embedded systems due to reduced semiconductor structures and supply voltages. However, resource-constrained systems cannot afford traditional error correction for overhead and cost reasons. New methods are required to sustain acceptable service quality in case of errors while avoiding crashes.\u0000 We present a flexible fault-tolerance approach that is able to select correction actions depending on error semantics using application annotations and static analysis approaches. We verify the validity of our approach by analyzing the vulnerability and improving the reliability of an H.264 decoder using flexible error handling.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128308963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predictable and configurable component-based scheduling in the Composite OS","authors":"Gabriel Parmer, R. West","doi":"10.1145/2536747.2536754","DOIUrl":"https://doi.org/10.1145/2536747.2536754","url":null,"abstract":"This article presents the design of user-level scheduling hierarchies in the Composite component-based system. The motivation for this is centered around the design of a system that is both dependable and predictable, and which is configurable to the needs of specific applications. Untrusted application developers can safely develop services and policies, that are isolated in protection domains outside the kernel. To ensure predictability, Composite enforces timing control over user-space services. Moreover, it must provide a means by which asynchronous events, such as interrupts, are handled in a timely manner without jeopardizing the system. Towards this end, we describe the features of Composite that allow user-defined scheduling policies to be composed for the purposes of combined interrupt and task management. A significant challenge arises from the need to synchronize access to shared data structures (e.g., scheduling queues), without allowing untrusted code to disable interrupts. Additionally, efficient upcall mechanisms are needed to deliver asynchronous event notifications in accordance with policy-specific priorities, without undue recourse to schedulers. We show how these issues are addressed in Composite, by comparing several hierarchies of scheduling polices, to manage both tasks and the interrupts on which they depend. Studies show how it is possible to implement guaranteed differentiated services as part of the handling of I/O requests from a network device while diminishing livelock. Microbenchmarks indicate that the costs of implementing and invoking user-level schedulers in Composite are on par with, or less than, those in other systems, with thread switches more than twice as fast as in Linux.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122892601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating radiation dose calculation: A multi-FPGA solution","authors":"Bo Zhou, X. Hu, D. Chen, Cedric X. Yu","doi":"10.1145/2536747.2536755","DOIUrl":"https://doi.org/10.1145/2536747.2536755","url":null,"abstract":"Remarkable progress has been made in the past few decades in various aspects of radiation therapy (RT). However, some of these promising technologies, such as image-guided online replanning and arc therapy, rely heavily on the availability of fast dose calculation. In this article, based on a popular dose calculation algorithm, the Collapsed-Cone Convolution/Superposition (CCCS) algorithm, we present a multi-FPGA accelerator to speed up radiation dose calculation. Our performance-driven design strategy yields a fully pipelined architecture, which includes a resource-economic raytracing engine and high-performance energy deposition pipeline. An evaluation based on a set of clinical treatment planning cases confirms that our FPGA design almost fully utilizes the available external memory bandwidth and achieves close to the best possible performance for the CCCS algorithm while using less resource. Compared with an existing FPGA design which aimed to accelerate the identical algorithm, the proposed design achieved 1.9X speedup by providing better memory bandwidth utilization (81.7% v.s. 43% of the available external memory bandwidth), higher working frequency (90MHz v.s. 70MHz) and less logic resource usage (25K v.s. 55K logic cells). Furthermore, it obtains a speedup of 20X over a commercial multithreaded software on a quad-core system and 15X performance improvement over closely related results. In terms of accuracy, the measured less than 1% statistical fluctuation indicates that our solution is practical in real medical scenarios.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114564012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}