{"title":"Cyber physical systems: Systems engineering of industrial embedded systems - Barriers, enablers and opportunities","authors":"C. Jacobson, R. Schooler, M. Laurence","doi":"10.1109/CASES.2013.6662503","DOIUrl":"https://doi.org/10.1109/CASES.2013.6662503","url":null,"abstract":"Cyber physical systems: systems engineering of industrial embedded systems-barriers, enablers and opportunities; high-performance, scalable, general-purpose processors to accelerate high-throughput networking and security applications; Low-power high-performance asynchronous processors.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129705430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems","authors":"J. Wernsing, G. Stitt","doi":"10.1145/2380403.2380423","DOIUrl":"https://doi.org/10.1145/2380403.2380423","url":null,"abstract":"Embedded systems increasingly combine multi-core processors and heterogeneous resources such as graphics-processing units and field-programmable gate arrays. However, significant application design complexity for such systems caused by parallel programming and device-specific challenges has often led to untapped performance potential. Application developers targeting such systems currently must determine how to parallelize computation, create different device-specialized implementations for each heterogeneous resource, and then determine how to apportion work to each resource. In this paper, we present the RACECAR heuristic to automate the optimization of applications for multi-core heterogeneous systems by automatically exploring implementation alternatives that include different algorithms, parallelization strategies, and work distributions. Experimental results show RACECAR-specialized implementations can effectively incorporate provided implementations and parallelize computation across multiple cores, graphics-processing units, and field-programmable gate arrays, improving performance by an average of 47x compared to a CPU, while the fastest provided implementations are only able to average 33x.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115138732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Function inlining and loop unrolling for loop acceleration in reconfigurable processors","authors":"Narasinga Rao Miniskar, Pankaj Shailendra Gode, Soma Kohli, Donghoon Yoo","doi":"10.1145/2380403.2380426","DOIUrl":"https://doi.org/10.1145/2380403.2380426","url":null,"abstract":"The next generation SoCs for consumer electronics need software solutions for faster time-to-market, lower development cost and higher performance while maintaining lower energy consumption and area. As a result, reconfigurable processors (RPs) have become increasingly important, which enables just enough exibility of accepting software solutions and providing application-specific hardware reconfigurability. Samsung Electronics has developed a reconfigurable processor called Samsung Reconfigurable Processor (SRP), which is the basis of our work. Though, the SRP is a powerful processor, it requires a smart and intelligent compiler to compile the application software while exploring its reconfigurable architecture. The existing compiler for the SRP does not support functional inlining and loop unrolling, and no study has yet been done on these optimizations for the RPs. In this paper, we study the impact of these optimizations on the performance of applications for the SRP processor and we also show how these optimizations are supported in the SRP compiler. We analyze the performance improvement due to these optimizations on various benchmarks namely Sobel Edge filter, JPEG decoder, and Luma Deblocking filter of the H.264 standard. Our experimental results have shown about 83% gain on performance with the functional inlining optimization and the loop unrolling optimization when compared to the original code for Sobel filter and JPEG encoder, and 11% gain on performance for Luma Deblock filter.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124604586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static task partitioning for locked caches in multi-core real-time systems","authors":"Abhik Sarkar, F. Mueller, H. Ramaprasad","doi":"10.1145/2380403.2380434","DOIUrl":"https://doi.org/10.1145/2380403.2380434","url":null,"abstract":"Locking cache lines in hard real-time systems is a common means to ensure timing predictability of data references and to lower bounds on worst-case execution time, especially in a multi-tasking environment. Growing processing demand on multi-tasking real-time systems can be met by employing scalable multi-core architectures, like the recently introduced tile-based architectures. This paper studies the use of cache locking on massive multi-core architectures with private caches in the context of hard real-time systems. In shared cache architectures, a single resource is shared among {em all} the tasks. However, in scalable cache architectures with private caches, conflicts exist only among the tasks scheduled on one core. This calls for a cache-aware allocation of tasks onto cores. Our work extends the cache-unaware First Fit Decreasing (FFD) algorithm with a Naive locked First Fit Decreasing (NFFD) policy. We further propose two cache-aware static scheduling schemes: (1) Greedy First Fit Decreasing (GFFD) and (2) Colored First Fit Decreasing (CoFFD). This work contributes an adaptation of these algorithms for conflict resolution of partially locked regions. Experiments indicate that NFFD is capable of scheduling high utilization task sets that FFD cannot schedule. Experiments also show that CoFFD consistently outperforms GFFD resulting in lower number of cores and lower system utilization. CoFFD reduces the number of core requirements from 30% to 60% compared to NFFD. With partial locking, the number of cores in some cases is reduced by almost 50% with an increase in system utilization of 10%. Overall, this work is unique in considering the challenges of future multi-core architectures for real-time systems and provides key insights into task partitioning with locked caches for architectures with private caches.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"581 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123937703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guillermo A. Pérez, Chung-Min Kao, Yeh-Ching Chung, W. Hsu
{"title":"A hybrid just-in-time compiler for android: comparing JIT types and the result of cooperation","authors":"Guillermo A. Pérez, Chung-Min Kao, Yeh-Ching Chung, W. Hsu","doi":"10.1145/2380403.2380418","DOIUrl":"https://doi.org/10.1145/2380403.2380418","url":null,"abstract":"The Dalvik virtual machine is the main application platform running on Google's Android operating system for mobile devices and tablets. It is a Java Virtual Machine running a basic trace-based JIT compiler, unlike web browser JavaScript engines that usually run a combination of both method and trace-based JIT types. We developed a method-based JIT compiler based on the Low Level Virtual Machine framework that delivers performance improvement comparable to that of an Ahead-Of-Time compiler. We compared our method-based JIT against Dalvik's own trace-based JIT using common benchmarks available in the Android Market. Our results show that our method-based JIT is better than a basic trace-based JIT, and that, by sharing profiling and compilation information among each other, a smart combination of both JIT techniques can achieve a great performance gain.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130473879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting level-0 caches in embedded processors","authors":"Nam Duong, Taesu Kim, Dali Zhao, A. Veidenbaum","doi":"10.1145/2380403.2380435","DOIUrl":"https://doi.org/10.1145/2380403.2380435","url":null,"abstract":"Level-0 (L0) caches have been proposed in the past as an inexpensive way to improve performance and reduce energy consumption in resource-constrained embedded processors. This paper proposes new L0 data cache organizations using the assumption that an L0 hit/miss determination can be completed prior to the L1 access. This is a realistic assumption for very small L0 caches that can nevertheless deliver significant miss rate and/or energy reduction. The key issue for such caches is how and when to move data between the L0 and L1 caches. The first new cache, a flow cache, targets a conflict miss reduction in a direct-mapped L1 cache. It offers a simpler hardware design and uses on average 10% less dynamic energy than the victim cache with nearly identical performance. The second new cache, a hit cache, reduces the dynamic energy consumption in a set-associative L1 cache by 30% without impacting performance. A variant of this policy reduces the dynamic energy consumption by up to 50%, with 5% performance degradation.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129661485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A cost-effective tag design for memory data authentication in embedded systems","authors":"Mei Hong, Hui Guo, X. Hu","doi":"10.1145/2380403.2380414","DOIUrl":"https://doi.org/10.1145/2380403.2380414","url":null,"abstract":"This paper presents a tag design approach for memory data integrity protection. The approach is area, power and memory efficient, suitable to embedded systems that often suffer from stringent resource restriction. Experiments have been performed to compare the proposed approach with the state-of-the-art designs, which demonstrate that the approach can produce a memory data protection design with a low resource cost - achieving overhead savings of about 39% on chip area, 45% on power consumption, 65% on performance, and 12% on memory cost while maintaining the same or higher security level.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130966351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. E. Kiasari, A. Jantsch, M. Bekooij, A. Burns, Zhonghai Lu
{"title":"Analytical approaches for performance evaluation of networks-on-chip","authors":"A. E. Kiasari, A. Jantsch, M. Bekooij, A. Burns, Zhonghai Lu","doi":"10.1145/2380403.2380442","DOIUrl":"https://doi.org/10.1145/2380403.2380442","url":null,"abstract":"This tutorial reviews four popular mathematical formalisms -- dataflow analysis, schedulability analysis, network calculus, and queueing theory -- and how they have been applied to the analysis of Network-on-Chip (NoC) performance. We review the basic concepts and results of each formalism and provide examples of how they have been used in on-chip communication performance analysis. The tutorial also discusses the respective strengths and weaknesses of each formalism, their suitability for a specific purpose, and the attempts that have been made to bridge these analytical approaches. Finally, we conclude the tutorial by discussing open research issues.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115140473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy efficient special instruction support in an embedded processor with compact isa","authors":"Dongrui She, Yifan He, H. Corporaal","doi":"10.1145/2380403.2380430","DOIUrl":"https://doi.org/10.1145/2380403.2380430","url":null,"abstract":"The use of special instructions that execute complex operation patterns is a common approach in application specific processor design to improve performance and efficiency. However, in an embedded generic processor with compact instruction set architecture (ISA), such instructions may lead to large overhead as: i) more bits are needed to encode the extra opcodes and operands, resulting in wider instructions; ii) more register file (RF) ports are required to provide the extra operands to the function units. Such overhead may increase energy consumption considerably.\u0000 In this paper, we propose to support flexible operation pair patterns in a processor with a compact 24-bit RISC-like ISA using: i) a partially reconfigurable decoder that exploits the locality of patterns to reduce the requirement for opcode space; ii) a software controlled bypass network to reduce the requirement for operand encoding and RF ports. We also propose an energy-aware compiler backend design for the proposed architecture that performs pattern selection and bypass-aware scheduling to generate energy efficient codes. Though proposed design imposes extra constraints on the operation patterns, the experimental results show that the average dynamic instruction count is reduced by over 25%, which is only about 2% less than the architecture without such constraints. Due to the low overhead, the total energy of the proposed architecture reduces by an average of 15.8% compared to the RISC baseline, while the one without constraints achieves almost no energy improvement.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134521359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"When less is more (LIMO):controlled parallelism forimproved efficiency","authors":"Gaurav Chadha, S. Mahlke, S. Narayanasamy","doi":"10.1145/2380403.2380431","DOIUrl":"https://doi.org/10.1145/2380403.2380431","url":null,"abstract":"While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive contention for shared resources, wasting energy, which is of primary concern for embedded devices. Furthermore, thermal and power constraints prevent us from operating all the processor cores at the highest possible frequency, favoring fewer threads. The best number of threads to run depends on the application, user input and hardware resources available. It can also change at runtime making it infeasible for the programmer to determine this number.\u0000 To address this problem, we propose LIMO, a runtime system that dynamically manages the number of running threads of an application for maximizing peformance and energy-efficiency. LIMO monitors threads' progress along with the usage of shared hardware resources to determine the best number of threads to run and the voltage and frequency level. With dynamic adaptation, LIMO provides an average of 21% performance improvement and a 2x improvement in energy-efficiency on a 32-core system over the default configuration of 32 threads for a set of concurrent applications from the PARSEC suite, the Apache web server, and the Sphinx speech recognition system.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132574865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}