{"title":"The PROMPT design principles for predictable multi-core architectures","authors":"R. Wilhelm","doi":"10.1145/1543820.1543826","DOIUrl":"https://doi.org/10.1145/1543820.1543826","url":null,"abstract":"Embedded hard real-time systems need reliable guarantees for the satisfaction of their timing constraints. The precision of the results and the efficiency of timing-analysis methods are highly dependent on the predictability of the execution platform.\u0000 The possibility of proving the safety of embedded systems is seriously compromised by architectural developments aiming exclusively at improving average-case performance. Proving the correctness of a modern high-performance processor is beyond the reach of verification methods. Even the chances to derive reliable and precise bounds on execution times are endangered by exactly these developments.\u0000 We propose design principles for multi-core architectures to provide efficiently predictable good worst-case performance as needed for embedded control in the aeronautics and automotive industries supporting the Integrated Modular Avionics (IMA) and the Automotive Open System Architecture (AUTOSAR) development trends. This talk presents a development process oriented at achieving predictability at all levels of the architecture hierarchy.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117033961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication between nested loop programs via circular buffers in an embedded multiprocessor system","authors":"T. Bijlsma, M. Bekooij, P. Jansen, G. Smit","doi":"10.1145/1361096.1361104","DOIUrl":"https://doi.org/10.1145/1361096.1361104","url":null,"abstract":"Multimedia applications, executed by embedded multiprocessor systems, can in some cases be represented as task graphs, with the tasks containing nested loop programs. The nested loop programs communicate via arrays and can be executed on different processors. Typically an array can be communicated via a circular buffer with a capacity smaller than the array. For such buffers, the communicating nested loop programs have to synchronize and a sufficient buffer capacity needs to be computed. In a circular buffer we use a write and a read window to support rereading, out-of-order reading or writing, and skipping of locations. A cyclo static dataflow model is derived from the application and used to compute buffer capacities that guarantee deadlock free execution. Our case-study applies circular buffers in a Digital Audio Broadcasting channel decoder application, where the frequency deinterleaver reads according to a non-affine pseudo-random function. For this application, buffer capacities are calculated that guarantee deadlock free execution.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134559478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal vs. heuristic integrated code generation for clustered VLIW architectures","authors":"Mattias V. Eriksson, Oskar Skoog, C. Kessler","doi":"10.1145/1361096.1361099","DOIUrl":"https://doi.org/10.1145/1361096.1361099","url":null,"abstract":"In this paper we present two algorithms for integrated code generation for clustered VLIW architectures. One algorithm is a heuristic based on genetic algorithms, the other algorithm is based on integer linear programming. The performance of the algorithms are compared on a portion of the Mediabench [10] benchmark suite. We found the results of the genetic algorithm to be within one or two clock cycles from optimal for the cases where the optimum is known. In addition the heuristic algorithm produces results in predictable time also when the optimal integer linear program fails.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115541880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast source-level data assignment to dual memory banks","authors":"A. Murray, Björn Franke","doi":"10.1145/1361096.1361105","DOIUrl":"https://doi.org/10.1145/1361096.1361105","url":null,"abstract":"Due to their streaming nature memory bandwidth is critical for most digital signal processing applications. To accommodate for these bandwidth requirements digital signal processors are typically equipped with dual memory banks that enable simultaneous access to two operands if the data is partitioned appropriately. Fully automated and compiler integrated approaches to data partitioning and memory bank assignment, however, have found little acceptance by DSP software developers. This is partly due to their inflexibility and inability to cope with certain manual data pre-assignments, e.g. due to I/O constraints. In this paper we present a different and more flexible approach, namely source-level dual memory assignment where code generation targets DSP-C, a standardised C language extension widely supported by industrial C compilers for DSPs. Additionally, we present a novel partitioning algorithm based on soft colouring that is more efficient and scalable than the currently known best integer linear programming algorithm, whilst achieving competitive code quality. We have evaluated our scheme on an Analog Devices TigerSHARC DSP and achieved speedups of up to 1.57 on 13 UTDSP benchmarks.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123504181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory footprint reduction for embedded systems","authors":"K. D. Bosschere","doi":"10.1145/1361096.1361102","DOIUrl":"https://doi.org/10.1145/1361096.1361102","url":null,"abstract":"The memory footprint is considered an important constraint for embedded systems. This is especially important in the context of increasing sophistication of embedded software, and the increasing use of modern software engineering techniques like component-based design. Since reusability is the major motivation for using components, most components are not optimized for the (limited) functionality they have to realize in an embedded system. All this leads to an increasing amount of code and data that might not be needed for a given functionality. The memory footprint of an embedded system consists of 2 parts: the footprint of the application and the footprint of the operating system. In this keynote talk, I will focus on the memory footprint reduction of application as well as the Linux kernel. I will report memory footprint reductions that have been obtained by the Diablo binary rewriter, which has been used to substantially reduce the memory footprint of both applications and of the system software. For the applications, the optimizer is capable of reducing the code size of programs compiled with two proprietary ARM tool chains (ADS 1.1 and RVCT 2.1) with on average 16% for statically linked ARM programs, while making them 12.8% faster. Execution of the rewritten programs also consumes on average 10.7% less energy. For the system software, we specialize the kernel both for the system calls that are actually occurring in the application program, and for the boot parameters of the kernel. We also assume that the hardware is fixed so that part of the bootstrap process is completely deterministic and can be optimized based on actual trace information. Finally, we compress frozen code, and we swap cold code to flash memory. All combined, these compaction techniques on the kernel can reduce the kernel's RAM footprint with up to 48% for the Linux kernel. The slowdown was limited to 1--2%. This proves that binary rewriting can help in substantially reducing the memory footprint of both the application and the system software. The nice thing is that it can be done automatically, and that it also reduces the execution time and the power consumption.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133343628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new heuristic for SOA problem based on effective tie break function","authors":"H. Shokry, H. M. El-Boghdadi, S. Shaheen","doi":"10.1145/1361096.1361106","DOIUrl":"https://doi.org/10.1145/1361096.1361106","url":null,"abstract":"Producing efficient and compact code for embedded DSP processors is very important for nowadays faster and smaller size devices. Because such processors have highly irregular data-path, conventional code generation techniques typically result in inefficient code. Embedded software compilers are expected to make use of the Address Generation Unit (AGU); a feature commonly found in modern embedded DSP processors. This helps in generating optimized offset assignments to program variables in memory, and consequently minimize the overhead instructions dedicated for addresses computations. This paper addresses one of the problems of code optimizations; namely Simple Offset Assignment (SOA) problem.\u0000 In this paper, we study the tie break function introduced by Leupers and Marwedel [1] and show that this function does not represent the actual tie break that could happen in the graph. Then we introduce the notion of Effective Tie Break Function (ETBF) and use it in proposing a new algorithm for solving the SOA problem. We apply the algorithm to randomly generated graphs. Our results show improvement in offset assignment cost of up to 7% over well known offset assignment algorithms [1,2,3].","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129740099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast cycle-approximate instruction set simulation","authors":"Björn Franke","doi":"10.1145/1361096.1361109","DOIUrl":"https://doi.org/10.1145/1361096.1361109","url":null,"abstract":"Instruction set simulators are indispensable tools in both ASIP design space exploration and the software development and optimisation process for existing platforms. Despite the recent progress in improving the speed of functional instruction set simulators cycle-accurate simulation is still prohibitively slow for all but the most simple programs. This severely limits the applicability of cycle-accurate simulators in the performance evaluation of complex embedded applications. In this paper we present a novel approach, namely the prediction of cycle counts based on information gathered during fast functional simulation and prior training. We have evaluated our approach against a cycle-accurate ARM v5 architecture simulator and a large set of benchmarks. We demonstrate it is capability of providing highly accurate performance predictions with an average error of less than 5.8% at a fraction of the time for cycle-accurate simulation.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133706598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Lokuciejewski, H. Falk, P. Marwedel, Henrik Theiling
{"title":"WCET-driven, code-size critical procedure cloning","authors":"Paul Lokuciejewski, H. Falk, P. Marwedel, Henrik Theiling","doi":"10.1145/1361096.1361100","DOIUrl":"https://doi.org/10.1145/1361096.1361100","url":null,"abstract":"In the domain of the worst-case execution time (WCET) analysis, loops are an inherent source of unpredictability and loss of precision since the determination of tight and safe information on the number of loop iterations is a difficult task. In particular, data-dependent loops whose iteration counts depend on function parameters can not be precisely handled by a timing analysis. Procedure Cloning can be exploited to make these loops explicit within the source code allowing a highly precise WCET analysis.\u0000 In this paper we extend the standard Procedure Cloning optimization by WCET-aware concepts with the objective to improve the tightness of the WCET estimation. Our novel approach is driven by WCET information which successively eliminates code structures leading to overestimated timing results, thus making the code more suitable for the analysis. In addition, the code size increase during the optimization is monitored and large increases are avoided.\u0000 The effectiveness of our optimization is shown by tests on real-world benchmarks. After performing our optimization, the estimated WCET is reduced by up to 64.2% while the employed code transformations yield an additional code size increase of 22.6% on average. In contrast, the average-case performance being the original objective of Procedure Cloning showed a slight decrease.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124081880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hugo Venturini, F. Riss, Jean-Claude Fernandez, M. Santana
{"title":"A fully-non-transparent approach to the code location problem","authors":"Hugo Venturini, F. Riss, Jean-Claude Fernandez, M. Santana","doi":"10.1145/1361096.1361108","DOIUrl":"https://doi.org/10.1145/1361096.1361108","url":null,"abstract":"In the context of embedded systems such as cell-phones, PDA or cars and planes software, optimizations of code are required because of timing and memory constraints imposed. Many problems arise when trying to debug optimized code. One of them is the irrelevance of the mapping between the source code and the optimized target program: the Code Location Problem. This paper proposes a solution to this problem in the case of highly optimized code in the context of embedded systems.\u0000 Two approaches exist: non-transparent and transparent debugging. Our approach is non-transparent. The idea is to reveal the execution of the optimized program to the user so the latter understands the mapping to the source code in spite of transformations applied to the program. We do not emulate the execution of the unoptimized program. We make good use of the programmer's knowledge of its development platform. Standard debuggers do not provide the required mechanisms while compilers do not provide the relevant debug information. We propose a novel method to maintain accurate debug information when optimizing at compilation and we experiment this method on the MMDSP+ C compiler and the IDBug debugger.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126106257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrated code generation by using fuzzy control system","authors":"Xiaoyan Jia, Jie Guo, G. Fettweis","doi":"10.1145/1361096.1361098","DOIUrl":"https://doi.org/10.1145/1361096.1361098","url":null,"abstract":"High quality code generation for DSPs that consist of irregular architectures is a challenge in terms of problem complexity. Since such problems are divided into several separated subtasks in the traditional compiler backends, the code quality is decreased owing to the ignorance of the interdependencies among these subtasks. Thus, an integrated compiler backend by using fuzzy control system is developed for an irregular architecture which is called Synchronous Transfer Architecture (STA). According to the experimental results, our novel method is proved to be more efficient than the traditional method. The code size and execution time of the generated code are reduced to be about 42.7% to 62.5% of those achieved by traditional compiler backend. Moreover, the power consumption is greatly reduced concerning the efficient utilization of the STA data paths.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122742895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}