Christakis Lezos, I. Latifis, G. Dimitroulakos, K. Masselos
{"title":"Compiler-Directed Data Locality Optimization in MATLAB","authors":"Christakis Lezos, I. Latifis, G. Dimitroulakos, K. Masselos","doi":"10.1145/2906363.2906378","DOIUrl":"https://doi.org/10.1145/2906363.2906378","url":null,"abstract":"Array programming languages, such as MATLAB, are often used for algorithm development by scientists and engineers without taking into consideration implementation related issues and with limited emphasis on relevant optimizations. Application code optimization, especially in terms of data storage and transfer behavior, is still an important issue and heavily affects implementations' quality in terms of performance, power consumption etc. Efficient approaches for the optimization of high level application code are required to derive high quality implementations while still reducing development time and cost. This paper presents MemAssist, a software tool supporting application developers in detecting parts of the application code in MATLAB that do not exploit efficiently the targeted processor architecture and especially the memory hierarchy. Furthermore, the proposed tool guides application developers in applying code transformations in MATLAB for the optimization of the algorithm's temporal data locality. An image processing algorithm has been optimized using MemAssist as a practical usage scenario. Experimental results prove that the use of MemAssist can heavily reduce cache misses (up to 40%) and improve execution time (up to 30% speedup) on two different processor architectures. Thus, MemAssist can be used for optimized application code development that can lead to efficient implementations while still reducing development time and cost.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130525752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ghassan Shobaki, Najm Eldeen Abu Rmaileh, J. Jamal
{"title":"Studying the Impact of Bit Switching on CPU Energy","authors":"Ghassan Shobaki, Najm Eldeen Abu Rmaileh, J. Jamal","doi":"10.1145/2906363.2906382","DOIUrl":"https://doi.org/10.1145/2906363.2906382","url":null,"abstract":"It has been proposed in previous work that compiler instruction scheduling may reduce energy consumption by reordering instructions to minimize bit switching. Multiple algorithms have been proposed in the literature for performing this form of instruction scheduling. However, the impact of these algorithms on actual energy consumption has not been quantified using real hardware measurements; only simulation results have been reported. In this paper, we study the impact of bit switching on the CPU energy consumption using direct hardware measurements on a modern ARM processor. The measurements are performed using an energy probe provided by ARM. The experimental results show that the switching energy is significant and measurable, thus negating the hypothesis that compiling for performance is equivalent to compiling for energy. Yet, our experimental evaluation of multiple bit-switching-aware algorithms suggests that developing a compiler scheduling algorithm for reducing energy consumption by minimizing bit switching is quite challenging, because bit switching may conflict with execution time. An instruction order that minimizes bit switching but increases execution time may result in an overall increase in CPU energy, because the execution time has a higher impact on CPU energy than bit switching. In conclusion, our experimental results show that although performance is a primary factor that affects energy, it is not the only factor; switching energy is another significant factor.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129196163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Task-Level Monitoring Framework for Multi-Processor Platforms","authors":"Philipp Ittershagen, Kim Grüttner, W. Nebel","doi":"10.1145/2906363.2906373","DOIUrl":"https://doi.org/10.1145/2906363.2906373","url":null,"abstract":"In this paper, a monitoring framework for observing properties of tasks running on a multi-processor platform is proposed. We describe the implementation of the framework on a TLM-based virtual platform containing an ARM Cortex A9 multi-core instruction-set simulator and shared memory modules. An application model consisting of periodic tasks and communication channels is used to demonstrate the applicability of the monitoring framework. Based on the application model, we describe a method for deriving a monitor implementation at design time that is able to check the execution order and the platform mapping during run-time. The model is implemented on top of a POSIX-compatible real-time operating system and the monitor is instantiated as a TLM component in the virtual platform. The monitor implementation is then able to check the execution order and the platform mapping of the application against the specification at run-time. Finally, we discuss the monitoring capability and its contribution to a safety concept for fail-safe systems.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130608523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning Approach to Generate Pareto Front for List-scheduling Algorithms","authors":"Pham Nam Khanh, Akash Kumar, Khin Mi Mi Aung","doi":"10.1145/2906363.2906380","DOIUrl":"https://doi.org/10.1145/2906363.2906380","url":null,"abstract":"List Scheduling is one of the most widely used techniques for scheduling due to its simplicity and efficiency. In traditional list-based schedulers, a cost/priority function is used to compute the priority of tasks/jobs and put them in an ordered list. The cost function has been becoming more and more complex to cover increasing number of constraints in the system design. However, most of the existing list-based schedulers implement a static priority function that usually provides only one schedule for each task graph input. Therefore, they may not be able to satisfy the desire of system designers, who want to examine the trade-off between a number of design requirements (performance, power, energy, reliability ...). To address this problem, we propose a framework to utilize the Genetic Algorithm (GA) for exploring the design space and obtaining Pareto-optimal design points. Furthermore, multiple regression techniques are used to build predictive models for the Pareto fronts to limit the execution time of GA. The models are built using training task graph datasets and applied on incoming task graphs. The Pareto fronts for incoming task graphs are produced in time 2 orders of magnitude faster than the traditional GA, with only 4% degradation in the quality.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134494809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuoxin Lin, Yanzhou Liu, W. Plishker, S. Bhattacharyya
{"title":"A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms","authors":"Shuoxin Lin, Yanzhou Liu, W. Plishker, S. Bhattacharyya","doi":"10.1145/2906363.2906374","DOIUrl":"https://doi.org/10.1145/2906363.2906374","url":null,"abstract":"Heterogeneous computing platforms with multicore central processing units (CPUs) and graphics processing units (GPUs) are of increasing interest to designers of embedded signal processing systems since they offer the potential for significant performance boost while maintaining the flexibility of software-based design flows. Developing optimized implementations for CPU-GPU platforms is challenging due to complex, inter-related design issues, including task scheduling, interprocessor communication, memory management, and modeling and exploitation of different forms of parallelism. In this paper, we present an automated, dataflow based, design framework called DIF-GPU for application mapping and software synthesis on heterogeneous CPU-GPU platforms. DIF-GPU is based on novel extensions to the dataflow interchange format (DIF) package, which is a software environment for developing and experimenting with dataflow-based design methods and synthesis techniques for embedded signal processing systems. DIF-GPU exploits multiple forms of parallelism by deeply incorporating efficient vectorization and scheduling techniques for synchronous dataflow specifications, and incorporating techniques for streamlining interprocessor communication. DIF-GPU also provides software synthesis capabilities to help accelerate the process of moving from high-level application models to optimized implementations.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114885798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Single Source Shortest Path Parallelization on Shared Memory Accelerators","authors":"D. Palossi, A. Marongiu","doi":"10.1145/2906363.2915925","DOIUrl":"https://doi.org/10.1145/2906363.2915925","url":null,"abstract":"Single Source Shortest Path (SSSP) algorithms are widely used in embedded systems for several applications. The emerging trend towards the adoption of heterogeneous designs in embedded devices, where low-power parallel accelerators are coupled to the main processor, opens new opportunities to deliver superior performance/watt, but calls for efficient parallel SSSP implementation. In this work we provide a detailed exploration of the Δ-stepping algorithm performance on a representative heterogeneous embedded system, TI Keystone II, considering the impact of several parallelization parameters (threading, load balancing, synchronization).","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"559 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117053525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In-Place Update in a Dataflow Synchronous Language: A Retiming-Enabled Language Experiment","authors":"Ulysse Beaugnon, Albert Cohen, Marc Pouzet","doi":"10.1145/2906363.2906379","DOIUrl":"https://doi.org/10.1145/2906363.2906379","url":null,"abstract":"Dataflow synchronous languages such as Lustre have a purely functional semantics. This incurs a high overhead when dealing with arrays, as they have to be copied at each update. We propose to tackle this problem at the source, by constraining programs so that every functional array definition can be optimized into an in-place update. Our solution handles aliasing between function arguments. It also allows more programs with in-place updates to be accepted thanks to a new retiming framework, effectively rescheduling computations across time steps. Our proposed language and compilation method enforces zero-copy purely functional arrays while preserving expressiveness and programmer control through explicit copies.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127055406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Extensible Platform Description Language Supporting Retargetable Toolchains and Adaptive Execution","authors":"C. Kessler, Lu Li, A. Atalar, A. Dobre","doi":"10.1145/2906363.2906366","DOIUrl":"https://doi.org/10.1145/2906363.2906366","url":null,"abstract":"XPDL is a modular, extensible platform description language for heterogeneous multicore systems and clusters. XPDL models provide metadata about hardware and installed system software that are relevant for adaptive static and dynamic optimizations of application programs and system settings for improved performance and energy efficiency. XPDL is based on XML and uses hyperlinks and inheritance to create modular, distributed libraries of platform models. We also provide a retargetable toolchain that browses and processes XPDL models and generates driver code for microbenchmarking to bootstrap empirical performance and energy models at deployment time. A C++ API enables convenient introspection of platform models, even at run-time, which allows for adaptive dynamic program optimizations such as tuned selection of implementation variants.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125486441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introducing MoC Drivers for the Integration of Sensor-Actuator Behaviors in Model-Based Design Flows of Embedded Systems","authors":"Omair Rafique, K. Schneider","doi":"10.1145/2906363.2906368","DOIUrl":"https://doi.org/10.1145/2906363.2906368","url":null,"abstract":"Model-based design flows for embedded systems have been introduced to allow late design changes while still keeping tight time-to-market deadlines. In general, these design flows start with abstract models and refine these to a final implementation maintaining already implemented properties. However, essentially all of these design flows suffer from a deployment gap in the sense that the finally generated files are general program files which assume a particular model of computation (MoC) that may not be provided by the chosen target architecture. For this reason, the final deployment is usually a non-trivial manual design step that can break all correctness-by-construction guarantees of the previous model-based design. In this paper, we therefore introduce the idea of MoC drivers which wraps the real sensor and actuator interaction in a shell that provides the MoC of the generated software. As a particular example, we discuss in this paper how MoC drivers bridge the deployment gap between automatically generated dataflow programs and event-driven behaviors of the target architecture. The approach is illustrated with a Speedometer application on a distributed automotive embedded platform.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122800417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Practical Challenges of ILP-based SPM Allocation Optimizations","authors":"Dominic Oehlert, Arno Luppold, H. Falk","doi":"10.1145/2906363.2906371","DOIUrl":"https://doi.org/10.1145/2906363.2906371","url":null,"abstract":"Scratchpad Memory (SPM) allocation is a well-known technique for compiler-based code optimizations. Integer-Linear Programming has been proven to be a powerful technique to determine which parts of a program should be moved to the SPM. Although the idea is quite straight-forward in theory, the technique features several challenges when being applied to modern embedded systems. In this paper, we aim to bring out the main issues and possible solutions which arise when trying to apply those optimizations to existing hardware platforms.","PeriodicalId":344390,"journal":{"name":"Proceedings of the 19th International Workshop on Software and Compilers for Embedded Systems","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127011924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}