Olaf Neugebauer, Pascal Libuschewski, M. Engel, H. Müller, P. Marwedel
{"title":"Plasmon-based Virus Detection on Heterogeneous Embedded Systems","authors":"Olaf Neugebauer, Pascal Libuschewski, M. Engel, H. Müller, P. Marwedel","doi":"10.1145/2764967.2764976","DOIUrl":"https://doi.org/10.1145/2764967.2764976","url":null,"abstract":"Embedded systems, e.g. in computer vision applications, are expected to provide significant amounts of computing power to process large data volumes. Many of these systems, such as used in medical diagnosis, are mobile devices and face significant challenges to provide sufficient performance while operating on a constrained energy budget. Modern embedded MPSoC platforms use heterogeneous CPU and GPU cores providing a large number of optimization parameters. This allows to find useful trade-offs between energy consumption and performance for a given application. In this paper, we describe how the complex data processing required for PAMONO, a novel type of biosensor for the detection of biological viruses, can efficiently be implemented on a state-of-the-art heterogeneous MPSoC platform. An additional optimization dimension explored is the achieved quality of service. Reducing the virus detection accuracy enables additional optimizations not achievable by modifying hardware or software parameters alone. Instead of relying on often inaccurate simulation models, our design space exploration employs a hardware-in-the-loop approach to evaluate the performance and energy consumption on the embedded target platform. Trade-offs between performance, energy and accuracy are controlled by a genetic algorithm running on a PC control system which deploys the evaluation tasks to a number of connected embedded boards. Using our optimization approach, we are able to achieve frame rates meeting the requirements without losing accuracy. Further, our approach is able to reduce the energy consumption by 93% with a still reasonable detection quality.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115272435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Isolation for Predictable MPSoC Stream Processing","authors":"J. Teich","doi":"10.1145/2764967.2771821","DOIUrl":"https://doi.org/10.1145/2764967.2771821","url":null,"abstract":"Resource sharing and interferences of multiple threads of one, but even worse between multiple application programs running concurrently on a Multi-Processor System-on-a-Chip (MPSoC) today make it very hard to provide any timing or throughput-critical applications with time bounds. Additional interferences result from the interaction of OS functions such as thread multiplexing and scheduling as well as complex resource (e.g., cache) reservation protocols used heavily today. Finally, dynamic power and temperature management on a chip might also throttle down processor speed at arbitrary times leading to additional variations and jitter in execution time. This may be intolerable for many safety-critical applications such as medical imaging or automotive driver assistance systems. Static solutions to provide the required isolation by allocating distinct resources to safety-critical applications may not be feasible for reasons of cost and due to the lack of efficiency and inflexibility. Also, shutting off or restricting temperature and power management might not be tolerable. In this keynote, we propose new techniques for adaptive isolation of resources including processor, I/O, memory as well as communication resources on demand on an MPSoC based on the paradigm of Invasive Computing. In Invasive Computing, a programmer may specify bounds on the execution quality of a program or even single segments of a program followed by an invade command. This system returns a constellation of exclusive resources called a claim that is subsequently used in a by-default non-shared way until being released again by the invader. Through this principle, it becomes possible to isolate applications automatically and in an on-demand manner. In invasive computing, isolation is supported on all levels of hardware and software including an invasive OS. In case of an abundant number of cores available on an MPSoC today, the problem still becomes how to find suitable claims that will guarantee a performance bound in a negligible amount of time? For a broad class of streaming applications, we propose a combined static/dynamic approach based on a static design space exploration phase to extract a set of satisfying claim characteristics for which program execution is guaranteed to stay within the desired performance bounds. For a class of compositional and heterogeneous MPSoC systems, only very little information must then be passed to the OS for run-time claim search in the form of so-called CCGs (claim constraint graphs). A special role here plays a compositional Network-on-a-Chip (NoC) architecture that allows to invade guaranteed bandwith between processor, memory and I/O tiles independently from other applications. We demonstrate the above concepts for a complex object detection application algorithm chain taken from robot vision to show jitter-minimized implementations become possible, even for statically unknown arrivals of other concurrent applications.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116391555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach","authors":"R. K. Thakur, Y. Srikant","doi":"10.1145/2764967.2764968","DOIUrl":"https://doi.org/10.1145/2764967.2764968","url":null,"abstract":"Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing units (GPUs) now emerging as general purpose co-processors, scheduling and distribution of these stream programs onto heterogeneous architectures (having both GPUs and CPUs) provides for challenging research. Exploiting this abundant parallelism in hardware, and providing a scalable solution is a hard problem. In this paper we describe a coarse-grained software pipelined scheduling algorithm for stream programs which statically schedules a stream graph onto heterogeneous architectures. We formulate the problem of partitioning the work between the CPU cores and the GPU as a model-checking problem. The partitioning process takes into account the costs of the required buffer layout transformations associated with the partitioning and the distribution of the stream graph. The solution trace result from the model checking provides a map for the distribution of actors across different processors/-cores. This solution is then divided into stages, and then a coarse grained software-pipelined code is generated. We use CUDA streams to map these programs synergistically onto the CPU and GPUs. We use a performance model for data transfers to determine the optimal number of CUDA streams on GPUs. Our software-pipelined schedule yields a speedup of upto 55.86X and a geometric mean speedup of 9.62X over a single threaded CPU.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125479194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is dynamic compilation possible for embedded systems?","authors":"H. Charles, V. Lomüller","doi":"10.1145/2764967.2782785","DOIUrl":"https://doi.org/10.1145/2764967.2782785","url":null,"abstract":"JIT compilation and dynamic compilation are powerful techniques allowing to delay the final code generation to the runtime. There is many benefits: improved portability, virtual machine security, etc. Unforturnately the tools used for JIT compilation and dynamic compilation does not met the classical requirement for embedded platforms: memory size is huge and code generation has big overheads. In this paper we show how dynamic code specialization (JIT) can be used and be beneficial in terms of execution speed and energy consumption with memory footprint kept under control. We based our approaches on our tool deGoal and on LLVM, that we extended to be able to produce lightweight runtime specializers from annotated LLVM programs. Benchmarks are manipulated and transformed into templates and a specialization routine is build to instantiate the routines. Such approach allows to produce efficient specializations routines, with a minimal energy consumption and memory footprint compare to a generic JIT application. Through some benchmarks, we present its efficiency in terms of speed, energy and memory footprint. We show that over static compilation we can achieve a speed-up of 21 % in terms of execution speed but also a 10 % energy reduction with a moderate memory footprint.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114081048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Energy Efficient Message Passing Synchronization Algorithm for Concurrent Data Structures in Embedded Systems","authors":"Lazaros Papadopoulos, D. Soudris","doi":"10.1145/2764967.2771931","DOIUrl":"https://doi.org/10.1145/2764967.2771931","url":null,"abstract":"Nowadays, modern multicore embedded systems often execute complex applications that rely heavily on concurrent data structures. Databases on embedded microservers, file systems and stream processing algorithms belong in application domains that normally utilize concurrent data structures to store and process their data. The prevalent lock-based synchronization methods based on mutexes provide poor scalability and, most importantly, they lead to high energy consumption, which is an important constraint on embedded systems. In this work, we propose an energy efficient synchronization model for embedded system architectures based on message-passing communication. Our results show that concurrent data structures based on the proposed model provide lower power consumption in comparison with the corresponding lock-based implementations, along with comparable performance.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133871892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, C. Haubelt
{"title":"Throughput-optimizing Compilation of Dataflow Applications for Multi-Cores using Quasi-Static Scheduling","authors":"T. Schwarzer, J. Falk, M. Glaß, J. Teich, C. Zebelein, C. Haubelt","doi":"10.1145/2764967.2764972","DOIUrl":"https://doi.org/10.1145/2764967.2764972","url":null,"abstract":"Application modeling using dynamic dataflow graphs is well-suited for multi-core platforms. However, there is often a mismatch between the fine granularity of the application and the platform. Tailoring this granularity to the platform promises performance gains by (a) reducing dynamic scheduling overhead and (b) exploiting compiler optimizations. In this paper, we propose a throughput-optimizing compilation approach that uses Quasi-Static Schedules (QSSs) to combine actors of static dataflow subgraphs. Our proposed approach combines core allocation, QSSs, and actor binding in a Design Space Exploration (DSE), optimizing the throughput for a number of available cores. During the DSE, each implementation candidate is compiled to and evaluated on the target hardware---here an Intel i7 and an ARM Cortex-A9. Experimental results including synthetic benchmarks as well as a real-world control application show that our proposed holistic compilation approach outperforms classic DSEs that are agnostic of QSS as well as a DSE that employs QSS as a post-processing step. Amongst others, we show a case where the compilation approach obtains a speedup of 9.91 x for a 4-core implementation, while a classic DSE only obtains a speedup of 2.12 x.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126153584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Fernando Eusse Giraldo, L. Murillo, C. McGirr, R. Leupers, G. Ascheid
{"title":"Application-Specific Architecture Exploration Based on Processor-Agnostic Performance Estimation","authors":"Juan Fernando Eusse Giraldo, L. Murillo, C. McGirr, R. Leupers, G. Ascheid","doi":"10.1145/2764967.2771932","DOIUrl":"https://doi.org/10.1145/2764967.2771932","url":null,"abstract":"Early design decisions such as architectural class and instruction set selection largely determine the performance and energy consumption of application specific processors (ASIPs). However, making decisions that effectively reflect in high performance require that a careful analysis of the target application is done by an experienced designer. Such process is extremely time consuming, and a confirmation that the processor meets the application requirements can only be extracted after costly architectural implementation, synthesis and simulation. To shorten design times, this work couples High-Level Synthesis (HLS) with pre-architectural performance estimation. We do so with the aim of providing designers with an initial architectural seed together with quantitative feedback about its performance. This enables to perform a light-weight refinement process based on the obtained feedback, such that time-consuming microarchitectural implementation is done only once at the end of the refinement steps. We employed our flow to generate four potential ASIPs for a 1024-point FFT. Estimates validation and gain evaluation is performed on actual ASIP implementations, which achieve performance gains of up to 8.42x and energy gains up to 1.32x over an existing VLIW processor.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128645407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
É. Sousa, Frank Hannig, J. Teich, Qingqing Chen, Ulf Schlichtmann
{"title":"Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays","authors":"É. Sousa, Frank Hannig, J. Teich, Qingqing Chen, Ulf Schlichtmann","doi":"10.1145/2764967.2771933","DOIUrl":"https://doi.org/10.1145/2764967.2771933","url":null,"abstract":"Massively Parallel Processor Arrays (MPPAs) can be nicely used in portable devices such as tablets and smartphones. However, applications running on mobile platforms require a certain performance level or quality (e.g., high-resolution image processing) that need to be satisfied while adhering to a certain power budget and temperature threshold. As a solution to the aforementioned challenges, we consider a resource-aware computing paradigm to exploit runtime adaptation without violating any thermal and/or power constraint in a programmable MPPA. For estimating the power consumption, we developed a mathematical model based on the post-synthesis implementation of an MPPA in different CMOS technologies while the temperature variation was emulated. We showcase our hardware/software mechanism to load new, on-the-fly configurations into the accelerator, considering quality/throughput tradeoffs for image processing applications. The results show that the average power consumption of a Sobel and Laplace operators using different number of processing elements amounts to 1.24 mW and 10.35 mW, respectively. Furthermore, only 1.64 μs are necessary for configuring a class of MPPA running at 550 MHz.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132258275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maurice Peemen, W. Pramadi, B. Mesman, H. Corporaal
{"title":"VLIW Code Generation for a Convolutional Network Accelerator","authors":"Maurice Peemen, W. Pramadi, B. Mesman, H. Corporaal","doi":"10.1145/2764967.2771928","DOIUrl":"https://doi.org/10.1145/2764967.2771928","url":null,"abstract":"This paper presents a compiler flow to map Deep Convolutional Networks (ConvNets) to a highly specialized VLIW accelerator core targeting the low-power embedded market. Earlier works have focused on energy efficient accelerators for this class of algorithms, but none of them provides a complete and practical programming model. Due to the large parameter set of a ConvNet it is essential that the user can abstract from the accelerator architecture and does not have to rely on an error prone and ad-hoc assembly programming model. By using modulo scheduling for software pipelining we demonstrate that our automatic generated code achieves equal or within 5-20% less hardware utilization w.r.t. code written manually by experts. Our compiler removes the huge manual workload to efficiently map ConvNets to an energy-efficient core for the next-generation mobile and wearable devices.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133142986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A model-based, single-source approach to design-space exploration and synthesis of mixed-criticality systems","authors":"F. Herrera, P. Peñil, E. Villar","doi":"10.1145/2764967.2784777","DOIUrl":"https://doi.org/10.1145/2764967.2784777","url":null,"abstract":"While the Moore's Law is still in place, the complexity of embedded systems continues to growth exponentially. Embedded Systems are implemented on complex HW/SW platforms, requiring more powerful design methods and tools. Electronic System Level (ESL) design [1] proposes to raise the level of abstraction in which the system is modeled in order to allow the analysis and optimization of the system at earlier stages of the design process.","PeriodicalId":110157,"journal":{"name":"Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133331169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}