{"title":"Exploiting critical data regions to reduce data cache energy consumption","authors":"K. Vardhan, Y. Srikant","doi":"10.1145/2609248.2609253","DOIUrl":"https://doi.org/10.1145/2609248.2609253","url":null,"abstract":"In this paper we propose an energy aware optimization that exploits latency tolerance of data regions in programs. We propose techniques to identify data regions and rate their criticality using a dynamic critical path model. We compare latency tolerance of data regions to existing characteristics like access frequency and size of data regions. We leverage previously proposed drowsy cache lines to design an optimization that can reduce energy consumption in a data cache. We target this optimization to a simplified single-core with a private cache and single-threaded system which can be part of any type of a multi-core processor. We compare this technique to existing optimizations that use drowsy caches. We experimentally show that this technique can yield total power savings close to 38% and leakage power savings of 20% in the data cache when compared to a baseline configuration without any significant performance penalty.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimizing the cost of synchronisations in the WCET of real-time parallel programs","authors":"Haluk Ozaktas, Christine Rochange, P. Sainrat","doi":"10.1145/2609248.2609261","DOIUrl":"https://doi.org/10.1145/2609248.2609261","url":null,"abstract":"Designing time-predictable architectures to support the requirements of hard real-time systems is the goal of several research projects. In this paper we assume that such platforms exist and we focus on the timing analysis of parallel real-time applications. One of the main challenges is to determine how much the delays induced by software constructs such as synchronisations can impact the worst-case execution times (WCETs) of parallel threads. In this paper, we refine state-of-the-art analysis: first, we derive more accurate estimations of stalls at critical sections; second, we introduce new locking primitives that minimise stall times on the worst-case path. Experimental results show noticeable improvements on the WCETs of benchmarks.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122723352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivan Llopard, Albert Cohen, Christian Fabre, N. Hili
{"title":"A parallel action language for embedded applications and its compilation flow","authors":"Ivan Llopard, Albert Cohen, Christian Fabre, N. Hili","doi":"10.1145/2609248.2609257","DOIUrl":"https://doi.org/10.1145/2609248.2609257","url":null,"abstract":"The complexity of Embedded System (ES) development is increasing dramatically. This has several cumulative sources: the intricate combination of data-intensive, computational and control aspects; the ubiquity of parallelism and heterogeneity of modern architectures; and the diversity of target-specific, non-deterministic programming models (e.g., C++ with explicit message passing, OpenCL, VHDL). Model-Driven Engineering (MDE) proposes to manage complexity by raising the level of abstraction for designers and developers, and refining the implementation for a particular context and platform through model transformations. In such frameworks, behavior is often specified by means of Hierarchical State Machines (HSMs) equiped with an action language. However, although such models represent some level of control parallelism through objects and HSMs, data parallelism, compound data, and the exploitation and optimization thereof remains very limited.\u0000 In this paper, we propose an action language that seamlessly combines HSMs with data parallelism and operations on compound data. It preserves the expressivity of HSM and captures a layout-neutral description of data organisation. It also extends message-passing with an intuitive semantics for this additional paralellism and provides strong foundation for array-based optimisation techniques. We present this language together with a baseline code generation flow to enable the production of efficient, low-level imperative code.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122267265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Single-rate approximations of cyclo-static synchronous dataflow graphs","authors":"R. D. Groote, P. Hölzenspies, J. Kuper, G. Smit","doi":"10.1145/2609248.2609249","DOIUrl":"https://doi.org/10.1145/2609248.2609249","url":null,"abstract":"Exact analysis of synchronous dataflow (sdf) graphs is often considered too costly, because of the expensive transformation of the graph into a single-rate equivalent. As an alternative, several authors have proposed approximate analyses. Existing approaches to approximation are based on the operational semantics of an sdf graph.\u0000 We propose an approach to approximation that is based on functional semantics. This generalises earlier work done on multi-rate sdf graphs towards cyclo-static sdf (csdf) graphs. We take, as a starting point, a mathematical characterisation, and derive two transformations of a csdf graph into hsdf graphs. These hsdf graphs have the same size as the csdf graph, and are approximations: their respective temporal behaviours are optimistic and pessimistic with respect to the temporal behaviour of the csdf graph. Analysis results computed for these single-rate approximations give bounds on the analysis results of the csdf graph. As an illustration, we show how these single-rate approximations may be used to compute bounds on the buffer sizes required to reach a given throughput.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116158566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A framework for dynamic parallelization of FPGA-accelerated applications","authors":"J. Fowers, Jianye Liu, G. Stitt","doi":"10.1145/2609248.2609256","DOIUrl":"https://doi.org/10.1145/2609248.2609256","url":null,"abstract":"High-level synthesis and compiler studies have introduced many compile-time techniques for parallelizing applications. However, one fundamental limitation of compile-time optimization is the requirement for pessimistic dependence assumptions that can significantly restrict parallelism. To avoid this limitation, many compilers require a restrictive coding style that is not practical for many designers. We present a more transparent approach that aggressively parallelizes applications by dynamically analyzing actual runtime dependencies and scheduling functions onto multiple devices when dependencies allow. In addition, the approach applies FPGA-specific pipelining optimizations to exploit deep parallelism in chains of dependent functions. Experimental results show a speedup of 4.9x for a video-processing application compared to sequential software execution, a speedup of 5.6x compared to traditional FPGA execution, with a framework overhead of only 4%.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121881524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Temporal analysis model extraction for optimizing modal multi-rate stream processing applications","authors":"Stefan J. Geuns, J. Hausmans, M. Bekooij","doi":"10.1145/2609248.2609252","DOIUrl":"https://doi.org/10.1145/2609248.2609252","url":null,"abstract":"Modern real-time stream processing applications, such as Software Defined Radio (SDR) applications, typically have multiple modes and multi-rate behavior. Modes are often described using while-loops whereas multi-rate behavior is frequently described using arrays with pseudo-random indexing patterns. The temporal properties of these applications have to be analyzed in order to determine whether optimizations improve throughput. However, no method exists in which a temporal analysis model is derived from these applications that is suitable for temporal analysis and optimization.\u0000 In this paper an approach is presented in which a concurrency model for the temporal analysis and optimization of stream processing applications is automatically extracted from a parallelized sequential application. With this model it can be determined whether a program transformation improves the worst-case temporal behavior. The key feature of the presented approach is that arrays with arbitrary indexing patterns can be described, allowing the description of multi-rate behavior, while still supporting the description of modes using while-loops. In the model, an over-approximation of the synchronization dependencies is used in case of arrays with pseudo-random indexing patterns. Despite the use of this approximation, we show that deadlock is only concluded from the model if there is also deadlock in the parallelized application. The relevance and applicability of the presented approach are demonstrated using an Orthogonal Frequency-Division Multiplexing (OFDM) transmitter application.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113962126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A data parallel view on polyhedral process networks","authors":"A. Balevic, B. Kienhuis","doi":"10.1145/1988932.1988939","DOIUrl":"https://doi.org/10.1145/1988932.1988939","url":null,"abstract":"Emerging architectures in embedded space are expected to make use of a diverse mix of multicorcs, vector-based units, GPU cores and special function accelerators. In order to facilitate mapping onto diverse architectures, different models of computation have been considered. Polyhedral Process Networks (PPNs) have been extensively used in automatic generation of task and pipeline parallel programs for embedded architectures. However, the single program multiple data (SPMD) type of data parallelism has not been addressed in the PPN model. In this paper, we propose a Data Parallel View (DPV) on PPNs which introduces abstractions necessary for capturing and exploiting data parallelism on top of the PPN model. As a proof of concept, we demonstrate how a PPN can be mapped onto a modern GPU using the DPV. By complementing the native PPN support for task and pipeline parallelism with the DPV support for data parallelism, we expect to make the best use of different types of architectural components and types of parallelism on heterogeneous architectures.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132063810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank Hannig, Sascha Roloff, G. Snelting, J. Teich, Andreas Zwinkau
{"title":"Resource-aware programming and simulation of MPSoC architectures through extension of X10","authors":"Frank Hannig, Sascha Roloff, G. Snelting, J. Teich, Andreas Zwinkau","doi":"10.1145/1988932.1988941","DOIUrl":"https://doi.org/10.1145/1988932.1988941","url":null,"abstract":"The efficient use of future MPSoCs with 1000 or more processor cores requires new means of resource-aware programming to deal with increasing imperfections such as process variation, fault rates, aging effects, and power as well as thermal problems. In this paper, we apply a new approach called invasive computing that enables an application programmer to spread computations to processors deliberately and on purpose at certain points of the program. Such decisions can be made depending on the degree of application parallelism and the state of the underlying resources such as utilization, load, and temperature. The introduced programming constructs for resource-aware programming are embedded into the parallel computing language X10 as developed by IBM using a library-based approach. Moreover, we show how individual heterogeneous MPSoC architectures may be modeled for subsequent functional simulation by defining compute resources such as processors themselves by lightweight threads that are executed in parallel together with the application threads by the X10 run-time system. Thus, the state changes of each hardware resource may be simulated including temperature, aging, and other useful monitor functionality to provide a first high-level programming test-bed for invasive computing.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125904285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static run-time mode extraction by state partitioning in synchronous process networks","authors":"M. Beyer, S. Glesner","doi":"10.1145/1988932.1988938","DOIUrl":"https://doi.org/10.1145/1988932.1988938","url":null,"abstract":"Process Networks (PNs) are used for modeling streaming-oriented applications with changing behavior, which must be mapped on a concurrent architecture to meet the performance and energy constraints of embedded devices. Finding an optimal mapping of Process Networks to the constrained architecture presumes that the behavior of the PN is statically known. In this paper we present a static analysis for synchronous PNs that partitions the state space according to extract run-time modes based on a Data Augmented Control Flow Automaton (DACFA). The result is a mode automaton whose nodes describe identified program modes and whose edges represent transitions among them. Optimizing back-ends mapping from PNs to concurrent architectures can be guided by these analysis results.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130270537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The case for application specific compilers","authors":"M. Beemster","doi":"10.1145/1988932.1988943","DOIUrl":"https://doi.org/10.1145/1988932.1988943","url":null,"abstract":"We believe it makes sense to develop compilers that are specific for particular application domains. In many areas of embedded computing, processor architectures are designed specifically to run a narrow band of application code very well. These architectures are unlike any the world has seen before and to program them is a challenge to say the least. CoSy's flexible compiler technology thrives in this area. By viewing the compiler as a means and not as a goal, it is possible to achieve spectacular results in a very short time-frame.","PeriodicalId":375451,"journal":{"name":"Software and Compilers for Embedded Systems","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121393001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}