{"title":"Fine-grain dynamic instruction placement for L0 scratch-pad memory","authors":"Jongsoo Park, J. Balfour, W. Dally","doi":"10.1145/1878921.1878943","DOIUrl":"https://doi.org/10.1145/1878921.1878943","url":null,"abstract":"We present a fine-grain dynamic instruction placement algorithm for small L0 scratch-pad memories (SPMs), whose unit of transfer can be an individual instruction. Our algorithm captures a large fraction of instruction reuse missed by coarse-grain placement algorithms whose unit of transfer is restricted to loops or functions within the capacity of SPMs. Evaluation of L0 SPMs with our fine-grain algorithm in 17 applications shows that the energy consumed by instruction storage hierarchy is reduced by 38% and 31% compared to that of L0 instruction caches and L0 SPMs with an ideal coarse-grain algorithm, respectively.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131798086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parsimonious information technologies for pixels, perception, wetware and simulation: issues for Petrasek's global virtual hospital system","authors":"A. Barr","doi":"10.1145/1878921.1878931","DOIUrl":"https://doi.org/10.1145/1878921.1878931","url":null,"abstract":"New types of \"engaging\" embedded systems and devices will greatly assist future medical care, as for Petrasek's envisioned Global Virtual Hospital System. The most effective devices will need to be designed in a \"parsimonious\" way for their economic use of energy, digital bits, communication time, and in terms of trading more expensive physical structures for less expensive computational ones. At the technological level, each device needs a carefully selected \"matched set\" of technological tradeoffs between the particular medical and user ends and means. The matched set of choices would carefully make sure that the device \"methods\" and implementations lead reliably to the device \"goals\" and purposes.\u0000 In addition, however, there is a critical user-oriented aspect where the devices will also need to utilize highly \"engaging environments\" that are not too cumbersome or too tiring to use. People are becoming increasingly sophisticated with regard to the interactive requirements they have for their devices, from their experience with digital media, iPhones, video computer games and other types of environments that \"engage\" a person's attention for long periods of time, and without annoying delays and frustrations.\u0000 It is an absolute requirement that the devices incorporate highly engaging environments so that using them does not tire the user or cause unnecessary medical errors and delays.\u0000 This improved type of portable device, scanners, services and information methods would efficiently and more accurately gather sufficiently detailed medical information from the patient's body, help relay sufficient parts of the patient information electronically to a worldwide net of physicians and relay appropriate results and prescriptions back to the patient","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124689974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs","authors":"A. Marongiu, P. Burgio, L. Benini","doi":"10.1145/1878921.1878952","DOIUrl":"https://doi.org/10.1145/1878921.1878952","url":null,"abstract":"In this paper we address the issue of efficient doall workload distribution on a embedded 3D MPSoC. 3D stacking technology enables low latency and high bandwidth access to multiple, large memory banks in close spatial proximity. In our implementation one silicon layer contains multiple processors, whereas one or more DRAM layers on top host a NUMA memory subsystem. To obtain high locality and balanced workload we consider a two-step approach. First, a compiler pass analyzes memory references in a loop and schedules each iteration to the processor owning the most frequently accessed data. Second, if locality-aware loop parallelization has generated unbalanced workload we allow idle processors to execute part of the remaining work from neighbors by implementing runtime support for work stealing.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127882145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Kandemir, Yuanrui Zhang, Sai Prashanth Muralidhara, O. Ozturk, S. Narayanan
{"title":"Slicing based code parallelization for minimizing inter-processor communication","authors":"M. Kandemir, Yuanrui Zhang, Sai Prashanth Muralidhara, O. Ozturk, S. Narayanan","doi":"10.1145/1629395.1629409","DOIUrl":"https://doi.org/10.1145/1629395.1629409","url":null,"abstract":"One of the critical problems in distributed memory multi-core architectures is scalable parallelization that minimizes inter-processor communication. Using the concept of iteration space slicing, this paper presents a new code parallelization scheme for data-intensive applications. This scheme targets distributed memory multi-core architectures, and formulates the problem of data-computation distribution (partitioning) across parallel processors using slicing such that, starting with the partitioning of the output arrays, it iteratively determines the partitions of other arrays as well as iteration spaces of the loop nests in the application code. The goal is to minimize inter-processor data communications. Based on this iteration space slicing based formulation of the problem, we also propose a solution scheme. The proposed data-computation scheme is evaluated using six data-intensive benchmark programs. In our experimental evaluation, we also compare this scheme against three alternate data-computation distribution schemes. The results obtained are very encouraging, indicating around 10% better speedup, with 16 processors, over the next-best scheme when averaged over all benchmark codes we tested.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121343233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A platform for developing adaptable multicore applications","authors":"D. Fay, L. Shang, D. Grunwald","doi":"10.1145/1629395.1629418","DOIUrl":"https://doi.org/10.1145/1629395.1629418","url":null,"abstract":"Computer systems are resource constrained. Application adaptation is a useful way to optimize system resource usage while satisfying the application performance constraints. Previous application adaptation efforts, however, were ad-hoc, time-consuming, and highly application-specific with limited portability between computer systems. In this work, our goal is to provide a development platform to systematically explore and rigorously apply portable application-specific runtime optimization. We present OCCAM, a software platform for developing multicore adaptive applications. OCCAM's design-time platform consists of APIs and data structures that allow application developers to specify the performance constraints and application-specific optimization techniques. OCCAM's run-time system dynamically manages the application behavior and optimizes system resource usage. OCCAM targets emerging Recognition, Mining, and Synthesis Applications (RMS). Using a set of RMS benchmarks, the experimental study demonstrates that OCCAM can successfully optimize resource usage under application performance constraints across a wide range of computer platforms, with an average of 38% energy savings on an Intel Atom-based, energy-constrained portable system, and an average of 24% energy savings on a high-performance, dual-core computer platform. These savings are accomplished with low overhead. We have also successfully extended OCCAM applications to run on a 16-core setup.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127176714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal loop parallelization for maximizing iteration-level parallelism","authors":"Duo Liu, Z. Shao, M. Wang, M. Guo, Jingling Xue","doi":"10.1145/1629395.1629407","DOIUrl":"https://doi.org/10.1145/1629395.1629407","url":null,"abstract":"This paper solves the open problem of extracting the maximal number of iterations from a loop that can be executed in parallel on chip multiprocessors. Our algorithm solves it optimally by migrating the weights of parallelism-inhibiting dependences on dependence cycles in two phases. First, we model dependence migration with retiming and formulate this classic loop parallelization into a graph optimization problem, i.e., one of finding retiming values for its nodes so that the minimum non-zero edge weight in the graph is maximized. We present our algorithm in three stages with each being built incrementally on the preceding one. Second, the optimal code for a loop is generated from the retimed graph of the loop found in the first phase. We demonstrate the effectiveness of our optimal algorithm by comparing with a number of representative non-optimal algorithms using a set of benchmarks frequently used in prior work.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133190278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatial complexity of reversibly computable DAG","authors":"Mouad Bahi, C. Eisenbeis","doi":"10.1145/1629395.1629404","DOIUrl":"https://doi.org/10.1145/1629395.1629404","url":null,"abstract":"In this paper we address the issue of making a program reversible in terms of spatial complexity. Spatial complexity is the amount of memory/register locations required for performing the computation in both forward and backward directions. Spatial complexity has important relationship with the intrinsics power consumption required at run time; this was our primary motivation. But it has also important relationship with the trade off between storing or recomputing reused intermediate values, also known as the rematerialization problem in the context of compiler register allocation, or the checkpointing issue in the general case. We present a lower bound of the spatial complexity of a DAG (directed acyclic graph) with reversible operations, as well as a heuristic aimed at finding the minimum number of registers required for a forward and backward execution of a DAG . We define energetic garbage as the additional number of registers needed for the reversible computation with respect to the original computation. We have run experiments that suggest that the garbage size is never more than 50% of the DAG size for DAGs with unary/binary operations.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114181183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-aware probabilistic multiplier: design and analysis","authors":"Mark S. K. Lau, K. Ling, Y. Chu","doi":"10.1145/1629395.1629434","DOIUrl":"https://doi.org/10.1145/1629395.1629434","url":null,"abstract":"Probabilistic CMOS is considered to be a promising technology for substantial energy savings for computing devices, such as DSPs and graphics chips. The basic principle is to relax the energy requirement by allowing possibly incorrect computation results. For devices with probabilistic components, energy should be assigned to each component wisely, in order to achieve a good trade-off between energy consumption and correctness of the outputs. Recently, a few schemes have been proposed for energy assignment of ripple-carry adders, which are often based on intuitive arguments. In the present paper, we extend the idea of energy assignment to probabilistic multipliers. We focus on a fundamental type of multipliers, known as array multipliers. We derive some analytical results. Guided by these results, we devise an energy assignment scheme. We also find that energy assignment for array multipliers and ripple-carry adders can be quite different, due to differences in their structures. To our best knowledge, our work here is the first attempt in the literature to consider energy assignment for multipliers. Some examples, including digital image enhancement, are presented to demonstrate the effectiveness of the proposed scheme.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"407 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121811898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fault tolerant cache architecture for sub 500mV operation: resizable data composer cache (RDC-cache)","authors":"Avesta Sasan, H. Homayoun, A. Eltawil, F. Kurdahi","doi":"10.1145/1629395.1629431","DOIUrl":"https://doi.org/10.1145/1629395.1629431","url":null,"abstract":"In this paper we introduce Resizable Data Composer-Cache (RDC-Cache). This novel cache architecture operates correctly at sub 500 mV in 65 nm technology tolerating large number of Manufacturing Process Variation induced defects. Based on a smart relocation methodology, RDC-Cache decomposes the data that is targeted for a defective cache way and relocates one or few word to a new location avoiding a write to defective bits. Upon a read request, the requested data is recomposed through an inverse operation. For the purpose of fault tolerance at low voltages the cache size is reduced, however, in this architecture the final cache size is considerably higher compared to previously suggested resizable cache organizations [2][3]. The following three features a) compaction of relocated words, b)ability to use defective words for fault tolerance and c) \"linking\" (relocating the defective word to any row in the next bank), allows this architecture to achieve far larger fault tolerance in comparison to [2][3]. In high voltage mode, the fault tolerant mechanism of RDC-Cache is turned-off with minimal (0.91%) latency overhead compared to a traditional cache.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114056710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Roop, Sidharta Andalam, R. V. Hanxleden, S. Yuan, C. Traulsen
{"title":"Tight WCRT analysis of synchronous C programs","authors":"P. Roop, Sidharta Andalam, R. V. Hanxleden, S. Yuan, C. Traulsen","doi":"10.1145/1629395.1629424","DOIUrl":"https://doi.org/10.1145/1629395.1629424","url":null,"abstract":"Accurate estimation of the tick length of a synchronous program is essential for efficient and predictable implementations that are devoid of timing faults. The techniques to determine the tick length statically are classified as worst case reaction time (WCRT) analysis. While a plethora of techniques exist for worst case execution time (WCET) analysis of procedural programs, there are only a handful of techniques for determining the WCRT value of synchronous programs. Most of these techniques produce overestimates and hence are unsuitable for the design of systems that are predictable while being also efficient. In this paper, we present an approach for the accurate estimation of the exact WCRT value of a synchronous program, called its tight WCRT value, using model checking. For our input specifications we have selected a synchronous C based language called PRET-C that is designed for programming Precision Timed (PRET) architectures. We then present an approach for static WCRT analysis of these programs via an intermediate format called TCCFG. This intermediate representation is then compiled to produce the input for the model checker.\u0000 Experimental results that compare our approach to existing approaches demonstrate the benefits of the proposed approach. The proposed approach, while presented for PRET-C is also applicable for WCRT analysis of Esterel using simple adjustments to the generated model. The proposed approach thus paves the way for a generic approach for determining the tight WCRT value of synchronous programs at compile time.","PeriodicalId":136293,"journal":{"name":"International Conference on Compilers, Architecture, and Synthesis for Embedded Systems","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115774391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}