{"title":"Data reuse driven energy-aware MPSoC co-synthesis of memory and communication architecture for streaming applications","authors":"I. Issenin, N. Dutt","doi":"10.1145/1176254.1176326","DOIUrl":"https://doi.org/10.1145/1176254.1176326","url":null,"abstract":"The memory subsystem of a complex multiprocessor systems- on-chip (MPSoC) is an important contributor to the chip power consumption. The selection of memory architecture, as well as of communication architecture, both affect the power efficiency of the design. In this paper we propose a novel approach that enables energy-aware co-synthesis of both memory and communication architecture for streaming applications. As opposed to earlier techniques, we employ a powerful compile-time analysis of memory access behavior that adds flexibility in selecting memory architectures. Additionally, we target TDMA bus-based communication architectures, which not only guarantee performance, but also greatly reduce the design time and allow us to find the energy optimal system configuration. We propose and compare three techniques: an optimal mixed ILP- based co-synthesis technique, a mixed ILP-based traditional two- step synthesis approach where memory and communication synthesis is performed sequentially, and a co-synthesis heuristic that synthesizes energy-efficient hierarchical bus-based communication architectures with guaranteed throughput. Our experimental results on a number of streaming applications show that both the traditional two-step synthesis approach and heuristic result in up to 50% worse power consumption in comparison with proposed co-synthesis approach. However, on some of the streaming benchmarks, our co-synthesis heuristic approach was able to find optimal or near-optimal results in a much shorter time than the MILP co-synthesis approach.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126741423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic phase detection for stochastic on-chip traffic generation","authors":"A. Scherrer, A. Fraboulet, T. Risset","doi":"10.1145/1176254.1176277","DOIUrl":"https://doi.org/10.1145/1176254.1176277","url":null,"abstract":"(NoC) prototyping is used for adapting NoC parameters to the application running on the chip. This prototyping is currently done using traffic generators which emulate the SoC components (IPs) behavior: processors, hardware accelerators, etc. Traffic generated by processor-like IPs is highly non-regular, it must be decomposed into program phases. We propose an original feature for NoC prototyping, inspired by techniques used in processor architecture performance evaluation: the automatic detection of traffic phases. Integrated in our NoC prototyping environment, this feature permits to have a completely automatic toolchain for the generation of stochastic traffic generators. We show that our traffic generators emulate precisely the behavior of processors and that our environment is a versatile tool for networks-on-chip prototyping. Simulations are performed in a SystemC-based simulation environment with a mesh network-on-chip (DSPIN) and a processor running MP3 decoding applications.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126971782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aseem Gupta, N. Dutt, F. Kurdahi, K. Khouri, M. Abadir
{"title":"Floorplan driven leakage power aware IP-based SoC design space exploration","authors":"Aseem Gupta, N. Dutt, F. Kurdahi, K. Khouri, M. Abadir","doi":"10.1145/1176254.1176284","DOIUrl":"https://doi.org/10.1145/1176254.1176284","url":null,"abstract":"Multi-million gate system-on-chip (SoC) designs increasingly rely on intellectual property (IP) blocks. However, due to technology scaling the leakage power consumption of the IP blocks has risen thus leading to possible thermal runaway. In IP-based design there has been a disconnect between system level design and physical level steps such as floorplanning which can lead to failures in manufactured chips. This necessitates coupling between system level and physical level design steps. The leakage power of an IP-block increases with its temperature which is dependent on the SoC's floorplan due to thermal diffusion. We have observed that different floorplans of the same SoC can have up to 3X difference in leakage power. Hence the system designer needs to be aware of this design space between floorplans and leakage power. We propose a leakage aware exploration (LAX) framework which enables the system designer to create this design space early in the design cycle and provides an opportunity to make changes in the system design. We show the size of the design space generated by applying LAX on ten industrial SoC designs from Freescale Semiconductor Inc. and observe that the leakage power can vary by as much as 190% for 65% difference in the inactive area.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122889108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A bus architecture for crosstalk elimination in high performance processor design","authors":"W. Hsieh, Po-Yuan Chen, TingTing Hwang","doi":"10.1145/1176254.1176314","DOIUrl":"https://doi.org/10.1145/1176254.1176314","url":null,"abstract":"In deep sub-micron technology, the crosstalk effect between adjacent wires has become an important issue, especially between long on-chip buses. This effect leads to the increase in delay, in power consumption, and in worst case, to incorrect result. In this paper, we propose a de-assembler/assembler structure to eliminate undesirable crosstalk effect on bus transmission. By taking advantage of the prefetch process where the instruction/data fetch rate is always higher than instruction/data commit rate in high performance processors, the proposed method would hardly reduce the performance. In addition, the required number of extra bus wires is only 7 as compared with 85 needed in [6] when the bus width is 128 bits.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123449678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decision-theoretic exploration of multiProcessor platforms","authors":"G. Beltrame, Dario Bruschi, D. Sciuto, C. Silvano","doi":"10.1145/1176254.1176305","DOIUrl":"https://doi.org/10.1145/1176254.1176305","url":null,"abstract":"In this paper, we present an efficient technique to perform design space exploration of a multi-processor platform that minimizes the number of simulations needed to identify the power-performance approximate Pareto curve. Instead of using semi-random search algorithms (like simulated annealing, tabu search, genetic algorithms, etc.), we use domain knowledge derived from the platform architecture to set-up exploration as a decision problem. Each action in the decision-theoretic framework corresponds to a change in the platform parameters. Simulation is performed only when information about the probability of action outcomes becomes insufficient for a decision. The algorithm has been tested with two multi-media industrial applications, namely an MPEG4 encoder and an Ogg-Vorbis decoder. Results show that the exploration of the number of processors and two-level cache size and policy, can be performed with less than 15 simulations with 95% accuracy, increasing the exploration speed by one order of magnitude when compared to traditional operation research techniques.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123602050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hohenauer, Christoph Schumacher, R. Leupers, G. Ascheid, H. Meyr, Hans van Someren
{"title":"Retargetable code optimization with SIMD instructions","authors":"M. Hohenauer, Christoph Schumacher, R. Leupers, G. Ascheid, H. Meyr, Hans van Someren","doi":"10.1145/1176254.1176291","DOIUrl":"https://doi.org/10.1145/1176254.1176291","url":null,"abstract":"Retargetable C compilers are nowadays widely used to quickly obtain compiler support for new embedded processors and to perform early processor architecture exploration. One frequent concern about retargetable compilers, though, is their lack of machine-specific code optimization techniques in order to achieve highest code quality. While this problem is partially inherent to the retargetable compilation approach, it can be circumvented by designing flexible, configurable code optimization techniques that apply to a certain range of target architectures. This paper focuses on target machines with SIMD instruction support which is widespread in embedded processors for multimedia applications. We present an efficient and quickly retargetable SIMD code optimization technique that is integrated into an industrial retargetable C compiler. Experimental results for the Philips Trimedia processor demonstrate that the proposed technique applies to real-life target machines and that it produces code quality improvements close to the theoretical limit.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Papanikolaou, T. Grabner, M. Corbalan, P. Roussel, F. Catthoor
{"title":"Yield prediction for architecture exploration in nanometer technology nodes:: a model and case study for memory organizations","authors":"A. Papanikolaou, T. Grabner, M. Corbalan, P. Roussel, F. Catthoor","doi":"10.1145/1176254.1176315","DOIUrl":"https://doi.org/10.1145/1176254.1176315","url":null,"abstract":"Process variability has a detrimental impact on the performance of memories and other system components, which can lead to parametric yield loss at the system level due to timing violations. Conventional yield models do not allow to accurately analyze this, at least not at the system level. In this paper we propose a technique to estimate this system level yield loss for a number of alternative memory organization implementations. This can aid the designer into making educated trade-offs at the architecture level between energy consumption and parametric timing yield by using memories from different available libraries with different energy/performance characteristics considering the impact of manufacturing variations. The accuracy of this technique is very high, an average error of less than 1% is reported, which enables an early exploration of the available options.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132861084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cutting across layers of abstraction:: removing obstacles from the advancement of embedded systems","authors":"K. Flautner","doi":"10.1145/1176254.1176318","DOIUrl":"https://doi.org/10.1145/1176254.1176318","url":null,"abstract":"Silicon technology evolution over the last four decades has yielded an exponential increase in integration densities with steady improvements of performance and power consumption at each technology generation. This steady progress has created a sense of entitlement for the riches that future process generations would bring. Today, however, classical process scaling seems to be dead and living up to technology expectations requires continuous innovation at many levels, which comes at steadily progressing implementation and design costs. Solutions to problems need to cut across layers of abstractions and require coordination between software, architecture and circuit features. Heterogeneous multiprocessor clusters are increasingly used to deliver the required compute power for high-end applications. Heterogeneity ensures that the necessary processing power can be delivered at high levels of efficiency at reasonable implementation cost, while the use of processors endow these systems with large degrees of flexibility. One of the key challenges with these systems is system-level programming. Traditional compiler technologies are strong at programming individual cores but leave the task of parallelization to a team of experts. The first part of this talk will describe the coupling of the compiler to the system architecture on a multi-core signal-processing cluster and illustrate how compiler technology can enable the writing of portable parallel programs for it using little more than C. As claimed above, close coupling of abstraction layers can be beneficial. This can also be illustrated at the microarchitecture - circuit boundary. The second part of the talk will describe a prototype microarchitecture which is designed explicitly to deal with issues such as silicon variation and soft errors. These features in return enable system designers to focus on the typical-case performance of their implementations without having to be over-constrained by worst-case conditions.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124322636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongsoo Joo, Yongseok Choi, Chanik Park, S. Chung, Eui-Young Chung, N. Chang
{"title":"Demand paging for OneNANDTM Flash eXecute-in-place","authors":"Yongsoo Joo, Yongseok Choi, Chanik Park, S. Chung, Eui-Young Chung, N. Chang","doi":"10.1145/1176254.1176310","DOIUrl":"https://doi.org/10.1145/1176254.1176310","url":null,"abstract":"NAND flash memory can provide cost-effective secondary storage in mobile embedded systems, but its lack of a random access capability means that code shadowing is generally required, taking up extra RAM space. Demand paging with NAND flash memory has recently been proposed as an alternative which requires less RAM. This scheme is even more attractive for OneNAND flash, which consists of a NAND flash array with SRAM buffers, and supports eXecute-ln-Place (XIP), which allows limited random access to data on the SRAM buffers. We introduce a novel demand paging method for OneNAND flash memory with XIP feature. The proposed on-line demand paging method with XIP adopts finite size sliding window to capture the paging history and thus predict future page demands. We particularly focus on non-critical code accesses which can disturb real-time code. Experimental results show that our method outperforms conventional LRU-based demand paging by 57% in terms of execution time and by 63% in terms of energy consumption. It even beats the optimal solution obtained from MIN, which is a conventional off-line demand paging technique by 30% and 40% respectively.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126197666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bounded arbitration algorithm for QoS-supported on-chip communication","authors":"M. A. Faruque, Gereon Weiss, J. Henkel","doi":"10.1145/1176254.1176275","DOIUrl":"https://doi.org/10.1145/1176254.1176275","url":null,"abstract":"Time-critical multi-processor systems require guaranteed services in terms of throughput, bandwidth etc. in order to comply to hard real-time constraints. However, guaranteed-service schemes suffer from low resource utilization. To the best of our knowledge, we are presenting the first approach for on-chip communication that provides a high resource utilization under a transaction-specific, flexible (i.e. different classifications on data exchange) communication scheme. It does provide tight time-related guarantees. Hence, we are presenting our bounded arbitration scheme considering lower and upper bounds for each type of transaction level. We demonstrate its advantages by means of a complete MPEG4 decoder case study and achieve under these constraints a bandwidth utilization of up to 100%, on an average 97% with a guaranteed (100%) bandwidth.","PeriodicalId":370841,"journal":{"name":"Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129390713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}