P. Burgio, M. Ruggiero, Francesco Esposito, Mauro Marinoni, G. Buttazzo, L. Benini
{"title":"Adaptive TDMA bus allocation and elastic scheduling: A unified approach for enhancing robustness in multi-core RT systems","authors":"P. Burgio, M. Ruggiero, Francesco Esposito, Mauro Marinoni, G. Buttazzo, L. Benini","doi":"10.1109/ICCD.2010.5647792","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647792","url":null,"abstract":"Next-generation real-time systems will be increasingly based on heterogeneous MPSoC design paradigms, where predictability and performance will be key issues to deal with. Such issues can be tackled both at the hardware level, by embedding technologies such as TDMA busses, and at the OS level, where suitable scheduling techniques can improve performance and reduce energy consumption. Among these, elastic scheduling has been proved to provide satisfactory results by dynamically reducing task periods at run-time to ensure the highest utilization possible of the processors. On the other hand, elastic scheduling lowers the degree of predictability and increases the complexity of the analysis at the system level. This reduces the benefits given by the TDMA bus, which relies on the high level task analysis for a robust and efficient slot allocation. Starting from this consideration, we propose a system where the elastic scheduling and the TDMA bus work synergistically. We introduce a QoS-aware adaptive bus service which takes the best of both techniques, mitigating their drawbacks at the same time. We show how the overhead introduced by coordination action is small, and it is however dominated by the benefits of the overall strategy in terms of performance and predictability guarantees.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131815450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Routability-driven flip-flop merging process for clock power reduction","authors":"Zhi-Wei Chen, Jin-Tai Yan","doi":"10.1109/ICCD.2010.5647784","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647784","url":null,"abstract":"The concept of merging some 1-bit flip-flops into a multi-bit flip-flop is applied to reduce dynamic clock power and decrease the total flip-flop area in a synchronous design. To acquire these advantages, the design must be guaranteed to satisfy certain physical constraints in the merging process. In this paper, given a set of 1-bit flip-flops with the input and output timing constraints, the area constraint inside any partitioned bin and the capacity constraint on any bin edge in a placement plane, an efficient routability-driven approach is proposed to merge 1-bit flip-flops into some multi-bit flip-flops for clock power reduction. The experimental results show that our proposed approach reduces 37.4% of the flip-flop area to maintain the synchronous design and saves 24.82% of the clock power for five examples in reasonable CPU time on the average.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133033920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EQUIPE: Parallel equivalence checking with GP-GPUs","authors":"Debapriya Chatterjee, V. Bertacco","doi":"10.1109/ICCD.2010.5647645","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647645","url":null,"abstract":"Combinational equivalence checking (CEC) is a mainstream application in Electronic Design Automation used to determine the equivalence between two combinational netlists. Tools performing CEC are widely deployed in the design flow to determine the correctness of synthesis transformations and optimizations. One of the main limitations of these tools is their scalability, as industrial scale designs demand time-consuming computation. In this work we propose EQUIPE, a novel combinational equivalence checking solution, which leverages the massive parallelism of modern general purpose graphic processing units. EQUIPE reduces the need for hard-to-parallelize engines, such as BDDs and SAT, by taking advantage of algorithms well-suited to concurrent implementation. We found experimentally that EQUIPE outperforms commercial CEC tools by an order of magnitude, on average, and state-of-the-art research CEC solutions by up to a factor of three, on a wide range of industry-strength designs.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115470582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using variable clocking to reduce leakage in synchronous circuits","authors":"Navid Toosizadeh, S. Zaky, Jianwen Zhu","doi":"10.1109/ICCD.2010.5647716","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647716","url":null,"abstract":"There is a growing demand for high-performance, low-power systems, particularly in portable devices. New approaches to design are needed in technologies with feature sizes of 90 nm and below to reduce leakage power and to deal with process variations, which force designers to use increasingly conservative delay estimations. This paper presents a variable clock generator for a conventionally-designed synchronous circuit core. The clock frequency adjusts automatically to inter-and intra-chip process, voltage and temperature variations, making it possible to design the circuit assuming typical rather than worst-case conditions. The resulting circuit uses much fewer high-speed, low-voltage-threshold cells, and consequently has significantly reduced leakage power. Post-layout test results on a 32-bit microprocessor implemented in 90-nm technology showed 10X less leakage and 19% less dynamic power when operating under typical conditions, compared to a conventional, fixed-frequency implementation. The system is functional under all PVT corners.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117207737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient MIMD architectures for high-performance ray tracing","authors":"D. Kopta, J. Spjut, E. Brunvand, A. Davis","doi":"10.1109/ICCD.2010.5647555","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647555","url":null,"abstract":"Ray tracing efficiently models complex illumination effects to improve visual realism in computer graphics. Typical modern GPUs use wide SIMD processing, and have achieved impressive performance for a variety of graphics processing including ray tracing. However, SIMD efficiency can be reduced due to the divergent branching and memory access patterns that are common in ray tracing codes. This paper explores an alternative approach using MIMD processing cores custom-designed for ray tracing. By relaxing the requirement that instruction paths be synchronized as in SIMD, caches and less frequently used area expensive functional units may be more effectively shared. Heavy resource sharing provides significant area savings while still maintaining a high MIMD issue rate from our numerous light-weight cores. This paper explores the design space of this architecture and compares performance to the best reported results for a GPU ray tracer and a parallel ray tracer using general purpose cores. We show an overall performance that is six to ten times higher in a similar die area.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128026408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. D. Guglielmo, F. Fummi, Nicola Orlandi, G. Pravadelli
{"title":"DDPSL: An easy way of defining properties","authors":"L. D. Guglielmo, F. Fummi, Nicola Orlandi, G. Pravadelli","doi":"10.1109/ICCD.2010.5647654","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647654","url":null,"abstract":"The paper proposes DDPSL (Drag and Drop PSL) a template library and a tool which simplifies the definition of PSL (Property Specification Language) formal properties by exploiting PSL-based templates. DDPSL allows users not expert in formal methods to define PSL properties by dragging and dropping logical and temporal operators, and variables from the design under verification (DUV) into predefined templates. Moreover, confident users or experts can extend the set of templates, reducing the effort required for formalizing complex properties. From the methodological point of view, DDPSL combines the advantages of both Open Verification Library (OVL) and PSL. Note that the templates are characterized by a parametric interface that separates the formal definition from its semantics, as provided by OVL. Moreover, the adoption of PSL as reference language guarantees the expressiveness of popular temporal logics such as Linear Temporal Logic (LTL) and Computational Tree Logic (CTL), which, on the contrary, are not fully supported by OVL. DDPSL has been successfully used to define properties for verifying an embedded application running on the microcontroller of an industrial oven.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132018765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Skew-aware capacitive load balancing for low-power zero clock skew rotary oscillatory array","authors":"V. Honkote, B. Taskin","doi":"10.1109/ICCD.2010.5647781","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647781","url":null,"abstract":"Rotary clocking is a traveling wave based high-speed resonant clocking technology with low-power and controllable-skew properties. Capacitive load balance and bounded clock skew are identified as the primary requirements to maintain a stable oscillation frequency across the rings and to achieve timing closure, respectively, in the rotary oscillatory array (ROA). Towards this end, two methodologies are proposed to achieve balanced capacitive loads across the rings of the ROA with a bounded skew constraint. Experiments performed on IBM R1–R5 benchmark circuits show a 5.62X improved capacitive balance and a 3.67% improved clock skew to a total skew of 6.55% of the clock period at 1.8GHz. SPICE simulations show that the frequency variation across the rings of the ROA is reduced from 10.14% to 2.12% as well. Power dissipated with the proposed optimization methodologies are within ±1.5% of the conventional design automation techniques for rotary synchronization.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127455809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"QoS scheduling for NoCs: Strict Priority Queueing versus Weighted Round Robin","authors":"Yue Qian, Zhonghai Lu, Q. Dou","doi":"10.1109/ICCD.2010.5647577","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647577","url":null,"abstract":"Strict Priority Queueing (SPQ) andWeighted Round Robin (WRR) are two common scheduling techniques to achieve Quality-of-Service (QoS) while using shared resources. Based on network calculus, we build analytical models for traffic flows under SPQ and WRR scheduling in on-chip wormhole networks. With these models, we can derive per-flow end-to-end delay bound. We compare the service behavior and show that WRR is not only more fair but also more flexible for QoS provision. To exhibit the potential and flexibility enabled by WRR, we develop a weight allocation algorithm to automatically assign proper weights for individual flows to satisfy their delay constraints. In particular, the weights are assigned in a way not more than necessary, in other words, to approach flows' delay constraints in order to leave room for other flows. Our experimental results validate our analysis technique and algorithms.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128620171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inter-socket victim cacheing for platform power reduction","authors":"Subhra Mazumdar, D. Tullsen, Justin J. Song","doi":"10.1109/ICCD.2010.5647634","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647634","url":null,"abstract":"On a multi-socket architecture with load below peak, as is often the case in a server installation, it is common to consolidate load onto fewer sockets to save processor power. However, this can increase main memory power consumption due to the decreased total cache space. This paper describes inter-socket victim cacheing, a technique that enables such a system to do both load consolidation and cache aggregation at the same time. It uses the last level cache of an idle processor in a connected socket as a victim cache, holding evicted data from the active processor. This enables expensive main memory accesses to be replaced by cheaper cache hits. This work examines both static and dynamic victim cache management policies. Energy savings is as high as 32.5%, and averages 5.8%.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129186864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automotive embedded driver assistance: A real-time low-power FPGA stereo engine using semi-global matching","authors":"Felix Eberli","doi":"10.1109/ICCD.2010.5647552","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647552","url":null,"abstract":"Today, automotive driver assistant systems are a growing market. After an overview on current driver assistant systems we will focus on vision based systems and their special requirements. As an example project we will describe the development of a next generation stereo vision algorithm.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125080069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}