Pablo Abad Fidalgo, P. Prieto, Valentin Puente, J. Gregorio
{"title":"BIXBAR: A low cost solution to support dynamic link reconfiguration in networks on chip","authors":"Pablo Abad Fidalgo, P. Prieto, Valentin Puente, J. Gregorio","doi":"10.1109/ICCD.2012.6378617","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378617","url":null,"abstract":"Improving link utilization is a key aspect in interconnection network design. Reconfigurable-direction interrouter links optimize network resource utilization, which substantially increases the maximum achievable throughput. In the case of On-chip Networks, the short distance between adjacent routers makes feasible fast link arbitration, which makes dynamic link reconfiguration an attractive solution. In this paper we propose a low-cost router micro-architecture that is able to deal with reconfigurable links with a marginal cost over a conventional router. The key element of the proposal is a bidirectional crossbar, which enables reconfiguration of links, without significantly increasing router area and energy. The results obtained indicate that with this proposal, system performance could be improved, for some selected workloads, by up to 25% while energy-performance tradeoff is reduced by 20%, avoiding the additional costs entailed in other state-of-the-art routers capable of performing dynamic link reconfiguration.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124607825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interface design for synthesized structural hybrid microarchitectural simulators","authors":"Zhuo Ruan, D. Penry","doi":"10.1109/ICCD.2012.6378650","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378650","url":null,"abstract":"Computer designers rely upon near-cycle-accurate microarchitectural simulators to explore the design space of new systems. Hybrid simulators which offload simulation work onto FPGAs overcome the speed limitations of software-only simulators as systems become more complex, however, such simulators must be automatically synthesized or the time to design them becomes prohibitive. The performance of a hybrid simulator is significantly affected by how the interface between software and hardware is constructed. We characterize the design space of interfaces for synthesized structural hybrid microarchitectural simulators, provide implementations for several such interfaces, and determine the tradeoffs involved in choosing an efficient design candidate.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127625548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A stochastic reconfigurable architecture for fault-tolerant computation with sequential logic","authors":"Peng Li, Weikang Qian, D. Lilja","doi":"10.1109/ICCD.2012.6378656","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378656","url":null,"abstract":"Computation performed on stochastic bit streams is less efficient than that based on a binary radix because of its long latency. However, for certain complex arithmetic operations, computation on stochastic bit streams can consume less energy and tolerate more soft errors. In addition, the latency issue could be solved by using a faster clock frequency or in combination with a parallel processing approach. To take advantage of this computing technique, previous work proposed a combinational logic-based reconfigurable architecture to perform complex arithmetic operations on stochastic streams of bits. In this paper, we enhance and extend this reconfigurable architecture using sequential logic. Compared to the previous approach, the proposed reconfigurable architecture takes less hardware area and consumes less energy, while achieving the same performance in terms of processing time and fault-tolerance.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133828246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed thermal-aware task scheduling for 3D Network-on-Chip","authors":"Yingnan Cui, Wei Zhang, Hao Yu","doi":"10.1109/ICCD.2012.6378690","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378690","url":null,"abstract":"The development of 3D integration technology significantly improves the bandwidth of network-on-chip (NoC) system. However, the 3D technology-enabled high integration density also brings severe concerns of temperature increase, which may impair system reliability and degrade the performance. Task scheduling has been regarded as one effective approach in eliminating thermal hotspot without introducing hardware overhead. However, centralized thermal-aware task scheduling algorithms for 3D-NoC have been limited for incurring high computational complexity as the system scale increase. In this paper, we propose a distributed agent-based thermal-aware task scheduling algorithm for 3D-NoC which shows high scheduling efficiency and high scalability. Experimental results have shown that when compared to the centralized algorithms, our algorithm can achieve up to 13 °C reduction in peak temperature of the system without sacrificing performance.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115096810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. J. Filho, A. Aguiar, F. Magalhães, Oliver B. Longhi, Fabiano Hessel
{"title":"Task model suitable for dynamic load balancing of real-time applications in NoC-based MPSoCs","authors":"S. J. Filho, A. Aguiar, F. Magalhães, Oliver B. Longhi, Fabiano Hessel","doi":"10.1109/ICCD.2012.6378616","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378616","url":null,"abstract":"Modern embedded systems implemented through Multiprocessor System-on-Chip (MPSoCs) benefit themselves from resources that were previously available solely in generalpurpose computers. Currently, these systems are able to provide more features at the cost of an increased design complexity. In this scenario, the applications' behaviour has changed. In the past, the majority of applications showed a static behaviour throughout their entire lifetime. Applications could be divided into tasks and mapped onto processing elements at design time. Currently, the applications' dynamic nature imposes that efficient dynamic load balancing techniques with different task mapping strategies must arise, although a fair static mapping still helps increasing the system overall performance. In this paper we present a task model suitable for dynamic load balancing of real-time applications with special support for Network-on-Chip (NoC)-based MPSoCs that aims to stabilize the system load throughout its lifetime. Results show a reduction in both system stabilization time (mean of 47.62%) and deadline misses (mean of 32.28%) for several benchmarks, compared to classic approaches which employ a centralized migration manager.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115051906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 3D stacked high performance scalable architecture for 3D Fourier Transform","authors":"G. Voicu, M. Enachescu, S. Cotofana","doi":"10.1109/ICCD.2012.6378692","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378692","url":null,"abstract":"This paper proposes and evaluates a novel high-performance systolic architecture for 3D Fourier Transform specially tailored for 3D stacking integration with Through Silicon Vias. Our cuboid-shaped systolic network of orthogonally connected processing elements makes use of the DFT algorithm to compute an N<sub>1</sub>×N<sub>2</sub>×N<sub>3</sub>-point 3D-FT with an asymptotic time complexity of O(N<sub>1</sub>+N<sub>2</sub>+N<sub>3</sub>) multiplications. When compared with state-of-the-art 3D-FFT implementation on the Anton machine, a physical synthesized implementation of our architecture on the same 90nm technology node achieves 7.73× and 5.88× speed improvement when computing 16×1 6×16 and 32×3 2×32 FT, respectively.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Damavandpeyma, S. Stuijk, M. Geilen, T. Basten, H. Corporaal
{"title":"Parametric throughput analysis of scenario-aware dataflow graphs","authors":"M. Damavandpeyma, S. Stuijk, M. Geilen, T. Basten, H. Corporaal","doi":"10.1109/ICCD.2012.6378644","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378644","url":null,"abstract":"Scenario-aware dataflow graphs (SADFs) efficiently model dynamic applications. The throughput of an application is an important metric to determine the performance of the system. For example, the number of frames per second output by a video decoder should always stay above a threshold that determines the quality of the system. During design-space exploration (DSE) or run-time management (RTM), numerous throughput calculations have to be performed. Throughput calculations have to be performed as fast as possible. For synchronous dataflow graphs (SDFs), a technique exists that extracts throughput expressions from a parameterized SDF in which the execution time of the tasks (actors) is a function of some parameters. Evaluation of these expressions can be done in a negligible amount of time and provides the throughput for a specific set of parameter values. This technique is not applicable to SADFs. In this paper, we present a technique, based on Max-Plus automata, that finds throughput expressions for a parameterized SADF. Experimental evaluation shows that our technique can be applied to realistic applications. These results also show that our technique is better scalable and faster compared to the available parametric throughput analysis technique for SDFs.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132885535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing performance and fault detection for GPGPU workloads","authors":"J. Backer, R. Karri","doi":"10.1109/ICCD.2012.6378702","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378702","url":null,"abstract":"GPUs are increasingly being used for processing highly parallel scientific and high performance workloads. Such applications require correctness and accuracy of the computation. GPUs lack adequate support for detecting hardware faults that may lead to computation errors. We present a tunable fault detection scheme that allows one to balance GPU performance and fault checking by configuring the amount of resources to allocate for detection and the frequency of checking for faults.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117013620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A spectral transform approach to stochastic circuits","authors":"Armin Alaghi, J. Hayes","doi":"10.1109/ICCD.2012.6378658","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378658","url":null,"abstract":"Stochastic computing (SC) processes data in the form of long pseudo-random bit-streams denoting probabilities. Its key advantages are simple computational elements and high soft-error tolerance. Recent technology developments have revealed important new SC applications such as image processing and LDPC decoding. Despite its long history, SC still lacks a comprehensive design methodology; existing methods tend to be ad hoc and limited to a few arithmetic functions. We demonstrate a fundamental relation between stochastic circuits and spectral transforms. Based on this, we propose a transform approach to the analysis and synthesis of SC circuits. We illustrate the approach for a variety of basic combinational SC design problems, and show that the area cost associated with stochastic number generation can be significantly reduced.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117041405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac Liu, J. Reineke, David Broman, Michael Zimmer, Edward A. Lee
{"title":"A PRET microarchitecture implementation with repeatable timing and competitive performance","authors":"Isaac Liu, J. Reineke, David Broman, Michael Zimmer, Edward A. Lee","doi":"10.1109/ICCD.2012.6378622","DOIUrl":"https://doi.org/10.1109/ICCD.2012.6378622","url":null,"abstract":"We contend that repeatability of execution times is crucial to the validity of testing of real-time systems. However, computer architecture designs fail to deliver repeatable timing, a consequence of aggressive techniques that improve average-case performance. This paper introduces the Precision-Timed ARM (PTARM), a precision-timed (PRET) microarchitecture implementation that exhibits repeatable execution times without sacrificing performance. The PTARM employs a repeatable thread-interleaved pipeline with an exposed memory hierarchy, including a repeatable DRAM controller. Our benchmarks show an improved throughput compared to a single-threaded in-order five-stage pipeline, given sufficient parallelism in the software.","PeriodicalId":313428,"journal":{"name":"2012 IEEE 30th International Conference on Computer Design (ICCD)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115329683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}