{"title":"MINLP Based Power Optimization for Pipelined ADC","authors":"A. Purushothaman","doi":"10.1109/ISVLSI.2016.64","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.64","url":null,"abstract":"This paper proposes a Mixed Integer Non-linear Programming (MINLP) based optimization algorithm to design power optimized pipelined ADC. For a given specification the proposed algorithm gives stage resolution and sampling capacitor per stage that minimizes the total power consumption. Closed form expressions of the power consumption of each stage were derived and used as objective function. Pipelined ADCs of various specifications, viz., 10-bit, 12-bit, and 16-bit, were designed and validated using this algorithm.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116536030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rafael Fão de Moura, J. D. Souza, L. Carro, A. C. S. Beck, M. B. Rutzig
{"title":"The Impact of Heterogeneity on a Reconfigurable Multicore System","authors":"Rafael Fão de Moura, J. D. Souza, L. Carro, A. C. S. Beck, M. B. Rutzig","doi":"10.1109/ISVLSI.2016.67","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.67","url":null,"abstract":"Modern embedded system must efficiently exploit parallelism at thread-and instruction-level to achieve the best performance with the lowest energy consumption possible. While Multiprocessor System-on-Chip (MPSoCs) are a commonly used solution, they do not provide an effective environment for software production, as each processing element implements a different Instruction Set Architecture (ISA). On the other hand, processors such as the ARM big.LITTLE comprise multicores with different organizations and the same ISA. However, such cores are power consuming superscalar microarchitectures. Dynamic Reconfigurable Architectures (DRA) emerge as a solution to fill this gap. By taking advantage of its regular fabric, it is possible to develop a low-energy heterogeneous system by coupling to the cores DRAs with different processing capabilities and that implements the same ISA. In this work, we evaluate such system, varying both the size of the DRAs and the memory system involved. We show that, by tuning the latter, one can reach energy savings of up to 36%, while by using a fully heterogeneous system, saves of 28% in energy and losses of 7% in performance are observed when compared to its counterpart homogeneous version.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122884070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinghua Yang, Yue Xing, F. Qiao, Qi Wei, Huazhong Yang
{"title":"Approximate Adder with Hybrid Prediction and Error Compensation Technique","authors":"Xinghua Yang, Yue Xing, F. Qiao, Qi Wei, Huazhong Yang","doi":"10.1109/ISVLSI.2016.16","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.16","url":null,"abstract":"This paper proposed an approximate adder to accelerate computation and reduce energy consumption for error-resilient applications with a moderate output quality losses. The computation acceleration comes from the predictionscheme for the adder circuit, where the critical path is divided into multiple short fragments and a paralleling addition progress is enabled. The energy consumption is reduced as the result of trimming the registers from the lower predictors of the design. Furthermore, a simple module for error compensation is inserted into the approximate part of the circuit to decrease the relative error with very little hardware cost. Being simulated with 65nm CMOS process, 2.82X speedups and 57.8% energy-efficiency improvements have been achieved compared with traditional adders. Compared with the currenthigh performance approximate adders, the proposed addershows 6.9% energy-savings with 2 orders of reduction inrelative error using random test data. At last, the proposedapproximate adder is adopted in DCT processing, where more than 10dB PSNR increase can be achieved, compared with the current counterpart designs.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115804958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. N. K. Reddy, M. H. Vasantha, Kumar Y. B. Nithin
{"title":"A Gracefully Degrading and Energy-Efficient Fault Tolerant NoC Using Spare Core","authors":"B. N. K. Reddy, M. H. Vasantha, Kumar Y. B. Nithin","doi":"10.1109/ISVLSI.2016.80","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.80","url":null,"abstract":"Reliability is a significant strategy concern for modern day multi core embedded systems. On chip communicating systems are vulnerable to permanent network faults and transient faults which might essentially affect the performance of the system. Targeting at fault tolerance solution for cores with faults in Network on Chip (NoC), this paper proposes an energy efficient fault tolerant NoC architecture using spare core. The proposed strategy comprises of finding smallest rectangular region to place the given application using a heuristic technique, and mapping vertices within the selected region, and selecting a region which results maximum overall performance and minimum communication energy. Spare core is placed within a region and connected to the vertices. Many application core graphs are used to evaluate the proposed technique. The simulation outcomes of many fault injection tests indicate that the proposed technique results in performance enhancement while also saving communication energy.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121882208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mod (2P-1) Shuffle Memory-Access Instructions for FFTs on Vector SIMD DSPs","authors":"Sheng Liu, Haiyan Chen, Jianghua Wan, Yaohua Wang","doi":"10.1109/ISVLSI.2016.71","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.71","url":null,"abstract":"Binary Exchange Algorithm (BEA) always introduces excessive shuffle operations when mapping FFTs on vector SIMD DSPs. This can greatly restrict the overall performance. We propose a novel mod (2P-1) shuffle function and Mod-BEA algorithm (MBEA), which can halve the shuffle operation count and unify the shuffle mode. Such unified shuffle mode inspires us to propose a set of novel mod (2P-1) shuffle memory-access instructions, which can totally eliminate the shuffle operations. Experimental results show that the combination of MBEA and the proposed instructions can bring 17.2%-31.4% performance improvements at reasonable hardware cost, and compress the code size by about 30%.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129959085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Per-Warp Reconvergence Stack for Efficient Control Flow Handling in GPUs","authors":"Yaohua Wang, Xiaowen Chen, Dong Wang, Sheng Liu","doi":"10.1109/ISVLSI.2016.35","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.35","url":null,"abstract":"GPGPUs usually experience performance degradation when the control flow of threads diverges in a warp. Reconvergence stack based control flow handling scheme is widely adopted in GPU architectures. The depth of such stack is always set to a large number, so that there can be enough entries for warps experiencing nested branches. However, for warps experiencing simple branches or even no branches, those deep reconvergence stacks would stay idle, causing a serious waste of hardware resource. Moreover, with the development of GPU architectures, more and more warps will be deployed on a GPU stream processor core, such problem could be even more serious. To solve this problem, this paper propose a dynamic reconvergence stack structure, in which a stack pool is shared by all the warps, and dynamic stacks of different warps can be constructed according to the run-time requirement. This can satisfy the stack requirement while eliminating unnecessary waste of hardware resource. Our experiments show that the dynamic reconvergence stack can reduce the cost of stack by 50% with the conventional performance well maintained.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131493188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design Optimization of Register File Throughput and Energy Using a Virtual Prototyping (ViPro) Tool","authors":"Ningxi Liu, B. Calhoun","doi":"10.1109/ISVLSI.2016.50","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.50","url":null,"abstract":"Register files (RFs) consume significant power in low-power processors, and their specifications vary substantially for different applications. Challenges exist in identifying the appropriate RF design and optimizing RFs for different specifications. This paper not only explores methodologies of designing low power and high performance RFs and it also extends a virtual prototyping (ViPro) tool to support fast and efficient estimation of different design knobs on the overall multi-port RF macros. To enable aggressive exploration for RFs design, three bitline (BL) sensing schemes are included into ViPro along with parasitic parameters extracted from layout. Accuracy of ViPro results are within 15 % compared to full RF schematic SPICE simulation, while the simulation speed of ViPro is 5-10 times faster. An example reveals how ViPro can optimize RF design based on various specifications in a 45nm CMOS technology. Improvements of data throughput for 1R/1W port RFs are 31% and 72% at 0.5KB and 512KB, respectively, with proper BL sensing techniques. Results also show that the optimal BL sensing scheme changes with memory capacity. At 0.5KB, the lowest energy per operation decreases by 7.5% with a single-ended BL, while energy reduction is 45% with a hierarchical BL for 512KB.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122371977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Accurate All CMOS Temperature Sensor for IoT Applications","authors":"Sunil Kumar Maddikatla, S. Jandhyala","doi":"10.1109/ISVLSI.2016.113","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.113","url":null,"abstract":"In this manuscript an area efficient, linear, robust CMOS integrated temperature sensor circuit has been proposed in multiple technology nodes using UMC RF process for IoT and low cost SoC applications. In UMC 180nm node the proposed temperature sensor has an accuracy of ±0.4°C over 3σ variation in process and ±10% variation in supply, in the temperature range -55°C to 125°C. In 65nm node it has an accuracy of ±0.6°C over 3σ variation in process and ±10% variation in supply, in the temperature range -55°C to 125°C. The proposed design achieves a highly linear, proportional to absolute temperature (PTAT) voltage at reduced process corner dependence, using a process invariant circuit in conjunction with a supply independent biasing circuit. The supply sensitivity of the output voltage is 1100 ppm/V and spread with process is limited to ±0.6°C at UMC 180nm and ±1.5°C at 65nm technology. The proposed sensor in UMC 180nm technology occupies an area of 0.002 mm<sup>2</sup> and consumes 108μW of power. The output voltage is 136mV at room temperature (27°C) in typical corner, with a slope of 0.650mV/°C. The temperature sensor is included in a micro gyroscope application and the effect of temperature on the angular frequency at zero bias is presented.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121528339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Dev, S. Reda, Indrani Paul, Wei Huang, W. Burleson
{"title":"Workload-Aware Power Gating Design and Run-Time Management for Massively Parallel GPGPUs","authors":"K. Dev, S. Reda, Indrani Paul, Wei Huang, W. Burleson","doi":"10.1109/ISVLSI.2016.60","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.60","url":null,"abstract":"Power gating (PG) is an effective power efficiency improvement technique. Future general-purpose graphics processing units (GPGPUs) will likely feature hundreds of compute units (CUs) and be power constrained, which leads to serious challenges to existing PG methodologies. In this paper, we propose novel design-time and run-time techniques to effectively implement power gating in future GPGPUs. Based on industrial models/measurement facilities, we show that designers must consider run-time parallelism within potential applications while implementing power gating designs to avoid incurring unnecessary design overheads. By scaling measurements from a real 28nm GPGPU to a hypothetical future 10nm node, we show that a PG granularity of 16 CU/cluster achieves 99% peak run-time performance without the excessive 53% design-time area overhead of per-CU PG. We also demonstrate that a run-time power management algorithm that is aware of the PG granularity leads to up to 18% additional performance through frequency-boosting under thermal-design power (TDP) constraints.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121398504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Nguyen, Yao Chen, K. Rupnow, S. Gurumani, Deming Chen
{"title":"SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow","authors":"T. Nguyen, Yao Chen, K. Rupnow, S. Gurumani, Deming Chen","doi":"10.1109/ISVLSI.2016.131","DOIUrl":"https://doi.org/10.1109/ISVLSI.2016.131","url":null,"abstract":"The FCUDA project aims to improve programmability of FPGAs and expression of application parallelism in High Level Synthesis (HLS) through the use of the CUDA language. The CUDA language is a popular single-instruction multiple data (SIMD) style programming language with wide adoption, thus offering significant opportunity to bring experienced programmers to FPGA computing. The FCUDA project now has open-sourced the core CUDA to RTL transformation as well as the infrastructure for design space exploration, bus-based andNoC-based on-chip communications, and platform integration with Xilinx's SoC systems. In this paper, we present FCUDA's design space exploration, interconnect and platform integration to present guidelines for selecting system-level infrastructure for an application for the best implementation.","PeriodicalId":140647,"journal":{"name":"2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129127215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}