{"title":"A power-efficient FPGA accelerator: Systolic array with cache-coherent interface for pair-HMM algorithm","authors":"Megumi Ito, Moriyoshi Ohara","doi":"10.1109/CoolChips.2016.7503681","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503681","url":null,"abstract":"A systolic array is known as a parallel hardware architecture applicable to a wide range of applications. Naive implementations, however, can lead to inefficient resource usage and low power performance. In this paper, we discuss two techniques for improving the hardware resource usage: flexible multi-threading and dummy data padding. The design was implemented to accelerate a pair-HMM algorithm on an FPGA with the IBM POWER8 CAPI (Coherent Accelerator Processor Interface) feature. The CAPI feature simplifies the software design for driving the FPGA accelerator. Our experimental result indicates that the implemented FPGA accelerator executing the pair-HMM algorithm achieves 33x higher power performance than a POWER8 processor chip executing the same algorithm.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132850040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johannes Maximilian Kühn, Akram Ben Ahmed, Hayate Okuhara, H. Amano, O. Bringmann, W. Rosenstiel
{"title":"MuCCRA4-BB: A fine-grained body biasing capable DRP","authors":"Johannes Maximilian Kühn, Akram Ben Ahmed, Hayate Okuhara, H. Amano, O. Bringmann, W. Rosenstiel","doi":"10.1109/CoolChips.2016.7503676","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503676","url":null,"abstract":"The partitioning, implementation and in-silicon leakage evaluation of MuCCRA4-BB proved the feasibility and validity of fine-grained BB. Furthermore, it demonstrated the superiority over coarse- and chip-grained BB, minimizing FBB leakage penalty and allowing far more RBB usage in all applications and scenarios. As leakage exacerbates in smaller geometries, fine-grained BB might be an answer with sensible overhead.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127681474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Tran, Soichiro Kanagawa, D. Nguyen, Y. Nakashima
{"title":"ASIC design of MUL-RED Radix-2 Pipeline FFT circuit for 802.11ah system","authors":"T. Tran, Soichiro Kanagawa, D. Nguyen, Y. Nakashima","doi":"10.1109/CoolChips.2016.7503678","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503678","url":null,"abstract":"In this paper, we propose a multiplier-reduction (MUL-RED) Radix-2 Pipeline FFT processor for 802.11ah IoT sensors. Utilizing the symmetry of Twiddle factors we show two ideas to reduce the hardware cost and power consumption. Firstly, the complex-multipliers in the last 4 layers are replaced by gain blocks. Secondly, in the remained layers we store only one fourth of the required Twiddle factors to reduce the ROM amount by 4 times. Based on the proposed architecture, we implement the 16-point FFT and 64-point FFT circuits in ASIC CMOS 0.18 μm technology. Area and dynamic power of the proposed 16-point FFT are respectively reduced by 28% and 10% as compared to those of the conventional Radix-2 Pipeline one. Whereas, the proposed 64-point FFT costs 0.34mm2 area and consumes 21.43 mW power at 50 MHz, which are much smaller than the other works. In addition, to find out the optimal number of fractional bit width of FFT's data flow, we conduct BER/PER simulation and show the results in the paper.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"45 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128732839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Panel discussions “computing and communication evolution for IoT innovations”","authors":"H. Nishi","doi":"10.1109/CoolChips.2016.7503673","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503673","url":null,"abstract":"“Internet of Things” connects everything to provide new services. These new services confront severe requirements represented by the fact that systems handle the real world. Moreover, ubiquitous IoT devices generate a tremendous amount of data. This data concentration also makes it difficult to accomplish the requirements. Thus, future computing and communication systems have to face this difficulty. Now, computing and communication evolution is truly desired. This panel session discusses what kinds of computing and communication evolution are desired and expected and how to achieve them.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130999183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusuke Shirota, S. Yoshimura, S. Shirai, Tatsunori Kanai
{"title":"Powering-off DRAM with aggressive page-out to storage-class memory in low power virtual memory system","authors":"Yusuke Shirota, S. Yoshimura, S. Shirai, Tatsunori Kanai","doi":"10.1109/CoolChips.2016.7503675","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503675","url":null,"abstract":"With the rapidly growing demands for large capacity main memory in server systems and embedded systems, current DRAM-only approach is hitting the limit due to DRAM's capacity scaling issue and significant background power. With the emergence of new non-volatile memories, or storage-class memories (SCMs), we can now explore low power, high capacity memory subsystem by redesigning virtual memory system to be SCM-aware. Most research on virtual memory system design has focused on minimizing page fault frequency due to slow data transfers using storage such as HDD/SSD as virtual memory swap device. However with an SCM-based swap device, its near-DRAM access latency has potential for reducing requisite DRAM size by aggressively evicting pages from DRAM to SCM without sacrificing performance, and thus reducing background power by powering off the freed DRAM space for low power. To select an optimal SCM from among the many candidate SCM technologies, the impact of SCM characteristics was evaluated using full-system simulation. Results show that utilizing SCM with low access latency and low write energy can lead to significant potential reduction of memory subsystem energy by up to 83%, while maintaining performance degradation within acceptable range.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126969147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 1.1 mW 32-thread artificial intelligence processor with 3-level transposition table and on-chip PVT compensation for autonomous mobile robots","authors":"Youchang Kim, Dongjoo Shin, Jinsu Lee, H. Yoo","doi":"10.1109/CoolChips.2016.7503671","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503671","url":null,"abstract":"An ultra-low-power multi-threaded artificial intelligence processor (AIP) is proposed for real-time autonomous navigation of mobile robots. To achieve real-time operation under low power consumption, the proposed AIP adopts 3 key features: 1) an 8-thread tree search processor (TSP) for real-time path planning, 2) a 3-level transposition table cache (TT$) for the reduction of duplicated computations, and 3) an on-chip PVT compensation circuit (PVTC) for energy-efficient operation at near-threshold supply voltage. As a result, it achieves 470,000 state/s search speed and 79 nJ/search energy consumption which are 9.4× and 11× better than the general-purpose CPUs currently used in recent mobile robots. In addition, the AIP is successfully applied to the robots for autonomous navigation without any collision in dynamic environments.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123018878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Lee, Kyeongryeol Bong, Changhyeon Kim, Junyoung Park, H. Yoo
{"title":"An energy-efficient parallel multi-core ADAS processor with robust visual attention and workload-prediction DVFS for real-time HD stereo stream","authors":"K. Lee, Kyeongryeol Bong, Changhyeon Kim, Junyoung Park, H. Yoo","doi":"10.1109/CoolChips.2016.7503672","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503672","url":null,"abstract":"A heterogeneous multicore processor is proposed to accelerate advanced driver assistance system (ADAS). To enable a real-time operation of ADAS functions with 720p stereo video stream, multiple granualrity parallel SIMD/MIMD architecture is proposed with precise visual attention and high throughput network-on-chip to reduce computation cost and network congestion, respectively. In addition, it employs a data resource management processor to control workload-prediction dynamic voltage and frequency scaling to reduce power consumption. As a result, the proposed SoC ahcieves 862GOPS/W energy efficiency and 31.4GOPS/mm2 area efficiency, which are 53% and 75% improvement over the state-of-the-art ADAS processor, respectively.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133506537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How SIMD width affects energy efficiency: A case study on sorting","authors":"H. Inoue","doi":"10.1109/CoolChips.2016.7503679","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503679","url":null,"abstract":"This paper studies the performance and energy efficiency of in-memory sorting algorithms. We put emphasis on the SIMD (single instruction multiple data) mergesort implemented with different SIMD widths. By evaluating the performance, power, and energy with various hardware configurations (achieved by changing the memory bandwidth, number of cores, and processor frequency), our results show that SIMD can reduce power in addition to enhancing the performance, especially when the memory bandwidth is not sufficient to fully drive the cores. We also show that balancing the computation power and the memory bandwidth is important to minimize the total energy consumption.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masayuki Sato, Shin Nishimura, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi
{"title":"A cache partitioning mechanism to protect shared data for CMPs","authors":"Masayuki Sato, Shin Nishimura, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.1109/CoolChips.2016.7503674","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503674","url":null,"abstract":"The last-level cache (LLC) of a modern chip-multiprocessor (CMP) keeps two kinds of data: shared data accessed by multiple cores and private data accessed by only one core. Although the former are likely to have a larger performance impact than the latter, the LLC manages both of those data in the same fashion. To realize a highly efficient execution on a CMP, this paper proposes a cache partitioning mechanism to protect shared data from excessive eviction. The evaluation results show that the proposed mechanism improves the performance by up to 76% and by 8% on average at a cost of less than 2% of the LLC hardware.","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114985586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Rossi, A. Pullini, Igor Loi, Michael Gautschi, Frank K. Gürkaynak, A. Teman, J. Constantin, A. Burg, I. Panades, E. Beigné, F. Clermidy, F. Abouzeid, P. Flatresse, L. Benini
{"title":"193 MOPS/mW @ 162 MOPS, 0.32V to 1.15V voltage range multi-core accelerator for energy efficient parallel and sequential digital processing","authors":"D. Rossi, A. Pullini, Igor Loi, Michael Gautschi, Frank K. Gürkaynak, A. Teman, J. Constantin, A. Burg, I. Panades, E. Beigné, F. Clermidy, F. Abouzeid, P. Flatresse, L. Benini","doi":"10.1109/CoolChips.2016.7503670","DOIUrl":"https://doi.org/10.1109/CoolChips.2016.7503670","url":null,"abstract":"Low power (mW) and high performance (GOPS) are strong requirements for compute-intensive signal processing in E-health, Internet-of-Things, and wearable applications. This work presents a building block for programmable Ultra-Low Power accelerators, namely a tightly-coupled computing cluster that supports parallel and sequential execution at high energy efficiency over a wide range of workload requirements. The cluster, implemented in 28nm UTBB FD-SOI technology, achieves peak energy efficiency in the near-threshold (NVT) operating region: 193 MOPS/mW at 162 MOPS for parallel workloads, and 90 MOPS/mW at 68 MOPS for sequential workloads at 0.46V and 0.5V, respectively. The energy efficient operating range is wide (0.32V to 1.15V), also meeting the design goal of 1 GOPS within a 10 mW power envelope (at 0.66V).","PeriodicalId":273992,"journal":{"name":"2016 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XIX)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129962729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}