Donghyeon Han, Junha Ryu, Sangyeob Kim, Sangjin Kim, Jongjun Park, H. Yoo
{"title":"A Low-power Neural 3D Rendering Processor with Bio-inspired Visual Perception Core and Hybrid DNN Acceleration","authors":"Donghyeon Han, Junha Ryu, Sangyeob Kim, Sangjin Kim, Jongjun Park, H. Yoo","doi":"10.1109/COOLCHIPS57690.2023.10122036","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10122036","url":null,"abstract":"This paper presents a low-power neural 3D rendering processor which can support both inference (INF) and training of the deep neural network (DNN). The processor is realized with four key features: 1) bio-inspired visual perception core (VPC), 2) neural engines using hybrid sparsity exploitation, 3) dynamic neural network allocation (DNNA) core with centrifugal-sampling (CS), and 4) hierarchical weight memory (HWM) with input-channel (iCh) pre-fetcher. Thanks to the VPC and the proposed DNN acceleration architecture, it can improve throughput by 4174x and demonstrates> 30 FPS rendering while consuming 133 mW power.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123127139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote and Invited Speakers Biography","authors":"","doi":"10.1109/coolchips57690.2023.10122034","DOIUrl":"https://doi.org/10.1109/coolchips57690.2023.10122034","url":null,"abstract":"","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126721901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lookup Table Modular Reduction: A Low-Latency Modular Reduction for Fast ECC Processor","authors":"Anawin Opasatian, M. Ikeda","doi":"10.1109/COOLCHIPS57690.2023.10122002","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10122002","url":null,"abstract":"Modular multiplication is used extensively in many cryptosystems, such as in Elliptic Curve Cryptography (ECC). This is why the speed of the modular multiplication has a high impact on the overall speed of the cryptography computation. Recent works utilizing a lookup table for inferring value have shown a promising way for fast computation of modular re-duction, which can be used to construct a much faster modular multiplier than the conventional methods on FPGA. In this work, we explore an alternative way to implement the said technique, which we will call Lookup Table Modular Reduction (LUTMR). We show that in this technique, the modulo value used for generating the modular reduction circuit has a high impact on the generated circuit efficiency. With the LUTMR technique, three modular multipliers for curve Secp256k1, NIST-P384, and BLS12-381 are implemented on FPGA, which has shown to be the fastest compared to recent works. The NIST-P384 ECC processor is also implemented with the designed modular multiplier. It can compute the scalar multiplication in $75.08 mu mathrm{s}$, the fastest and lowest in Time-Area criteria among recent works.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129294872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA Emulation of Through-Silicon-Via (TSV) Dataflow Network for 3D Standard Chip Stacking System","authors":"Takeshi Ohkawa, M. Aoyagi","doi":"10.1109/COOLCHIPS57690.2023.10122025","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10122025","url":null,"abstract":"Through-Silicon-Via (TSV) is expected to realize high-performance, low-power consumption, and lowcost 3D-LSI (Large Scale Integration) system. It is realized by integrating pre-manufactured chips with a 3D Standard Chip Stacking System (3D-SCSS) through a standard bus TSV connection. However, it is difficult to define a standard chip connection mechanism. This paper proposes an FPGA emulation of the TSV dataflow network for evaluating the performance of 3D-SCSS. To emulate 3D-SCSS, multiple-clock domains are assumed to overcome the problem of jitter in the global clock, which is a separated clock domain model. Simple dataflow experiments are done where processes are deployed to different chips and communicate among the chips in the 3D-SCSS. The evaluation shows that the emulation method is suitable to measure the latency performance of the proposed TSV dataflow network. (Keywords: 3D-LSI, TSV, FPGA, Emulation, Dataflow, 3D-SCSS)","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125642859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexibly Controllable Dynamic Cooling Methods for Solid-State Annealing Processors to Improve Combinatorial Optimization Performance","authors":"Genta Inoue, Daiki Okonogi, Thiem Van Chu, Jaehoon Yu, Masato Motomura, Kazushi Kawamura","doi":"10.1109/COOLCHIPS57690.2023.10121990","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121990","url":null,"abstract":"A recently proposed dynamic cooling method enables automatic pseudo-temperature control in the computing process on solid-state annealing processors. Though it may be a practical approach to improve the optimization performance, its effectiveness has been verified only on one annealing policy. On the other hand, another work has claimed that annealing computation can speed up by adaptively utilizing multiple policies. In this paper, we propose a flexibly controllable dynamic cooling method effective for various policies, followed by a method to reduce the sampling frequency on an annealing system. Simulation results have demonstrated that our approach works well for several policies and can be introduced into annealing processors efficiently.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126569013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low power implementation of Geometric High-order Decorrelation-based Source Separation on an FPGA board","authors":"Ziquan Qin, Kaijie Wei, H. Amano, K. Nakadai","doi":"10.1109/COOLCHIPS57690.2023.10121954","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121954","url":null,"abstract":"Open source software for robot audition called HARK aims to make “OpenCV” in audio signal processing, providing comprehensive functions from multichannel audio input to sound localization, sound source separation, and au-tomatic speech recognition. Since each of these HARK modules takes considerable energy when executed on PC, we propose to implement each module on an FPGA board called M-KUBOS connected. Here, we focus on the most computationally expensive function of HARK; the sound source separation, and implement it on a Zynq Ultrascale+ board. More than twice a performance improvement was achieved by using the sound frequency level parallelization in the HLS description compared to the software execution on the Ryzen 3990X64-core server. Power evaluation of the real board showed that the energy consumption is only 1/23.4 of the server.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132644594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinsung Yoon, Dong-Hwi Lee, Neungyun Kim, Su-Jung Lee, Gil-Ho Kwak, Tae-Hwan Kim
{"title":"A Real-Time Keyword Spotting System Based on an End-To-End Binary Convolutional Neural Network in FPGA","authors":"Jinsung Yoon, Dong-Hwi Lee, Neungyun Kim, Su-Jung Lee, Gil-Ho Kwak, Tae-Hwan Kim","doi":"10.1109/COOLCHIPS57690.2023.10121981","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121981","url":null,"abstract":"This paper presents a real-time keyword spotting system in an FPGA. The proposed system performs the entire KWS task based on a binary convolutional neural network (BCNN) without involving any other complicated processing. The BCNN inference is efficiently carried out by skipping redundant operations. With all the essential components integrated, the proposed system has been implemented with only 8475 look-up tables in an FPGA. The proposed system processes one-second frame in 19.8 ms, exhibiting the spotting accuracy of 91.64%.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133463519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Infall Syafalni, Mohamad Imam Firdaus, A. M. R. Ilmy, N. Sutisna, T. Adiono
{"title":"MazeCov-Q: An Efficient Maze-Based Reinforcement Learning Accelerator for Coverage","authors":"Infall Syafalni, Mohamad Imam Firdaus, A. M. R. Ilmy, N. Sutisna, T. Adiono","doi":"10.1109/COOLCHIPS57690.2023.10122120","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10122120","url":null,"abstract":"Reinforcement learning (RL) is an unsupervised machine learning that does not requires pre-assigned labeled data to learn. It is implemented in many areas such as robotics, games, finances, health, transportation, and energy applications. In this paper, we present an application of reinforcement learning accelerator for finding coverage area and its implementation in a mobile robot called MazeCov-Q (Maze-Based Coverage Q-Learning). We define a novel state that is divided into two conditions. The conditions are directions and visit counters for the Q-value calculation. The experimental results show that our MazeCov-Q achieves more than 74% path efficiency on average. Moreover, our coverage-based Q-learning accelerator (MazeCov-Q) achieves 48.3 Mps and 169.05 Mps for 50 Mhz Pynq Z1 and 175 MHz ZCU104 boards, respectively. This research is useful for surveillance, resource allocation, environmental monitoring, and autonomous navigation.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126025493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 2.41-μW/MHz, 437-PE/mm2 CGRA in 22 nm FD-SOI With RISC-Like Code Generation","authors":"Tobias Kaiser, F. Gerfers","doi":"10.1109/COOLCHIPS57690.2023.10121985","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10121985","url":null,"abstract":"While coarse-grained reconfigurable arrays (CGRAs) have the potential to improve energy efficiency in general-purpose computing beyond the limitations of von Neumann architectures, they suffer from challenges in code generation. Pasithea-l is a CGRA architecture that aims to combine high energy efficiency with RISC-like programmability. This paper presents its first silicon prototype and a C compiler that uses conventional CPU compiler techniques. Compared to code generation for traditional CGRAs, which require expensive place and route steps, this method of code generation reduces compile times and compiler complexity significantly. Performance and power were measured for a set of benchmark programs written in C. On average, energy efficiency of 195.1 int32 MIPS/mW and active power of 2.41μW/MHz were achieved. Peak energy efficiency of 558.2 MIPS/mW and peak performance of 97.5 MIPS were measured. Load/store instructions and instruction transfers are identified as critical factors for energy efficiency in Pasithea. In comparison to an MCU with state-of-the-art energy efficiency, Pasithea achieves higher energy efficiency in four of the benchmarked programs. Switched capacitance per benchmark run was reduced by a factor of approximately 1.4, on average. Its 0.75 mm2 core area and fabric density of 437 Plis/mm2 enable use in cost-sensitive applications and permit further upscaling.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114449203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cachet: A High-Performance Joint-Subtree Integrity Verification for Secure Non-Volatile Memory","authors":"Tatsuya Kubo, Shinya Takamaeda-Yamazaki","doi":"10.1109/COOLCHIPS57690.2023.10122117","DOIUrl":"https://doi.org/10.1109/COOLCHIPS57690.2023.10122117","url":null,"abstract":"Data confidentiality, integrity, and persistence are essential in secure non-volatile memory (NVM) systems. However, the cost of persisting all affected security metadata is high and leads to non-negligible overheads, including performance degradation, memory lifetime reduction, and high energy consumption. This is because integrity trees, which are typically used for data authentication of NVMs, require additional cryptographic calculations and memory accesses to persist the metadata for the recovery. In this paper, we propose Cachet, a novel integrity verification scheme that leverages set hash functions to achieve high performance and crash consistency. Specifically, Cachet maintains two set hash values representing the metadata cache state to enable the lazy update of the integrity tree in a joint-subtree manner with minimal overheads. The observation that underlies Cachet is that regarding the metadata cache, the integrity of each cached node is never verified individually, and the recovery process requires just the digest of the cached metadata. Our evaluation results show that Cachet reduces the application execution time by 21%, NVM writes by 30%, and hash calculations by 36% compared to the state-of-art solutions.","PeriodicalId":387793,"journal":{"name":"2023 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125552997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}