{"title":"Toward Low-Bit Neural Network Training Accelerator by Dynamic Group Accumulation","authors":"Yixiong Yang, Ruoyang Liu, Wenyu Sun, Jinshan Yue, Huazhong Yang, Yongpan Liu","doi":"10.1109/ASP-DAC52403.2022.9712505","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712505","url":null,"abstract":"Low-bit quantization is a big challenge for neural network training. Conventional training hardware adopts FP32 to accumulate the partial-sum result, which seriously degrades energy efficiency. In this paper, a technology called dynamic group accumulation (DGA) is proposed to reduce the accumulation error. First, we model the proposed group accumulation method and give the optimal DGA algorithm. Second, we design a training architecture and implement a hardware-efficient DGA unit. Third, we make a comprehensive analysis of the DGA algorithm and training architecture. The proposed method is evaluated on CIFAR and ImageNet datasets, and results show that DGA can reduce accumulation bit-width by 6 bits while achieving the same precision as the static group method. With the FP12 DGA, the CNN algorithm only loses 0.11% accuracy in ImageNet training, and our architecture saves 32% of power consumption compared to the FP32 baseline.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116674743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Reconfigurable Inference Processor for Recurrent Neural Networks Based on Programmable Data Format in a Resource-Limited FPGA","authors":"Jiho Kim, Kwoanyoung Park, Tae-Hwan Kim","doi":"10.1109/ASP-DAC52403.2022.9712510","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712510","url":null,"abstract":"An efficient inference processor for recurrent neural networks is designed and implemented in an FPGA. The proposed processor is designed to be reconfigurable for various models and perform every vector operation consistently utilizing a single array of multiply-accumulate units with the aim of achieving a high resource efficiency. The data format is programmable per operand. The resource and energy efficiency are 1.89MOP/LUT and 263.95GOP/J, respectively, in Intel Cyclone-V FPGA. The functionality has been verified successfully under a fully-integrated inference system.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122101053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 76–81 GHz FMCW 2TX/3RX Radar Transceiver with Integrated Mixed-Mode PLL and Series-Fed Patch Antenna Array","authors":"Taikun Ma, W. Deng, Haikun Jia, Yejun He, B. Chi","doi":"10.1109/ASP-DAC52403.2022.9712506","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712506","url":null,"abstract":"This paper presented a 76–81 GHz FMCW MIMO Radar transceiver with mixed-mode PLL. Utilizing series-fed patch antenna array, a prototype system is developed based on the proposed transceiver. On-chip Measurements show that reconfigurable sawtooth chirps could be generated with a bandwidth up to 4 GHz and a period as short as 30 ${mu s}$. Real-time experiments demonstrate that the prototype MIMO radar has the ability of target detection and achieves an angular resolution of 9°","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"440 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125777763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuit and System Technologies for Energy-Efficient Edge Robotics: (Invited Paper)","authors":"Zishen Wan, A. Lele, A. Raychowdhury","doi":"10.1109/asp-dac52403.2022.9712531","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712531","url":null,"abstract":"As we march towards the age of ubiquitous intelligence, we note that AI and intelligence are progressively moving from the cloud to the edge. The success of Edge-AI is pivoted on innovative circuits and hardware that can enable inference and limited learning in resource-constrained edge autonomous systems. This paper introduces a series of ultra-low-power accelerator and system designs on enabling the intelligence in edge robotic platforms, including reinforcement learning neuro-morphic control, swarm intelligence, and simultaneous mapping and localization. We put an emphasis on the impact of the mixed-signal circuit, neuro-inspired computing system, benchmarking and software infrastructure, as well as algorithm-hardware co-design to realize the most energy-efficient Edge-AI ASICs for the next-generation intelligent and autonomous systems.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124751815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuan-Yu Chen, Hsiu-Chu Hsu, Wai-Kei Mak, Ting-Chi Wang
{"title":"HybridGP: Global Placement for Hybrid-Row-Height Designs*","authors":"Kuan-Yu Chen, Hsiu-Chu Hsu, Wai-Kei Mak, Ting-Chi Wang","doi":"10.1109/ASP-DAC52403.2022.9712565","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712565","url":null,"abstract":"Conventional global placement algorithms typically assume that all cell rows in a design have the same height. Nevertheless, a design making use of standard cells with short-row height, tall-row height, and double-row (short plus tall) height can provide a better sweet spot for performance and area co-optimization in advanced nodes. In this paper, we assume for a hybrid-row-height design, its placement region is composed of both tall rows and short rows, and a cell library containing multiple versions of each cell in the design is provided. We present a new analytical global placer, HybridGP, for such hybrid-row-height designs. Furthermore, we assume that a subset of cells with sufficient timing slacks is given so that we may change their versions without overall timing degradation if desired. Our approach considers the usage of short-row and tall-row resources and exploits the flexibility of cell version change to facilitate the subsequent legalization stage. Augmented with an identical legalizer for final placement legalization, we compared HybridGP with a conventional global placer. The experimental results show that legalized placement solutions of much better quality can be obtained in less run time with HybridGP.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125138760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keren Zhu, Hao Chen, Mingjie Liu, Xiyuan Tang, Wei Shi, Nan Sun, D. Pan
{"title":"Generative-Adversarial-Network-Guided Well-Aware Placement for Analog Circuits","authors":"Keren Zhu, Hao Chen, Mingjie Liu, Xiyuan Tang, Wei Shi, Nan Sun, D. Pan","doi":"10.1109/asp-dac52403.2022.9712592","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712592","url":null,"abstract":"Generating wells for transistors is an essential challenge in analog circuit layout synthesis. While it is closely related to analog placement, very little research has explicitly considered well generation within the placement process. In this work, we propose a new analytical well-aware analog placer. It uses a generative adversarial network (GAN) for generating wells and guides the placement process. A global placement algorithm spreads the modules given the GAN guidance and optimizes for area and wirelength. Well-aware legalization techniques then legalize the global placement results and produce the final placement solutions. By allowing well sharing between transistors and explicitly considering wells in placement, the proposed framework achieves more than 74% improvement in the area and more than 26% reduction in half-perimeter wirelength over existing placement methodologies in experimental results.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127814798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous Memory Architecture Accommodating Processing-in-Memory on SoC for AIoT Applications","authors":"Kangyi Qiu, Yaojun Zhang, Bonan Yan, Ru Huang","doi":"10.1109/asp-dac52403.2022.9712544","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712544","url":null,"abstract":"Processing-In-Memory (PIM) technologies is one of most promising candidates for AIoT applications due to its attractive characteristics, such as low computation latency, large throughput and high power efficiency. However, how to efficiently utilize PIM with System-on-Chip (SoC) architecture has been scarcely discussed. In this paper, we demonstrate a series of solution from hardware architecture to algorithm to maximize the benefits of PIM design. First, we propose a Heterogeneous Memory Architecture (HMA) that facilitates the existing SoC with PIM via high-throughput on-chip buses. Then, based on given HMA structure, we also propose an HMA tensor mapping approach to partition tensors and deploy general matrix multiplication operations on PIM structures. Both HMA hardware and HMA tensor mapping approach harnesses the programmability of the mature embedded CPU solution stack and maximize the high efficiency of PIM technology. The whole HMA system can save 416 x power as well as 44.6% design area compare with the latest accelerator solutions. The evaluation also shows that our design can reduce the operation latency by 430 × and 11 × for TinyML applications, compare with state-of-art baseline and PIM without optimization, respectively.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomofumi Tsuchida, A. Tsuchiya, Toshiyuki Inoue, K. Kishine
{"title":"Supply-Variation-Tolerant Transimpedance Amplifier Using Non-Inverting Amplifier in 180-nm CMOS","authors":"Tomofumi Tsuchida, A. Tsuchiya, Toshiyuki Inoue, K. Kishine","doi":"10.1109/asp-dac52403.2022.9712503","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712503","url":null,"abstract":"This paper presents a supply-variation-tolerant transimpedance amplifier (TIA). For parallel integration of optical transceivers, supply voltage variation is one of the serious problems. We propose a TIA using a non-inverting stage to cancel the supply variation. As a proof of concept, we fabricated the proposed TIA in a 180-nm CMOS. We measured the eye-diagrams with various supply voltages. Measurement results show that the voltage swing and the eye-opening voltage are improved by 105% and 180%, respectively.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120940233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chung-Hsiang Lin, Shaoyu Lin, Yi-Jung Chen, En-Yu Jenp, Chia-Lin Yang
{"title":"PUMP: Profiling-free Unified Memory Prefetcher for Large DNN Model Support","authors":"Chung-Hsiang Lin, Shaoyu Lin, Yi-Jung Chen, En-Yu Jenp, Chia-Lin Yang","doi":"10.1109/asp-dac52403.2022.9712507","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712507","url":null,"abstract":"Modern DNNs are going deeper and wider to achieve higher accuracy. However, existing deep learning frameworks require the whole DNN model to fit into the GPU memory when training with GPUs, which puts an unwanted limitation on training large models. Utilizing NVIDIA Unified Memory (UM) could inherently support training DNN models beyond GPU memory capacity. However, naively adopting UM would suffer a significant performance penalty due to the delay of data transfer. In this paper, we propose PUMP, a Profiling-free Unified Memory Prefetcher. PUMP exploits GPU asynchronous execution for prefetch; that is, there exists a delay between the time that CPU launches a kernel and the time the kernel executes in GPU. PUMP extracts memory blocks accessed by the kernel when launching and swaps these blocks into GPU memory. Experimental results show PUMP achieves about 2x speedup on the average compared to the baseline that naively enables UM.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115935890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kexing Zhou, Zizheng Guo, Tsung-Wei Huang, Yibo Lin
{"title":"Efficient Critical Paths Search Algorithm using Mergeable Heap","authors":"Kexing Zhou, Zizheng Guo, Tsung-Wei Huang, Yibo Lin","doi":"10.1109/ASP-DAC52403.2022.9712566","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712566","url":null,"abstract":"Path searching is a central step in static timing analysis (STA). State-of-the-art algorithms need to generate path deviations for hundreds of thousands of paths, which becomes the runtime bottleneck of STA. Accelerating path searching is a challenging task due to the complex and iterative path generating process. In this work, we propose a novel path searching algorithm that has asymptotically lower runtime complexity than the state-of-the-art. We precompute the path deviations using mergeable heap and apply a group of deviations to a path in near-constant time. We prove our algorithm has a runtime complexity of $O(nlog n+klog k)$ which is asymptotically smaller than the state-of-the-art $O(nk)$. Experimental results show that our algorithm is up to $60times$ faster compared to OpenTimer and $1.8times$ compared to the leading path search algorithm based on suffix forest.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131191140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}