{"title":"Solving Least-Squares Fitting in $O(1)$ Using RRAM-based Computing-in-Memory Technique","authors":"Xiaoming Chen, Yinhe Han","doi":"10.1109/asp-dac52403.2022.9712568","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712568","url":null,"abstract":"Least-squares fitting (LSF) is a fundamental statistical method that is widely used in linear regression problems, such as modeling, data fitting, predictive analysis, etc. For large-scale data sets, LSF is computationally complex and poorly scaled due to the $O(N^{2})-O(N^{3})$ computational complexity. The computing-in-memory technique has potential to improve the performance and scalability of LSF. In this paper, we propose a computing-in-memory accelerator based on resistive random-access memory (RRAM) devices. We not only utilize the conventional idea of accelerating matrix-vector multiplications by RRAM-based crossbar arrays, but also elaborate the hardware and the mapping strategy. Our approach has a unique feature that it can finish a complete LSF problem in $O$ (1) time complexity. We also propose a scalable and configurable architecture such that the problem scale that can be solved is not restricted by the crossbar array size. Experimental results have demonstrated the superior performance and energy efficiency of our accelerator.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Kong, Di Liu, Xiangzhong Luo, Weichen Liu, Ravi Subramaniam
{"title":"HACScale: Hardware-Aware Compound Scaling for Resource-Efficient DNNs","authors":"Hao Kong, Di Liu, Xiangzhong Luo, Weichen Liu, Ravi Subramaniam","doi":"10.1109/ASP-DAC52403.2022.9712593","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712593","url":null,"abstract":"Model scaling is an effective way to improve the accuracy of deep neural networks (DNNs) by increasing the model capacity. However, existing approaches seldom consider the underlying hardware, causing inefficient utilization of hardware resources and consequently high inference latency. In this paper, we propose HACScale, a hardware-aware model scaling strategy to fully exploit hardware resources for higher accuracy. In HACScale, different dimensions of DNNs are jointly scaled with consideration of their contributions to hardware utilization and accuracy. To improve the efficiency of width scaling, we introduce importance-aware width scaling in HACScale, which computes the importance of each layer to the accuracy and scales each layer accordingly to optimize the trade-off between accuracy and model parameters. Experiments show that HACScale improves the hardware utilization by 1.92× on ImageNet, as a result, it achieves 2.41% accuracy improvement with a negligible latency increase of 0.6%. On CIFAR-10, HACScale improves the accuracy by 2.23% with only 6.5% latency growth.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133955665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal Data Allocation for Graph Processing in Processing-in-Memory Systems","authors":"Zerun Li, Xiaoming Chen, Yinhe Han","doi":"10.1109/asp-dac52403.2022.9712587","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712587","url":null,"abstract":"Graph processing involves lots of irregular memory accesses and increases demands on high memory bandwidth, making it difficult to execute efficiently on compute-centric architectures. Dedicated graph processing accelerators based on the processing-in-memory (PIM) technique have recently been proposed. Despite they achieved higher performance and energy efficiency than conventional architectures, the data allocation problem for communication minimization in PIM systems (e.g., hybrid memory cubes (HMCs)) has still not been well solved. In this paper, we demonstrate that the conventional “graph data allocation = graph partitioning” assumption is not true, and the memory access patterns of graph algorithms should also be taken into account when partitioning graph data for communication minimization. For this purpose, we classify graph algorithms into two representative classes from a memory access pattern point of view and propose different graph data partitioning strategies for them. We then propose two algorithms to optimize the partition-to-HMC mapping to minimize the inter-HMC communication. Evaluations have proved the superiority of our data allocation framework and the data movement energy efficiency is improved by 4.2-5 × on average than the state-of-the-art GraphP approach.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115101995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tay-Jyi Lin, Chen-Zong Liao, You-Jia Hu, Wei-Cheng Hsu, Zheng-Xian Wu, Shao-Yu Wang, Chun-Ming Huang, Ying-Hui Lai, C. Yeh, Jinn-Shyan Wang
{"title":"A 40nm CMOS SoC for Real-Time Dysarthric Voice Conversion of Stroke Patients","authors":"Tay-Jyi Lin, Chen-Zong Liao, You-Jia Hu, Wei-Cheng Hsu, Zheng-Xian Wu, Shao-Yu Wang, Chun-Ming Huang, Ying-Hui Lai, C. Yeh, Jinn-Shyan Wang","doi":"10.1109/ASP-DAC52403.2022.9712584","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712584","url":null,"abstract":"This paper presents the first dysarthric voice conversion SoC, which can translate stroke patients' voice into more intelligible and clearer speech in real time. The SoC is composed of a RISC-V MPU and a compact DNN engine with a single 16-bit multiply-accumulator, which improves 12x performance and > 100x energy efficiency, and has been implemented in 40nm CMOS. The silicon area is 0.68×0.79mm2, and the measured power is 18.4mW for converting 3-sec dysarthric voice within 0.5 sec (at 200MHz and 0.8V) and 4.8mW for conversion < 1 sec (at 100MHz and 0.6V).","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123855118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pearl: Towards Optimization of DNN-accelerators Via Closed-Form Analytical Representation","authors":"Arko Dutt, Suprojit Nandy, Mays Sabry","doi":"10.1109/ASP-DAC52403.2022.9712598","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712598","url":null,"abstract":"Hardware accelerators for deep learning are proliferating, owing to their high-speed and energy-efficient execution of deep neural network (DNN) workloads. Ensuring an efficient DNN accelerator design requires a vast design-space exploration of a large number of parameters. However, current exploration frameworks are limited by slow architectural simulations, which limit the number of design points to be examined. To address this challenge, in this paper we introduce Pearl, an analytical representation of executing the DNN inference, mapped to an accelerator. Pearl provides immediate estimates of performance and energy of DNN accelerators, where we incorporate new parameters to capture dataflow mapping schemes beneficial for DNN systems. We model equations that represent utilization rates of the compute fabric for different dataflow mappings. We validate the accuracy of our equations against a state-of-the-art cycle-accurate DNN hardware simulator. Results show that Pearl achieves $< 1.0%$ and $< 1.3%$ average error in performance and energy estimates, respectively, while achieving $> 1.2cdot 10^{7}times$ simulation speedup. Pearl shows higher average accuracy than existing analytical models that support the simulator. We also leverage Pearl to explore and optimize area-constrained DNN accelerators targeting inference on full HD resolution.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116163664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"This is SPATEM! A Spatial-Temporal Optimization Framework for Efficient Inference on ReRAM-based CNN Accelerator","authors":"Yen-Ting Tsou, Kuan-Hsun Chen, Chia-Lin Yang, Hsiang-Yun Cheng, Jian-Jia Chen, Der-Yu Tsai","doi":"10.1109/ASP-DAC52403.2022.9712536","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712536","url":null,"abstract":"Resistive memory-based computing-in-memory (CIM) has been considered as a promising solution to accelerate convolutional neural networks (CNN) inference, which stores the weights in crossbar memory arrays and performs in-situ matrix-vector multiplications (MVMs) in an analog manner. Several techniques assume that a whole crossbar can operate concurrently and discuss how to efficiently map the weights onto crossbar arrays. However, in practice, the accumulated effect of per-cell current deviation and Analog-to-Digital-Converter overhead may greatly degrade inference accuracy, which motivates the concept of Operation Unit (OU), by which an operation per cycle in a crossbar only involve limited wordlines and bitlines to preserve satisfactory inference accuracy. With OU-based operations, the mapping of weights and scheduling strategy for parallelizing CNN convolution operations should take the cost of communication overhead and resource utilization into consideration to optimize the inference acceleration. In this work, we propose the first optimization framework named SPATEM, that efficiently executes MVMs with OU-based operations on ReRAM-based CIM accelerators. It decouples the design space into tractable steps, models the expected inference latency, and derives an optimized spatial-temporal-aware scheduling strategy. By comparing with state-of-the-arts, the experimental result shows that the derived scheduling strategy of SPATEM achieves on average 29.24% inference latency reduction with 31.28% less communication overhead by exploiting more originally unused crossbar cells.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116603534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thermal-Aware Layout Optimization and Mapping Methods for Resistive Neuromorphic Engines","authors":"Chengrui Zhang, Yu Ma, Pingqiang Zhou","doi":"10.1109/asp-dac52403.2022.9712596","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712596","url":null,"abstract":"Resistive neuromorphic engines can accelerate spiking neural network tasks with memristor crossbars. However, the stored weight is influenced by the temperature, which leads to accuracy and endurance degradation. The higher the temperature is, the larger the influence is. In this work, we propose a cross-array mapping method and a layout optimization method to reduce the thermal effect with the consideration of input distribution, weight value and layout of memristor crossbars. Experimental results show that our method reduces the peak temperature up to 10.4K and improves the endurance up to 1.72×.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124885121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayuan He, U. Agarwal, Yihang Yang, R. Manohar, K. Pingali
{"title":"SPRoute 2.0: A detailed-routability-driven deterministic parallel global router with soft capacity","authors":"Jiayuan He, U. Agarwal, Yihang Yang, R. Manohar, K. Pingali","doi":"10.1109/ASP-DAC52403.2022.9712557","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712557","url":null,"abstract":"Global routing has become more challenging due to advancements in the technology node and the ever-increasing size of chips. Global routing needs to generate routing guides such that (1) routability of detailed routing is considered and (2) the routing is deterministic and fast. In this paper, we firstly introduce soft capacity which reserves routing space for detailed routing based on the pin density and Rectangular Uniform wire Density (RUDY). Second, we propose a deterministic parallelization approach that partitions the netlist into batches and then bulk-synchronously maze-routes a single batch of nets. The advantage of this approach is that it guarantees determinacy without requiring the nets running in parallel to be disjoint, thus guaranteeing scalability. We then design a scheduler that mitigates the load imbalance and livelock issues in this bulk synchronous execution model. We implement SPRoute 2.0 with the proposed methodology. The experimental results show that SPRoute 2.0 generates good quality of results with 43% fewer shorts, 14% fewer DRCs and a 7.4X speedup over a state-of-the-art global router on the ICCAD2019 contest benchmarks.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124713471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Huang, Hui-Ling Zhen, Naixing Wang, Mingxuan Yuan, Hui Mao, Yu Huang, Jiping Tao
{"title":"Accelerate SAT-based ATPG via Preprocessing and New Conflict Management Heuristics","authors":"Jun Huang, Hui-Ling Zhen, Naixing Wang, Mingxuan Yuan, Hui Mao, Yu Huang, Jiping Tao","doi":"10.1109/ASP-DAC52403.2022.9712573","DOIUrl":"https://doi.org/10.1109/ASP-DAC52403.2022.9712573","url":null,"abstract":"Due to the continuous advancement of semicon-ductor technologies, there are more defects than ever widely distributed in manufactured chips. In order to meet the high product quality and low defective-parts-per-million (DPPM) goals, Boolean Satisfiability (SAT) technique has been shown to be a robust alternative to conventional APTG techniques, especially for hard-to-detect faults. However, the SAT-based ATPG still confronts two challenges. The first one is to reduce extra computational overhead of SAT modeling, i.e. to transform a circuit testing problem to a Conjunctive Normal Form (CNF) which is the foundation of modern SAT solvers. The second one lies in the SAT solver's efficiency which is brought by the loss of structural information during CNF transformation. In this work, we propose a new SAT-based ATPG approach to address the two challenges mentioned above: (1) To reduce CNF transformation overhead, we utilize a simulation-driven pre-processing for narrowing down the fault propagation and activation logic cones, leading to an improvement in CNF transformation and reduction in runtime. (2) To further improve the solving efficiency, We propose new ranking-based heuristics to build more effective conflict database, enabling the direct solving for small scale instance and a looking-head method for large scale ones. Extensive experimental results on industrial circuits demonstrate that on average the proposed approach could cover 89.67% of the faults failed by a commercial ATPG tool with a comparable runtime.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121826888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yishuang Lin, Rongjian Liang, Yaguang Li, Hailiang Hu, Jiang Hu
{"title":"Mapping Large Scale Finite Element Computing on to Wafer-Scale Engines","authors":"Yishuang Lin, Rongjian Liang, Yaguang Li, Hailiang Hu, Jiang Hu","doi":"10.1109/asp-dac52403.2022.9712538","DOIUrl":"https://doi.org/10.1109/asp-dac52403.2022.9712538","url":null,"abstract":"The finite element method has wide applications and often presents a computing challenge due to huge problem sizes and slow convergence rate. A leading-edge computing acceleration approach is to leverage wafer-scale engine, which contains more than 800K processing elements. The effectiveness of this approach heavily depends on how to map a finite element computing task onto such enormous hardware space. A mapping method is introduced to partition an object space into computing kernels, which are further placed onto processing elements. This method achieves the best overall result in terms of computing accuracy and communication cost among all the ISPD 2021 contest participants.","PeriodicalId":239260,"journal":{"name":"2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115820489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}