Yintao He, Ying Wang, Yongchen Wang, Huawei Li, Xiaowei Li
{"title":"An Agile Precision-Tunable CNN Accelerator based on ReRAM","authors":"Yintao He, Ying Wang, Yongchen Wang, Huawei Li, Xiaowei Li","doi":"10.1109/iccad45719.2019.8942163","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942163","url":null,"abstract":"Precision-tuning is a popular approach of approximate computing to trade-off excessive computation exactness for power and efficiency gains. Particularly, it has been proved useful to reduce the computation and memory overhead for the deep neural networks on embedded and IoT usage. However, the switching overhead of precision tuning in hardware severely impacts its applicability and effectiveness to save more energy by quickly reacting to the change of environment, user constraint or input quality. This work for the first time investigates the feasibility of agile and cost-free precision tuning for neural network accelerators to benefit from approximate computing. The proposed Processing in Memory (PIM) CNN accelerators fully utilize the normally-off characteristics of memristor crossbars to achieve instant network precision tuning without worrying about the model reloading penalty. The ReRAM-based accelerator, with the proposed neural parameter mapping policy and the novel mixed-model training method, involves negligible precision-switching latency and power consumption compared with traditional variable precision accelerators. The proposed mixed-model training perfectly unifies the neural models of different precision into a single ReRAM array without compromising the accuracy, and the ReRAM accelerator could save 58.3%-62.47% area overhead compared with conventional designs that have to program multiple independent models into ReRAM arrays for precision tuning.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127557046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Embedding Binary Perceptrons in FPGA to improve Area, Power and Performance","authors":"Ankit Wagle, E. Azari, S. Vrudhula","doi":"10.1109/iccad45719.2019.8942071","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942071","url":null,"abstract":"For the flexibility of implementing any given Boolean function(s), the FPGA uses re-configurable building blocks called LUTs. The price for this reconfigurability is a large number of registers and multiplexers required to construct the FPGA. While researchers have been working on complex LUT structures to reduce the area and power for several years, most of these implementations come at the cost of performance penalty. This paper demonstrates simultaneous improvement in area, power, and performance in an FPGA by using special logic cells called Threshold Logic Cells (TLCs) (also known as binary perceptrons). The TLCs are capable of implementing a complex threshold function, which if implemented using conventional gates would require several levels of logic gates. The TLCs only require 7 SRAM cells and are significantly faster than the conventional LUTs. The implementation of the proposed FPGA architecture has been done using 28nm FDSOI standard cells and has been evaluated using ISCAS-85, ISCAS-89, and a few large industrial designs. Experiments demonstrate that the proposed architecture can be used to get an average reduction of 18.1% in configuration registers, 18.1% reduction in multiplexer count, 12.3% in Basic Logic Element (BLE) area, 16.3% in BLE power, 5.9% improvement in operating frequency, with a slight reduction in track count, routing area and routing power. The improvements are also demonstrated on the physically designed version of the architecture.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127638723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"elfPlace: Electrostatics-based Placement for Large-Scale Heterogeneous FPGAs","authors":"Wuxi Li, Yibo Lin, D. Pan","doi":"10.1109/iccad45719.2019.8942075","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942075","url":null,"abstract":"elfplace is a flat nonlinear placement algorithm for large-scale heterogeneous field-programmable gate arrays (FPGAs). We adopt the analogy between placement and electrostatic systems initially proposed by ePlace and extend it to tackle heterogeneous blocks in FPGA designs. To achieve satisfiable solution quality with fast and robust numerical convergence, an augmented Lagrangian formulation together with a preconditioning technique and a normalized subgradient-based multiplier updating scheme are proposed. Besides pure-wirelength minimization, we also propose a unified instance area adjustment scheme to simultaneously optimize routability, pin density, and downstream clustering compatibility. Our experiments on ISPD 2016 benchmark suite show that elfPlace outperforms four state-of-the-art FPGA placers UTPlaceF, RippleFPGA, GPlace3.0, and UTPlaceF-DL by 13.6%, 11.3%, 8.9%, and 7.1%, respectively, in routed wirelength with competitive runtime.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127737636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keren Zhu, Mingjie Liu, Yibo Lin, Biying Xu, Shaolan Li, Xiyuan Tang, Nan Sun, D. Pan
{"title":"GeniusRoute: A New Analog Routing Paradigm Using Generative Neural Network Guidance","authors":"Keren Zhu, Mingjie Liu, Yibo Lin, Biying Xu, Shaolan Li, Xiyuan Tang, Nan Sun, D. Pan","doi":"10.1109/iccad45719.2019.8942164","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942164","url":null,"abstract":"Due to sensitive layout-dependent effects and varied performance metrics, analog routing automation for performance-driven layout synthesis is difficult to generalize. Existing research has proposed a number of heuristic layout constraints targeting specific performance metrics. However, previous frameworks fail to automatically combine routing with human intelligence. This paper proposes a novel, fully automated, analog routing paradigm that leverages machine learning to provide routing guidance, mimicking the sophisticated manual layout approaches. Experiments show that the proposed methodology obtains significant improvements over existing techniques and achieves competitive performance to manual layouts while being capable of generalizing to circuits of different functionality.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133924125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Making the Fault-Tolerance of Emerging Neural Network Accelerators Scalable","authors":"Tao Liu, Wujie Wen","doi":"10.1109/iccad45719.2019.8942073","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942073","url":null,"abstract":"Deep neural network (DNN) accelerators built upon emerging technologies, such as memristor, are gaining increasing research attention because of the impressive computing efficiency brought by processing-in-memory. One critical challenge faced by these promising accelerators, however, is their poor reliability: the weight, which is stored as the memristance or resistance value of each device, suffers large uncertainty incurred by unique device physical limitations, e.g. stochastic programming, resistance drift etc., translating into prominent testing accuracy degradation. Non-trivial retraining, weight remapping or redundant cell fixing, are popular approaches to address this issue. However, these solutions have limited scalability since they are more like tedious patch adding or bug fixing after identifying each accelerator-dependent defect map. On the other side, scalable solutions are highly desirable in the envisioned scenario of a neural network trained once in the cloud and deployed to many edge devices with each equipped with an emerging accelerator. In this paper, we discuss the challenge and requirement of the fault-tolerance in these new accelerators. Then we show how to address this problem through a scalable algorithm-hardware codesign method, with a focus on unleashing the algorithmic error-resilience of DNN classifiers, so as to eliminate any expensive defect-map-specific calibration or training-from-scratch.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"61 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Walter Lau Neto, Max Austin, Scott Temple, L. Amarù, Xifan Tang, P. Gaillardon
{"title":"LSOracle: a Logic Synthesis Framework Driven by Artificial Intelligence: Invited Paper","authors":"Walter Lau Neto, Max Austin, Scott Temple, L. Amarù, Xifan Tang, P. Gaillardon","doi":"10.1109/iccad45719.2019.8942145","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942145","url":null,"abstract":"The increasing complexity of modern Integrated Circuits (ICs) leads to systems composed of various different Intellectual Property (IPs) blocks, known as System-on-Chip (SoC). Such complexity requires strong expertise from engineers, that rely on expansive commercial EDA tools. To overcome such a limitation, an automated open-source logic synthesis flow is required. In this context, this work proposes LSOracle: a novel automated mixed logic synthesis framework. LSOracle is the first to exploit state-of-the-art And-Inverter Graph (AIG) and Majority-Inverter Graph (MIG) logic optimizers and relies on a Deep Neural Network (DNN) to automatically decide which optimizer should handle different portions of the circuit. To do so, LSOracle applies $k-way$ partitioning to split a DAG into multiple partitions and uses a to chose the best-fit optimizer. Post-tech mapping ASIC results, targeting the 7nm ASAP standard cell library, for a set of mixed-logic circuits, show an average improvement in area-delay product of 6.87% (up to 10.26%) and 2.70% (up to 6.27%) when compared to AIG and MIG, respectively. In addition, we show that for the considered circuits, LSOracle achieves an area close to AIGs (which delivered smaller circuits) with a similar performance of MIGs, which delivered faster circuits.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133173393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ACG-Engine: An Inference Accelerator for Content Generative Neural Networks","authors":"Haobo Xu, Ying Wang, Yujie Wang, Jiajun Li, Bosheng Liu, Yinhe Han","doi":"10.1109/iccad45719.2019.8942169","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942169","url":null,"abstract":"The technological breakthrough in Generative Adversarial Networks (GAN) has propelled the advancement of content generative applications such as AI-based paintings, style transfer, and music composition. However, in contrast to previous deep learning models for prediction and categorization, generative networks generally rely on instance normalization (IN) layer for better feature distribution, which performs significantly better than batch normalization(BN) in image style-transfer, image to image translation, etc. Unlike batch or group normalization that can be fused into convolutional layers and ignored during the network inference stage, an instance normalization layer induces intensive computation and memory access. However, prior deep learning accelerator designs for traditional Neural Network and Generative Adversarial Networks mostly focus on the acceleration of convolution and deconvolution layer but lack of support for IN operations, which could become a performance bottleneck on edge devices with insufficient computational power. To address this problem, we propose an inference accelerator for content generation (ACG-Engine) aimed to support the fundamental operations of generative networks, including convolution layers, deconvolution layers, specifically instance normalization layer. We performed a hardware-aware mathematical transformation of the IN operation for less computation complexity and memory-friendliness, so that it can be efficiently mapped to the classic 2D processing element array. Owing to the proposed optimization techniques, ACG-Engine achieves 4.56X speedup and improve power efficiency up to 29X compared to prior baseline acceleration scheme in generative network acceleration. In addition, ACG-Engine can achieve performance comparable to the classic CNN-specific accelerators with negligible power consumption and area overhead.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131101453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Torun, Huan Yu, N. Dasari, Venkata Chaitanya Krishna Chekuri, Arvind Singh, Jinwoo Kim, S. Lim, S. Mukhopadhyay, M. Swaminathan
{"title":"A Spectral Convolutional Net for Co-Optimization of Integrated Voltage Regulators and Embedded Inductors","authors":"H. Torun, Huan Yu, N. Dasari, Venkata Chaitanya Krishna Chekuri, Arvind Singh, Jinwoo Kim, S. Lim, S. Mukhopadhyay, M. Swaminathan","doi":"10.1109/iccad45719.2019.8942109","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942109","url":null,"abstract":"Integrated voltage regulators (IVR) with embedded inductors is an emerging technology that provides point-of-load voltage regulation to high-performance systems. Conventional two-step approaches to the design of IVRs can suffer from suboptimal design as the optimal inductor depends on the characteristics of the buck converter (BC). Furthermore, inductor-level trade-offs such as AC and DC resistance, inductance and area can not be determined independently from the BC. This co-dependency of the BC and the inductor creates a highly non-linear response surface, which raises the necessity of co-optimization, involving multiple time-consuming electromagnetics (EM) simulations. In this paper, we propose a machine learning based optimization methodology that eliminates EM simulations from the optimization loop to significantly reduce the optimization complexity. A novel technique named as Spectral Transposed Convolutional Neural Network (S-TCNN) is presented to derive an accurate predictive model of the inductor frequency response using a small amount of training data. The derived S-TCNN is then used along with a time-domain model of the BC to perform multi-objective optimization that approximates the Pareto front for 5 objectives, namely inductor area, BC settling time, voltage conversion efficiency, droop and ripple. The resulting methodology provides multiple Pareto optimal inductors in an efficient and fully automated fashion, thereby allows to rapidly determine the optimal trade-offs for possibly contradicting design objectives. We demonstrate the proposed framework on co-optimization of solenoidal inductor with magnetic core and BC that are integrated on silicon interposer.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115590748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IcySAT: Improved SAT-based Attacks on Cyclic Locked Circuits","authors":"Kaveh Shamsi, D. Pan, Yier Jin","doi":"10.1109/iccad45719.2019.8942049","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942049","url":null,"abstract":"“Cyclic” circuit locking/camouflaging is a recently proposed direction in logic obfuscation for thwarting foundry and end-user reverse engineering. As opposed to traditional schemes, these techniques create cycles in the obfuscated circuit in a way that confuses the attacker but does not disrupt the combinational nature of the circuit. While these schemes can thwart the baseline SAT-based attack, the CycSAT attack was proposed recently to break these schemes through a preprocessing step that builds a Boolean condition to avoid cyclic solutions/keys during the attack. However, follow-up work has suggested that extracting these conditions requires enumerating all cycles in the circuit, or that instead of relying on these conditions preemptively, cyclic solutions must be banned individually on the fly. In this paper we present new algorithms for performing SAT-based attacks on cyclic circuits. We first propose an algorithm that can produce non-cyclic conditions in polynomial time with respect to the size of the circuit, avoiding the potentially exponential runtime of explicit key-banning or cycle enumeration. We then take a deeper look at the problem, discussing some of the fundamental limitations of extracting precise non-cyclic conditions and propose a more complex but complete procedure for cyclic deobfuscation. We evaluate our attacks on densely cyclic obfuscated benchmark circuits.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113956262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngbeom Jung, Yeongjae Choi, Jaehyeong Sim, L. Kim
{"title":"eSRCNN: A Framework for Optimizing Super-Resolution Tasks on Diverse Embedded CNN Accelerators","authors":"Youngbeom Jung, Yeongjae Choi, Jaehyeong Sim, L. Kim","doi":"10.1109/iccad45719.2019.8942086","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942086","url":null,"abstract":"CNN-based Super-Resolution (SR), the most representative of low-level vision task, is a promising solution to improve users' QoS on IoT devices that suffer from limited network bandwidth and storage capacity by effectively enhancing image/video resolution. Although prior accelerators to embed CNN show tremendous performance and energy efficiency, they are not suitable for SR tasks regarding off-chip memory accesses. In this work, we present eSRCNN, a framework that enables performing energy-efficient SR tasks on diverse embedded CNN accelerators by decreasing off-chip memory accesses. To reduce off-chip memory accesses, our framework consists of three steps: a network reformation using a cross-layer weight scaling, a precision minimization with priority-based quantization, and an activation map compression exploiting a data locality. As a result, the energy consumption of off-chip memory accesses is reduced up to 71.89% with less than 3.52% area overhead.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123782191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}