Duy-Thanh Nguyen, Changhong Min, Nhut-Minh Ho, I. Chang
{"title":"DRAMA","authors":"Duy-Thanh Nguyen, Changhong Min, Nhut-Minh Ho, I. Chang","doi":"10.1145/3400302.3415637","DOIUrl":"https://doi.org/10.1145/3400302.3415637","url":null,"abstract":"As the density of DRAM becomes larger, the refresh overhead becomes more significant. This becomes more problematic in the systems that require large DRAM capacity, such as the training of deep neural networks (DNNs). To solve this problem, we present DRAMA, a novel architecture which employs the approximate characteristic of DNNs. We make that non-critical bits are not refreshed while critical bits are normally refreshed. The refresh time of the critical bits are concealed by employing per-bank refreshes, significantly improving the training system performance. Furthermore, the potential racing hazard of bank-refresh technique is simply prevented by our novel command scheduler in DRAM controllers. Our experiments on various recent DNNs show that DRAMA can improve the training system performance by 10.4% and save 23.77% DRAM energy compared to the conventional architecture.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116353933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dali","authors":"Yihang Yang, Jiayuan He, R. Manohar","doi":"10.1145/3400302.3415689","DOIUrl":"https://doi.org/10.1145/3400302.3415689","url":null,"abstract":"Asynchronous Very-Large-Scale-Integration (VLSI) has several potential benefits over its synchronous counterparts, such as reduced power consumption, elastic pipelining, and robustness to variations. However, the lack of electronic design automation (EDA) support for asynchronous circuits, especially physical layout automation tools, largely limits their adoption. To tackle this challenge, we propose a gridded cell layout methodology for asynchronous circuits, in which the cell height and cell width can be any integer multiple of two grid values. The gridded cell approach combines the shape regularity of standard cells with the size flexibility of custom design, and thus achieves a better space utilization ratio and lower wire-length for asynchronous designs. We present the algorithms and our implementation of Dali, a gridded cell placer, that consists of an analytical global placer, a forward-backward legalizer, an N/P-well legalizer, and a power grid router. We show that the gridded cell placement approach reduces area by 15% without impacting the routability of the design. We have also used Dali to tape out a chip in a 65nm process technology, demonstrating that our placer generates design-rule clean placement.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124182083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haeyoung Kim, Jinjae Lee, Derry Pratama, A. Awaludin, Howon Kim, D. Kwon
{"title":"RIMI","authors":"Haeyoung Kim, Jinjae Lee, Derry Pratama, A. Awaludin, Howon Kim, D. Kwon","doi":"10.1145/3400302.3415727","DOIUrl":"https://doi.org/10.1145/3400302.3415727","url":null,"abstract":"With the advent of the Internet of Things, embedded systems have become widely used in various fields. Concurrently, the security of these systems has become a concern for many. However, security features that are already available for high-end systems have not been provided in low-end embedded systems due to its negative impact on cost and power consumption. Thus, to increase security with low overhead, many studies to implement the memory isolation approach to these systems have been conducted. However, existing techniques for this approach have suffered from problems in terms of scalability or performance. To mitigate such problems, we present RIMI, a new instruction extension to provide memory isolation in embedded systems. Thanks to instructions in RIMI, we can implement an instruction-level memory isolation where the access permission is bound to each memory and control transfer instructions. We implemented the RIMI prototype on a RISC-V architecture, which is a prominent open-source instruction set architecture (ISA). Our evaluation results show that existing security solutions, i.e., shadow stacks and in-process isolation, can be efficiently implemented with RIMI.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122040930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GridNet","authors":"Han Zhou, Wentian Jin, S. Tan","doi":"10.1145/3400302.3415714","DOIUrl":"https://doi.org/10.1145/3400302.3415714","url":null,"abstract":"Electromigration (EM) is a major failure effect for on-chip power grid networks of deep submicron VLSI circuits. EM degradation of metal grid lines can lead to excessive voltage drops (IR drops) before the target lifetime. In this paper, we propose a fast data-driven EM-induced IR drop analysis framework for power grid networks, named GridNet, based on the conditional generative adversarial networks (CGAN). It aims to accelerate the incremental full-chip EM-induced IR drop analysis, as well as IR drop violation fixing during the power grid design and optimization. More importantly, GridNet can naturally leverage the differentiable feature of deep neural networks (DNN) to obtain the sensitivity information of node voltage with respect to the wire resistance (or width) with marginal cost. Grid-Net treats continuous time and the given electrical features as input conditions, and the EM-induced time-varying voltage of power grid networks as the conditional outputs, which are represented as data series images. We show that GridNet is able to learn the temporal dynamics of the aging process in continuous time domain. Besides, we can take advantage of the sensitivity information provided by GridNet to perform efficient localized IR drop violation fixing in the late stage design and optimization. Numerical results on 36000 synthesized power grid network samples demonstrate that the new method can lead to 105× speedup over the recently proposed full-chip coupled EM and IR drop analysis tool. We further show that localized IR drop violation fix for the same set of power grid networks can be performed remarkably efficiently using the cheap sensitivity computation from GridNet.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125663507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqin Huangfu, Krishna T. Malladi, Shuangchen Li, P. Gu, Yuan Xie
{"title":"NEST","authors":"Wenqin Huangfu, Krishna T. Malladi, Shuangchen Li, P. Gu, Yuan Xie","doi":"10.1145/3400302.3415724","DOIUrl":"https://doi.org/10.1145/3400302.3415724","url":null,"abstract":"With the ability to help wildlife conservation, precise medical care, and disease understanding, genomics analysis is becoming more and moe important. Recently, with the development and wide adoption of the Next-Generation Sequencing (NGS) technology, bio-data grows exponentially, putting forward great challenges for k-mer counting - a widely used application in genomics analysis. Many hardware approaches have been explored to accelerate k-mer counting. Most of those approaches are compute-centric, i.e., based on CPU/GPU/FPGA. However, the space for performance improvement is limited for compute-centric accelerators, because k-mer counting is a memory-bound application. By integrating memory and computation close together and embracing higher memory bandwidth, Near-Data-Processing (NDP) is a good candidate to accelerate k-mer counting. Unfortunately, due to challenges of communication, bandwidth utilization, workload balance, and redundant memory accesses, previous NDP accelerators for k-mer counting cannot fully unleash the power of NDP. To build a practical, scalable, high-performance, and energy-efficient NDP accelerator for k-mer counting, we perform hardware/software co-design and propose the DIMM based Near-Data-Processing Accelerator for k-mer counting (NEST). To fully unleash the potential of NEST architecture, we modify the k-mer counting algorithm and propose a dedicated workflow to support efficient parallelism. Moreover, the proposed algorithm and workflow are able to reduce unnecessary inter-DIMM communication. To improve memory bandwidth utilization, we propose a novel address mapping scheme. The challenge of workload balance is addressed with the proposed task scheduling technique. In addition, scattered memory access and task switching are proposed to eliminate redundant memory access. Experimental results show that NEST provides 677.33x/27.24x/6.02x performance improvement and 1076.14x/62.26x/4.30x energy reduction, compared with a 48-thread CPU, a CPU/GPU hybrid approach, and a state-of-the-art NDP accelerator, respectively.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116433662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Hsiang Lai, Hongbo Rong, Size Zheng, Weihao Zhang, Xiuping Cui, Yunshan Jia, Jie Wang, Brendan Sullivan, Zhiru Zhang, Yuning. Liang, Youhui Zhang, J. Cong, N. George, J. Alvarez, Ciaran Hughes, P. Dubey
{"title":"SuSy","authors":"Yi-Hsiang Lai, Hongbo Rong, Size Zheng, Weihao Zhang, Xiuping Cui, Yunshan Jia, Jie Wang, Brendan Sullivan, Zhiru Zhang, Yuning. Liang, Youhui Zhang, J. Cong, N. George, J. Alvarez, Ciaran Hughes, P. Dubey","doi":"10.1145/3400302.3415644","DOIUrl":"https://doi.org/10.1145/3400302.3415644","url":null,"abstract":". We define stable supercurves and stable supermaps, and based on these notions we develop a theory of Nori motives for the category of stable supermaps of SUSY curves with punctures. This will require several preliminary constructions, including the development of a basic theory of supercycles.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121865172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JKQ","authors":"Robert-Jan Wille, S. Hillmich, Lukas Burgholzer","doi":"10.1145/3400302.3415746","DOIUrl":"https://doi.org/10.1145/3400302.3415746","url":null,"abstract":"With quantum computers on the brink of practical applicability, there is a lively community that develops toolkits for the design of corresponding quantum circuits. Many of the problems to be tackled here are similar to design problems from the classical realm for which sophisticated design automation tools have been developed in the previous decades. In this paper, we present JKQ-a set of tools for quantum computing developed at the Johannes Kepler University (JKU) Linz which utilizes this design automation expertise. By this, we offer complementary approaches for many design problems in quantum computing such as simulation, compilation, or verification. In the following, we provide an introduction of the tools for potential users who would like to work with them as well as potential developers aiming to extend them.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127572272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FlowTune","authors":"Cunxi Yu","doi":"10.1145/3400302.3415615","DOIUrl":"https://doi.org/10.1145/3400302.3415615","url":null,"abstract":"Recent years have seen increasing employment of decision intelligence in electronic design automation (EDA), which aims to reduce the manual efforts and boost the design closure process in modern toolflows. However, existing approaches either require a large number of labeled data for training or are limited in practical EDA toolflow integration due to computation overhead. This paper presents a generic end-to-end and high-performance domain-specific, multi-stage multi-armed bandit framework for Boolean logic optimization. This framework addresses optimization problems on a) And-Inv-Graphs (# nodes), b) Conjunction Normal Form (CNF) minimization (# clauses) for Boolean Satisfiability, c) post static timing analysis (STA) delay and area optimization for standard-cell technology mapping, and d) FPGA technology mapping for 6-in LUT architectures. Moreover, the proposed framework has been integrated with ABC [1], Yosys [2], VTR [3], and industrial tools. The experimental results demonstrate that our framework outperforms both hand-crafted flows [1] and ML explored flows [4], [5] in quality of results, and is orders of magnitude faster compared to ML-based approaches [4], [5].","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122619834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HitM","authors":"Bing Li, Ying Wang, Yiran Chen","doi":"10.1145/3400302.3415663","DOIUrl":"https://doi.org/10.1145/3400302.3415663","url":null,"abstract":"With the rapid progress of artificial intelligence (AI) algorithms, multi-modal deep neural networks (DNNs) have been applied to some challenging tasks, e.g., image and video description to process multi-modal information from vision and language. Resistive-memory-based processing-in-memory (ReRAM-based PIM) has been extensively studied to accelerate either convolutional neural network (CNN) or recurrent neural network (RNN). According to the requirements of their core layers, i.e. convolutional layers and linear layers, the existing ReRAM-based PIMs adopt different optimization schemes for them. Directly deploying multi-modal DNNs on the existing ReRAM-based PIMs, however, is inefficient because multi-modal DNNs have combined CNN and RNN where the primary layers differ depending on the specific tasks. Therefore, a high-efficiency ReRAM-based PIM design for multi-modal DNNs necessitates an adaptive optimization to the given network. In this work, we propose HitM, a high-throughput ReRAM-based PIM for multi-modal DNNs with a two-stage workflow, which consists of a static analysis and an adaptive optimization. The static analysis generates the layer-wise resource and computation information with the input multi-modal DNN description and the adaptive optimization produces a high-throughput ReRAM-based PIM design through the dynamic algorithm based on hardware resources and the information from the static analysis. We evaluated HitM using several popular multi-modal DNNs with different parameters and structures and compared it with a naive ReRAM-based PIM design and an optimal-throughput ReRAM-based PIM design that assumes no hardware resource limitations. The experimental results show that HitM averagely achieves 78.01% of the optimal throughput while consumes 64.52% of the total hardware resources.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123111513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saransh Gupta, Justin Morris, M. Imani, R. Ramkumar, Jeffrey Yu, Aniket Tiwari, Baris Aksanli, T. Rosing
{"title":"THRIFTY","authors":"Saransh Gupta, Justin Morris, M. Imani, R. Ramkumar, Jeffrey Yu, Aniket Tiwari, Baris Aksanli, T. Rosing","doi":"10.1145/3400302.3415723","DOIUrl":"https://doi.org/10.1145/3400302.3415723","url":null,"abstract":"Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dimensional vectors, hypervectors, instead of numbers. HDC replaces several complex learning computations with bitwise and simpler arithmetic operations, resulting in a faster and more energy-efficient learning algorithm. However, it comes at the cost of an increased amount of data to process due to mapping the data into high-dimensional space. While some datasets may nearly fit in the memory, the resulting hypervectors more often than not can't be stored in memory, resulting in long data transfers from storage. In this paper, we propose THRIFTY, an in-storage computing (ISC) solution that performs HDC encoding and training across the flash hierarchy. To hide the latency of training and enable efficient computation, we introduce the concept of batching in HDC. It allows us to split HDC training into sub-components and process them independently. We also present, for the first time, on-chip acceleration for HDC which uses simple low-power digital circuits to implement HDC encoding in Flash planes. This enables us to explore high internal parallelism provided by the flash hierarchy and encode multiple data points in parallel with negligible latency overhead. THRIFTY also implements a single top-level FPGA accelerator, which further processes the data obtained from the chips. We exploit the state-of-the-art INSIDER ISC infrastructure to implement the top-level accelerator and provide software support to THRIFTY. THRIFTY runs HDC training completely in storage while almost entirely hiding the latency of computation. Our evaluation over five popular classification datasets shows that THRIFTY is on average 1612× faster than a CPU-server and 14.4× faster than the state-of-the-art ISC solution, INSIDER for HDC encoding and training.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122818405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}