Cong Hao, Yao Chen, Xinheng Liu, A. Sarwari, Daryl Sew, Ashutosh Dhar, Bryan Wu, Dongdong Fu, Jinjun Xiong, Wen-mei W. Hwu, Junli Gu, Deming Chen
{"title":"NAIS: Neural Architecture and Implementation Search and its Applications in Autonomous Driving","authors":"Cong Hao, Yao Chen, Xinheng Liu, A. Sarwari, Daryl Sew, Ashutosh Dhar, Bryan Wu, Dongdong Fu, Jinjun Xiong, Wen-mei W. Hwu, Junli Gu, Deming Chen","doi":"10.1109/iccad45719.2019.8942055","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942055","url":null,"abstract":"The rapidly growing demands for powerful AI algorithms in many application domains have motivated massive investment in both high-quality deep neural network (DNN) models and high-efficiency implementations. In this position paper, we argue that a simultaneous DNN/implementation co-design methodology, named Neural Architecture and Implementation Search (NAIS), deserves more research attention to boost the development productivity and efficiency of both DNN models and implementation optimization. We propose a stylized design methodology that can drastically cut down the search cost while preserving the quality of the end solution. As an illustration, we discuss this DNN/implementation methodology in the context of both FPGAs and GPUs. We take autonomous driving as a key use case as it is one of the most demanding areas for high quality AI algorithms and accelerators. We discuss how such a co-design methodology can impact the autonomous driving industry significantly. We identify several research opportunities in this exciting domain.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125506887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Narasimhamurthy, Taisa Kushner, Souradeep Dutta, S. Sankaranarayanan
{"title":"Verifying Conformance of Neural Network Models: Invited Paper","authors":"M. Narasimhamurthy, Taisa Kushner, Souradeep Dutta, S. Sankaranarayanan","doi":"10.1109/iccad45719.2019.8942151","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942151","url":null,"abstract":"Neural networks are increasingly used as data-driven models for a wide variety of physical systems such as ground vehicles, airplanes, human physiology and automobile engines. These models are in-turn used for designing and verifying autonomous systems. The advantages of using neural networks include the ability to capture characteristics of particular systems using the available data. This is particularly advantageous for medical systems, wherein the data collected from individuals can be used to design devices that are well-adapted to a particular individual's unique physiological characteristics. At the same time, neural network models remain opaque: their structure makes them hard to understand and interpret by human developers. One key challenge lies in checking that neural network models of processes are “conformant” to the well established scientific (physical, chemical and biological) laws that underlie these models. In this paper, we will show how conformance often fails in models that are otherwise accurate and trained using the best practices in machine learning, with potentially serious consequences. We motivate the need for learning and verifying key conformance properties in data-driven models of the human insulin-glucose system and data-driven automobile models. We survey verification approaches for neural networks that can hold the key to learning and verifying conformance.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124699217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReDRAM: A Reconfigurable Processing-in-DRAM Platform for Accelerating Bulk Bit-Wise Operations","authors":"Shaahin Angizi, Deliang Fan","doi":"10.1109/iccad45719.2019.8942101","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942101","url":null,"abstract":"In this paper, we propose ReDRAM, as a reconfigurable DRAM-based processing-in-memory (PIM) accelerator, which transforms current DRAM architecture to massively parallel computational units exploiting the high internal bandwidth of modern memory chips. ReDRAM uses the analog operation of DRAM sub-arrays and elevates it to implement a full set of 1- and 2-input bulk bit-wise operations (NOT, (N)AND, (N)OR, and even X(N)OR) between operands stored in the same bit-line, based on a new dual-row activation mechanism with a modest change to peripheral circuits such sense amplifiers. ReDRAM can be leveraged to greatly reduce energy consumption and latency of complex in-DRAM logic computations relying on state-of-the-art mechanisms based on triple-row activation, dual-contact cells, row initialization, NOR style, etc. The extensive circuit-architecture simulations show that ReDRAM achieves on average 54× and 7.1× higher throughput for performing bulk bit-wise operations compared with CPU and GPU, respectively. Besides, ReDRAM outperforms recent processing-in-DRAM platforms with up to 3.7× better performance.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114411423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Instantaneous Sanitization through Disturbance-induced Errors and Recycling Programming over 3D Flash Memory","authors":"Wei-Chen Wang, P. Lin, Yung-Chun Li, Chien-Chung Ho, Yu-Ming Chang, Yuan-Hao Chang","doi":"10.1109/iccad45719.2019.8942084","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942084","url":null,"abstract":"As data security has become one of the most crucial issues in modern storage system/application designs, the data sanitization techniques are regarded as the promising solution on 3D NAND flash-memory-based devices. Many excellent works had been proposed to exploit the in-place reprogramming, erasure and encryption techniques to achieve and implement the sanitization functionalities. However, existing sanitization approaches could lead to performance, disturbance overheads or even deciphered issues. Different from existing works, this work aims at exploring an instantaneous data sanitization scheme by taking advantage of programming disturbance properties. Our proposed design can not only achieve the instantaneous data sanitization by exploiting programming disturbance and error correction code properly, but also enhance the performance with the recycling programming design. The feasibility and capability of our proposed design are evaluated by a series of experiments on 3D NAND flash memory chips, for which we have very encouraging results. The experiment results show that the proposed design could achieve the instantaneous data sanitization with low overhead; besides, it improves the average response time and reduces the number of block erase count by up to 86.8% and 88.8%, respectively.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132536750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Endurance Enhancement of Multi-Level Cell Phase Change Memory","authors":"Cheong-Yeop Lee, Youngsoo Song, Youngsoo Shin","doi":"10.1109/iccad45719.2019.8942175","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942175","url":null,"abstract":"Phase change memory (PCM) is a promising device for its good scalability and negligible standby power consumption. Multi-level cell (MLC) PCM allows higher memory density, but it suffers from reduced endurance due to frequent RESET operations during writing. Inter-state direct write (ISDW) method is proposed, in which intermediate states ‘01’ and ‘10’ are reached without RESET initialization. A new MLC PCM model is presented, which takes account of phase configuration of each MLC PCM state; the feasibility of ISDW is assessed using the model. Compression-based RESET removal encoding (CRE) is also proposed to further reduce the number of RESET operations. Experiments demonstrate that the proposed methods achieve 38.4× enhancement of cell endurance; the writing energy dissipation is reduced to 31% on average of test cases.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133767221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing Compilation Effort in Commercial FPGA Emulation Systems Using Machine Learning","authors":"Anthony Agnesina, E. Lepercq, J. Escobedo, S. Lim","doi":"10.1109/iccad45719.2019.8942091","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942091","url":null,"abstract":"This paper presents a machine learning (ML) framework to improve the use of computing resources in the FPGA compilation step of a commercial FPGA-based logic emulation flow. Our ML models enable highly accurate predictability of the final P&R design qualities, runtime, and optimal mapping parameters. We identify key compilation features that may require aggressive compilation efforts using our ML models. Experiments based on our large-scale database from an industry's emulation system show that our ML models help reduce the total number of jobs required for a given netlist by 33%. Moreover, our job scheduling algorithm based on our ML model reduces the overall time to completion of concurrent compilation runs by 24%. In addition, we propose a new method to compute “recommendations” from our ML model, in order to perform repartitioning of difficult partitions. Tested on a large-scale industry SoC design, our recommendation flow provides additional 15% compile time savings for the entire SoC.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124635516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qijing Huang, Christopher Yarp, S. Karandikar, Nathan Pemberton, Benjamin Brock, Liang Ma, Guohao Dai, Robert Quitt, K. Asanović, J. Wawrzynek
{"title":"Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration","authors":"Qijing Huang, Christopher Yarp, S. Karandikar, Nathan Pemberton, Benjamin Brock, Liang Ma, Guohao Dai, Robert Quitt, K. Asanović, J. Wawrzynek","doi":"10.1109/iccad45719.2019.8942048","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942048","url":null,"abstract":"To overcome the end of traditional scaling, modern SoC systems consist of general-purpose compute augmented with large numbers of specialized accelerators. However, building and evaluating these systems is extremely expensive and time-consuming, even in early stages of development. While high-level modeling and back-of-the-envelope calculations can provide early insights into a new system, there are key effects that only manifest at the full-system level. However, full-system design has traditionally required writing RTL or developing complex software models for the entire design. In this paper, we describe a methodology and implement an open-source flow (“Centrifuge”) that can rapidly generate and evaluate heterogeneous SoCs by combining an HLS toolchain with the open-source FireSim FPGA-accelerated simulation platform. Our system can quickly produce complete SoC systems with many integrated HLS-generated accelerators as specified by the user, simulate them quickly and cycle-accurately on FPGAs, and run complete software stacks on top, including booting Linux and running full application frameworks. Our system allows users to easily explore a variety of accelerator integration techniques, by automatically integrating accelerators in several ways—as tightly coupled RoCC accelerators, as accelerators that communicate over the standard on-chip network, and lastly as “disaggregated” accelerators that are directly attached to an Ethernet network between SoCs. By integrating these tools, our methodology allows users to rapidly generate an entire hardware/software stack for a customized SoC that can be fabricated as an ASIC and evaluate its end-to-end performance using cycle-exact FPGA simulation, allowing for agile design-space exploration of novel accelerator-based systems.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124943731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DEEPEYE: A Deeply Tensor-Compressed Neural Network Hardware Accelerator: Invited Paper","authors":"Yuan Cheng, Guangya Li, Ngai Wong, Hai-Bao Chen, Hao Yu","doi":"10.1109/iccad45719.2019.8942052","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942052","url":null,"abstract":"Video detection and classification constantly involve high dimensional data that requires a deep neural network (DNN) with huge number of parameters. It is thereby quite challenging to develop a DNN video comprehension at terminal devices. In this paper, we introduce a deeply tensor compressed video comprehension neural network called DEEPEYE for inference at terminal devices. Instead of building a Long Short-Term Memory (LSTM) network directly from raw video data, we build a LSTM-based spatio-temporal model from tensorized time-series features for object detection and action recognition. Moreover, a deep compression is achieved by tensor decomposition and trained quantization of the time-series feature-based spatio-temporal model. We have implemented DEEPEYE on an ARM-core based IOT board with only 2.4W power consumption. Using the video datasets MOMENTS and UCF11 as benchmarks, DEEPEYE achieves a 228.1× model compression with only 0.47% mAP deduction; as well as 15k× parameter reduction yet 16.27% accuracy improvement.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124945154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Internet of Microfluidic Things: Perspectives on System Architecture and Design Challenges: Invited Paper","authors":"Mohamed Ibrahim, M. Gorlatova, K. Chakrabarty","doi":"10.1109/iccad45719.2019.8942080","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942080","url":null,"abstract":"The integration of microfluidics and biosensor technology is transforming microbiology research by providing new capabilities for clinical diagnostics, cancer research, and pharmacology studies. This integration enables new approaches for biochemistry automation and cyber-physical adaptation. Similarly, recent years have witnessed the rapid growth of the Internet of Things (IoT) paradigm, where different types of real-world elements such as wearable sensors are connected and allowed to autonomously interact with each other. Combining the advances of both cyber-physical microfluidics and IoT domains can generate new opportunities for knowledge fusion by transforming distributed local microfluidic elements into a global network of coordinated microfluidic systems. This paper aims to streamline this transformation and it presents a research vision for enabling the Internet of Microfluidic Things (IoMT). To leverage advances in connected Microfluidic Things, we highlight new perspectives on system architecture, and describe technical challenges related to design automation, temporal flexibility, security, and service assignment. This vision is supported by case studies from cancer research and pharmacology studies to explain the significance of the proposed framework.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114505528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cloud Columba: Accessible Design Automation Platform for Production and Inspiration: Invited Paper","authors":"Tsun-Ming Tseng, Mengchu Li, Yushen Zhang, Tsung-Yi Ho, Ulf Schlichtmann","doi":"10.1109/iccad45719.2019.8942104","DOIUrl":"https://doi.org/10.1109/iccad45719.2019.8942104","url":null,"abstract":"Design automation for continuous-flow microfluidic large-scale integration (mLSI) biochips has made remarkable progress over the past few years. Nowadays a biochip containing up to hundreds of components can be automatically synthesized within a few minutes. However, the current advanced design automation tools are mostly developed for research use, which focus essentially on the algorithmic performance but overlook the accessibility. Therefore, we have started the Cloud Columba project since 2017 to provide users from different backgrounds with easy access to the state-of-the-art design automation approaches. Without being limited by the computing power of their end devices, users just need to formulate their design requests in a high abstraction level, based on which the cloud server will automatically synthesize a customized manufacturing-ready biochip design, which can be viewed and stored using simply a web browser. With the computer-synthesized designs, Cloud Columba supports application developers to explore a wider range of possibilities, and algorithm developers to validate and improve their ideas based on a practical foundation.","PeriodicalId":363364,"journal":{"name":"2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"157 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122466260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}