Shulin Zeng, Hanbo Sun, Yu Xing, Xuefei Ning, Yi Shan, Xiaoming Chen, Yu Wang, Huazhong Yang
{"title":"Black Box Search Space Profiling for Accelerator-Aware Neural Architecture Search","authors":"Shulin Zeng, Hanbo Sun, Yu Xing, Xuefei Ning, Yi Shan, Xiaoming Chen, Yu Wang, Huazhong Yang","doi":"10.1109/ASP-DAC47756.2020.9045179","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045179","url":null,"abstract":"Neural Architecture Search (NAS) is a promising approach to discover good neural network architectures for given applications. Among the three basic components in a NAS system (search space, search strategy, and evaluation), prior work mainly focused on the development of different search strategies and evaluation methods. As most of the previous hardware-aware search space designs aimed at CPUs and GPUs, it still remains a challenge to design a suitable search space for Deep Neural Network (DNN) accelerators. Besides, the architectures and compilers of DNN accelerators vary greatly, so it is quite difficult to get a unified and accurate evaluation of the latency of DNN across different platforms. To address these issues, we propose a black box profiling-based search space tuning method and further improve the latency evaluation by introducing a layer adaptive latency correction method. Used as the first stage in our general accelerator-aware NAS pipeline, our proposed methods could provide a smaller and dynamic search space with a controllable trade-off between accuracy and latency for DNN accelerators. Experimental results on CIFAR-10 and ImageNet demonstrate our search space is effective with up to 12.7% improvement in accuracy and 2.2x reduction of latency, and also efficient by reducing the search time and GPU memory up to 4.35x and 6.25x, respectively.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132634059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tolerating Retention Failures in Neuromorphic Fabric based on Emerging Resistive Memories","authors":"Christopher Münch, R. Bishnoi, M. Tahoori","doi":"10.1109/ASP-DAC47756.2020.9045339","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045339","url":null,"abstract":"In recent years, computation is shifting from conventional high performance servers to Internet of Things (IoT) edge devices, most of which require the processing of cognitive tasks. Hence, a great effort is put in the realization of neural network (NN) edge devices and their efficiency in inferring a pretrained Neural Network. In this paper, we evaluate the retention issues of emerging resistive memories used as non-volatile weight storage for embedded NN. We exploit the asymmetric retention behavior of Spintronic based Magnetic Tunneling Junctions (MTJs), which is also present in other resistive memories like Phase-Change memory (PCM) and ReRAM, to optimize the retention of the NN accuracy over time. We propose mixed retention cell arrays and an adapted training scheme to achieve a trade-off between array size and the reliable long-term accuracy of NNs. The results of our proposed method save up to 24% of inference accuracy of an MNIST trained Multi-Layer-Perceptron on MTJ-based crossbars.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134244032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable Power Grid Network Design Framework Considering EM Immortalities for Multi-Segment Wires","authors":"Han Zhou, Shuyuan Yu, Zeyu Sun, S. Tan","doi":"10.1109/ASP-DAC47756.2020.9045673","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045673","url":null,"abstract":"This paper presents a new power grid network design and optimization technique that considers the new EM immortality constraint due to EM void saturation volume for multi-segment interconnects. Void may grow to its saturation volume without changing the wire resistance significantly. However, this phenomenon was ignored in existing EM-aware optimization methods. By considering this new effect, we can remove more conservativeness in the EM-aware on-chip power grid design. Along with recently proposed nucleation phase immortality constraint for multi-segment wires, we show that both EM immortality constraints can be naturally integrated into the existing programming based power grid optimization framework. To further mitigate the overly conservative problem of existing immortality-constrained optimization methods, we further explore two strategies: first we size up failed wires to meet one of the immorality conditions subject to design rules; second, we consider the EM-induced aging effects on power supply networks for a target lifetime, which allows some short-lifetime wires to fail and optimizes the rest of the wires. Numerical results on a number of IBM-format power grid networks demonstrate that the new method can reduce more power grid area compared to the existing EM-immortality constrained optimizations. Furthermore, the new method can optimize power grids with nucleated wires, which would not be possible with the existing methods.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133321167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Read-Intensive Key-Value Stores with Tidal Structure Based on LSM-Tree","authors":"Yi Wang, Shangyu Wu, Rui Mao","doi":"10.1109/ASP-DAC47756.2020.9045617","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045617","url":null,"abstract":"Key-value store has played a critical role in many large-scale data storage applications. The log-structured merge-tree (LSM-tree) based key-value store achieves excellent performance on write-intensive workloads which is mainly benefited from the mechanism of converting a batch of random writes into sequential writes. However, LSM-tree doesn’t improve a lot in read-intensive workloads which takes a higher latency. The main reason lies in the hierarchical search mechanism in LSM-tree structure. The key challenge is how to propose new strategies based on the existing LSM-tree structure to improve read efficiency and reduce read amplifications.This paper proposes Tidal-tree, a novel data structure where data flows inside LSM-tree like Tidal waves. Tidal-tree targets at improving read efficiency in read-intensive workloads. Tidal-tree allows frequently accessed files at the bottom of LSM-tree to move to higher positions, thereby reducing read latency. Tidal-tree also makes LSM-tree into a variable shape to cater for different characteristic workloads. To evaluate the performance of Tidal-tree, we conduct a series of experiments using standard benchmarks from YCSB. The experimental results show that Tidal-tree can significantly improve read efficiency and reduce read amplifications compared with representative schemes.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114758528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Search-free Accelerator for Sparse Convolutional Neural Networks","authors":"Bosheng Liu, Xiaoming Chen, Yinhe Han, Ying Wang, Jiajun Li, Haobo Xu, Xiaowei Li","doi":"10.1109/ASP-DAC47756.2020.9045580","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045580","url":null,"abstract":"Sparsification is an efficient solution to reduce the demand of on-chip memory space for deep convolutional neural networks (CNNs). Most of state-of-the-art CNN accelerators can deliver high throughput for sparse CNNs by searching pairs of nonzero weights and activations, and then sending them to processing elements (PEs) for multiplication-accumulation (MAC) operations. However, their PE scales are difficult to be increased for superior and efficient computing because of the significant internal interconnect and memory bandwidth consumption. To deal with this dilemma, we propose a sparsity-aware architecture, called Swan, which frees the search process for sparse CNNs under limited interconnect and bandwidth resources. The architecture comprises two parts: a MAC unit that can free the search operation for the sparsity-aware MAC calculation, and a systolic compressive dataflow that well suits the MAC architecture and greatly reuses inputs for interconnect and bandwidth saving. With the proposed architecture, only one column of the PEs needs to load/store data while all PEs can operate in full scale. Evaluation results based on a place-and-route process show that the proposed design, in a compact factor of 4096 PEs, 4.9TOP/s peak performance, and 2.97W power running at 600MHz, achieves 1.5-2.1× speedup and 6.0-9.1× higher energy efficiency than state-of-the-art CNN accelerators with the same PE scale.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127826894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheriff Sadiqbatcha, Yue Zhao, Jinwei Zhang, H. Amrouch, J. Henkel, S. Tan
{"title":"Machine Learning Based Online Full-Chip Heatmap Estimation","authors":"Sheriff Sadiqbatcha, Yue Zhao, Jinwei Zhang, H. Amrouch, J. Henkel, S. Tan","doi":"10.1109/ASP-DAC47756.2020.9045204","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045204","url":null,"abstract":"Runtime power and thermal control is crucial in any modern processor. However, these control schemes require accurate real-time temperature information, ideally of the entire die area, in order to be effective. On-chip temperature sensors alone cannot provide the full-chip temperature information since the number of sensors that are typically available is very limited due to their high area and power overheads. Furthermore, as we will demonstrate, the peak locations within hot-spots are not stationary and are very workload dependent, making it difficult to rely on fixed temperature sensors alone. Therefore, we propose a novel approach to real-time estimation of full-chip transient heatmaps for commercial processors based on machine learning. The model derived in this work supplements the temperature data sensed from the existing on-chip sensors, allowing for the development of more robust runtime power and thermal control schemes that can take advantage of the additional thermal information that is otherwise not available. The new approach involves offline acquisition of accurate spatial and temporal heatmaps using an infrared thermal imaging setup while nominal working conditions are maintained on the chip. To build the dynamic thermal model, we apply Long-Short-Term-Memory (LSTM) neutral networks with system-level variables such as chip frequency, instruction counts, and other performance metrics as inputs. To reduce the dimensionality of the model, 2D spatial discrete cosine transformation (DCT) is first performed on the heatmaps so that they can be expressed with just their dominant DCT frequencies. Our study shows that only 6×6 DCT coefficients are required to maintain sufficient accuracy across a variety of workloads. Experimental results show that the proposed approach can estimate the full-chip heatmaps with less than $1.4^{o}$C root-mean-square-error and take only $sim$19ms for each inference which suits well for real-time use.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127301058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuo-Han Chen, Yu-Pei Liang, Yuan-Hao Chang, H. Wei, W. Shih
{"title":"Boosting the Profitability of NVRAM-based Storage Devices via the Concept of Dual-Chunking Data Deduplication","authors":"Shuo-Han Chen, Yu-Pei Liang, Yuan-Hao Chang, H. Wei, W. Shih","doi":"10.1109/ASP-DAC47756.2020.9045622","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045622","url":null,"abstract":"With the latest advance in the non-volatile random-access memory (NVRAM), NVRAM is widely considered as the mainstream for the next-generation storage mediums. NVRAM has numerous attractive features, which include byte addressability, limited idle energy consumption, and great read/write access speed. However, owing to the high manufacturing cost of NVRAM, the incentive of deploying NVRAM in consumer electronics is lowered due to the consideration of profitability. To resolve the profitability issue and bring the benefits of NVRAM into the design of consumer electronics, avoiding storing duplicate data on NVRAM becomes a crucial task for lowering the demand and deployment cost of NVRAM. Such observation motivates us to propose a data deduplication extended file system design (DeEXT) to boost the profitability of NVRAM via the concept of dual-chunking data deduplication while considering the characteristics of NVRAM and duplicate data content. The proposed DeEXT was then evaluated by real-world data deduplication traces with encouraging results.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121904789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Generalization of Wafer Defect Detection by Data Discrepancy-aware Preprocessing and Contrast-varied Augmentation","authors":"Chaofei Yang, H. Li, Yiran Chen, Jiang Hu","doi":"10.1109/ASP-DAC47756.2020.9045391","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045391","url":null,"abstract":"Wafer inspection locates defects at early fabrication stages and traditionally focuses on pixel-level defects. However, there are very few solutions that can effectively detect largescale defects. In this work, we leverage Convolutional Neural Networks (CNNs) to automate the wafer inspection process and propose several techniques to preprocess and augment wafer images for enhancing our model’s generalization on unseen wafers (e.g., from other fabs). Cross-fab experimental results of both wafer-level and pixel-level detections show that the F1 score increases from 0.09 to 0.77 and the Precision-Recall area under curve (PR AUC) increases from 0.03 to 0.62 using our proposed method.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121944169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Yang, Weiwen Jiang, Weichen Liu, E. Sha, Yiyu Shi, J. Hu
{"title":"Co-Exploring Neural Architecture and Network-on-Chip Design for Real-Time Artificial Intelligence","authors":"Lei Yang, Weiwen Jiang, Weichen Liu, E. Sha, Yiyu Shi, J. Hu","doi":"10.1109/ASP-DAC47756.2020.9045595","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045595","url":null,"abstract":"Hardware-aware Neural Architecture Search (NAS), which automatically finds an architecture that works best on a given hardware design, has prevailed in response to the ever-growing demand for real-time Artificial Intelligence (AI). However, in many situations, the underlying hardware is not pre-determined. We argue that simply assuming an arbitrary yet fixed hardware design will lead to inferior solutions, and it is best to co-explore neural architecture space and hardware design space for the best pair of neural architecture and hardware design. To demonstrate this, we employ Network-on-Chip (NoC) as the infrastructure and propose a novel framework, namely NANDS, to co-explore NAS space and NoC Design Search (NDS) space with the objective to maximize accuracy and throughput. Since two metrics are tightly coupled, we develop a multi-phase manager to guide NANDS to gradually converge to solutions with the best accuracy-throughput tradeoff. On top of it, we propose techniques to detect and alleviate timing performance bottleneck, which allows better and more efficient exploration of NDS space. Experimental results on common datasets, CIFAR10, CIFAR-100 and STL-10, show that compared with state-of-the-art hardware-aware NAS, NANDS can achieve 42.99% higher throughput along with 1.58% accuracy improvement. There are cases where hardware-aware NAS cannot find any feasible solutions while NANDS can.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123203624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}