Jorge Castro-Godínez, Julián Mateus-Vargas, M. Shafique, Jörg Henkel
{"title":"AxHLS","authors":"Jorge Castro-Godínez, Julián Mateus-Vargas, M. Shafique, Jörg Henkel","doi":"10.1145/3400302.3415732","DOIUrl":"https://doi.org/10.1145/3400302.3415732","url":null,"abstract":"With the emergence of approximate computing as a design paradigm, many approximate functional units have been proposed, particularly approximate adders and multipliers. These circuits compromise the accuracy of their results within a tolerable limit to reduce the required computational effort and energy requirements. However, for an ongoing number of such approximate circuits reported in the literature, selecting those that minimize the required resources for designing and generating an approximate accelerator from a high-level specification, while satisfying a defined accuracy constraint, is a joint high-level synthesis (HLS) and design space exploration (DSE) challenge. In this paper, we propose a novel automated framework for HLS of approximate accelerators using a given library of approximate functional units. Since repetitive circuit synthesis and gate-level simulations require a significant amount of time, to enable our framework, we present AxME, a set of analytical models for estimating the required computational resources when using approximate adders and multipliers in approximate designs. We propose DSEwam, a DSE methodology for error-tolerant applications, in which analytical models, such as AxME, are used to estimate resources needed and the accuracy of approximate designs. Furthermore, we integrate DSEwam into an HLS tool to automatically generate Pareto-optimal, or near Pareto-optimal, approximate accelerators from C language descriptions, for a given error threshold and minimization goal. We release our DSE framework as an open-source contribution, which will significantly boost the research and development in the field of automatic generation of approximate accelerators.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126557070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"fuseGNN","authors":"Zhaodong Chen, Mingyu Yan, Maohua Zhu, Lei Deng, Guoqi Li, Shuangchen Li, Yuan Xie","doi":"10.1145/3400302.3415610","DOIUrl":"https://doi.org/10.1145/3400302.3415610","url":null,"abstract":"Graph convolutional neural networks (GNN) have achieved state-of-the-art performance on tasks like node classification. It has become a new workload family member in data-centers. GNN works on irregular graph-structured data with three distinct phases: Combination, Graph Processing, and Aggregation. While Combination phase has been well supported by sgemm kernels in cuBLAS, the other two phases are still inefficient on GPGPU due to the lack of optimized CUDA kernels. In particular, Aggregation phase introduces large volume of DRAM storage footprint and data movement, and both Aggregation and Graph Processing phases suffer from high kernel launching time. These inefficiencies not only decrease training throughput but also limit users from training GNNs on larger graphs on GPGPU. Although these problems have been partially alleviated by recent studies, their optimizations are still not sufficient. In this paper, we propose fuseGNN, an extension of PyTorch that provides highly optimized APIs and CUDA kernels for GNN. First, two different programming abstractions for Aggregation phase are utilized to handle graphs with different average degrees. Second, dedicated GPGPU kernels are developed for Aggregation and Graph Processing in both forward and backward passes, in which kernel-fusion along with other optimization strategies are applied to reduce kernel launching time and latency as well as exploit data reuse opportunities. Evaluation on multiple benchmarks shows that fuseGNN achieves up to 5.3× end-to-end speedup over state-of-the-art frameworks, and the DRAM storage footprint is reduced by several orders of magnitude on large datasets.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122693794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PathDriver","authors":"Xing Huang, Youlin Pan, Grace Li Zhang, Bing Li, Wenzhong Guo, Tsung-Yi Ho, Ulf Schlichtmann","doi":"10.1145/3400302.3415725","DOIUrl":"https://doi.org/10.1145/3400302.3415725","url":null,"abstract":"Continuous-flow microfluidic biochips have attracted high research interest over the past years. Inside such a chip, fluid samples of milliliter volumes are efficiently transported between devices (e.g., mixers, etc.) to automatically perform various laboratory procedures in biology and biochemistry. Each transportation task, however, requires an exclusive flow path composed of multiple contiguous microchannels during its execution period. Excess/waste fluids, in the meantime, should be discarded by independent flow paths connected to waste ports. All these paths are etched in a very tiny chip area using multilayer soft lithography and driven by flow ports connecting with external pressure sources, forming a highly integrated chip architecture that dominates the performance of biochips. In this paper, we propose a practical synthesis flow called PathDriver for the design automation of microfluidic biochips, integrating the actual fluid manipulations into both high-level synthesis and physical design, which has never been considered in prior work. Given the protocols of biochemical applications, PathDriver aims to generate highly efficient chip architectures with a flow-path network that enables the manipulation of actual fluid transportation and removal. Additionally, fluid volume management between devices and flow-path minimization are realized for the first time, thus ensuring the correctness of assay outcomes while reducing the complexity of chip architectures. Experimental results on multiple benchmarks demonstrate the effectiveness of the proposed synthesis flow.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116675314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sujan Kumar Gonugondla, Ameya D. Patil, Naresh R Shanbhag
{"title":"SWIPE","authors":"Sujan Kumar Gonugondla, Ameya D. Patil, Naresh R Shanbhag","doi":"10.1145/3400302.3415642","DOIUrl":"https://doi.org/10.1145/3400302.3415642","url":null,"abstract":"Crossbar-based in-memory architectures have emerged as an attractive platform for energy-efficient realization of deep neural networks (DNNs). A key challenge in such architectures is achieving accurate and efficient writes due to the presence of bitcell conductance variations. In this paper, we propose the Single-Write In-memory Program-vErify (SWIPE) method that achieves high accuracy writes for crossbar-based in-memory architectures at 5×-to-10× lower cost than standard program-verify methods. SWIPE leverages the bit-sliced attribute of crossbar-based in-memory architectures and the statistics of conductance variations to compensate for device non-idealities. Using SWIPE to write into ReRAM crossbar allows for a 2× (CIFAR-10) and 3× (MNIST) increase in storage density with < 1% loss in DNN accuracy. In particular, SWIPE compensates for 4.8×-to-7.7× higher conductance variations. Furthermore, SWIPE can be augmented with injection-based training methods in order to achieve even greater enhancements in robustness.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127421350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReTransformer","authors":"Xiaoxuan Yang, Bonan Yan, Hai Li, Yiran Chen","doi":"10.1145/3400302.3415640","DOIUrl":"https://doi.org/10.1145/3400302.3415640","url":null,"abstract":"Transformer has emerged as a popular deep neural network (DNN) model for Neural Language Processing (NLP) applications and demonstrated excellent performance in neural machine translation, entity recognition, etc. However, its scaled dot-product attention mechanism in auto-regressive decoder brings a performance bottleneck during inference. Transformer is also computationally and memory intensive and demands for a hardware acceleration solution. Although researchers have successfully applied ReRAM-based Processing-in-Memory (PIM) to accelerate convolutional neural networks (CNNs) and recurrent neural networks (RNNs), the unique computation process of the scaled dot-product attention in Transformer makes it difficult to directly apply these designs. Besides, how to handle intermediate results in Matrix-matrix Multiplication (MatMul) and how to design a pipeline at a finer granularity of Transformer remain unsolved. In this work, we propose ReTransformer - a ReRAM-based PIM architecture for Transformer acceleration. ReTransformer can not only accelerate the scaled dot-product attention of Transformer using ReRAM-based PIM but also eliminate some data dependency by avoiding writing the intermediate results using the proposed matrix decomposition technique. Moreover, we propose a new sub-matrix pipeline design for multi-head self-attention. Experimental results show that compared to GPU and Pipelayer, ReTransformer improves computing efficiency by 23.21× and 3.25×, respectively. The corresponding overall power is reduced by 1086× and 2.82×, respectively.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"2673 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114666314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COALA","authors":"Yun-Jhe Jiang, Shao-Yun Fang","doi":"10.1145/3400302.3415721","DOIUrl":"https://doi.org/10.1145/3400302.3415721","url":null,"abstract":"Two-dimensional (2D) global routing followed by layer assignment is a common and popular strategy to obtain a good trade-off between runtime and routing performance. Yet, the huge gap between 2D routing patterns and the final 3D routing paths often results in inevitable overflow after layer assignment. State-of-the-art studies on layer assignment usually adopt dynamic programming-based approaches to sequentially find an optimal solution for each net in terms of overflow or/and the number of vias. However, a fixed assignment ordering severely restricts the solution space, and the distributed overflows can hardly be resolved with any existing refinement approach. This paper proposes a novel layer assignment framework that concurrently considers all the wire segments of nets and iteratively assigns them from the lowest available layer to the highest one. The concurrent scheme facilitates the maximal utilization of routing resource on each layer, contributing to an effective re-routing procedure that greatly reduces inevitable overflows. Experimental results show that compared to the sequential layer assignment solutions that also refined by the same re-routing procedure, the proposed framework can averagely reduce the maximum overflow in a tile by 32% and reduce the number of tiles with overflows by 28% with much less runtime, which shows the significant advantage of concurrent layer assignment over sequential methods.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128092321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanqing Zhang, Haoxing Ren, Ben Keller, Brucek Khailany
{"title":"Problem C: GPU accelerated logic re-simulation","authors":"Yanqing Zhang, Haoxing Ren, Ben Keller, Brucek Khailany","doi":"10.1145/3400302.3415740","DOIUrl":"https://doi.org/10.1145/3400302.3415740","url":null,"abstract":"Logic \"re\"-simulation can be defined as gate level simulation where the input waveforms at every primary input and pseudo-primary input (such as register/RAM outputs) are known. Such waveforms could come from the unit's RTL simulation trace or Automatic Test Pattern Generation (ATPG) vectors. This type of simulation is useful in doing functional verification on gate level netlists and power analysis, since we can take the known trace on all primary and pseudo-primary inputs, re-simulate the trace using propagation of signals through timing-aware gate-level combinational logic, and verify that results at the primary and pseudo-primary outputs match the reference RTL simulation results. However, gate level simulation is usually much slower than RTL simulation. Thus, there is motivation for faster solutions. In this contest, we ask contestants to use Graphic Processing Units (GPUs) to speedup the re-simulation task.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131477380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"i\u0000 TPlace","authors":"Tai-Cheng Lee, Chenghan. Yang, Yih-Lang Li","doi":"10.1145/3400302.3415613","DOIUrl":"https://doi.org/10.1145/3400302.3415613","url":null,"abstract":"Cell layout synthesis is a critical stage in modern digital IC design. In previous automatic synthesis solutions, algorithms always consider only cell area and routability. This is the first work to propose a method of delay-aware transistor placement for cell library synthesis at the sign-off level. We consider the delay and area of a cell in the transistor placement stage. Our methodology consists of three major steps. First, a search tree finds the candidate placement list that has the smallest area in a large search space. Then, a neural network filters out the unroutable candidates. Finally, a comparative convolutional neural network model, trained by sign-off level data, sorts the delays during the early placement stage. The experimental results show that the proposed CNN-based routable classifier can achieve up to 98% accuracy, and the proposed CNN-based delay ranker also can achieve up to 94.6% accuracy. The work obtains a 1.77% average sequential component delay improvement over the traditional cell synthesis method. Our method also has a 0.97% better delay performance than the human-level design.","PeriodicalId":367868,"journal":{"name":"Proceedings of the 39th International Conference on Computer-Aided Design","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114513104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}