{"title":"HL-Pow: A Learning-Based Power Modeling Framework for High-Level Synthesis","authors":"Zhe Lin, Jieru Zhao, Sharad Sinha, Wei Zhang","doi":"10.1109/ASP-DAC47756.2020.9045442","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045442","url":null,"abstract":"High-level synthesis (HLS) enables designers to customize hardware designs efficiently. However, it is still challenging to foresee the correlation between power consumption and HLS-based applications at an early design stage. To overcome this problem, we introduce HL-Pow, a power modeling framework for FPGA HLS based on state-of-the-art machine learning techniques. HL-Pow incorporates an automated feature construction flow to efficiently identify and extract features that exert a major influence on power consumption, simply based upon HLS results, and a modeling flow that can build an accurate and generic power model applicable to a variety of designs with HLS. By using HL-Pow, the power evaluation process for FPGA designs can be significantly expedited because the power inference of HL-Pow is established on HLS instead of the time-consuming register-transfer level (RTL) implementation flow. Experimental results demonstrate that HL-Pow can achieve accurate power modeling that is only 4.67% (24.02 mW) away from onboard power measurement. To further facilitate power-oriented optimizations, we describe a novel design space exploration (DSE) algorithm built on top of HL-Pow to trade off between latency and power consumption. This algorithm can reach a close approximation of the real Pareto frontier while only requiring running HLS flow for 20% of design points in the entire design space.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. D. Gundi, Tahmoures Shabanian, Prabal Basu, Pramesh Pandey, Sanghamitra Roy, Koushik Chakraborty, Zhen Zhang
{"title":"EFFORT: Enhancing Energy Efficiency and Error Resilience of a Near-Threshold Tensor Processing Unit","authors":"N. D. Gundi, Tahmoures Shabanian, Prabal Basu, Pramesh Pandey, Sanghamitra Roy, Koushik Chakraborty, Zhen Zhang","doi":"10.1109/ASP-DAC47756.2020.9045479","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045479","url":null,"abstract":"Modern deep neural network (DNN) applications demand a remarkable processing throughput usually unmet by traditional Von Neumann architectures. Consequently, hardware accelerators, comprising a sea of multiplier and accumulate (MAC) units, have recently gained prominence in accelerating DNN inference engine. For example, Tensor Processing Units (TPU) account for a lion’s share of Google’s datacenter inference operations. The proliferation of real-time DNN predictions is accompanied with a tremendous energy budget. In quest of trimming the energy footprint of DNN accelerators, we propose EFFORT—an energy optimized, yet high performance TPU architecture, operating at the Near-Threshold Computing (NTC) region. EFFORT promotes a better-than-worst-case design by operating the NTC TPU at a substantially high frequency while keeping the voltage at the NTC nominal value. In order to tackle the timing errors due to such aggressive operation, we employ an opportunistic error mitigation strategy. Additionally, we implement an in-situ clock gating architecture, drastically reducing the MACs’ dynamic power consumption. Compared to a cutting-edge error mitigation technique for TPUs, EFFORT enables up to 2.5× better performance at NTC with only 2% average accuracy drop across 3 out of 4 DNN datasets.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Chen, Weichen Liu, Mengquan Li, Lei Yang, Nan Guan
{"title":"Contention Minimized Bypassing in SMART NoC","authors":"Peng Chen, Weichen Liu, Mengquan Li, Lei Yang, Nan Guan","doi":"10.1109/ASP-DAC47756.2020.9045103","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045103","url":null,"abstract":"SMART, a recently proposed dynamically reconfigurable NoC, enables single-cycle long-distance communication by building single-bypass paths. However, such a single-cycle single-bypass path will be broken when contention occurs. Thus, lower-priority packets will be buffered at intermediate routers with blocking latency from higher-priority packets, and extra router-stage latency to rebuild remaining path, reducing the bypassing benefits that SMART offers. In this paper, we for the first time propose an effective routing strategy to achieve nearly contention-free bypassing in SMART NoC. Specifically, we identify two different routes for communication pairs: direct route, with which data can reach the destination in a single bypass; and indirect route, with which data can reach the destination in two bypasses via an intermediate router. If a direct route is not found, we would alternatively resort to an indirect route in advance to eliminate the blocking latency, at the cost of only one router-stage latency. Compared with the current routing, our new approach can effectively isolate conflicting communication pairs, greatly balance the traffic loads and fully utilize bypass paths. Experiments show that our approach makes 22.6% performance improvement on average in terms of communication latency.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134599021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fritjof Bornebusch, Christoph Lüth, R. Wille, R. Drechsler
{"title":"Towards Automatic Hardware Synthesis from Formal Specification to Implementation","authors":"Fritjof Bornebusch, Christoph Lüth, R. Wille, R. Drechsler","doi":"10.1109/ASP-DAC47756.2020.9045406","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045406","url":null,"abstract":"In this work, we sketch an automated design flow for hardware synthesis based on a formal specification. Verification results are propagated from the FSL level through the proposed flow to generate an ESL model as well as an RTL implementation automatically. In contrast, the established design flow relies on manual implementations at the ESL and RTL level. The proposed design flow combines proof assistants with functional hardware description languages. This combination decreases the implementation effort significantly and the generation of test benches is no longer needed. We illustrate our design flow by specifying and synthesizing a set of benchmarks that contain sequential and combinational hardware designs. We compare them with implementations required by the established hardware design flow.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"2017 17","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132679907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent Monitoring of Operational Health in Neural Networks Through Balanced Output Partitions","authors":"Elbruz Ozen, A. Orailoglu","doi":"10.1109/ASP-DAC47756.2020.9045662","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045662","url":null,"abstract":"The abundant usage of deep neural networks in safety-critical domains such as autonomous driving raises concerns regarding the impact of hardware-level faults on deep neural network computations. As a failure can prove to be disastrous, low-cost safety mechanisms are needed to check the integrity of the deep neural network computations. We embed safety checksums into deep neural networks by introducing a custom regularization term in the network training. We partition the outputs of each network layer into two groups and guide the network to balance the summation of these groups through an additional penalty term in the cost function. The proposed approach delivers twin benefits. While the embedded checksums deliver low-cost detection of computation errors upon violations of the trained equilibrium during network inference, the regularization term enables the network to generalize better during training by preventing overfitting, thus leading to significantly higher network accuracy.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130964142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satoru Maruyama, Debraj Kundu, S. Yamashita, Sudip Roy
{"title":"Optimization of Fluid Loading on Programmable Microfluidic Devices for Bio-protocol Execution","authors":"Satoru Maruyama, Debraj Kundu, S. Yamashita, Sudip Roy","doi":"10.1109/ASP-DAC47756.2020.9045675","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045675","url":null,"abstract":"Recently, Programmable Microfluidic Device (PMD) has got an attention of the design automation communities as a new type of microfluidic biochips. For the design of PMD chips, one of the important tasks is to minimize the number of flows for loading the reactant fluids into specific cells (by creating some flows of the fluids) before the bio-protocol is executed. Nevertheless of the importance of the problem, there has been almost no work to study this problem. Thus, in this paper, we intensively study this fluid loading problem in PMD chips. First, we successfully formulate the problem as a constraint satisfaction problem (CSP) to solve the problem optimally for the first time. Then, we also propose an efficient heuristic called Determining Flows from the Last (DFL) method for larger problem instances. DFL is based on a novel idea that it is better to determine the flows from the last flow unlike the state-of-the-art method Fluid Loading Algorithm for PMD (FLAP) [Gupta et al., TODAES, 2019]. Simulation results confirm that the exact method can find the optimal solutions for practical test cases, whereas our heuristic can find near-optimal solutions, which are better than those obtained by FLAP.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"59 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127609836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallelism in Deep Learning Accelerators","authors":"Linghao Song, Fan Chen, Yiran Chen, H. Li","doi":"10.1109/ASP-DAC47756.2020.9045206","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045206","url":null,"abstract":"Deep learning is the core of artificial intelligence and it achieves state-of-the-art in a wide range of applications. The intensity of computation and data in deep learning processing poses significant challenges to the conventional computing platforms. Thus, specialized accelerator architectures are proposed for the acceleration of deep learning. In this paper, we classify the design space of current deep learning accelerators into three levels, (1) processing engine, (2) memory and (3) accelerator, and present a constructive view from a perspective of parallelism in the three levels.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132824656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qian Lou, Wenyang Liu, Weichen Liu, Feng Guo, Lei Jiang
{"title":"MindReading: An Ultra-Low-Power Photonic Accelerator for EEG-based Human Intention Recognition","authors":"Qian Lou, Wenyang Liu, Weichen Liu, Feng Guo, Lei Jiang","doi":"10.1109/ASP-DAC47756.2020.9045333","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045333","url":null,"abstract":"A scalp-recording electroencephalography (EEG)-based brain-computer interface (BCI) system can greatly improve the quality of life for people who suffer from motor disabilities. Deep neural networks consisting of multiple convolutional, LSTM and fully-connected layers are created to decode EEG signals to maximize the human intention recognition accuracy. However, prior FPGA, ASIC, ReRAM and photonic accelerators cannot maintain sufficient battery lifetime when processing realtime intention recognition. In this paper, we propose an ultra-low-power photonic accelerator, MindReading, for human intention recognition by only low bit-width addition and shift operations. Compared to prior neural network accelerators, to maintain the real-time processing throughput, MindReading reduces the power consumption by 62.7% and improves the throughput per Watt by 168%.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121456723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrency in DD-based Quantum Circuit Simulation","authors":"S. Hillmich, Alwin Zulehner, R. Wille","doi":"10.1109/ASP-DAC47756.2020.9045711","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045711","url":null,"abstract":"Despite recent progress in physical implementations of quantum computers, a significant amount of research still depends on simulating quantum computations on classical computers. Here, most state-of-the-art simulators rely on array-based approaches which are perfectly suited for acceleration through concurrency using multi- or many-core processors. However, those methods have exponential memory complexities and, hence, become infeasible if the considered quantum circuits are too large. To address this drawback, complementary approaches based on decision diagrams (called DD-based simulation) have been proposed which provide more compact representations in many cases. While this allows to simulate quantum circuits that could not be simulated before, it is unclear whether DD-based simulation also allows for similar acceleration through concurrency as array-based approaches. In this work, we investigate this issue. The resulting findings provide a better understanding about when DD-based simulation can be accelerated through concurrent executions of sub-tasks and when not.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122646633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA based Network Interface Card with Query Filter for Storage Nodes of Big Data Systems","authors":"Ying Li, Jinyu Zhan, Wei Jiang, Junting Wu, Jianping Zhu","doi":"10.1109/ASP-DAC47756.2020.9045372","DOIUrl":"https://doi.org/10.1109/ASP-DAC47756.2020.9045372","url":null,"abstract":"In this paper, we are interested in improving the data processing of storage and computing separated Big Data systems. We propose an Field Programmable Gate Array (FPGA) based Network Interface Card with Query Filter (NIC-QF) to accelerate the data query efficiency of storage nodes and reduce the workloads of computing nodes and the communication overheads between them. NIC-QF designed with PCIe core, query filter and NIC communication can filter the original data on storage nodes as an implicit coprocessor and directly send the filtered data to computing nodes of Big Data systems. Filter units in query filter can perform multiple SQL tasks in parallel, and each filter unit is internally pipelined, which can further speed up the data processing. Filter units can be designed to support general SQL queries on different data formats and we implement two schemes for TextFile and RCFile separately. Based on TPC-H benchmark and Tencent data set, we conduct extensive experiments to evaluate our design, which can achieve averagely up to 46.91% faster than the traditional approach.","PeriodicalId":125112,"journal":{"name":"2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115830713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}