Panu Sjövall, Ari Lemmetti, Jarno Vanne, Sakari Lahti, Timo D. Hämäläinen
{"title":"High-Level Synthesis Implementation of an Embedded Real-Time HEVC Intra Encoder on FPGA for Media Applications","authors":"Panu Sjövall, Ari Lemmetti, Jarno Vanne, Sakari Lahti, Timo D. Hämäläinen","doi":"10.1145/3491215","DOIUrl":"https://doi.org/10.1145/3491215","url":null,"abstract":"High Efficiency Video Coding (HEVC) is the key enabling technology for numerous modern media applications. Overcoming its computational complexity and customizing its rich features for real-time HEVC encoder implementations, calls for automated design methodologies. This article introduces the first complete High-Level Synthesis (HLS) implementation for HEVC intra encoder on FPGA. The C source code of our open-source Kvazaar HEVC encoder is used as a design entry point for HLS that is applied throughout the whole encoder design process, from data-intensive coding tools like intra prediction and discrete transforms to more control-oriented tools such as context-adaptive binary arithmetic coding (CABAC). Our prototype is run on Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 PCIe FPGA accelerator cards with 40 Gigabit Ethernet. This proof-of-concept system is designed for hardware-accelerated HEVC encoding and it achieves real-time 4K coding speed up to 120 fps. The coding performance can be easily scaled up by adding practically any number of network-connected FPGA cards to the system. These results indicate that our HLS proposal not only boosts development time, but also provides previously unseen design scalability with competitive performance over the existing FPGA and ASIC encoder implementations.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"154 8 1","pages":"35:1-35:34"},"PeriodicalIF":0.0,"publicationDate":"2022-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83185883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ding Han, Guohui Li, Quan Zhou, Jianjun Li, Yong Yang, Xiaofei Hu
{"title":"An Efficient Execution Framework of Two-Part Execution Scenario Analysis","authors":"Ding Han, Guohui Li, Quan Zhou, Jianjun Li, Yong Yang, Xiaofei Hu","doi":"10.1145/3465474","DOIUrl":"https://doi.org/10.1145/3465474","url":null,"abstract":"\u0000 Response Time Analysis\u0000 (\u0000 RTA\u0000 ) is an important and promising technique for analyzing the schedulability of real-time tasks under both\u0000 Global Fixed-Priority\u0000 (\u0000 G-FP\u0000 ) scheduling and\u0000 Global Earliest Deadline First\u0000 (\u0000 G-EDF\u0000 ) scheduling. Most existing RTA methods for tasks under global scheduling are dominated by partitioned scheduling, due to the pessimism of the\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 -based interference calculation where\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 is the number of processors. Two-part execution scenario is an effective technique that addresses this pessimism at the cost of efficiency. The major idea of two-part execution scenario is to calculate a more accurate upper bound of the interference by dividing the execution of the target job into two parts and calculating the interference on the target job in each part. This article proposes a novel RTA execution framework that improves two-part execution scenario by reducing some unnecessary calculation, without sacrificing accuracy of the schedulability test. The key observation is that, after the division of the execution of the target job, two-part execution scenario enumerates all possible execution time of the target job in the first part for calculating the final\u0000 Worst-Case Response Time\u0000 (\u0000 WCRT\u0000 ). However, only some special execution time can cause the final result. A set of experiments is conducted to test the performance of the proposed execution framework and the result shows that the proposed execution framework can improve the efficiency of two-part execution scenario analysis by up to\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 in terms of the execution time.\u0000","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"58 1","pages":"3:1-3:24"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83390279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Maleki, Alireza Nabipour-Meybodi, M. Kamal, A. Afzali-Kusha, M. Pedram
{"title":"An Energy-Efficient Inference Method in Convolutional Neural Networks Based on Dynamic Adjustment of the Pruning Level","authors":"M. A. Maleki, Alireza Nabipour-Meybodi, M. Kamal, A. Afzali-Kusha, M. Pedram","doi":"10.1145/3460972","DOIUrl":"https://doi.org/10.1145/3460972","url":null,"abstract":"In this article, we present a low-energy inference method for convolutional neural networks in image classification applications. The lower energy consumption is achieved by using a highly pruned (lower-energy) network if the resulting network can provide a correct output. More specifically, the proposed inference method makes use of two pruned neural networks (NNs), namely mildly and aggressively pruned networks, which are both designed offline. In the system, a third NN makes use of the input data for the online selection of the appropriate pruned network. The third network, for its feature extraction, employs the same convolutional layers as those of the aggressively pruned NN, thereby reducing the overhead of the online management. There is some accuracy loss induced by the proposed method where, for a given level of accuracy, the energy gain of the proposed method is considerably larger than the case of employing any one pruning level. The proposed method is independent of both the pruning method and the network architecture. The efficacy of the proposed inference method is assessed on Eyeriss hardware accelerator platform for some of the state-of-the-art NN architectures. Our studies show that this method may provide, on average, 70% energy reduction compared to the original NN at the cost of about 3% accuracy loss on the CIFAR-10 dataset.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"43 1","pages":"49:1-49:20"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86972178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning for Electronic Design Automation: A Survey","authors":"Guyue Huang, Jingbo Hu, Yifan He, Jialong Liu, Mingyuan Ma, Zhaoyang Shen, Juejian Wu, Yuanfan Xu, Hengrui Zhang, Kai Zhong, Xuefei Ning, Yuzhe Ma, Haoyu Yang, Bei Yu, Huazhong Yang, Yu Wang","doi":"10.1145/3451179","DOIUrl":"https://doi.org/10.1145/3451179","url":null,"abstract":"With the down-scaling of CMOS technology, the design complexity of very large-scale integrated is increasing. Although the application of machine learning (ML) techniques in electronic design automation (EDA) can trace its history back to the 1990s, the recent breakthrough of ML and the increasing complexity of EDA tasks have aroused more interest in incorporating ML to solve EDA tasks. In this article, we present a comprehensive review of existing ML for EDA studies, organized following the EDA hierarchy.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"1 1","pages":"40:1-40:46"},"PeriodicalIF":0.0,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79002517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saranyu Chattopadhyay, Pranesh Santikellur, R. Chakraborty, J. Mathew, M. Ottavi
{"title":"A Conditionally Chaotic Physically Unclonable Function Design Framework with High Reliability","authors":"Saranyu Chattopadhyay, Pranesh Santikellur, R. Chakraborty, J. Mathew, M. Ottavi","doi":"10.1145/3460004","DOIUrl":"https://doi.org/10.1145/3460004","url":null,"abstract":"Physically Unclonable Function (PUF) circuits are promising low-overhead hardware security primitives, but are often gravely susceptible to machine learning–based modeling attacks. Recently, chaotic PUF circuits have been proposed that show greater robustness to modeling attacks. However, they often suffer from unacceptable overhead, and their analog components are susceptible to low reliability. In this article, we propose the concept of a conditionally chaotic PUF that enhances the reliability of the analog components of a chaotic PUF circuit to a level at par with their digital counterparts. A conditionally chaotic PUF has two modes of operation: bistable and chaotic , and switching between these two modes is conveniently achieved by setting a mode-control bit (at a secret position) in an applied input challenge. We exemplify our PUF design framework for two different PUF variants—the CMOS Arbiter PUF and a previously proposed hybrid CMOS-memristor PUF, combined with a hardware realization of the Lorenz system as the chaotic component. Through detailed circuit simulation and modeling attack experiments, we demonstrate that the proposed PUF circuits are highly robust to modeling and cryptanalytic attacks, without degrading the reliability of the original PUF that was combined with the chaotic circuit, and incurs acceptable hardware footprint.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"26 1","pages":"41:1-41:24"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74931457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Li, S. Shahsavani, Xuan Zhou, Massoud Pedram, P. Beerel
{"title":"A Variation-aware Hold Time Fixing Methodology for Single Flux Quantum Logic Circuits","authors":"Xi Li, S. Shahsavani, Xuan Zhou, Massoud Pedram, P. Beerel","doi":"10.1145/3460289","DOIUrl":"https://doi.org/10.1145/3460289","url":null,"abstract":"Single flux quantum (SFQ) logic is a promising technology to replace complementary metal-oxide-semiconductor logic for future exa-scale supercomputing but requires the development of reliable EDA tools that are tailored to the unique characteristics of SFQ circuits, including the need for active splitters to support fanout and clocked logic gates. This article is the first work to present a physical design methodology for inserting hold buffers in SFQ circuits. Our approach is variation-aware, uses common path pessimism removal and incremental placement to minimize the overhead of timing fixes, and can trade off layout area and timing yield. Compared to a previously proposed approach using fixed hold time margins, Monte Carlo simulations show that, averaging across 10 ISCAS’85 benchmark circuits, our proposed method can reduce the number of inserted hold buffers by 8.4% with a 6.2% improvement in timing yield and by 21.9% with a 1.7% improvement in timing yield.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"13 1","pages":"47:1-47:17"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73082304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataflow Model-based Software Synthesis Framework for Parallel and Distributed Embedded Systems","authors":"Eunji Jeong, Dowhan Jeong, S. Ha","doi":"10.1145/3447680","DOIUrl":"https://doi.org/10.1145/3447680","url":null,"abstract":"Existing software development methodologies mostly assume that an application runs on a single device without concern about the non-functional requirements of an embedded system such as latency and resource consumption. Besides, embedded software is usually developed after the hardware platform is determined, since a non-negligible portion of the code depends on the hardware platform. In this article, we present a novel model-based software synthesis framework for parallel and distributed embedded systems. An application is specified as a set of tasks with the given rules for execution and communication. Having such rules enables us to perform static analysis to check some software errors at compile-time to reduce the verification difficulty. Platform-specific programs are synthesized automatically after the mapping of tasks onto processing elements is determined. The proposed framework is expandable to support new hardware platforms easily. The proposed communication code synthesis method is extensible and flexible to support various communication methods between devices. In addition, the fault-tolerant feature can be added by modifying the task graph automatically according to the selected fault-tolerance configurations by the user. The viability of the proposed software development methodology is evaluated with a real-life surveillance application that runs on six processing elements.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"3 1","pages":"35:1-35:38"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79515219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastCFI: Real-time Control-Flow Integrity Using FPGA without Code Instrumentation","authors":"Lang Feng, Jeff Huang, Jiang Hu, A. Reddy","doi":"10.1145/3458471","DOIUrl":"https://doi.org/10.1145/3458471","url":null,"abstract":"Control-Flow Integrity (CFI) is an effective defense technique against a variety of memory-based cyber attacks. CFI is usually enforced through software methods, which entail considerable performance overhead. Hardware-based CFI techniques can largely avoid performance overhead, but typically rely on code instrumentation, forming a non-trivial hurdle to the application of CFI. Taking advantage of the tradeoff between computing efficiency and flexibility of FPGA, we develop FastCFI, an FPGA-based CFI system that can perform fine-grained and stateful checking without code instrumentation. We also propose an automated Verilog generation technique that facilitates fast deployment of FastCFI, and a compression algorithm for reducing the hardware expense. Experiments on popular benchmarks confirm that FastCFI can detect fine-grained CFI violations over unmodified binaries. When using FastCFI on prevalent benchmarks, we demonstrate its capability to detect fine-grained CFI violations in unmodified binaries, while incurring an average of 0.36% overhead and a maximum of 2.93% overhead.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"3 1","pages":"39:1-39:39"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83627063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FTT-NAS: Discovering Fault-tolerant Convolutional Neural Architecture","authors":"Xuefei Ning, Guangjun Ge, Wenshuo Li, Zhenhua Zhu, Yin Zheng, Xiaoming Chen, Zhen Gao, Yu Wang, Huazhong Yang","doi":"10.1145/3460288","DOIUrl":"https://doi.org/10.1145/3460288","url":null,"abstract":"XUEFEI NING, GUANGJUN GE, WENSHUO LI, and ZHENHUA ZHU, Department of Electronic Engineering, Tsinghua University, China YIN ZHENG, Weixin Group, Tencent, China XIAOMING CHEN, State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China ZHEN GAO, School of Electrical and Information Engineering, Tianjin University, China YU WANG and HUAZHONG YANG, Department of Electronic Engineering, Tsinghua University, China","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"44 1","pages":"44:1-44:24"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77893828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chin-Hsien Wu, Hao-Wei Zhang, Chia-Wei Liu, T. Yu, Chi-Yen Yang
{"title":"A Dynamic Huffman Coding Method for Reliable TLC NAND Flash Memory","authors":"Chin-Hsien Wu, Hao-Wei Zhang, Chia-Wei Liu, T. Yu, Chi-Yen Yang","doi":"10.1145/3446771","DOIUrl":"https://doi.org/10.1145/3446771","url":null,"abstract":"With the progress of the manufacturing process, NAND flash memory has evolved from the single-level cell and multi-level cell into the triple-level cell (TLC). NAND flash memory has physical problems such as the characteristic of erase-before-write and the limitation of program/erase cycles. Moreover, TLC NAND flash memory has low reliability and short lifetime. Thus, we propose a dynamic Huffman coding method that can apply to the write operations of NAND flash memory. The proposed method exploits observations from a Huffman tree and machine learning from data patterns to dynamically select a suitable Huffman coding. According to the experimental results, the proposed method can improve the reliability of TLC NAND flash memory and also consider the compression performance for those applications that require the Huffman coding.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"1079 1","pages":"34:1-34:25"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76699586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}