F. Kempf, Julian Hoefer, T. Harbaum, Juergen Becker, Nael Fasfous, Alexander Frickenstein, Hans-Jörg Vögel, Simon Friedrich, R. Wittig, E. Matús, G. Fettweis, Matthias Lüders, Holger Blume, Jens Benndorf, Darius Grantz, Martin Zeller, Dietmar Engelke, K. Eickel
{"title":"The ZuSE-KI-Mobil AI Accelerator SoC: Overview and a Functional Safety Perspective","authors":"F. Kempf, Julian Hoefer, T. Harbaum, Juergen Becker, Nael Fasfous, Alexander Frickenstein, Hans-Jörg Vögel, Simon Friedrich, R. Wittig, E. Matús, G. Fettweis, Matthias Lüders, Holger Blume, Jens Benndorf, Darius Grantz, Martin Zeller, Dietmar Engelke, K. Eickel","doi":"10.23919/DATE56975.2023.10137257","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137257","url":null,"abstract":"ZuSE-KI-Mobil (ZuKIMo) is a nationally funded research project, currently in its intermediate stage. The goal of the ZuKIMo project is to develop a new System-on-Chip (SoC) platform and corresponding ecosystem to enable efficient Artificial Intelligence (AI) applications with specific requirements. With ZuKIMo, we specifically target applications from the mobility domain, i.e. autonomous vehicles and drones. The initial ecosystem is built by a consortium consisting of seven partners from German academia and industry. We develop the SoC platform and its ecosystem around a novel AI accelerator design. The customizable accelerator is conceived from scratch to fulfill the functional and non-functional requirements derived from the ambitious use cases. A tape-out in 22 nm FDX-technology is planned in 2023. Apart from the System-on-Chip hardware design itself, the ZuKIMo ecosystem has the objective of providing software tooling for easy deployment of new use cases and hardware-CNN co-design. Furthermore, AI accelerators in safety-critical applications like our mobility use cases, necessitate the fulfillment of safety requirements. Therefore, we investigate new design methodologies for fault analysis of Deep Neural Networks (DNNs) and introduce our new redundancy mechanism for AI accelerators.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115117374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Jumping Shift: A Logarithmic Quantization Method for Low-Power CNN Acceleration","authors":"Longxing Jiang, David Aledo, R. V. Leuken","doi":"10.23919/DATE56975.2023.10137169","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137169","url":null,"abstract":"Logarithmic quantization for Convolutional Neural Networks (CNN): a) fits well typical weights and activation distributions, and b) allows the replacement of the multiplication operation by a shift operation that can be implemented with fewer hardware resources. We propose a new quantization method named Jumping Log Quantization (JLQ). The key idea of JLQ is to extend the quantization range, by adding a coefficient parameter “s” in the power of two exponents $(2^{sx+i})$. This quantization strategy skips some values from the standard logarithmic quantization. In addition, we also develop a small hardware-friendly optimization called weight de-zero. Zero-valued weights that cannot be performed by a single shift operation are all replaced with logarithmic weights to reduce hardware resources with almost no accuracy loss. To implement the Multiply-And-Accumulate (MAC) operation (needed to compute convolutions) when the weights are JLQ-ed and de-zeroed, a new Processing Element (PE) have been developed. This new PE uses a modified barrel shifter that can efficiently avoid the skipped values. Resource utilization, area, and power consumption of the new PE standing alone are reported. We have found that JLQ performs better than other state-of-the-art logarithmic quantization methods when the bit width of the operands becomes very small.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115444120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gopikrishnan Raveendran Nair, Han-Sok Suh, M. Halappanavar, Frank Liu, J.-s. Seo, Yu Cao
{"title":"FPGA Acceleration of GCN in Light of the Symmetry of Graph Adjacency Matrix","authors":"Gopikrishnan Raveendran Nair, Han-Sok Suh, M. Halappanavar, Frank Liu, J.-s. Seo, Yu Cao","doi":"10.23919/DATE56975.2023.10137076","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137076","url":null,"abstract":"Graph Convolutional Neural Networks (GCNs) are widely used to process large-scale graph data. Different from deep neural networks (DNNs), GCNs are sparse, irregular, and unstructured, posing unique challenges to hardware acceleration with regular processing elements (PEs). In particular, the adja-cency matrix of a GCN is extremely sparse, leading to frequent but irregular memory access, low spatial/temporal data locality and poor data reuse. Furthermore, a realistic graph usually consists of unstructured data (e.g., unbalanced distributions), creating significantly different processing times and imbalanced workload for each node in GCN acceleration. To overcome these challenges, we propose an end-to-end hardware-software co-design to accelerate GCNs on resource-constrained FPGAs with the features including: (1) A custom dataflow that leverages symmetry along the diagonal of the adjacency matrix to accelerate feature aggregation for undirected graphs. We utilize either the upper or the lower triangular matrix of the adjacency matrix to perform aggregation in GCN to improve data reuse. (2) Unified compute cores for both aggregation and transform phases, with full support to the symmetry-based dataflow. These cores can be dynamically reconfigured to the systolic mode for transformation or as individual accumulators for aggregation in GCN processing. (3) Preprocessing of the graph in software to rearrange the edges and features to match the custom dataflow. This step improves the regularity in memory access and data reuse in the aggregation phase. Moreover, we quantize the GCN precision from FP32 to INT8 to reduce the memory footprint without losing the inference accuracy. We implement our accelerator design in Intel Stratix10 MX FPGA board with HBM2, and demonstrate $1.3times-110.5times$ improvement in end-to-end GCN latency as compared to the state-of the-art FPGA implementations, on the graph datasets of Cora, Pubmed, Citeseer and Reddit.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115485486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanquan Tian, Shayan Moini, Daniel E. Holcomb, R. Tessier, Jakub Szefer
{"title":"A Practical Remote Power Attack on Machine Learning Accelerators in Cloud FPGAs","authors":"Shanquan Tian, Shayan Moini, Daniel E. Holcomb, R. Tessier, Jakub Szefer","doi":"10.23919/DATE56975.2023.10136956","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10136956","url":null,"abstract":"The security and performance of FPGA-based accelerators play vital roles in today's cloud services. In addition to supporting convenient access to high-end FPGAs, cloud vendors and third-party developers now provide numerous FPGA accelerators for machine learning models. However, the security of accelerators developed for state-of-the-art Cloud FPGA environments has not been fully explored, since most remote accelerator attacks have been prototyped on local FPGA boards in lab settings, rather than in Cloud FPGA environments. To address existing research gaps, this work analyzes three existing machine learning accelerators developed in Xilinx Vitis to assess the potential threats of power attacks on accelerators in Amazon Web Services (AWS) F1 Cloud FPGA platforms, in a multi-tenant setting. The experiments show that malicious co-tenants in a multi-tenant environment can instantiate voltage sensing circuits as register-transfer level (RTL) kernels within the Vitis design environment to spy on co-tenant modules. A methodology for launching a practical remote power attack on Cloud FPGAs is also presented, which uses an enhanced time-to-digital (TDC) based voltage sensor and auto-triggered mechanism. The TDC is used to capture power signatures, which are then used to identify power consumption spikes and observe activity patterns involving the FPGA shell, DRAM on the FPGA board, or the other co-tenant victim's accelerators. Voltage change patterns related to shell use and accelerators are then used to create an auto-triggered attack that can automatically detect when to capture voltage traces without the need for a hard-wired synchronization signal between victim and attacker. To address the novel threats presented in this work, this paper also discusses defenses that could be leveraged to secure multi-tenant Cloud FPGAs from power-based attacks.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126066366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TAM: A Computing in Memory based on Tandem Array within STT-MRAM for Energy-Efficient Analog MAC Operation","authors":"Jinkai Wang, Zhengkun Gu, Hongyu Wang, Zuolei Hao, Bojun Zhang, Weisheng Zhao, Yue Zhang","doi":"10.23919/DATE56975.2023.10137323","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137323","url":null,"abstract":"Computing in memory (CIM) has been demonstrated promising for energy efficient computing. However, the dramatic growth of the data scale in neural network processors has aroused a demand for CIM architecture of higher bit density, for which the spin transfer torque magnetic RAM (STT-MRAM) with high bit density and performance arises as an up-and-coming candidate solution. In this work, we propose an analog CIM scheme based on tandem array within STT-MRAM (TAM) to further improve energy efficiency while achieving high bit density. First, the resistance summation based analog MAC operation minimizes the effect of low tunnel magnetoresistance (TMR) by the serial magnetic tunnel junctions (MTJs) structure in the proposed tandem array with smaller area overhead. Moreover, a read scheme of resistive-to-binary is designed to achieve the MAC results accurately and reliably. Besides, the data-dependent error caused by MTJs in series has been eliminated with a proposed dynamic selection circuit. Simulation results of a 2Kb TAM architecture show 113.2 TOPS/W and 63.7 TOPS/W for 4-bit and 8-bit input/weight precision, respectively, and reduction by 39.3% for bit-cell area compared with existing array of MTJs in series.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125488006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou
{"title":"FastRW: A Dataflow-Efficient and Memory-Aware Accelerator for Graph Random Walk on FPGAs","authors":"Yingxue Gao, Teng Wang, Lei Gong, Chao Wang, Xi Li, Xuehai Zhou","doi":"10.23919/DATE56975.2023.10137297","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137297","url":null,"abstract":"Graph random walk (GRW) sampling is becoming increasingly important with the widespread popularity of graph applications. It involves some walkers that wander through the graph to capture the desirable properties and reduce the size of the original graph. However, previous research suffers long sampling latency and severe memory access bottlenecks due to intrinsic data dependency and irregular vertex distribution. This paper proposes FastRW, a dedicated accelerator to release GRW acceleration on FPGAs. FastRW first schedules walkers' execution to address data dependency and mask long sampling latency. Then, FastRW leverages pipeline specialization and bit-level optimization to customize a processing engine with five modules and achieve a pipelining dataflow. Finally, to alleviate the differential accesses caused by irregular vertex distribution, FastRW implements a hybrid memory architecture to provide parallel access ports according to the vertex's degree. We evaluate FastRW with two classic GRW algorithms on a wide range of real-world graph datasets. The experimental results show that FastRW achieves a speedup of 14.13× on average over the system running on two 8-core Intel CPUs. FastRW also achieves 3.28×∼198.24× energy efficiency over the architecture implemented on V100 GPU.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126924448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu
{"title":"A Novel Fault-Tolerant Architecture for Tiled Matrix Multiplication","authors":"Sandeep Bal, Chandra sekhar Mummidi, V. C. Ferreira, S. Srinivasan, S. Kundu","doi":"10.23919/DATE56975.2023.10136985","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10136985","url":null,"abstract":"General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, a new generation of processors features GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kB AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub-matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency, and power overhead.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122698209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengang Chen, Guohui Wang, Zhiping Shi, Yong-Yuan Guan, Tianyu Wang
{"title":"Region-based Flash Caching with Joint Latency and Lifetime Optimization in Hybrid SMR Storage Systems","authors":"Zhengang Chen, Guohui Wang, Zhiping Shi, Yong-Yuan Guan, Tianyu Wang","doi":"10.23919/DATE56975.2023.10137148","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137148","url":null,"abstract":"The frequent Read-Modify-Write operations (RMWs) in Shingled Magnetic Recording (SMR) disks severely degrade the random write performance of the system. Although the adoption of persistent cache (PC) and built-in NAND flash cache alleviates some of the RMWs, when the cache is full, the triggered write-back operations still prolong I/O response time and the erasure of NAND flash also sacrifices its lifetime. In this paper, we propose a Region-based Co-optimized strategy named Multi-Regional Collaborative Management (MCM) to optimize the average response time by separately managing sequential/random and hot/cold data and extend the NAND flash lifetime by a region-aware wear-leveling strategy. The experimental results show that our MCM reduces 71 % of the average response time and 96% of RMWs on average compared with the Skylight (baseline). For the comparison with the state-of-art flash-based cache (FC) approach, we can still save the average response time and flash erase operations by 17.2 % and 33.32 %, respectively.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121977806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Hurta, Vojtěch Mrázek, Michaela Drahosova, L. Sekanina
{"title":"ADEE-LID: Automated Design of Energy-Efficient Hardware Accelerators for Levodopa-Induced Dyskinesia Classifiers","authors":"Martin Hurta, Vojtěch Mrázek, Michaela Drahosova, L. Sekanina","doi":"10.23919/DATE56975.2023.10137079","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137079","url":null,"abstract":"Levodopa, a drug used to treat symptoms of Parkin-son's disease, is connected to side effects known as Levodopa-induced dyskinesia (LID). LID is difficult to classify during a physician's visit. A wearable device allowing long-term and continuous classification would significantly help with dosage adjustments. This paper deals with an automated design of energy-efficient hardware accelerators for such LID classifiers. The proposed accelerator consists of a feature extractor and a classifier co-designed using genetic programming. Improvements are achieved by introducing a variable bit width for arithmetic operators, eliminating redundant registers, and using precise energy consumption estimation for Pareto front creation. Evolved solutions reduce energy consumption while maintaining classification accuracy comparable to the state of the art.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128510157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expanding In-Cone Obfuscated Tree for Anti SAT Attack","authors":"RuiJie Wang, Li-Nung Hsu, Yung-Chih Chen, TingTing Hwang","doi":"10.23919/DATE56975.2023.10137091","DOIUrl":"https://doi.org/10.23919/DATE56975.2023.10137091","url":null,"abstract":"Logic locking is a hardware security technology to protect circuit designs from overuse, piracy, and reverse engineering. It protects a circuit by inserting key gates to hide the circuit functionality, so that the circuit is functional only when a correct key is applied. In recent years, encrypting the point function, e.g., AND-tree, in a circuit has been shown to be promising to resist SAT attack. However, the encryption technique may suffer from two problems: First, the tree size may not be large enough to achieve desired security. Second, SAT attack could break the encryption in one iteration when it finds a specific input pattern, called remove-all DIP. Thus, in this paper, we present a new method for constructing the obfuscated tree. We first apply the sum-of-product transformation to find the largest AND-tree in a circuit, and then insert extra variables with the proposed split-compensate operation to further enlarge the AND-tree and mitigate the remove-all DIP issue. The experimental results show that the proposed obfuscated tree can effectively resist SAT attack.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129665426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}