{"title":"A Lane Detection Hardware Algorithm Based on Helmholtz Principle and Its Application to Unmanned Mobile Vehicles","authors":"Katsuaki Kamimae, Shintaro Matsui, Yasutoshi Araki, Takehiro Miura, Keigo Motoyoshi, Keizo Yamashita, Haruto Ikehara, Takuho Kawazu, Huang Yuwei, Masahiro Nishimura, Shuto Abe, Kenyu Okino, Yuta Hashiguchi, Koki Fukuda, Kengo Yanagihara, Taito Manabe, Yuichiro Shibata","doi":"10.1109/ICFPT56656.2022.9974208","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974208","url":null,"abstract":"We are developing an SoC FPGA-based unmanned mobile vehicle for the FPGA design competition. For the vehicle to follow roads successfully, it must be able to detect not only straight lines but also curved lines accurately. Therefore, we implemented a lane detection algorithm that is robust not only against straight lines but also against curves to improve driving performance. We implemented an autonomous driving system employing this algorithm on Digilent Zybo Z7-20. We evaluated the lane detection algorithm based on simulations and showed that this algorithm can reduce false detection of lane features compared to the classical Canny filter.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133366372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Markovian Approach for Detecting Failures in the Xilinx SEM core","authors":"T. Rajkumar, Johnny Öberg","doi":"10.1109/ICFPT56656.2022.9974240","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974240","url":null,"abstract":"The soft error mitigation (SEM) core is an internal scrubber used to detect and correct single event upsets in the configuration memory. Although the core can mitigate errors with a high accuracy, recent studies have found it to be vulnerable to radiation errors owing to its implementation in the FPGA fabric. As the reliability of the system depends on the correctness of the scrubber, undetected SEM failure is hazardous in critical applications. In this study, we investigate the effectiveness of Markov chains in detecting such failures. In order to minimise the effects of single event upsets, the detection scheme is implemented external to the FPGA and leverages log analysis to monitor the SEM health. We evaluated our approach on the Xilinx ZCU104 Ultrascale+ board using fault injection. The results show that the SEM failures caused by single and double bit errors could be detected with an $F_{1}$ score of 0.90 and 0.99 respectively. To the best of our knowledge, this is the first custom approach for failure detection in the SEM core.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116120532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using integer linear programming for correctly rounded multipartite architectures","authors":"Orégane Desrentes, F. D. Dinechin","doi":"10.1109/ICFPT56656.2022.9974486","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974486","url":null,"abstract":"This article introduces several improvements to the multipartite method, a generic technique for the hardware implementation of numerical functions. A multipartite architecture replaces a table of value with several tables and an adder tree. Here, the optimization of multipartite tables is formalized using Integer Linear Programming so that generic ILP solvers can be used. This improves the quality of faithfully rounded architectures compared to the state of the art. The proposed approach also enables correctly rounded multipartite architectures, providing errorless table compression. This improves the area by a factor 5 without any performance penalty compared with the state of the art in errorless compression. Another improvement of the proposed work is a cost function that attempts to predict the total cost of an architecture in FPGA architectural LUTs, where most of the previous works only count the size of the tables, thus ignoring the cost of the adder tree.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129170010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna
{"title":"Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA","authors":"Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/ICFPT56656.2022.9974369","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974369","url":null,"abstract":"Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(mathbf{M}times mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $mathbf{M}times mathbf{V}$, we show that simply using the principles of unencrypted $mathbf{M}times mathbf{V}$ to design accelerators for HE $mathbf{M}times mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $mathbf{M}times mathbf{V}$. We identify highly reused data entities in HE $mathbf{M}times mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $mathbf{M}times mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $mathbf{M}times mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8times$ (GeoMean $2.8times$) faster compared to the 64-thread CPU implementation.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116711267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch
{"title":"Automated Generation and Orchestration of Stream Processing Pipelines on FPGAs","authors":"Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch","doi":"10.1109/ICFPT56656.2022.9974596","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974596","url":null,"abstract":"FPGAs have demonstrated substantial performance and energy efficiency advantages for workloads that fit a stream processing model with direct module-to-module communication. However, when the dataflow processing system is required to adapt to runtime conditions, current static acceleration solutions are limited. To better use FPGAs in dynamic scenarios, this paper proposes using partial reconfiguration to stitch together different physically implemented operator modules on-the-fly. Rather than using designated module slots, our system places all modules and routing wires into a shared region with more placement options to minimize fragmentation. Furthermore, we use a module library that provides different resource and performance trade-offs for faster execution while considering the configuration cost. Our system finds the optimal set of modules while scheduling multiple acceleration requests and managing all constraints transparently to the end-user. We demonstrate that the middleware is fast enough to compose accelerator pipelines at runtime with end-to- end execution times equal to hand-crafted static systems when processing small datasets. For large datasets, we found up to 7.2 x faster execution over static systems when using our runtime methods. We exemplified our approach for database acceleration, where the whole dynamic FPGA acceleration is inferred by directly executing SQL queries.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124670766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Kamucheka, Alexander Nelson, David Andrews, Miaoqing Huang
{"title":"A Masked Pure-Hardware Implementation of Kyber Cryptographic Algorithm","authors":"T. Kamucheka, Alexander Nelson, David Andrews, Miaoqing Huang","doi":"10.1109/ICFPT56656.2022.9974404","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974404","url":null,"abstract":"Quantum computing-specifically Shor's algorithm [1]-presents an existential threat to some standard cryptographic algorithms. In preparation, post-quantum cryptography (PQC) algorithms have been in development and are nearing mathematical and cryptanalytic maturity. Standardization efforts through the National Institute of Standards and Technology (NIST) PQC standardization process have chosen one PKE/KEM algorithm (i.e., CRYSTALS-Kyber) and three digital signature algorithms (i.e., CRYSTALS-Dilithium, Falcon, and SPHINCS+). CRYSTALS-Kyber is a lattice-based, IND-CCA2-secure, key-encapsulation mechanism (KEM) based on the learning-with-errors problem over module lattices. This paper presents a masked hardware implementation of Kyber that is demonstrably secure against side-channel power analysis methods.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121549717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design","authors":"Nobuho Hashimoto, Shinya Takamaeda-Yamazaki","doi":"10.1109/ICFPT56656.2022.9974565","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974565","url":null,"abstract":"3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of cal-culations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments. This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, which is a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCUI04 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation. Code available: https://github.com/casys-utokyo/fadec/","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115277219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning","authors":"Jenny Yang, Jaeuk Kim, Joo-Young Kim","doi":"10.1109/ICFPT56656.2022.9974543","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974543","url":null,"abstract":"Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent rein-forcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training accel-eration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create spar-sity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52 x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125391092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}