2022 International Conference on Field-Programmable Technology (ICFPT)最新文献_第6页

A Lane Detection Hardware Algorithm Based on Helmholtz Principle and Its Application to Unmanned Mobile Vehicles 一种基于亥姆霍兹原理的车道检测硬件算法及其在无人驾驶汽车上的应用

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974208

Katsuaki Kamimae, Shintaro Matsui, Yasutoshi Araki, Takehiro Miura, Keigo Motoyoshi, Keizo Yamashita, Haruto Ikehara, Takuho Kawazu, Huang Yuwei, Masahiro Nishimura, Shuto Abe, Kenyu Okino, Yuta Hashiguchi, Koki Fukuda, Kengo Yanagihara, Taito Manabe, Yuichiro Shibata

引用次数: 1

A Markovian Approach for Detecting Failures in the Xilinx SEM core 一种检测Xilinx SEM岩心故障的马尔可夫方法

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974240

T. Rajkumar, Johnny Öberg

引用次数: 2

Using integer linear programming for correctly rounded multipartite architectures 用整数线性规划求解正确四舍五入的多部结构

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974486

Orégane Desrentes, F. D. Dinechin

引用次数: 0

Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA 基于FPGA的带宽高效同态加密矩阵矢量乘法加速器

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974369

Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna

{"title":"Bandwidth Efficient Homomorphic Encrypted Matrix Vector Multiplication Accelerator on FPGA","authors":"Yang Yang, S. Kuppannagari, R. Kannan, V. Prasanna","doi":"10.1109/ICFPT56656.2022.9974369","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974369","url":null,"abstract":"Homomorphic Encryption (HE) is a promising solution to the increasing concerns of privacy in Machine Learning (ML) as it enables computations directly on encrypted data. However, it imposes significant overhead on the compute system and remains impractically slow. Prior works have proposed efficient FPGA implementations of basic HE primitives such as number theoretic transform (NTT), key switching, etc. Composing the primitives together to realize higher level ML computation is still a challenge due to the large data transfer overhead. In this work, we propose an efficient FPGA implementation of HE Matrix Vector Multiplication $(mathbf{M}times mathbf{V})$, a key kernel in HE-based Machine Learning applications. By analyzing the data reuse characteristics and the encryption overhead of HE $mathbf{M}times mathbf{V}$, we show that simply using the principles of unencrypted $mathbf{M}times mathbf{V}$ to design accelerators for HE $mathbf{M}times mathbf{V}$ can lead to a significant amount of DRAM data transfers. We tackle the computation and data transfer challenges by proposing a bandwidth efficient dataflow that is specially optimized for HE $mathbf{M}times mathbf{V}$. We identify highly reused data entities in HE $mathbf{M}times mathbf{V}$ and efficiently utilize the on-chip SRAM to reduce the DRAM data transfers. To speed up the computation of HE $mathbf{M}times mathbf{V}$, we exploit three types of parallelism: partial sum parallelism, residual polynomial parallelism and coefficient parallelism. Leveraging these innovations, we demonstrate the first FPGA accelerator for HE matrix vector multiplication. Evaluation on 7 HE $mathbf{M}times mathbf{V}$ benchmarks shows that our FPGA accelerator is up to $3.8times$ (GeoMean $2.8times$) faster compared to the 64-thread CPU implementation.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116711267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automated Generation and Orchestration of Stream Processing Pipelines on FPGAs fpga上流处理管道的自动生成和编排

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974596

Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch

{"title":"Automated Generation and Orchestration of Stream Processing Pipelines on FPGAs","authors":"Kaspar Mätas, Kristiyan Manev, Joseph Powell, Dirk Koch","doi":"10.1109/ICFPT56656.2022.9974596","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974596","url":null,"abstract":"FPGAs have demonstrated substantial performance and energy efficiency advantages for workloads that fit a stream processing model with direct module-to-module communication. However, when the dataflow processing system is required to adapt to runtime conditions, current static acceleration solutions are limited. To better use FPGAs in dynamic scenarios, this paper proposes using partial reconfiguration to stitch together different physically implemented operator modules on-the-fly. Rather than using designated module slots, our system places all modules and routing wires into a shared region with more placement options to minimize fragmentation. Furthermore, we use a module library that provides different resource and performance trade-offs for faster execution while considering the configuration cost. Our system finds the optimal set of modules while scheduling multiple acceleration requests and managing all constraints transparently to the end-user. We demonstrate that the middleware is fast enough to compose accelerator pipelines at runtime with end-to- end execution times equal to hand-crafted static systems when processing small datasets. For large datasets, we found up to 7.2 x faster execution over static systems when using our runtime methods. We exemplified our approach for database acceleration, where the whole dynamic FPGA acceleration is inferred by directly executing SQL queries.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124670766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Masked Pure-Hardware Implementation of Kyber Cryptographic Algorithm Kyber密码算法的掩码纯硬件实现

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-05 DOI: 10.1109/ICFPT56656.2022.9974404

T. Kamucheka, Alexander Nelson, David Andrews, Miaoqing Huang

引用次数: 4

FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design 基于fpga的视频深度估计加速的硬件/软件协同设计

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-12-01 DOI: 10.1109/ICFPT56656.2022.9974565

Nobuho Hashimoto, Shinya Takamaeda-Yamazaki

{"title":"FADEC: FPGA-based Acceleration of Video Depth Estimation by HW/SW Co-design","authors":"Nobuho Hashimoto, Shinya Takamaeda-Yamazaki","doi":"10.1109/ICFPT56656.2022.9974565","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974565","url":null,"abstract":"3D reconstruction from videos has become increasingly popular for various applications, including navigation for autonomous driving of robots and drones, augmented reality (AR), and 3D modeling. This task often combines traditional image/video processing algorithms and deep neural networks (DNNs). Although recent developments in deep learning have improved the accuracy of the task, the large number of cal-culations involved results in low computation speed and high power consumption. Although there are various domain-specific hardware accelerators for DNNs, it is not easy to accelerate the entire process of applications that alternate between traditional image/video processing algorithms and DNNs. Thus, FPGA-based end-to-end acceleration is required for such complicated applications in low-power embedded environments. This paper proposes a novel FPGA-based accelerator for DeepVideoMVS, which is a DNN-based depth estimation method for 3D reconstruction. We employ HW/SW co-design to appropriately utilize heterogeneous components in modern SoC FPGAs, such as programmable logic (PL) and CPU, according to the inherent characteristics of the method. As some operations are unsuitable for hardware implementation, we determine the operations to be implemented in software through analyzing the number of times each operation is performed and its memory access pattern, and then considering comprehensive aspects: the ease of hardware implementation and degree of expected acceleration by hardware. The hardware and software implementations are executed in parallel on the PL and CPU to hide their execution latencies. The proposed accelerator was developed on a Xilinx ZCUI04 board by using NNgen, an open-source high-level synthesis (HLS) tool. Experiments showed that the proposed accelerator operates 60.2 times faster than the software-only implementation on the same FPGA board with minimal accuracy degradation. Code available: https://github.com/casys-utokyo/fadec/","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115277219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning LearningGroup:基于可学习权分组的FPGA实时稀疏训练多智能体强化学习

2022 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2022-10-29 DOI: 10.1109/ICFPT56656.2022.9974543

Jenny Yang, Jaeuk Kim, Joo-Young Kim

{"title":"LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning","authors":"Jenny Yang, Jaeuk Kim, Joo-Young Kim","doi":"10.1109/ICFPT56656.2022.9974543","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974543","url":null,"abstract":"Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent rein-forcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. In this paper, we present a real-time sparse training accel-eration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create spar-sity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52 x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125391092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0