2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)最新文献_第7页

Assessing the Effectiveness of Active Fences Against SCAs for Multi-Tenant FPGAs 评估针对多租户fpga的sca的主动围栏的有效性

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00065

Christos Diktopoulos, Konstantinos Georgopoulos, A. Brokalakis, Georgios Christou, Grigorios Chrysos, Ioannis Morianos, S. Ioannidis

{"title":"Assessing the Effectiveness of Active Fences Against SCAs for Multi-Tenant FPGAs","authors":"Christos Diktopoulos, Konstantinos Georgopoulos, A. Brokalakis, Georgios Christou, Grigorios Chrysos, Ioannis Morianos, S. Ioannidis","doi":"10.1109/FPL57034.2022.00065","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00065","url":null,"abstract":"The rising use of FPGAs, in the context of cloud computing, has created security concerns. Previous works have shown that malicious users can implement voltage fluctuation sensors and mount successful power analysis attacks against cryptographic algorithms that share the same Power Distribution Network (PDN). So far, masking and hiding schemes are the two main mitigation strategies against such attacks and previous work has shown that the use of an active fence of Ring Oscillators (ROs) holds the potential for constituting an effective hiding countermeasure if placed between two adversary users. Nevertheless, developing an effective proposition against remote Side-Channel Attacks (SCAs) remains an open research topic. This work presents the mapping of an intra-FPGA adversary scenario on a Xilinx UltraScale+ MPSoC to assess the effectiveness of the Ring Oscillator active fence countermeasure. We compare different active fence configurations, with a varying number of Ring Oscillators, while using a new, resource efficient, activation method aiming at the achievement of noise injection hiding. The results show that by using our active fence scheme, which exhibits lower area overhead and lower power consumption than the algorithm under attack, the side-channel leakage is reduced to such a degree that the amount of traces that need to be collected for a successful attack is more than ten times higher compared to no fence present. Moreover, this work presents qualitative results that FPGA cloud providers can consider in order to assess the benefits gained through the deployment of active fence mechanisms within their platforms for multi-tenant services.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127595001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPL Demo: SERVE: Agile Hardware Development Platform with Cloud IDE and Cloud FPGAs FPL演示:SERVE:敏捷硬件开发平台与云IDE和云fpga

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00087

Zelin Wang, Ke Zhang, Yisong Chang, Yanlong Yin, Yuxiao Chen, Ran Zhao, Songyue Wang, Mingyu Chen, Yungang Bao

引用次数: 1

Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform CPU-FPGA异构平台上加速蒙特卡罗树搜索

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00037

Yuan Meng, R. Kannan, V. Prasanna

引用次数: 3

A Unified Approach for Managing Heterogeneous Processing Elements on FPGAs fpga上异构处理元件的统一管理方法

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00048

S. Denholm, W. Luk

{"title":"A Unified Approach for Managing Heterogeneous Processing Elements on FPGAs","authors":"S. Denholm, W. Luk","doi":"10.1109/FPL57034.2022.00048","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00048","url":null,"abstract":"FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129589130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

FPL Demo: 400G FPGA Packet Capture Based on Network Development Kit FPL演示:基于网络的400G FPGA抓包开发工具包

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00090

Jakub Cabal, Jiri Sikora, Stepán Friedl, Martin Spinler, J. Korenek

引用次数: 0

Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs fpga上对称范围限制分子力计算的优化映射

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00026

Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt

{"title":"Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs","authors":"Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt","doi":"10.1109/FPL57034.2022.00026","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00026","url":null,"abstract":"In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing the pair-wise forces in parallel, finding the optimal mapping of particles and computations to memories and processors is surprisingly challenging, but can result in greatly reduced data movement and computation. Despite FPGAs having a distinct compute model (BRAMs/network/pipelines) from CPUs and ASICs, mappings on FPGAs have not previously been studied in depth: it was thought that the half-shell method was preferred. In this work, we find that the Manhattan method is sur-prisingly compatible with FPGA hardware. With the cache overlapping technique proposed in this paper, the ultra-fine-grained data access demanded by the Manhattan method can be satisfied, despite the fact that the memory blocks on FPGAs appear to be insufficiently fine-grained. We further demonstrate that, compared to the traditional baseline half-shell method, approximately a half of the filters (preprocessors) can be removed without performance degradation. For communication, the amount of data transferred can be reduced by 40% - 75% in the most common multi-FPGA scenarios. Moreover, data transfers are almost perfectly balanced along all directions, and the optimization requires only minimal hardware resources. The practical consequence is that nearly 2 x to 4 x the workload can be handled without upgrading the network connections between FPGAs. This is a critical finding given the relatively limited bandwidth available in many common accelerator boards and the strong-scaling applications to which FPGA clusters are being applied.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130991896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Unleashing Parallelism in Elastic Circuits with Faster Token Delivery 通过更快的令牌传递释放弹性电路中的并行性

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00046

Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, P. Ienne

引用次数: 6

Message from the FPL PhD Forum and Demo Night Chairs 来自FPL博士论坛和演示夜椅的信息

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/fpl57034.2022.00007

引用次数: 0

XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine xxvpu:基于AI引擎的通用平台上的高性能CNN加速器

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00041

Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan

{"title":"XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1109/FPL57034.2022.00041","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00041","url":null,"abstract":"The convolution neural networks (CNNs) are widely used in computer vision applications nowadays. However, the trends of higher accuracy and higher resolution generate larger networks, indicating that computation and I/O bandwidth are key bottlenecks to reach performance. The Xilinx's latest 7nm Versal ACAP platform with AI-Engine (AIE) cores can deliver up-to 8x silicon compute density at 50% the power consumption compared with the traditional FPGA solutions. In this paper, we propose XVDPU: the AIE-based int8-precision CNN accelerator on Versal chips, scaling from 16-AIE-core (C16B1) to 320-AIE-core (C64B5, Peak:109.2 TOPs) to meet computation requirements. To resolve IO bottleneck, we adopt several techniques such as multi-batch (MB), shared-weights (SHRWGT), feature-map-stationary (FMS) and long-load-weights (LLW) to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) design is further proposed into the accelerator which mainly performs non-convolution layers such as Depthwise-Conv layer, Pooling layer and Non-linear function layers using the same logic resources, which can better balance resource utilization, new feature support and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core (C32B3, Peak: 32.76 TOPs) implementation can achieve 1653 FPS for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS with peak 3.6 TOPs. The 256-AIE-core (C32B8, Peak: 87.36 TOPs) implementation can further achieve 4050 FPS which better leverages the computing power of Versal AIE devices. The powerful XVDPU will help enable many applications on the embedded system, such as low-latency data center, high level ADAS and complex robotics.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116320835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

TRAM: An Open-Source Template-based Reconfigurable Architecture Modeling Framework TRAM:一个开源的基于模板的可重构架构建模框架

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL) Pub Date : 2022-08-01 DOI: 10.1109/FPL57034.2022.00021

Yunhui Qiu, Yuhang Cao, Yuan Dai, Wenbo Yin, Lingli Wang

{"title":"TRAM: An Open-Source Template-based Reconfigurable Architecture Modeling Framework","authors":"Yunhui Qiu, Yuhang Cao, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1109/FPL57034.2022.00021","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00021","url":null,"abstract":"Coarse-grained reconfigurable architecture (CGRA) is a promising accelerator design choice due to its high performance and power efficiency in the computation or data-intensive application domains, such as security, multimedia, digital signal processing, machine learning, and high-performance computing. CGRA consists of coarse-grained processing elements (PEs) and interconnects that determine the architecture flexibility to support different applications and also affect the performance and power efficiency significantly. Although multiple types of interconnects have been proposed, a parameterized unified model is still lacking. In this paper, we propose a flexible and scalable CGRA template with a novel interconnect model that can unify the typical neighbor-to-neighbor, switch-based, and FPGA-like interconnects. Furthermore, we present TRAM, an open-source template-based reconfigurable architecture modeling framework that integrates the Chisel-based CGRA modeling, architecture intermediate representation (IR) and Verilog generation, dataflow graph (DFG) mapping, simulation, and evaluation. The mapping flow contains graph-based placement and routing, critical-path-driven data synchronization, and simulated-annealing-based optimization. We evaluate the impacts of the rich design parameters, which demonstrate the significance of such a flexible template to facilitate architecture optimization. Compared with the related work, TRAM can achieve a 4.1× smaller DFG latency and a faster mapping speed for both the 8×8 and 16×16 CGRAs. Moreover, TRAM is able to attain an extremely high PE utilization of 94.4 % on average by architecture tuning.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127254677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4