Christos Diktopoulos, Konstantinos Georgopoulos, A. Brokalakis, Georgios Christou, Grigorios Chrysos, Ioannis Morianos, S. Ioannidis
{"title":"Assessing the Effectiveness of Active Fences Against SCAs for Multi-Tenant FPGAs","authors":"Christos Diktopoulos, Konstantinos Georgopoulos, A. Brokalakis, Georgios Christou, Grigorios Chrysos, Ioannis Morianos, S. Ioannidis","doi":"10.1109/FPL57034.2022.00065","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00065","url":null,"abstract":"The rising use of FPGAs, in the context of cloud computing, has created security concerns. Previous works have shown that malicious users can implement voltage fluctuation sensors and mount successful power analysis attacks against cryptographic algorithms that share the same Power Distribution Network (PDN). So far, masking and hiding schemes are the two main mitigation strategies against such attacks and previous work has shown that the use of an active fence of Ring Oscillators (ROs) holds the potential for constituting an effective hiding countermeasure if placed between two adversary users. Nevertheless, developing an effective proposition against remote Side-Channel Attacks (SCAs) remains an open research topic. This work presents the mapping of an intra-FPGA adversary scenario on a Xilinx UltraScale+ MPSoC to assess the effectiveness of the Ring Oscillator active fence countermeasure. We compare different active fence configurations, with a varying number of Ring Oscillators, while using a new, resource efficient, activation method aiming at the achievement of noise injection hiding. The results show that by using our active fence scheme, which exhibits lower area overhead and lower power consumption than the algorithm under attack, the side-channel leakage is reduced to such a degree that the amount of traces that need to be collected for a successful attack is more than ten times higher compared to no fence present. Moreover, this work presents qualitative results that FPGA cloud providers can consider in order to assess the benefits gained through the deployment of active fence mechanisms within their platforms for multi-tenant services.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127595001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zelin Wang, Ke Zhang, Yisong Chang, Yanlong Yin, Yuxiao Chen, Ran Zhao, Songyue Wang, Mingyu Chen, Yungang Bao
{"title":"FPL Demo: SERVE: Agile Hardware Development Platform with Cloud IDE and Cloud FPGAs","authors":"Zelin Wang, Ke Zhang, Yisong Chang, Yanlong Yin, Yuxiao Chen, Ran Zhao, Songyue Wang, Mingyu Chen, Yungang Bao","doi":"10.1109/FPL57034.2022.00087","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00087","url":null,"abstract":"We introduce SERVE, a cloud platform for agile hardware software co-design, with cloud IDE and cloud FPGAs integrated. SERVE enables users to focus on logic designs, without facing the hassle of setting up FPGA tools and development environment. Users can write and simulate hardware logic in the cloud IDE and then generate bitstream files through a Continuous Integration (CI) pipeline. Finally, the bitstream files are deployed on an FPGA board. A great amount of testbenches will be executed to ensure the correctness of the hardware logic. We will demo a workflow of modifying a RISC- V processor and getting the design change quickly evaluated using SERVE.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132607811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Monte-Carlo Tree Search on CPU-FPGA Heterogeneous Platform","authors":"Yuan Meng, R. Kannan, V. Prasanna","doi":"10.1109/FPL57034.2022.00037","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00037","url":null,"abstract":"Monte Carlo Tree Search (MCTS) methods have achieved great success in many Artificial Intelligence (AI) benchmarks. The in-tree operations become a critical performance bottleneck in realizing parallel MCTS on CPUs. In this work, we develop a scalable CPU-FPGA system for Tree-Parallel MCTS. We propose a novel decomposition and mapping of MCTS data structure and computation onto CPU and FPGA to reduce communication and coordination. High scalability of our system is achieved by encapsulating in-tree operations in an SRAM-based FPGA accelerator. To lower the high data access latency and inter-worker synchronization overheads, we develop several hardware optimizations. We show that by using our accelerator, we obtain up to 35× speedup for in-tree operations, and 3× higher overall system throughput. Our CPU-FPGA system also achieves superior scalability wrt number of parallel workers than state-of-the-art parallel MCTS implementations on CPU.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128291327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Unified Approach for Managing Heterogeneous Processing Elements on FPGAs","authors":"S. Denholm, W. Luk","doi":"10.1109/FPL57034.2022.00048","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00048","url":null,"abstract":"FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129589130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub Cabal, Jiri Sikora, Stepán Friedl, Martin Spinler, J. Korenek
{"title":"FPL Demo: 400G FPGA Packet Capture Based on Network Development Kit","authors":"Jakub Cabal, Jiri Sikora, Stepán Friedl, Martin Spinler, J. Korenek","doi":"10.1109/FPL57034.2022.00090","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00090","url":null,"abstract":"CESNET, the Czech NREN (National Research and Education Network), has a long research history in the area of high-speed network monitoring using FPGA accelerated cards. Now, we are ready to present our open-source Network Development Kit for FPGAs11https://github.com/CESNET/ndk-app-minimal/ which is ready for 400 Gbps data transfers via Ethernet and PCI Express. The demo aims to show the possibilities of NDK, which allows users to quickly and easily develop new network applications for FPGA-based acceleration cards. Even high-speed DMA Module fully supported in NDK is available free of charge for academic purposes. It can thus significantly contribute to the spread of 400G technology in the academic community and also among other users. The accelerator card equipped with the Intel Agilex I-Series FPGA will transmit and receive back 400G Ethernet (400GBASE) traffic via external loopback. The received packets will be forwarded via very fast packet DMA transfers directly to the RAM of the host computer.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130035036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt
{"title":"Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs","authors":"Chunshu Wu, Sahan Bandara, Tong Geng, Anqi Guo, Pouya Haghi, Vipin Sachdeva, W. Sherman, Martin C. Herbordt","doi":"10.1109/FPL57034.2022.00026","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00026","url":null,"abstract":"In N-body applications, the efficient evaluation of range-limited forces depends on applying certain constraints, including a cut-off radius and force symmetry (Newton's Third Law). When computing the pair-wise forces in parallel, finding the optimal mapping of particles and computations to memories and processors is surprisingly challenging, but can result in greatly reduced data movement and computation. Despite FPGAs having a distinct compute model (BRAMs/network/pipelines) from CPUs and ASICs, mappings on FPGAs have not previously been studied in depth: it was thought that the half-shell method was preferred. In this work, we find that the Manhattan method is sur-prisingly compatible with FPGA hardware. With the cache overlapping technique proposed in this paper, the ultra-fine-grained data access demanded by the Manhattan method can be satisfied, despite the fact that the memory blocks on FPGAs appear to be insufficiently fine-grained. We further demonstrate that, compared to the traditional baseline half-shell method, approximately a half of the filters (preprocessors) can be removed without performance degradation. For communication, the amount of data transferred can be reduced by 40% - 75% in the most common multi-FPGA scenarios. Moreover, data transfers are almost perfectly balanced along all directions, and the optimization requires only minimal hardware resources. The practical consequence is that nearly 2 x to 4 x the workload can be handled without upgrading the network connections between FPGAs. This is a critical finding given the relatively limited bandwidth available in many common accelerator boards and the strong-scaling applications to which FPGA clusters are being applied.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130991896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, P. Ienne
{"title":"Unleashing Parallelism in Elastic Circuits with Faster Token Delivery","authors":"Ayatallah Elakhras, Andrea Guerrieri, Lana Josipović, P. Ienne","doi":"10.1109/FPL57034.2022.00046","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00046","url":null,"abstract":"High-level synthesis (HLS) is the process of automatically generating circuits out of high-level language descriptions. Previous research has shown that dynamically scheduled HLS through elastic circuit generation is successful at exploiting parallelism in some important use-cases. Nevertheless, the literal conversion of a standard compiler's control-data flow graph into elastic circuits often produces circuits with notable resource demands and inferior performance. In this work, we present a methodology for generating more area- and timing-efficient elastic circuits. We show that our strategy results in significant area and timing improvements compared to previous circuit generation strategies.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114695847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the FPL PhD Forum and Demo Night Chairs","authors":"","doi":"10.1109/fpl57034.2022.00007","DOIUrl":"https://doi.org/10.1109/fpl57034.2022.00007","url":null,"abstract":"","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117206410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan
{"title":"XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine","authors":"Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, D. Xu, Hong Wang, Rongzhang Zheng, Satyaprakash Pareek, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan","doi":"10.1109/FPL57034.2022.00041","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00041","url":null,"abstract":"The convolution neural networks (CNNs) are widely used in computer vision applications nowadays. However, the trends of higher accuracy and higher resolution generate larger networks, indicating that computation and I/O bandwidth are key bottlenecks to reach performance. The Xilinx's latest 7nm Versal ACAP platform with AI-Engine (AIE) cores can deliver up-to 8x silicon compute density at 50% the power consumption compared with the traditional FPGA solutions. In this paper, we propose XVDPU: the AIE-based int8-precision CNN accelerator on Versal chips, scaling from 16-AIE-core (C16B1) to 320-AIE-core (C64B5, Peak:109.2 TOPs) to meet computation requirements. To resolve IO bottleneck, we adopt several techniques such as multi-batch (MB), shared-weights (SHRWGT), feature-map-stationary (FMS) and long-load-weights (LLW) to improve data-reuse and reduce I/O requirements. An Arithmetic Logic Unit (ALU) design is further proposed into the accelerator which mainly performs non-convolution layers such as Depthwise-Conv layer, Pooling layer and Non-linear function layers using the same logic resources, which can better balance resource utilization, new feature support and efficiency of the whole system. We have successfully deployed more than 100 CNN models with our accelerator. Our experimental results show that the 96-AIE-core (C32B3, Peak: 32.76 TOPs) implementation can achieve 1653 FPS for ResNet50 on VCK190, which is 9.8x faster than the design on ZCU102 running at 168.5 FPS with peak 3.6 TOPs. The 256-AIE-core (C32B8, Peak: 87.36 TOPs) implementation can further achieve 4050 FPS which better leverages the computing power of Versal AIE devices. The powerful XVDPU will help enable many applications on the embedded system, such as low-latency data center, high level ADAS and complex robotics.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116320835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunhui Qiu, Yuhang Cao, Yuan Dai, Wenbo Yin, Lingli Wang
{"title":"TRAM: An Open-Source Template-based Reconfigurable Architecture Modeling Framework","authors":"Yunhui Qiu, Yuhang Cao, Yuan Dai, Wenbo Yin, Lingli Wang","doi":"10.1109/FPL57034.2022.00021","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00021","url":null,"abstract":"Coarse-grained reconfigurable architecture (CGRA) is a promising accelerator design choice due to its high performance and power efficiency in the computation or data-intensive application domains, such as security, multimedia, digital signal processing, machine learning, and high-performance computing. CGRA consists of coarse-grained processing elements (PEs) and interconnects that determine the architecture flexibility to support different applications and also affect the performance and power efficiency significantly. Although multiple types of interconnects have been proposed, a parameterized unified model is still lacking. In this paper, we propose a flexible and scalable CGRA template with a novel interconnect model that can unify the typical neighbor-to-neighbor, switch-based, and FPGA-like interconnects. Furthermore, we present TRAM, an open-source template-based reconfigurable architecture modeling framework that integrates the Chisel-based CGRA modeling, architecture intermediate representation (IR) and Verilog generation, dataflow graph (DFG) mapping, simulation, and evaluation. The mapping flow contains graph-based placement and routing, critical-path-driven data synchronization, and simulated-annealing-based optimization. We evaluate the impacts of the rich design parameters, which demonstrate the significance of such a flexible template to facilitate architecture optimization. Compared with the related work, TRAM can achieve a 4.1× smaller DFG latency and a faster mapping speed for both the 8×8 and 16×16 CGRAs. Moreover, TRAM is able to attain an extremely high PE utilization of 94.4 % on average by architecture tuning.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127254677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}