{"title":"The Design Method of Logic Circuits based on the Voltage-Input Enhanced Scouting Logic Gates","authors":"Fan Liu, S. Zhang, Xiaole Cui","doi":"10.1109/FPL57034.2022.00031","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00031","url":null,"abstract":"The Enhanced Scouting Logic (ESL) is a memristive logic gate family with low sensitivity to resistance variation and high device endurance. This work studies the design methods of logic circuits based on the Voltage-Input Enhanced Scouting Logic (VIESL) gates. Both the single-array and dual-array synthesis methods are proposed. The read/write separation technique of VIESL gates facilitates the pipelined logic operations. The synthesis results on the benchmarks show that the circuit generated by the proposed single-array synthesis method has the best performance compared with that of its counterparts, and the dual-array synthesis method reduces the cell counts effectively.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127284428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Direct Device-to-Device Physical Page Migrations in Multi-FPGA Shared Virtual Memory Systems","authors":"Torben Kalkhof, A. Koch","doi":"10.1109/FPL57034.2022.00043","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00043","url":null,"abstract":"Shared Virtual Memory (SVM) is a proven approach to simplify the programming of heterogeneous computing systems. It enables a single virtual address space across all computing devices, even for systems having Non-Uniform Memory Accesses (NUMA) across devices. Access time spikes due to NUMA can be reduced, though, by performing physical page migrations in SVM. These migrations ensure high data locality by moving the underlying memory pages close to the computing device currently working on the contained data, and allow the devices to fault-in pages from remote to local memories autonomously. The main contribution of this work is the implementation of an open-source framework enabling scalable SVM for multi-FPGA architectures, and providing efficient device-to-device page migrations. We compare the runtime of on-demand and user-managed migrations, and examine three different communication mechanisms for the actual board-to-board data transfers. Our framework supports both low-latency and high-throughput operations, requiring, e.g., only 11.6 μs to migrate a single 4 kB page between physical memories on different boards, and 760 μs to migrate an entire 4 MB range of memory.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121074435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FSHMEM: Supporting Partitioned Global Address Space on FPGAs for Large-Scale Hardware Acceleration Infrastructure","authors":"Yashael F. Arthanto, David Ojika, Joo-Young Kim","doi":"10.1109/FPL57034.2022.00042","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00042","url":null,"abstract":"By providing highly efficient one-sided communication with globally shared memory space, Partitioned Global Address Space (PGAS) has become one of the most promising parallel computing models in high-performance computing (HPC). Meanwhile, FPGA is getting attention as an alternative compute platform for HPC systems with the benefit of custom computing and design flexibility. However, the exploration of PGAS has not been conducted on FPGAs, unlike the traditional message passing interface. This paper proposes FSHMEM, a software/hardware framework that enables the PGAS programming model on FPGAs. We implement the core functions of GASNet specification on FPGA for native PGAS integration in hardware, while its programming interface is designed to be highly compatible with legacy software. Our experiments show that FSHMEM achieves the peak bandwidth of 3813 MB/s, which is more than 95% of the theoretical maximum, outperforming the prior works by 9.5×. It records 0.35us and 0.59us latency for remote write and read operations, respectively. Finally, we conduct a case study on the two Intel D5005 FPGA nodes integrating Intel's deep learning accelerator. The two-node system programmed by FSHMEM achieves 1.94× and 1.98× speedup for matrix multiplication and convolution operation, respectively, showing its scalability notential for HPC infrastructure.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129157516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chengming Zhang, Tong Geng, Anqi Guo, Jiannan Tian, Martin C. Herbordt, Ang Li, Dingwen Tao
{"title":"H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture","authors":"Chengming Zhang, Tong Geng, Anqi Guo, Jiannan Tian, Martin C. Herbordt, Ang Li, Dingwen Tao","doi":"10.1109/FPL57034.2022.00040","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00040","url":null,"abstract":"Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph typologies. Existing efforts, however, have focused mainly on handling graphs' irregularity and have not studied their heterogeneity. To this end we propose H-GCN, a PL (Programmable Logic) and AIE (AI Engine) based hybrid accelerator that leverages the emerging heterogeneity of Xilinx Versal Adaptive Compute Acceleration Platforms (ACAPs) to achieve high-performance GNN inference. In particular, H-GCN partitions each graph into three subgraphs based on its inherent heterogeneity, and processes them using PL and AIE, respectively. To further improve performance, we explore the sparsity support of AIE and develop an efficient density-aware method to automatically map tiles of sparse matrix-matrix multiplication (SpMM) onto the systolic tensor array. Compared with state-of-the-art GCN accelerators, H-GCN achieves, on average, speedups of 1.1~2.3x.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122231722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Y. Tan, Felix Staudigl, Lukas Jünger, Anna Drewes, R. Leupers, J. Joseph
{"title":"EmuNoC: Hybrid Emulation for Fast and Flexible Network-on-Chip Prototyping on FPGAs","authors":"Y. Y. Tan, Felix Staudigl, Lukas Jünger, Anna Drewes, R. Leupers, J. Joseph","doi":"10.1109/FPL57034.2022.00058","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00058","url":null,"abstract":"Networks-on-Chips (NoCs) recently became widely used, from multi-core CPUs to edge-AI accelerators. Emulation on FPGAs promises to accelerate their RTL modeling compared to slow simulations. However, realistic test stimuli are challenging to generate in hardware for diverse applications. In other words, both a fast and flexible design framework is required. The most promising solution is hybrid emulation, in which parts of the design are simulated in software, and the other parts are emulated in hardware. This paper proposes a novel hybrid emulation framework called EmuNoC. We introduce a clock-synchronization method and software-only packet generation that improves the emulation speed by 36.3 × to 79.3 × over state-of-the-art frameworks while retaining the flexibility of a pure-software interface for stimuli simulation. We also increased the area efficiency to model up to an NoC with 169 routers on a single FPGA, while previous frameworks only achieved 64 routers.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jens Trautmann, Nikolaos Patsiatzis, Andreas Becher, J. Teich, S. Wildermann
{"title":"Real-Time Waveform Matching with a Digitizer at 10 GS/s","authors":"Jens Trautmann, Nikolaos Patsiatzis, Andreas Becher, J. Teich, S. Wildermann","doi":"10.1109/FPL57034.2022.00025","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00025","url":null,"abstract":"Side-Channel Analysis (SCA) requires the detection of the specific time frame within which Cryptographic Operations (COs) take place in the side-channel signal. In laboratory conditions with full control over the Device under Test (DuT), dedicated trigger signals can be implemented to indicate the start and end of COs. For real-world scenarios, waveform-matching techniques have been established which compare the side-channel signal with a template of the CO's pattern in real time to detect the CO in the side channel. State-of-the-art approaches are implemented on Field-Programmable Gate Arrays (FPGAs). However, current waveform-matching designs process the samples from Analog-to-Digital Converters (ADCs) sequentially and can only work with low sampling rates due to the limited clock speed of FPGAs. This makes it increasingly difficult to apply existing techniques on modern DuTs that operate with clock speeds in the GHz range. In this paper, we present a parallel waveform-matching architecture that is capable of performing waveform matching at the speed of fast ADCs. We implement the proposed architecture in a high-end FPGA-based digitizer and deploy it to detect AES COs from the side channel of a single-board computer operating at 1 GHz. Our implementation allows for waveform matching at 10 GS/s with high accuracy, thus offering a speedup of 50× compared to the fastest state-of-the-art implementation known to us.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124706207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraphScale: Scalable Bandwidth-Efficient Graph Processing on FPGAs","authors":"Jonas Dann, Daniel Ritter, H. Fröning","doi":"10.1109/FPL57034.2022.00016","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00016","url":null,"abstract":"Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing accelerators either use the off-chip memory bandwidth inefficiently or do not scale well across memory channels. In this work, we propose GraphScale, a scalable graph processing framework for FPGAs. For the first time, Graph-Scale combines multi-channel memory with asynchronous graph processing (i. e., for fast convergence on results) and a com-pressed graph representation (i. e., for efficient usage of memory bandwidth and reduced memory footprint). GraphScale solves common graph problems like breadth-first search, PageRank, and weakly -connected components through modular user-defined functions, a novel two-dimensional partitioning scheme, and a high-performance two-level crossbar design.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121310002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Half Title Page","authors":"","doi":"10.1109/fpl57034.2022.00001","DOIUrl":"https://doi.org/10.1109/fpl57034.2022.00001","url":null,"abstract":"","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130615196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Sommer, Akif ¨Ozkan, Member Ieee Oliver Keszocze, Fellow Ieee J¨urgen Teich
{"title":"DSP-Packing: Squeezing Low-precision Arithmetic into FPGA DSP Blocks","authors":"J. Sommer, Akif ¨Ozkan, Member Ieee Oliver Keszocze, Fellow Ieee J¨urgen Teich","doi":"10.1109/FPL57034.2022.00035","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00035","url":null,"abstract":"The number of Digital Signal Processor (DSP) resources available in Field Programmable Gate Arrays (FPGAs) is often quite limited. Therefore, full utilization of available DSP resources for the computationally intensive parts of an algorithm is paramount for optimizing the non-functional properties of an implementation (i.e., performance, power, and area). The DSPs available in Xilinx devices implement large bit width operators (i.e. a 48-bit accumulator or a 18 × 27 multiplier). However, using such a DSP for low-precision quantized data (as is common in image processing or machine learning applications) leaves the DSP resources underutilized. As a remedy, a method has been proposed to pack and compute four 4-bit multiplications on a single DSP in a single clock cycle. This paper presents a generalization of this scheme to arbitrary bit widths and number of multiplications. We also demonstrate that the previously proposed approach leads to errors (Mean Absolute Error (MAE) = 0.37). Furthermore, we explain where these errors come from and how they can be corrected. On top, we introduce a novel approximate method called “Overpacking” which allows to squeeze even more multiplications into a single DSP at the cost of small errors (MAE = 0.47). Overpacking allows to squeeze six 4-bit multiplications into a single DSP compared to just four in the literature. Finally, we introduce an alternative method for packing multiple small-bit width additions into a single 48-bit accumulator for use in applications such as Spiking Neural Networks.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Montgomerie-Corcoran, Zhewen Yu, C. Bouganis
{"title":"SAMO: Optimised Mapping of Convolutional Neural Networks to Streaming Architectures","authors":"Alexander Montgomerie-Corcoran, Zhewen Yu, C. Bouganis","doi":"10.1109/FPL57034.2022.00069","DOIUrl":"https://doi.org/10.1109/FPL57034.2022.00069","url":null,"abstract":"Significant effort has been placed on the development of toolflows that map Convolutional Neural Network (CNN) models to Field Programmable Gate Arrays (FPGAs) with the aim of automating the production of high performance designs for a diverse set of applications. However, within these toolflows, the problem of finding an optimal mapping is often overlooked, with the expectation that the end user will tune their generated hardware for their desired platform. This is particularly prominent within Streaming Architecture toolflows, where there is a large design space to be explored. In this work, we establish the framework SAMO: a Streaming Architecture Mapping Optimiser. SAMO exploits the structure of CNN models and the common features that exist in Streaming Architectures, and casts the mapping optimisation problem under a unified methodology. Furthermore, SAMO explicitly explores the re-configurability property of FPGAs, allowing the methodology to overcome mapping limitations imposed by certain toolflows under resource-constrained scenarios, as well as improve on the achievable throughput. Three optimisation methods - Brute-Force, Simulated Annealing and Rule-Based - have been developed in order to generate valid, high performance designs for a range of target platforms and CNN models. Results show that SAMO-optimised designs can achieve 4x-20x better performance compared to existing hand-tuned designs. The SAMO framework is open-source: https://github.com/AlexMontgomerie/samo.","PeriodicalId":380116,"journal":{"name":"2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125084949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}