{"title":"A streaming hardware architecture for real-time SIFT feature extraction","authors":"Hector A. Li Sanchez, A. George","doi":"10.1109/ICFPT52863.2021.9609932","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609932","url":null,"abstract":"The Scale-Invariant Feature Transform (SIFT) is a feature extractor that serves as a key step in many computer-vision pipelines. Real-time operation based on a software-only approach is often infeasible, but FPGAs can be employed to parallelize execution and accelerate the application to meet latency requirements. In this study, we present a stream-based hardware acceleration architecture for SIFT feature extraction. Using a novel strategy to store pixels required for descriptor computation, the execution time needed to generate SIFT descriptors is greatly improved relative to previous designs. This strategy also enables further reduction of the execution time by introducing multiple processing elements (PEs) for computation of several SIFT descriptors in parallel. Additionally, the proposed architecture supports keypoint detection at an arbitrary number of octaves and allows for runtime configuration of various parameters. An FPGA implementation targeting the Xilinx Zynq-7045 system-on-chip (SoC) device is deployed to demonstrate the efficiency of the proposed architecture. In the target hardware, the resulting system is capable of processing images with a resolution of 1280 × 720 pixels at up to 150 FPS while maintaining modest resource utilization.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114938703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays","authors":"Su Zheng, Kaisen Zhang, Yaoguang Tian, Wenbo Yin, Lingli Wang, Xuegong Zhou","doi":"10.1109/ICFPT52863.2021.9609928","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609928","url":null,"abstract":"Coarse-Grained Reconfigurable Arrays (CGRAs) provide sufficient flexibility in domain-specific applications with high hardware efficiency, which make CGRAs suitable for fast-evolving fields such as neural network acceleration and edge computing. To meet the requirement of the fast evolution, we propose FastCGRA, the modeling, mapping, and exploration platform for large-scale CGRAs. FastCGRA supports hierarchical architecture description and automatic switch module generation. Connectivity-aware packing and graph partition algorithms are designed to reduce the complexity of placement and routing. The graph homomorphism placement algorithm in FastCGRA enables efficient placement on large-scale CGRAs. The packing and placement algorithms cooperate with a negotiation-based routing algorithm to form an integral mapping procedure. FastCGRA can support the modeling and mapping of large-scale CGRAs with significantly higher placement and routing efficiency than existing platforms. The automatic switch module generation method can reduce the complexity of CGRA interconnection design. With these features, FastCGRA can boost the exploration of large-scale CGRAs.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115757638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In-Storage Computation of Histograms with differential privacy","authors":"Andrei Tosa, A. Hangan, G. Sebestyen, Z. István","doi":"10.1109/ICFPT52863.2021.9609899","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609899","url":null,"abstract":"Network-attached Smart Storage is becoming increasingly common in data analytics applications. It relies on processing elements, such as FPGAs, close to the storage medium to offload compute-intensive operations, reducing data movement across distributed nodes in the system. As a result, it can offer outstanding performance and energy efficiency. Modern data analytics systems are not only becoming more distributed they are also increasingly focusing on privacy policy compliance. This means that, in the future, Smart Storage will have to offload more and more privacy-related processing. In this work, we explore how the computation of differentially private (DP) histograms, a basic building block of privacy-preserving analytics, can be offloaded to FPGAs. By performing DP aggregation on the storage side, untrusted clients can be allowed to query the data in aggregate form without risking the leakage of personally identifiable information. We prototype our idea by extending an FPGA-based distributed key-value store with three new components. First, a histogram module, that processes values at 100Gbps line-rate. Second, a random noise generator that adds noise to final histogram according to the rules dictated by DP. Third, a mechanism to limit the rate at which key-value pairs can be used in histograms, to stay within the DP privacy budget.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch
{"title":"Scalable and Flexible High-Performance In-Network Processing of Hash Joins in Distributed Databases","authors":"John M. Wirth, Jaco A. Hofmann, Lasse Thostrup, Carsten Binnig, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609804","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609804","url":null,"abstract":"Programmable switches allow to offload specific processing tasks into the network and promise multi-Tbit/s throughput. One major goal when moving computation to the network is typically to reduce the volume of network traffic, and thus improve the overall performance. In this manner, programmable switches are increasingly used, both in research as well as in industry applications, for various scenarios, including statistics gathering, in-network consensus protocols, and more. However, the currently available programmable switches suffer from several practical limitations. One important restriction is the limited amount of available memory, making them unsuitable for stateful operations such as Hash Joins in distributed databases. In previous work, an FPGA-based In-Network Hash Join accelerator was presented, initially using DDR-DRAM to hold the state. In a later iteration, the hash table was moved to on-chip HBM-DRAM to improve the performance even further. However, while very fast, the size of the joins in this setup was limited by the relatively small amount of available HBM. In this work, we heterogeneously combine DDR-DRAM and HBM memories to support both larger joins and benefit from the far faster and more parallel HBM accesses. In this manner, we are able to improve the performance by a factor of 3x compared to the previous HBM-based work. We also introduce additional configuration parameters, supporting a more flexible adaptation of the underlying hardware architecture to the different join operations required by a concrete use-case.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130240373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Physical Page Migrations in Shared Virtual Memory Reconfigurable Computing Systems","authors":"Torben Kalkhof, Andreas Koch","doi":"10.1109/ICFPT52863.2021.9609831","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609831","url":null,"abstract":"Shared Virtual Memory (SVM) can considerably simplify the application development for FPGA-accelerated computers, as it allows the seamless passing of virtually addressed pointers across the hardware/software boundary. Especially applications operating on complex pointer-based data structures can profit from this approach, as SVM can often avoid having to copy the entire data to FPGA memory, while performing pointer relocations in the process. Many FPGA-accelerated computers, especially in a data center setting, employ PCIe-attached boards that have FPGA-local memory in the form of on-chip HBM or on-board DRAM. Accesses to this local memory are much faster than going to the host memory via PCIe. Thus, even in the presence of SVM, it is desirable to be able to move the physical memory pages holding frequently accessed data closest to the compute unit that is operating on them. This capability is called physical page migration. The main contribution of this work is an open-source framework which provides SVM with physical page migration capabilities to PCIe-attached FPGA cards. We benchmark both fully automatic on-demand and user-managed explicit migration modes, and show that for suitable use-cases, the performance of migrations cannot just match that of conventional DMA copy-based accelerator operations, but may even exceed it by overlapping computations and migrations.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133087760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai
{"title":"Resource-saving FPGA Implementation of the Satisfiability Problem Solver: AmoebaSATslim","authors":"Yi Yan, H. Amano, M. Aono, Kaori Ohkoda, Shingo Fukuda, Kenta Saito, S. Kasai","doi":"10.1109/ICFPT52863.2021.9609882","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609882","url":null,"abstract":"The Boolean satisfiability problem (SAT) is an NP-complete combinatorial optimization problem, where fast SAT solvers are useful for various smart society applications. Since these edge-oriented applications require time-critical control, a high speed SAT solver on FPGA is a promising approach. Here the authors propose a novel FPGA implementation of a bio-inspired stochastic local search algorithm called ‘AmoebaSAT’ on a Zynq board. Previous studies on FPGA-AmoebaSATs tackled relatively smaller-sized 3-SAT instances with a few hundred variables and found the solutions in several milli seconds. These implementations, however, adopted an instance-specific approach, which requires synthesis of FPGA configuration every time when the targeted instance is altered. In this paper, a slimmed version of AmoebaSAT named ‘AmoebaSATslim,’ which omits the most resource-consuming part of interactions among variables, is proposed. The FPGA-AmoebaSATslim enables to tackle significantly larger-sized 3-SAT instances, accepting 30,000 variables with 130, 800 clauses. It achieves up to approximately 24 times faster execution speed than the software-AmoebaSATslim implemented on a CPU of the x86 server.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130786852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Profiling-Based Control-Flow Reduction in High-Level Synthesis","authors":"Austin Liolli, Omar Ragheb, J. Anderson","doi":"10.1109/ICFPT52863.2021.9609816","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609816","url":null,"abstract":"Control flow in a program can be represented in a directed graph, called the control flow graph (CFG). Nodes in the graph represent straight-line segments of code, basic blocks, and directed edges between nodes correspond to transfers of control. We present a methodology to selectively reduce control flow by collapsing basic blocks into their parent blocks, revealing increased instruction-level parallelism to a high-level synthesis (HLS) scheduler, thereby raising circuit performance.We evaluate our approach within an HLS tool that allows a C-language software program to be automatically synthesized into a hardware circuit, using the CHStone benchmark suite [1], targeting an Intel Cyclone V FPGA. For individual benchmark circuits we observe cycle count reductions up to 20.7% and wall-clock time reductions up to 22.6%, and 6% on average.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh
{"title":"An area-efficient multiply-accumulation architecture and implementations for time-domain neural processing","authors":"Ichiro Kawashima, Yuichi Katori, T. Morie, H. Tamukoh","doi":"10.1109/ICFPT52863.2021.9609809","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609809","url":null,"abstract":"In our work, a new area-efficient multiply-accumulation scheme for time-domain neural processing named differential multiply-accumulation is proposed. Our new scheme reduces hardware resources utilization of multiply-accumulation with suppressing the increasing computational time resulting from the time-multiplexing. As a result, 2,048 neurons of fully connected CBM and RC-CBM were synthesized for a single field-programmable gate array (FPGA).","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133764589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomás Fukac, J. Matoušek, J. Korenek, Lukás Kekely
{"title":"Increasing Memory Efficiency of Hash-Based Pattern Matching for High-Speed Networks","authors":"Tomás Fukac, J. Matoušek, J. Korenek, Lukás Kekely","doi":"10.1109/ICFPT52863.2021.9609859","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609859","url":null,"abstract":"Increasing speed of network links continuously pushes up requirements on the performance of network security and monitoring systems, including their typical representative and its core function: an intrusion detection system (IDS) and pattern matching. To allow the operation of IDS applications like Snort and Suricata in networks supporting throughput of 100Gbps or even more, a recently proposed pre-filtering architecture approximates exact pattern matching using hash-based matching of short strings that represent a given set of patterns. This architecture can scale supported throughput by adjusting the number of parallel hash functions and on-chip memory blocks utilized in the implementation of a hash table. Since each hash function can address every memory block, scaling throughput also increases the total capacity of the hash table. Nevertheless, the original architecture utilizes the available capacity of the hash table inefficiently. We therefore propose three optimization techniques that either reduce the amount of information stored in the hash table or increase its achievable occupancy. Moreover, we also design modifications of the architecture that enable resource-efficient utilization of all three optimization techniques together in synergy. Compared to the original pre-filtering architecture, combined use of the proposed optimizations in the 100Gbps scenario increases the achievable capacity for short strings by three orders of magnitude. It also reduces the utilization of FPGA logic resources to only a third.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121990730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A unified accelerator design for LiDAR SLAM algorithms for low-end FPGAs","authors":"K. Sugiura, Hiroki Matsutani","doi":"10.1109/ICFPT52863.2021.9609886","DOIUrl":"https://doi.org/10.1109/ICFPT52863.2021.9609886","url":null,"abstract":"A fast and reliable LiDAR (Light Detection and Ranging) SLAM (Simultaneous Localization and Mapping) system is the growing need for autonomous mobile robots, which are used for a variety of tasks such as indoor cleaning, navigation, and transportation. To bridge the gap between the limited processing power on such robots and the high computational requirement of the SLAM system, in this paper we propose a unified accelerator design for 2D SLAM algorithms on resource-limited FPGA devices. As scan matching is the heart of these algorithms, the proposed FPGA-based accelerator utilizes scan matching cores on the programmable logic part and users can switch the SLAM algorithms to adapt to performance requirements and environments without modifying and re-synthesizing the logic part. We integrate the accelerator into two representative SLAM algorithms, namely particle filter-based and graph-based SLAM. They are evaluated in terms of resource utilization, processing speed, and quality of output results with various real-world datasets, highlighting their algorithmic characteristics. Experiment results on a Pynq-Z2 board demonstrate that scan matching is accelerated by 13.67–14.84x, improving the overall performance of particle filter-based and graph-based SLAM by 4.03–4.67x and 3.09–4.00x respectively, while maintaining the accuracy comparable to their software counterparts and even state-of-the-art methods.","PeriodicalId":376220,"journal":{"name":"2021 International Conference on Field-Programmable Technology (ICFPT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}