{"title":"Application Specific Instruction-Set Processors for Machine Learning Applications","authors":"Muhammad Ali, D. Göhringer","doi":"10.1109/ICFPT56656.2022.9974187","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974187","url":null,"abstract":"Machine learning algorithms are becoming more complicated with time in order to solve complex problems. This is creating a gap for embedded system solutions e.g. General-Purpose Processors (GPPs), Graphic Processing Units (GPUs), and hardware accelerators, for the machine learning algorithms. To bridge the gap between the available solutions, Application Specific Instruction-set Processors (ASIPs) are a promising solution. ASIPs are processor designs with a tailored architecture for a specific application. This allows a better efficiency (performance-to-power) ratio for the application ex-ecution. Furthermore, it adds more flexibility to the system as compared with hardware accelerators. The scope of this Ph. D. work is to develop a RISC-V-based ASIP for machine learning applications and explore the design space of the optimizations. RISC-V is an open-source Instruction-Set-Architecture (ISA) and allows the addition of custom application-specific instructions to the ISA. In the scope of this work three main design space optimization of ASIPs will be explored; specialized application-specific ISA, vector processing (for data-level parallelism), and multi-core architecture (for task-level parallelism). RISC- V 32-bit architecture is used as the base platform. For vector processing, RISC- V V-extension is utilized for a SIMD-based architecture called Vector Processing Unit (VPU) which is coupled with a 32-bit RISC- V host CPU. A modular memory system is implemented to have a shared (bus-based) and distributed (NoC- based) multi-core system. The memory system increases the flexibility and scalability of the system. Other known machine learning platforms are also explored and used as a comparison case.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116648805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kristiyan Manev, Joseph Powell, Kaspar Matas, Dirk Koch
{"title":"byteman: A Bitstream Manipulation Framework","authors":"Kristiyan Manev, Joseph Powell, Kaspar Matas, Dirk Koch","doi":"10.1109/ICFPT56656.2022.9974549","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974549","url":null,"abstract":"From better resource pooling for FPGA cloud providers to building dynamic execution pipelines at runtime, the capabilities of partial reconfiguration (PR) are waiting to be fully explored. However, the community still fails to materialize PR at scale, and FPGAs are only used as updatable ASICs, hence, omitting the opportunities offered by dynamically reconfiguring FPGAs at runtime. This work proposes a resourceful FPGA bitstream manipulation framework. The proposed tool provides means for parsing, modification, and generation of bitstream files, and it has been open-sourced and demonstrated in a working system. As a distinguished feature, it supports multidie FPGAs (among the 106 Xilinx 7 Series, UltraScale, and UltraScale+ devices), and enables datacenter FPGAs to be used for relocatable PR. Using the versatile tool's built-in (dis)assembler allows for manual bitstream manipulations. Bundled with an efficient bitstream manipulation core, the efficacy is demonstrated by two case studies where we observe 58 - 377x higher bitstream merging throughput than a current state-of-art tool.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116158363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shikha Goel, Rajesh Kedia, Rijurekha Sen, M. Balakrishnan
{"title":"EXPRESS: CNN EXecution Time PREdiction for DPU DeSign Space Exploration","authors":"Shikha Goel, Rajesh Kedia, Rijurekha Sen, M. Balakrishnan","doi":"10.1109/ICFPT56656.2022.9974299","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974299","url":null,"abstract":"Deep learning Processor Units (DPUs) from Xilinx are design-time configurable CNN accelerators for FPGAs. We propose EXPRESS, which predicts the execution time of any given CNN on a DPU. EXPRESS incorporates the effect of bus connections into prediction. As a DPU is invoked by a host CPU to process a CNN layer by layer, EXPRESS considers the CPU and the DPU execution time for predicting the end-to-end processing time. EXPRESS has an average prediction error of 2.2% and significantly outperforms state-of-the-art.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133970121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sultan S. Alqahtani, Yiqun Zhu, Qizhi Shi, Xiaolin Meng, Xinhua Wang
{"title":"A Highly Customizable and Efficient Hardware Implementation for Parallel Matrix Inversion","authors":"Sultan S. Alqahtani, Yiqun Zhu, Qizhi Shi, Xiaolin Meng, Xinhua Wang","doi":"10.1109/ICFPT56656.2022.9974569","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974569","url":null,"abstract":"This paper introduces an efficient and customizable FPGA-based architecture for parallel matrix inversion. The capability of the proposed customizable architecture to adapt to different matrix sizes with low latency and effective resource utilization is achieved. The hardware resource usage is optimized by re-using the same multiplication units for different calculations. The architecture uses multiple multiplication units in parallel to perform the normalization step and then re-uses them for the elimination step. The performance of the proposed architecture is enhanced by maximizing parallelism and minimizing the sequential execution time of the division unit. Compared with other related works, the implementation results show that the proposed architecture is sufficiently flexible to support different matrix sizes with high parallel computing power. Additionally, the number of clock cycles and multiplication units of the proposed architecture is reduced proportionally to the increase in matrix size. The proposed architecture has been optimized for a Zynq xc7z045 FPGA and implemented using both single and double- precision floating-point representations.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133473726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongdong Tang, Xuan Sun, Nan Guan, Tei-Wei Kuo, C. Xue
{"title":"$p$LPAQ: Accelerating LPAQ Compression on FPGA","authors":"Dongdong Tang, Xuan Sun, Nan Guan, Tei-Wei Kuo, C. Xue","doi":"10.1109/ICFPT56656.2022.9974593","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974593","url":null,"abstract":"In recent years, the demand for data storage space has increased dramatically due to the exponential growth of data volume. Data compression is of great significance since it saves data storage space and reduces data transfer demand. Compression algorithms based on statistical models have a much higher compression ratio than dictionary-based methods, but the high computational time cost of statistical modeling limits their wider application. In this paper, we introduce pLPAQ, an FPGA-based design of a powerful compression algorithm LPAQ based on statistical models. A novel hardware accelerator is proposed to speed up LPAQ by fully utilizing the parallelism of FPGA. Experimental results show that the proposed design can achieve a throughput of 12 MB/s on Xilinx Virtex Plus UltraScale XCVU9P card, 25x faster than executing on AMD Ryzen R7 4800U at 2.8 GHz and 80x faster compared with the naive FPGA implementation on average.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133130597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware SAT Solver-based Area-efficient Accelerator for Autonomous Driving","authors":"Yusuke Inuma, Yuko Hara-Azumi","doi":"10.1109/ICFPT56656.2022.9974200","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974200","url":null,"abstract":"Today's embedded systems applications consisting of a variety of tasks are becoming larger and more complex. Hence, when multiple tasks need to be accelerated, designing a dedicated accelerator for each task would be difficult on small devices due to large area overhead. In this study, we propose an efficient accelerator for autonomous driving, which is a theme of a design competition held at International Conference on Field Programmable Technology. Focusing on two key tasks (path planning and object detection), we formulate each of them as a satisfiability problem (SAT) and use a hardware SAT solver as a common accelerator for these tasks. We present efficient problem formulation methods for solving these tasks on a small FPGA. Experimental results show the effectiveness of our work for these tasks.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127573517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CAPI-Precis: Towards a Compute-Centric Interface for Coherent Shared Memory Accelerators","authors":"A. Mughrabi, G. Byrd","doi":"10.1109/ICFPT56656.2022.9974504","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974504","url":null,"abstract":"Emerging shared memory accelerator interfaces promote a tighter coupling between traditional general-purpose processing cores and accelerator units through cache-coherence and shared virtual address space capabilities. However, different interface standards solving similar problems often require custom designs and optimizations depending on the adopted interface. This work introduces CAPI-Precis, an abstract layer between CAPI, a cache-coherent interface standard proposed by IBM, and the Accelerator Functional Unit (AFU). CAPI-Precis provides a Compute-Centric FIFO-based paradigm with the shared memory accelerator interface, hiding CAPI complexities and latency requirements in an abstract layer focusing on optimized, efficient, and scalable AFUs. Such a layer adapts to other shared memory interfaces, such as CCIX or CXL, with minimal overhead in area and performance while preserving the algorithm logic design.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122758132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of Fast Sample Entropy Towards Biomedical Applications on FPGAs","authors":"Chao Chen, B. Silva, Jianqing Li, Chengyu Liu","doi":"10.1109/ICFPT56656.2022.9974323","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974323","url":null,"abstract":"Sample Entropy (SampEn) is an information en-tropy algorithm widely used for complexity analysis and chaos estimation in many applications. In particular, SampEn measures complexity of time series by the conditional probability of the inner pattern. Unfortunately, the straightforward implementation of SampEn is quadratic time complexity, restricting its real-time analysis ability for health applications and long-term data analysis. Although researchers have proposed fast versions of SampEn to avoid unnecessary comparisons, they have not been accelerated yet due to their performance bottleneck in the complex similarity pair process. In this paper, we evaluate fast SampEn algorithms by employing multi-source biomedical signals on an Field-Programmable Gate Arrays (FPGA). Since fast SampEn algorithms based of a pre-sorting stage promise to outperform other SampEn algorithms, Lightweight SampEn based on Merge Sort is here implemented and optimized. Dif-ferent type of optimizations, that can be generalized for similar Lightweight-based SampEn algorithms, are used to reduce the overall latency while the data throughput is increased. A load balancing strategy for multi similarity pair modules is also proposed to solve the unbalancing loads, a bottleneck when increasing the execution parallelism of this type of algorithms. As a result, the proposed SampEn architecture runs 10 times faster than the fastest SampEn implementation on a modern CPU.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"47 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132604682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gorgin, M. Gholamrezaei, D. Javaheri, Jeong-A. Lee
{"title":"An Energy-Efficient K-means Clustering FPGA Accelerator via Most-Significant Digit First Arithmetic","authors":"S. Gorgin, M. Gholamrezaei, D. Javaheri, Jeong-A. Lee","doi":"10.1109/ICFPT56656.2022.9974222","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974222","url":null,"abstract":"K-means clustering is the most well-known unsupervised learning method that partitions the input dataset into $K$ clusters based on the similarity between the data samples. In this paper, to achieve an energy-efficient implementation without sacrificing performance, we take advantage of massive parallelism and digit-level pipelining via FPGA and the most-significant digit first arithmetic. Having the result of the most-significant digits in advance provides the possibility of early termination for unnecessary computations and fetching just the required most-significant part of data points from memory. This early termination technique significantly increases performance and decreases energy consumption. Our experimental results from various datasets and comparisons with the state-of-the-art FPGA accelerators indicate that our proposed design has effectively reduced energy consumption without any performance loss.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127056831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quality & Generality: A Flexible FPGA Re-Clustering Technique to Improve Packing and Placement","authors":"Mohamed A. Elgammal, Vaughn Betz","doi":"10.1109/ICFPT56656.2022.9974325","DOIUrl":"https://doi.org/10.1109/ICFPT56656.2022.9974325","url":null,"abstract":"The Packing and Placement stages are two major steps in the FPGA backend flow which greatly affect the Quality-of-Results (QoR) of design implementation. While these problems have been extensively studied in the literature, most approaches have either sacrificed generality by targeting specific and simplified FPGAs with few “block packing” legality constraints, or sacrificed quality by making irreversible packing decisions early in the flow and hence constraining the optimizations available to the subsequent placement stage. In this paper, we propose a new (re-clustering API) that can be used to update the packed netlist at different points throughout the packing and placement stages. This API can be used in our proposed flow to improve the QoR while preserving the generality and flexibility of the flow and ensuring the legality of the solution for any proposed FPGA architecture.","PeriodicalId":239314,"journal":{"name":"2022 International Conference on Field-Programmable Technology (ICFPT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127518501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}