ACM Transactions on Reconfigurable Technology and Systems最新文献_第9页

BLOOP: Boolean Satisifiability-based Optimized Loop Pipelining blop:基于布尔满意度的优化循环流水线

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-30 DOI: https://dl.acm.org/doi/10.1145/3599972

Nicolai Fiege, Peter Zipf

{"title":"BLOOP: Boolean Satisifiability-based Optimized Loop Pipelining","authors":"Nicolai Fiege, Peter Zipf","doi":"https://dl.acm.org/doi/10.1145/3599972","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3599972","url":null,"abstract":"Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called initiation interval (II). For throughput maximization, this value should be as low as possible; therefore its minimization is the main optimization goal. Despite its long historical existence, modulo scheduling always remained a relevant research topic over the last years with many exact and heuristic algorithms available in literature. Nevertheless, we are able to leverage the scalability of modern Boolean Satisfiability (SAT) solvers to outperform state-of-the-art ILP-based algorithms for latency-optimal modulo scheduling for both integer and rational IIs. Our algorithm is able to compute valid modulo schedules for the whole CHStone and MachSuite benchmark suites, with 99% of the solutions being proven to be throughput-optimal for a timeout of only 10 min per candidate II. For various time limits, not a single tested scheduler from the state-of-the-art is able to compute more verified optimal solutions or even a single schedule with a higher throughput than our proposed approach. Using an HLS toolflow we show that our algorithm can be effectively used to generate Pareto-optimal FPGA implementations regarding throughput and resource usage.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138541643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Increasing the Robustness of TERO-TRNGs against Process Variation 提高tero - trng对工艺变化的鲁棒性

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-23 DOI: https://dl.acm.org/doi/10.1145/3597418

Christian Skubich, Peter Reichel, Marc Reichenbach

引用次数: 0

Increasing the Robustness of TERO-TRNGs Against Process Variation 提高tero - trng对工艺变化的鲁棒性

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-23 DOI: 10.1145/3597418

Christian Skubich, Peter Reichel, M. Reichenbach

引用次数: 0

An FPGA Accelerator for Genome Variant Calling 基因组变异调用的FPGA加速器

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-22 DOI: https://dl.acm.org/doi/10.1145/3595297

Tiancheng Xu, Scott Rixner, Alan L. Cox

引用次数: 0

Exploring FPGA Switch-Blocks without Explicit Pattern Listing 探索FPGA开关块没有明确的模式列表

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-17 DOI: 10.1145/3597417

Stefan Nikolic, P. Ienne

{"title":"Exploring FPGA Switch-Blocks without Explicit Pattern Listing","authors":"Stefan Nikolic, P. Ienne","doi":"10.1145/3597417","DOIUrl":"https://doi.org/10.1145/3597417","url":null,"abstract":"Increased lower metal resistance makes physical aspects of Field-Programmable Gate Array (FPGA) switch-blocks more relevant than before. The need to navigate a design space where each individual switch can have significant impact on the FPGA’s performance in turn makes automated switch-pattern exploration techniques increasingly appealing. However, most existing exploration techniques have a fundamental limitation—they use the CAD tools as a black box to evaluate the performance of explicitly listed switch-patterns. Given the time needed to route a modern circuit on a single architecture, the number of switch-patterns that can be explicitly tested quickly becomes negligible compared to the size of the design space. This paper presents a technique that removes this fundamental limitation by making the entire design space visible to the router and letting it choose the switches to be added to the pattern, based on the requirements of the circuits being routed. The key to preventing the router from selecting arbitrary switches that would render the final pattern excessively large is to apply the same negotiation principle used by the router to remove congestion, just in the opposite direction, to make the signals reach a consensus on which switches are worthy of being included in the final switch-pattern.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45928886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks 一种提高基于可扩展CORDIC的深度神经网络性能的经验方法

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-05-08 DOI: 10.1145/3596220

Gopal R. Raut, Saurabh Karkun, S. Vishvakarma

{"title":"An Empirical Approach to Enhance Performance for Scalable CORDIC-Based Deep Neural Networks","authors":"Gopal R. Raut, Saurabh Karkun, S. Vishvakarma","doi":"10.1145/3596220","DOIUrl":"https://doi.org/10.1145/3596220","url":null,"abstract":"Practical implementation of deep neural networks (DNNs) demands significant hardware resources, necessitating high computational power and memory bandwidth. While existing field-programmable gate array (FPGA)–based DNN accelerators are primarily optimized for fast single-task performance, cost, energy efficiency, and overall throughput are crucial considerations for their practical use in various applications. This article proposes a performance-centric pipeline Coordinate Rotation Digital Computer (CORDIC)–based MAC unit and implements a scalable CORDIC-based DNN architecture that is area- and power-efficient and has high throughput. The CORDIC-based neuron engine uses bit-rounding to maintain input-output precision and minimal hardware resource overhead. The results demonstrate the versatility of the proposed pipelined MAC, which operates at 460 MHz and allows for higher network throughput. A software-based implementation platform evaluates the proposed MAC operation’s accuracy for more extensive neural networks and complex datasets. The DNN accelerator with parameterized and modular layer-multiplexed architecture is designed. Empirical evaluation through Pareto analysis is used to improve the efficiency of DNN implementations by fixing the arithmetic precision and optimal pipeline stages. The proposed architecture utilizes layer-multiplexing, a technique that effectively reuses a single DNN layer to enhance efficiency while maintaining modularity and adaptability for integrating various network configurations. The proposed CORDIC MAC-based DNN architecture is scalable for any bit-precision network size, and the DNN accelerator is prototyped using the Xilinx Virtex-7 VC707 FPGA board, operating at 66 MHz. The proposed design does not use any Xilinx macros, making it easily adaptable for ASIC implementation. Compared with state-of-the-art designs, the proposed design reduces resource use by 45% and power consumption by 4× without sacrificing performance. The accelerator is validated using the MNIST dataset, achieving 95.06% accuracy, only 0.35% less than other cutting-edge implementations.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":"1 - 32"},"PeriodicalIF":2.3,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48649357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration RapidStream 2.0:通过部分重构实现延迟不敏感FPGA设计的自动并行实现

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-04-26 DOI: 10.1145/3593025

Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong

{"title":"RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration","authors":"Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong","doi":"10.1145/3593025","DOIUrl":"https://doi.org/10.1145/3593025","url":null,"abstract":"FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48513829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

NAPOLY: A Non-deterministic Automata Processor OverLaY 非确定性自动机处理器叠加

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-04-24 DOI: 10.1145/3593586

Rasha Karakchi, J. Bakos

{"title":"NAPOLY: A Non-deterministic Automata Processor OverLaY","authors":"Rasha Karakchi, J. Bakos","doi":"10.1145/3593586","DOIUrl":"https://doi.org/10.1145/3593586","url":null,"abstract":"Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the core of many big data applications. Recent efforts to develop Domain-Specific Architectures (DSAs) for DFA/NFA have taken divergent approaches, but achieving consistent throughput for arbitrarily-large pattern sets, state activation rates, and pattern match rates remains a challenge. In this article, we present NAPOLY (Non-Deterministic Automata Processor OverLaY), an FPGA overlay and associated compiler. A common limitation of prior efforts is a limit on NFA size for achieving the advertised throughput. NAPOLY is optimized for fast re-programming to permit practical time-division multiplexing of the hardware and permit high asymptotic throughput for NFAs of unlimited size, unlimited state activation rate, and high pattern reporting rate. NAPOLY also allows for offline generation of configurations having tradeoffs between state capacity and transition capacity. In this article, we (1) evaluate NAPOLY using benchmarks packaged in the ANMLZoo benchmark suite, (2) evaluate the use of an SAT solver for allocating physical resources, and (3) compare NAPOLY’s performance against existing solutions. NAPOLY performs most favorably on larger benchmarks, benchmarks with higher state activation frequency, and benchmarks with higher reporting frequency. NAPOLY outperforms the fastest of the CPU and GPU implementations in 10 out of 12 benchmarks.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 25"},"PeriodicalIF":2.3,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41485080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking 基于事件的实时多目标跟踪的可重构体系结构

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-04-21 DOI: 10.1145/3593587

Yizhao Gao, Song Wang, Hayden Kwok-Hay So

{"title":"A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking","authors":"Yizhao Gao, Song Wang, Hayden Kwok-Hay So","doi":"10.1145/3593587","DOIUrl":"https://doi.org/10.1145/3593587","url":null,"abstract":"Although advances in event-based machine vision algorithms have demonstrated unparalleled capabilities in performing some of the most demanding tasks, their implementations under stringent real-time and power constraints in edge systems remain a major challenge. In this work, a reconfigurable hardware-software architecture called REMOT, which performs real-time event-based multi-object tracking on FPGAs, is presented. REMOT performs vision tasks by defining a set of actions over attention units (AUs). These actions allow AUs to track an object candidate autonomously by adjusting its region of attention, and allow information gathered by each AU to be used for making algorithmic-level decisions. Taking advantage of this modular structure, algorithm-architecture codesign can be performed by implementing different parts of the algorithm in either hardware or software for different tradeoffs. Results show that REMOT can process 0.43–2.91 million events per second at 1.75–5.45 watts. Compared with the software baseline, our implementation achieves up to 44 times higher throughput and 35.4 times higher power efficiency. Migrating the Merge operation to hardware further reduces the worst-case latency to be 95 times shorter than the software baseline. By varying the AU configuration and operation, a reduction of 0.59–0.77mW per AU on the programmable logic has also been demonstrated.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":2.3,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44050130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Stream Aggregation with Compressed Sliding Windows 流聚合与压缩滑动窗口

IF 2.3 4区计算机科学

ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-04-05 DOI: 10.1145/3590774

Prajith Ramakrishnan Geethakumari, I. Sourdis

{"title":"Stream Aggregation with Compressed Sliding Windows","authors":"Prajith Ramakrishnan Geethakumari, I. Sourdis","doi":"10.1145/3590774","DOIUrl":"https://doi.org/10.1145/3590774","url":null,"abstract":"High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":"16 1","pages":"1 - 28"},"PeriodicalIF":2.3,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47102285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0