ACM Transactions on Reconfigurable Technology and Systems (TRETS)最新文献

筛选
英文 中文
Automata Processing in Reconfigurable Architectures 可重构结构中的自动机处理
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-05-17 DOI: 10.1145/3314576
Chunkun Bo, V. Dang, Ted Xie, J. Wadden, M. Stan, K. Skadron
{"title":"Automata Processing in Reconfigurable Architectures","authors":"Chunkun Bo, V. Dang, Ted Xie, J. Wadden, M. Stan, K. Skadron","doi":"10.1145/3314576","DOIUrl":"https://doi.org/10.1145/3314576","url":null,"abstract":"We present a general automata processing framework on FPGAs, which generates an RTL kernel for automata processing together with an AXI and PCIe based I/O circuitry. We implement the framework on both local nodes and cloud platforms (Amazon AWS and Nimbix) with novel features. A full performance comparison of the proposed framework is conducted against state-of-the-art automata processing engines on CPUs, GPUs, and Micron’s Automata Processor using the ANMLZoo benchmark suite and some real-world datasets. Results show that FPGAs enable extremely high-throughput automata processing compared to von Neumann architectures. We also collect the resource utilization and power consumption on the two cloud platforms, and find that the I/O circuitry consumes most of the hardware resources and power. Furthermore, we propose a fast, symbol-only reconfiguration mechanism based on the framework for large pattern sets that cannot fit on a single device and need to be partitioned. The proposed method supports multiple passes of the input stream and reduces the re-compilation cost from hours to seconds.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131572537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exact and Practical Modulo Scheduling for High-Level Synthesis 精确实用的高阶合成模调度
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-05-06 DOI: 10.1145/3317670
J. Oppermann, Melanie Reuter-Oppermann, Lukas Sommer, A. Koch, O. Sinnen
{"title":"Exact and Practical Modulo Scheduling for High-Level Synthesis","authors":"J. Oppermann, Melanie Reuter-Oppermann, Lukas Sommer, A. Koch, O. Sinnen","doi":"10.1145/3317670","DOIUrl":"https://doi.org/10.1145/3317670","url":null,"abstract":"Loop pipelining is an essential technique in high-level synthesis to increase the throughput and resource utilisation of field-programmable gate array--based accelerators. It relies on modulo schedulers to compute an operator schedule that allows subsequent loop iterations to overlap partially when executed while still honouring all precedence and resource constraints. Modulo schedulers face a bi-criteria problem: minimise the initiation interval (II; i.e., the number of timesteps after which new iterations are started) and minimise the schedule length. We present Moovac, a novel exact formulation that models all aspects (including the II minimisation) of the modulo scheduling problem as a single integer linear program, and discuss simple measures to prevent excessive runtimes, to challenge the old preconception that exact modulo scheduling is impractical. We substantiate this claim by conducting an experimental study covering 188 loops from two established high-level synthesis benchmark suites, four different time limits, and three bounds for the schedule length, to compare our approach against a highly tuned exact formulation and a state-of-the-art heuristic algorithm. In the fastest configuration, an accumulated runtime of under 16 minutes is spent on scheduling all loops, and proven optimal IIs are found for 179 test instances.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"444 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132081399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
[DL] A Survey of FPGA-based Neural Network Inference Accelerators [DL]基于fpga的神经网络推理加速器综述
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-03-28 DOI: 10.1145/3289185
Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang
{"title":"[DL] A Survey of FPGA-based Neural Network Inference Accelerators","authors":"Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang","doi":"10.1145/3289185","DOIUrl":"https://doi.org/10.1145/3289185","url":null,"abstract":"Recent research on neural networks has shown a significant advantage in machine learning over traditional algorithms based on handcrafted features and models. Neural networks are now widely adopted in regions like image, speech, and video recognition. But the high computation and storage complexity of neural network inference poses great difficulty on its application. It is difficult for CPU platforms to offer enough computation capacity. GPU platforms are the first choice for neural network processes because of its high computation capacity and easy-to-use development frameworks. However, FPGA-based neural network inference accelerator is becoming a research topic. With specifically designed hardware, FPGA is the next possible solution to surpass GPU in speed and energy efficiency. Various FPGA-based accelerator designs have been proposed with software and hardware optimization techniques to achieve high speed and energy efficiency. In this article, we give an overview of previous work on neural network inference accelerators based on FPGA and summarize the main techniques used. An investigation from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"929 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123060897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 149
FeatherNet FeatherNet
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-03-28 DOI: 10.1145/3306202
Raghid Morcel, Hazem M. Hajj, M. Saghir, Haitham Akkary, H. Artail, R. Khanna, A. Keshavamurthy
{"title":"FeatherNet","authors":"Raghid Morcel, Hazem M. Hajj, M. Saghir, Haitham Akkary, H. Artail, R. Khanna, A. Keshavamurthy","doi":"10.1145/3306202","DOIUrl":"https://doi.org/10.1145/3306202","url":null,"abstract":"Convolutional Neural Network (ConvNet or CNN) algorithms are characterized by a large number of model parameters and high computational complexity. These two requirements have made it challenging for implementations on resource-limited FPGAs. The challenges are magnified when considering designs for low-end FPGAs. While previous work has demonstrated successful ConvNet implementations with high-end FPGAs, this article presents a ConvNet accelerator design that enables the implementation of complex deep ConvNet architectures on resource-constrained FPGA platforms aimed at the IoT market. We call the design “FeatherNet” for its light resource utilization. The implementations are VHDL-based providing flexibility in design optimizations. As part of the design process, new methods are introduced to address several design challenges. The first method is a novel stride-aware graph-based method targeted at ConvNets that aims at achieving efficient signal processing with reduced resource utilization. The second method addresses the challenge of determining the minimal precision arithmetic needed while preserving high accuracy. For this challenge, we propose variable-width dynamic fixed-point representations combined with a layer-by-layer design-space pruning heuristic across the different layers of the deep ConvNet model. The third method aims at achieving a modular design that can support different types of ConvNet layers while ensuring low resource utilization. For this challenge, we propose the modules to be relatively small and composed of computational filters that can be interconnected to build an entire accelerator design. These model elements can be easily configured through HDL parameters (e.g., layer type, mask size, stride, etc.) to meet the needs of specific ConvNet implementations and thus they can be reused to implement a wide variety of ConvNet architectures. The fourth method addresses the challenge of design portability between two different FPGA vendor platforms, namely, Intel/Altera and Xilinx. For this challenge, we propose to instantiate the device-specific hardware blocks needed in each computational filter, rather than relying on the synthesis tools to infer these blocks, while keeping track of the similarities and differences between the two platforms. We believe that the solutions to these design challenges further advance knowledge as they can benefit designers and other researchers using similar devices or facing similar challenges. Our results demonstrated the success of addressing the design challenges and achieving low (30%) resource utilization for the low-end FPGA platforms: Zedboard and Cyclone V. The design overcame the limitation of designs targeted for high-end platforms and that cannot fit on low-end IoT platforms. Furthermore, our design showed superior performance results (measured in terms of [Frame/s/W] per Dollar) compared to high-end optimized designs.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131306158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
HopliteBuf
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-02-20 DOI: 10.1145/3375899
Tushar Garg, Saud Wasly, R. Pellizzoni, Nachiket Kapre
{"title":"HopliteBuf","authors":"Tushar Garg, Saud Wasly, R. Pellizzoni, Nachiket Kapre","doi":"10.1145/3375899","DOIUrl":"https://doi.org/10.1145/3375899","url":null,"abstract":"HopliteBuf is a deflection-free, low-cost, and high-speed FPGA overlay Network-on-chip (NoC) with stall-free buffers. It is an FPGA-friendly 2D unidirectional torus topology built on top of HopliteRT overlay NoC. The stall-free buffers in HopliteBuf are supported by static analysis tools based on network calculus that help determine worst-case FIFO occupancy bounds for a prescribed workload. We implement these FIFOs using cheap LUT SRAMs (Xilinx SRL32s and Intel MLABs) to reduce cost. HopliteBuf is a hybrid microarchitecture that combines the performance benefits of conventional buffered NoCs by using stall-free buffers with the cost advantages of deflection-routed NoCs by retaining the lightweight unidirectional torus topology structure. We present two design variants of the HopliteBuf NoC: (1) single corner-turn FIFO (W → S) and (2) dual corner-turn FIFO (W → S+N). The single corner-turn (W → S) design is simpler and only introduces a buffering requirement for packets changing dimension from the X ring to the downhill Y ring (or West to South). The dual corner-turn variant requires two FIFOs for turning packets going downhill (W → S) as well as uphill (W → N). The dual corner-turn design overcomes the mathematical analysis challenges associated with single corner-turn designs for communication workloads with cyclic dependencies between flow traversal paths at the expense of a small increase in resource cost. Our static analysis delivers bounds that are not only better (in latency) than HopliteRT but also tighter by 2−3×. Across 100 randomly generated flowsets mapped to a 5×5 system size, HopliteBuf is able to route a larger fraction of these flowsets with <128-deep FIFOs, boost worst-case routing latency by ≈ 2× for mutually feasible flowsets, and support a 10% higher injection rate than HopliteRT. At 20% injection rates, HopliteRT is only able to route 1--2% of the flowsets, while HopliteBuf can deliver 40--50% sustainability. When compared to the W → Sbkp backpressure-based router, we observe that our HopliteBuf solution offers 25--30% better feasibility at 30--40% lower LUT cost.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126559886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
FlexSaaS
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-02-17 DOI: 10.1145/3301409
Shijie Cao, Lanshun Nie, D. Zhan, Wenqiang Wang, Ningyi Xu, Ramashis Das, Ming Wu, Lintao Zhang, Derek Chiou
{"title":"FlexSaaS","authors":"Shijie Cao, Lanshun Nie, D. Zhan, Wenqiang Wang, Ningyi Xu, Ramashis Das, Ming Wu, Lintao Zhang, Derek Chiou","doi":"10.1145/3301409","DOIUrl":"https://doi.org/10.1145/3301409","url":null,"abstract":"Web search engines deploy large-scale selection services on CPUs to identify a set of web pages that match user queries. An FPGA-based accelerator can exploit various levels of parallelism and provide a lower latency, higher throughput, more energy-efficient solution than commodity CPUs. However, maintaining such a customized accelerator in a commercial search engine is challenging because selection services are changed often. This article presents our design for FlexSaaS (Flexible Selection as a Service), an FPGA-based accelerator for web search selection. To address efficiency and flexibility challenges, FlexSaaS abstracts computing models and separates memory access from computation. Specifically, FlexSaaS (i) contains a reconfigurable number of matching processors that can handle various possible query plans, (ii) decouples index stream reading from matching computation to fetch and decode index files, and (iii) includes a universal memory accessor that hides the complex memory hierarchy and reduces host data access latency. Evaluated on FPGAs in the selection service of a commercial web search--the Bing web search engine—FlexSaaS can be evolved quickly to adapt to new updates. Compared to the software baseline, FlexSaaS on Arria 10 reduces average latency by 30% and increases throughput by 1.5×.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129004163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms 现代异构CPU-FPGA平台的微架构深入分析
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-02-17 DOI: 10.1145/3294054
Young-kyu Choi, J. Cong, Zhenman Fang, Y. Hao, Glenn D. Reinman, Peng Wei
{"title":"In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms","authors":"Young-kyu Choi, J. Cong, Zhenman Fang, Y. Hao, Glenn D. Reinman, Peng Wei","doi":"10.1145/3294054","DOIUrl":"https://doi.org/10.1145/3294054","url":null,"abstract":"Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs that can be reconfigured to accelerate a broad class of applications with orders-of-magnitude performance/watt gains, are attracting increased attention from both academia and industry. As a consequence, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This article aims to address this challenge by determining which microarchitectural characteristics affect performance, and in what ways. Specifically, we conduct a quantitative comparison and an in-depth analysis on five state-of-the-art CPU-FPGA acceleration platforms: (1) the Alpha Data board and (2) the Amazon F1 instance that represent the traditional PCIe-based platform with private device memory; (3) the IBM CAPI that represents the PCIe-based system with coherent shared memory; (4) the first generation of the Intel Xeon+FPGA Accelerator Platform that represents the QPI-based system with coherent shared memory; and (5) the second generation of the Intel Xeon+FPGA Accelerator Platform that represents a hybrid PCIe-based (non-coherent) and QPI-based (coherent) system with shared memory. Based on the analysis of their CPU-FPGA communication latency and bandwidth characteristics, we provide a series of insights for both application developers and platform designers. Furthermore, we conduct two case studies to demonstrate how these insights can be leveraged to optimize accelerator designs. The microbenchmarks used for evaluation have been released for public use.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"511 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124651826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
An Efficient Memory Partitioning Approach for Multi-Pattern Data Access via Data Reuse 基于数据重用的多模式数据访问的高效内存划分方法
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-02-05 DOI: 10.1145/3301296
Wensong Li, Fan Yang, Hengliang Zhu, Xuan Zeng, Dian Zhou
{"title":"An Efficient Memory Partitioning Approach for Multi-Pattern Data Access via Data Reuse","authors":"Wensong Li, Fan Yang, Hengliang Zhu, Xuan Zeng, Dian Zhou","doi":"10.1145/3301296","DOIUrl":"https://doi.org/10.1145/3301296","url":null,"abstract":"Memory bandwidth has become a bottleneck that impedes performance improvement during the parallelism optimization of the datapath. Memory partitioning is a practical approach to reduce bank-level conflicts and increase the bandwidth on a field-programmable gate array. In this work, we propose a memory partitioning approach for multi-pattern data access. First, we propose to combine multiple patterns into a single pattern to reduce the complexity of multi-pattern. Then, we propose to perform data reuse analysis on the combined pattern to find data reuse opportunities and the non-reusable data pattern. Finally, an efficient bank mapping algorithm with low complexity and low overhead is proposed to find the optimal memory partitioning solution. Experimental results demonstrated that compared to the state-of-the-art method, our proposed approach can reduce the number of block RAMS by 58.9% on average, with 79.6% reduction in SLICEs, 85.3% reduction in LUTs, 67.9% in reduction Flip-Flops, 54.6% reduction in DSP48Es, 83.9% reduction in SRLs, 50.0% reduction in storage overhead, 95.0% reduction in execution time, and 77.3% reduction in dynamic power consumption on average. Meanwhile, the performance can be improved by 14.0% on average.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114658918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
COFFE 2
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-01-30 DOI: 10.1145/3301298
S. Yazdanshenas, Vaughn Betz
{"title":"COFFE 2","authors":"S. Yazdanshenas, Vaughn Betz","doi":"10.1145/3301298","DOIUrl":"https://doi.org/10.1145/3301298","url":null,"abstract":"FPGAs are becoming more heteregeneous to better adapt to different markets, motivating rapid exploration of different blocks/tiles for FPGAs. To evaluate a new FPGA architectural idea, one should be able to accurately obtain the area, delay, and energy consumption of the block of interest. However, current FPGA circuit design tools can only model simple, homogeneous FPGA architectures with basic logic blocks and also lack DSP and other heterogeneous block support. Modern FPGAs are instead composed of many different tiles, some of which are designed in a full custom style and some of which mix standard cell and full custom styles. To fill this modelling gap, we introduce COFFE 2, an open-source FPGA design toolset for automatic FPGA circuit design. COFFE 2 uses a mix of full custom and standard cell flows and supports not only complex logic blocks with fracturable lookup tables and hard arithmetic but also arbitrary heterogeneous blocks. To validate COFFE 2 and demonstrate its features, we design and evaluate a multi-mode Stratix III-like DSP block and several logic tiles with fracturable LUTs and hard arithmetic. We also demonstrate how COFFE 2’s interface to VTR allows full evaluation of block-routing interfaces and various fracturable 6-LUT architectures.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133019993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Loop Unrolling for Energy Efficiency in Low-Cost Field-Programmable Gate Arrays 低成本现场可编程门阵列能源效率的环展开
ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2019-01-21 DOI: 10.1145/3289186
Naveen Kumar Dumpala, S. B. Patil, Daniel E. Holcomb, R. Tessier
{"title":"Loop Unrolling for Energy Efficiency in Low-Cost Field-Programmable Gate Arrays","authors":"Naveen Kumar Dumpala, S. B. Patil, Daniel E. Holcomb, R. Tessier","doi":"10.1145/3289186","DOIUrl":"https://doi.org/10.1145/3289186","url":null,"abstract":"Field-programmable gate arrays (FPGAs) are used for a wide variety of computations in low-cost embedded systems. Although these systems often have modest performance constraints, their energy consumption must typically be limited. Many FPGA applications employ repetitive loops that cannot be straightforwardly split into parallel computations. Performing a loop sequentially generally requires high-speed clocks that consume considerable clock power and sometimes require clock generation using a phase-locked loop (PLL). Loop unrolling addresses the high-speed clock issue, but its use often leads to significant combinational glitch power. In this work, a computer-aided design (CAD) approach that unrolls loops for designs targeted to low-cost FPGAs is described. Our approach considers latency constraints in an effort to minimize energy consumption for loop-based computation. To reduce glitch power, a glitch-filtering approach is introduced that provides a balance between glitch reduction and design performance. Glitch-filter enable signals are generated and routed to the filters using resources best suited to the target FPGA. Our approach automatically inserts glitch filters and associated control logic into a design prior to processing with FPGA synthesis, place, and route tools. Our energy-saving loop-unrolling approach has been evaluated using five benchmarks often used in low-cost FPGAs. The energy-saving capabilities of the approach have been evaluated for an Intel Cyclone IV and a Xilinx Artix-7 FPGA using board-level power measurement. The use of unrolling and glitch filtering is shown to reduce energy by at least 65% for an Artix-7 device and 50% for a Cyclone IV device while meeting design latency constraints.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114387640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信