2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

筛选
英文 中文
Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA 基于FPGA的在线决策树学习的高效可扩展加速研究
Zhe Lin, Sharad Sinha, Wei Zhang
{"title":"Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA","authors":"Zhe Lin, Sharad Sinha, Wei Zhang","doi":"10.1109/FCCM.2019.00032","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00032","url":null,"abstract":"Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples and providing inference results. However, even the most up-to-date online tree learning algorithms still suffer from either high memory usage or high computational intensity with dependency and long latency, making them challenging to implement in hardware. To overcome these difficulties, we introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models. The proposed algorithm is light-weight in terms of both memory and computational demand, while still maintaining high generalization ability. A series of optimization techniques dedicated to the proposed algorithm have been investigated from the hardware perspective, including coarse-grained and fine-grained parallelism, dynamic and memory-based resource sharing, pipelining with data forwarding. We further present a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques. Experimental results show that our proposed algorithm outperforms the state-of-the-art Hoeffding tree learning method, leading to 0.05% to 12.3% improvement in inference accuracy. Real implementation of the complete learning system on the FPGA demonstrates a 384x to 1581x speedup in execution time over the state-of-the-art design.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130283946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
KPynq: A Work-Efficient Triangle-Inequality Based K-Means on FPGA 基于三角不等式的高效K-Means FPGA
Yuke Wang, Zhaorui Zeng, Boyuan Feng, Lei Deng, Yufei Ding
{"title":"KPynq: A Work-Efficient Triangle-Inequality Based K-Means on FPGA","authors":"Yuke Wang, Zhaorui Zeng, Boyuan Feng, Lei Deng, Yufei Ding","doi":"10.1109/FCCM.2019.00061","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00061","url":null,"abstract":"K-means is a popular but computation-intensive algorithm for unsupervised learning. To address this issue, we present KPynq, a work-efficient triangle-inequality based K-means on FPGA for handling large-size, high-dimension datasets. KPynq leverages an algorithm-level optimization to balance the performance and computation irregularity, and a hardware architecture design to fully exploit the pipeline and parallel processing capability of various FPGAs. In the experiment, KPynq consistently outperforms the CPU-based standard K-means in terms of its speedup (up to 4.2x) and significant energy efficiency (up to 218x).","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"346 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121468274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
SimBNN: A Similarity-Aware Binarized Neural Network Acceleration Framework SimBNN:一个相似度感知的二值化神经网络加速框架
Cheng Fu, Shilin Zhu, Huili Chen, F. Koushanfar, Hao Su, Jishen Zhao
{"title":"SimBNN: A Similarity-Aware Binarized Neural Network Acceleration Framework","authors":"Cheng Fu, Shilin Zhu, Huili Chen, F. Koushanfar, Hao Su, Jishen Zhao","doi":"10.1109/FCCM.2019.00060","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00060","url":null,"abstract":"Binarized Neural Networks (BNNs) eliminate bitwidth redundancy in Convolutional Neural Networks (CNNs) by using a single bit (-1/+1) for network parameters and intermediate representations. This greatly reduces off-chip data transfer and storage overhead. However, considerable computation redundancy remains in BNN inference. To tackle this problem, we investigate the similarity property in input data and kernel weights. We identify an average of 79% input similarity and 61% kernel similarity measured by our proposed metric across common network architectures. Motivated by this observation, we propose SimBNN, a fast and energy-efficient acceleration framework for BNN inference that leverages similarity properties. SimBNN consists of a set of similarity-aware accelerators, a weight reuse optimization algorithm, and a similarity selection mechanism. SimBNN incorporates two types of BNN accelerators, which exploit the input similarity and kernel similarity, respectively. More specifically, the result from the previous stage is reused if similarity is identified, thus significantly reducing BNN computation overhead. Furthermore, we propose a weight reuse optimization algorithm, which increases the weight similarity by off-line re-ordering weight kernels. Finally, our framework provides a systematic method to determine the optimal strategy between input data and kernel weights reuse, based on the similarity characteristics of input data and pre-trained BNNs.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122991661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Welcome Message from the General and Program Chairs 总主委及计划主委致欢迎辞
{"title":"Welcome Message from the General and Program Chairs","authors":"","doi":"10.1109/fccm.2019.00005","DOIUrl":"https://doi.org/10.1109/fccm.2019.00005","url":null,"abstract":"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124013570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Safe Task Interruption for FPGAs fpga的安全任务中断
Sameh Attia, Vaughn Betz
{"title":"Safe Task Interruption for FPGAs","authors":"Sameh Attia, Vaughn Betz","doi":"10.1109/FCCM.2019.00070","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00070","url":null,"abstract":"Saving and restoring the state of an FPGA task in an orderly manner is essential for enabling hardware checkpointing and context switching. However, it requires task interruption, and stopping a task at an arbitrary time can cause several hazards including deadlock and data loss. In this work, we build a context switching simulator to simulate and identify these hazards. In addition, we introduce design rules that should be followed to achieve safe task interruption, and propose task wrappers that can be placed around an FPGA task to implement these rules.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115498163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Impact of FPGA Architecture on Area and Performance of CGRA Overlays FPGA架构对CGRA覆盖面积和性能的影响
Ian Taras, J. Anderson
{"title":"Impact of FPGA Architecture on Area and Performance of CGRA Overlays","authors":"Ian Taras, J. Anderson","doi":"10.1109/FCCM.2019.00022","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00022","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) are programmable logic devices with ALU-style processing elements and datapath interconnect. CGRAs can be realized as custom ASICs or implemented on FPGAs as overlays . A key element of CGRAs is that they are typically software programmable with rapid compile times – an advantage arising from their coarse-grained characteristics, simplifying CAD mapping tasks. We implement two previously published CGRAs as overlays on two commercial FPGAs (Intel and Xilinx), and consider the impact of the underlying FPGA architecture on the CGRA area and performance. We present optimizations for the overlays to take advantage of the FPGA architectural features and show a peak performance improvement of 1.93x, as well as maximum area savings of 31.1% and 48.5% for Intel and Xilinx, respectively, relative to a naive first-cut implementation. We also present a novel technique for a configurable multiplexer implementation, which embeds the select signals into SRAM configuration, saving 35.7% in area. The research is conducted using the open-source CGRA-ME (modeling and exploration) framework [1].","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Raparo: Resource-Level Angle-Based Parallel Routing for FPGAs 基于资源级角度的fpga并行路由
Minghua Shen, Nong Xiao
{"title":"Raparo: Resource-Level Angle-Based Parallel Routing for FPGAs","authors":"Minghua Shen, Nong Xiao","doi":"10.1109/FCCM.2019.00053","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00053","url":null,"abstract":"Routing is a time-consuming step in the FPGA compilation flow. The parallelization of routing has the potential to reduce the time but imposes the dependent problem as the inherent order of nets. In this paper, we present Raparo, a resource-level angle-based parallel router. Raparo exploits angle-based region partitioning to drive the assignment of the nets for efficient parallel routing on the multi-core processor systems. Raparo parallelizes the routing at resource level rather than region level for the similar convergence as the serial router. Results show that Raparo can scale to 32 processor cores to provide about 16x speedup on average with acceptable impacts on the quality of results, comparing to the serial router.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129921478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An FPGA-Based Computing Infrastructure Tailored to Efficiently Scaffold Genome Sequences 一种基于fpga的计算基础架构,可有效地支撑基因组序列
Alberto Zeni, M. Crespi, Lorenzo Di Tucci, M. Santambrogio
{"title":"An FPGA-Based Computing Infrastructure Tailored to Efficiently Scaffold Genome Sequences","authors":"Alberto Zeni, M. Crespi, Lorenzo Di Tucci, M. Santambrogio","doi":"10.1109/FCCM.2019.00074","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00074","url":null,"abstract":"In the current years broad access to genomic data is leading to improve the understanding and prevention of human diseases as never before. De-novo genome assembly, represents a main obstacle to perform the analysis on a large scale, as it is one of the most time-consuming phases of the genome analysis. In this paper, we present a scalable, high performance and energy efficient architecture for the alignment step of SSPACE, a state of the art tool used to perform scaffolding also in case of de-novo assembly. The final architecture is able to achieve up to 9.83x speedup in performance when compared to the software version of Bowtie, a state of the art tool used by SSPACE to perform the alignment.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127082565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Maverick: A Stand-Alone CAD Flow for Partially Reconfigurable FPGA Modules Maverick:部分可重构FPGA模块的独立CAD流程
D. Glick, Jesse Grigg, B. Nelson, M. Wirthlin
{"title":"Maverick: A Stand-Alone CAD Flow for Partially Reconfigurable FPGA Modules","authors":"D. Glick, Jesse Grigg, B. Nelson, M. Wirthlin","doi":"10.1109/FCCM.2019.00012","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00012","url":null,"abstract":"This paper presents Maverick, a proof-of-concept computer-aided design (CAD) flow for generating reconfigurable modules (RMs) which target partial reconfiguration (PR) regions in field-programmable gate array (FPGA) designs. After an initial static design and PR region are created with Xilinx's Vivado PR flow, the Maverick flow can then compile and configure RMs onto that PR region—without the use of vendor tools. Maverick builds upon existing open source tools (Yosys, RapidSmith2, and Project X-Ray) to form an end-to-end compilation flow. This paper describes the Maverick flow and shows the results of it running on a PYNQ-Z1's ARM processor to compile a set of HDL designs to partial bitstreams. The resulting bitstreams were configured onto the PYNQ-Z1's FPGA fabric, demonstrating the feasibility of a single-chip embedded system which can both compile HDL designs to bitstreams and then configure them onto its own programmable fabric.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128127421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High Precision, High Performance FPGA Adders 高精度,高性能FPGA加法器
M. Langhammer, B. Pasca, Gregg Baeckler
{"title":"High Precision, High Performance FPGA Adders","authors":"M. Langhammer, B. Pasca, Gregg Baeckler","doi":"10.1109/FCCM.2019.00047","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00047","url":null,"abstract":"FPGAs are now being commonly used in the datacenter as smart Network Interface Cards (NICs), with cryptography as one of the strategic application areas. Public key cryptography algorithms in particular require arithmetic with thousands of bits of precision. Even an operation as simple as addition can be difficult for the FPGA when dealing with large integers, because of the high resource count and high latency needed to achieve usable performance levels with known methods. This paper examines the architecture and implementation of high-performance integer adders on FPGAs for widths ranging from 1024 to 8192 bits, in both single-instance and many-core chip-filling configurations. For chip-filling designs the routing impact of these wide busses are assessed, as they often have an impact outside the immediate locality of the structures. The architectures presented in this work show 1 to 2 orders magnitude reduction in the area-latency product over commonly used approaches. Routing congestion is managed, with near 100% logic efficiency (packing) for the adder function. Performance for these largely automatically placed designs are approximately the same as for carefully floor-planned non-arithmetic applications. In one example design, we show a 2048 bit adder in 5021 ALMs, with a latency of 6 clock cycles, at 628 MHz in a Stratix 10 E-2 device.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信