Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

筛选
英文 中文
FPGA Acceleration for Simultaneous Image Reconstruction and Segmentation based on the Mumford-Shah Regularization (Abstract Only) 基于Mumford-Shah正则化的FPGA同步图像重构与分割
Wentai Zhang, Li Shen, Thomas Page, Guojie Luo, Peng Li, P. Maass, M. Jiang, J. Cong
{"title":"FPGA Acceleration for Simultaneous Image Reconstruction and Segmentation based on the Mumford-Shah Regularization (Abstract Only)","authors":"Wentai Zhang, Li Shen, Thomas Page, Guojie Luo, Peng Li, P. Maass, M. Jiang, J. Cong","doi":"10.1145/2684746.2689097","DOIUrl":"https://doi.org/10.1145/2684746.2689097","url":null,"abstract":"X-ray computed tomography is an important technique for clinical diagnose and nondestructive testing. In many applications a number of image processing steps are needed before the image information becomes useful. Image segmentation is one of such processing steps and has important applications. The conventional flow is to first reconstruct the image and then obtain image segmentation afterwards. In contrast, an iterative method for simultaneous reconstruction and segmentation (SRS) with Mumford-Shah model has been proposed, which not only regularizes the ill-posedness of the tomographic reconstruction problem, but also produces the image segmentation at the same time. The Mumford-Shah model is both mathematically and computationally difficult. In this paper, we propose a data-decomposed algorithm of the SRS method, accelerate it using FPGA devices. The proposed algorithm has a structure that invokes a single kernel many times without involving other computational tasks. Though this structure seems best fit on GPU-like devices, experimental results show that a 73X, 11X, and 1.4X speedup can be achieved by the FPGA acceleration over the CPU implementation of the original SRS algorithm and ray-parallel SRS algorithm, and the GPU implementation of the ray-parallel SRS.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117082957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The BEEcube Story: Lessons Learned from Running a FPGA Startup for the Past 7 Years BEEcube的故事:从过去7年运行FPGA启动的经验教训
Chen Chang
{"title":"The BEEcube Story: Lessons Learned from Running a FPGA Startup for the Past 7 Years","authors":"Chen Chang","doi":"10.1145/2684746.2721405","DOIUrl":"https://doi.org/10.1145/2684746.2721405","url":null,"abstract":"After running BEEcube Inc for the past 7 years, I learned many lessons the hard way as an entrepreneur fresh out of engineer school. Behind the glory of being the #9 fastest growing private company in Silicon Valley in 2013, there were many untold stories about our FPGA technology based startup company. A startup company is where dreams start by smart people, and also where harsh reality squashes them. This is not one of those \"unicorn\" billion-dollar-valuation-in-18-month stories, but rather a bootstrap startup manage to find it's own pot of gold under the rainbow.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121296876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-Grained Interconnect Synthesis 细粒度互连合成
A. Rodionov, David Biancolin, Jonathan Rose
{"title":"Fine-Grained Interconnect Synthesis","authors":"A. Rodionov, David Biancolin, Jonathan Rose","doi":"10.1145/2684746.2689061","DOIUrl":"https://doi.org/10.1145/2684746.2689061","url":null,"abstract":"One of the key challenges for the FPGA industry going forward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and focuses on the interconnect needs in the fine-grained (sub-IP-block) design space. Here there are several issues that prior research and tools do not address well: the need to have fixed, deterministic latency between communicating units (to enable high-performance local communication without the area overheads of latency-insensitivity), and the ability to avoid generating un-necessary arbitration hardware when the application design can avoid it. Using a design example, our tool generates interconnect that requires 72% fewer lines of specification code than a hand-written Verilog implementation, which is a 33% overall reduction for the entire application. The resulting system, while requiring 4% more total functional and interconnect area, achieves the same performance. We also show a quantitative and qualitative advantages against an existing commercial interconnect synthesis tool, over which we achieve a 25% performance advantage and 17%/57% logic/memory area savings.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130244845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An FPGA-Based Accelerator for the 2D Implicit FDM and Its Application to Heat Conduction Simulations (Abstract Only) 基于fpga的二维隐式FDM加速器及其在热传导仿真中的应用(摘要)
Yutaro Ishigaki, Ning Li, Yoichi Tomioka, A. Miyazaki, H. Kitazawa
{"title":"An FPGA-Based Accelerator for the 2D Implicit FDM and Its Application to Heat Conduction Simulations (Abstract Only)","authors":"Yutaro Ishigaki, Ning Li, Yoichi Tomioka, A. Miyazaki, H. Kitazawa","doi":"10.1145/2684746.2689112","DOIUrl":"https://doi.org/10.1145/2684746.2689112","url":null,"abstract":"Field-programmable gate arrays (FPGAs) are extremely advanced with regard to high performance; they are becoming one of the primary device choices to realize high-performance computing (HPC). In this work, we propose an FPGA-based accelerator for the two-dimensional (2D) finite difference method (FDM) with the implicit scheme and implement a 2D unsteady-state heat conduction simulation using red/black successive over-relaxation (SOR). The accelerator consists of a 2D single-instruction multiple-data (SIMD) array processor, which has pipelined processing elements (PEs) including 32-bit floating point calculation units. This processor can avoid the memory-access bottleneck and perform with high operating efficiency and low waiting time for data transfer by applying the proposed control method with synchronous shift data transfer. We demonstrate that the experimental hardware implemented on an Altera Stratix V FPGA (5SGSMD5K2F40C2N) reaches a 99.83% operating rate of the calculation units for the computation of red/black SOR. In addition, it is approximately six times faster than GPU computing on an NVIDIA GeForce GTX 770 for a 32-bit floating-point calculation of a printed circuit board (PCB) heat conduction simulation, and it is about eight times faster than an NVIDIA Tesla C2075 for the same calculation.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"42 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wavefront Skipping using BRAMs for Conditional Algorithms on Vector Processors 基于BRAMs的矢量处理器条件算法的波前跳变
Aaron Severance, Joe Edwards, G. Lemieux
{"title":"Wavefront Skipping using BRAMs for Conditional Algorithms on Vector Processors","authors":"Aaron Severance, Joe Edwards, G. Lemieux","doi":"10.1145/2684746.2689072","DOIUrl":"https://doi.org/10.1145/2684746.2689072","url":null,"abstract":"Soft vector processors can accelerate data parallel algorithms on FPGAs while retaining software programmability. To handle divergent control flow, vector processors typically use mask registers and predicated instructions. These work by executing all branches and finally selecting the correct one. Our work improves FPGA based vector processors by adding wavefront skipping, where wavefronts that are completely masked off are skipped. This accelerates conditional algorithms, particularly useful where elements terminate early if simple tests fail but require extensive processing in the worst case. The difference in logic speed and RAM area for FPGA based circuits versus ASICs led us to a different implementation than used in fixed vector processors, storing wavefront offsets in on-chip BRAM rather than computing wavefronts skipped dynamically. Additionally, we allow for partitioning the wavefronts so that partial wavefronts can skip independently of one another. We show that <5% extra area can give up to 3.2× better performance on conditional algorithms. Partial wavefront skipping may not be generally useful enough to be added to a fixed vector processor; it provides up to 65% more performance for up to 27% more area. In an FGPA, however, the designer can use it to make application specific tradeoffs between area and performance.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127644565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only) 面向多标准视频解码的混合粒度可重构计算平台(仅摘要)
Leibo Liu, Victor Y. Chen, Dong Wang, Min Zhu, S. Yin, Shaojun Wei
{"title":"A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only)","authors":"Leibo Liu, Victor Y. Chen, Dong Wang, Min Zhu, S. Yin, Shaojun Wei","doi":"10.1145/2684746.2689116","DOIUrl":"https://doi.org/10.1145/2684746.2689116","url":null,"abstract":"A mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of which consists of 16×16 multi-functional Processing Elements (PEs) and are implemented in TSMC 65 nm technology and two Altera Stratix IV EP4SE820 FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including H.264, MPEG-2, AVS and HEVC. Two types of platform configuration are tested in this work. One configuration utilizes two RPUs and targets multiple-standard high-definition (HD) video decoding, while the other utilizes only one RPU, which works under a lower frequency and targets at standard resolution (SD) decoding. The HD configuration can decode 1920×1080 H.264 video streams at 30 frames per second (fps) under 200 MHz and 1920×1080 HEVC video streams at 30 fps under 236 MHz. It achieves a 25% performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 3.85× performance boosts over the Intel i5 general-purpose CPU for HEVC decoding.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134108406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Session details: Technical Session 2: Configuration and Processing 会议详情:技术会议2:配置和处理
S. Trimberger
{"title":"Session details: Technical Session 2: Configuration and Processing","authors":"S. Trimberger","doi":"10.1145/3251651","DOIUrl":"https://doi.org/10.1145/3251651","url":null,"abstract":"","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134540187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REPROC: A Dynamically Reconfigurable Architecture for Symmetric Cryptography (Abstract Only) 对称密码的动态可重构体系结构(仅摘要)
Bo D. Wang, Leibo Liu
{"title":"REPROC: A Dynamically Reconfigurable Architecture for Symmetric Cryptography (Abstract Only)","authors":"Bo D. Wang, Leibo Liu","doi":"10.1145/2684746.2689121","DOIUrl":"https://doi.org/10.1145/2684746.2689121","url":null,"abstract":"The paper presents a VLSI architecture of a reconfigurable processor. The proposed architecture can efficiently implement symmetric ciphers, while maintaining flexibility through reconfiguration. A series of optimization methods are introduced during this process. The InterConnection Tree between Rows (ICTR) decreases the area overhead through reducing the complexity of interconnection. The use of the Hierarchical Context Organization (HCO) scheme reduces the total size of contexts and increases the speed of dynamic configuration. The proposed architecture has the ability of implementing most symmetric ciphers, such as AES, DES, SHACAL-1, SMS4 and ZUC, etc. The performance, area efficiency (throughput/area) and energy efficiency (throughput/power) of the proposed architecture have obvious advantages over the state-of-the-art architectures in literatures.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114579699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Physical Design Space Exploration 物理设计空间探索
Ephrem Wu, Inkeun Cho
{"title":"Physical Design Space Exploration","authors":"Ephrem Wu, Inkeun Cho","doi":"10.1145/2684746.2689080","DOIUrl":"https://doi.org/10.1145/2684746.2689080","url":null,"abstract":"A polynomial accelerator implemented with a custom high-dynamic-range number representation operates up to 534MHz in the slowest speed grade on a 28nm FPGA, a clock rate that a typical FPGA tool flow cannot achieve. This design tutorial shows how to achieve a physically scalable and high-speed numerical design by partitioning it into a cascade of identical stages, and balancing the LUT-to-DSP ratio within each stage to match the available resources on the FPGA.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116449928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access EURECA:片上配置生成有效的动态数据访问
Xinyu Niu, W. Luk, Yu Wang
{"title":"EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access","authors":"Xinyu Niu, W. Luk, Yu Wang","doi":"10.1145/2684746.2689076","DOIUrl":"https://doi.org/10.1145/2684746.2689076","url":null,"abstract":"This paper describes Effective Utilities for Run-timE Configuration Adaptation (EURECA), a novel memory architecture for supporting effective dynamic data access in reconfigurable devices. EURECA exploits on-chip configuration generation to reconfigure active connections in such devices cycle by cycle. When integrated into a baseline architecture based on the Virtex-6 SX475T, the EURECA memory architecture introduces small area, delay and power overhead. Three benchmark applications are developed with the proposed architecture targeting social networking (Memcached), scientific computing (sparse matrix-vector multiplication), and in-memory database (large-scale sorting). Compared with conventional static designs, up to 14.9 times reduction in area, 2.2 times reduction in critical-path delay, and 32.1 times reduction in area-delay product are achieved.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122025898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信