Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第9页

FPGA Acceleration for Simultaneous Image Reconstruction and Segmentation based on the Mumford-Shah Regularization (Abstract Only) 基于Mumford-Shah正则化的FPGA同步图像重构与分割

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689097

Wentai Zhang, Li Shen, Thomas Page, Guojie Luo, Peng Li, P. Maass, M. Jiang, J. Cong

{"title":"FPGA Acceleration for Simultaneous Image Reconstruction and Segmentation based on the Mumford-Shah Regularization (Abstract Only)","authors":"Wentai Zhang, Li Shen, Thomas Page, Guojie Luo, Peng Li, P. Maass, M. Jiang, J. Cong","doi":"10.1145/2684746.2689097","DOIUrl":"https://doi.org/10.1145/2684746.2689097","url":null,"abstract":"X-ray computed tomography is an important technique for clinical diagnose and nondestructive testing. In many applications a number of image processing steps are needed before the image information becomes useful. Image segmentation is one of such processing steps and has important applications. The conventional flow is to first reconstruct the image and then obtain image segmentation afterwards. In contrast, an iterative method for simultaneous reconstruction and segmentation (SRS) with Mumford-Shah model has been proposed, which not only regularizes the ill-posedness of the tomographic reconstruction problem, but also produces the image segmentation at the same time. The Mumford-Shah model is both mathematically and computationally difficult. In this paper, we propose a data-decomposed algorithm of the SRS method, accelerate it using FPGA devices. The proposed algorithm has a structure that invokes a single kernel many times without involving other computational tasks. Though this structure seems best fit on GPU-like devices, experimental results show that a 73X, 11X, and 1.4X speedup can be achieved by the FPGA acceleration over the CPU implementation of the original SRS algorithm and ray-parallel SRS algorithm, and the GPU implementation of the ray-parallel SRS.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117082957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The BEEcube Story: Lessons Learned from Running a FPGA Startup for the Past 7 Years BEEcube的故事:从过去7年运行FPGA启动的经验教训

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2721405

Chen Chang

引用次数: 0

Fine-Grained Interconnect Synthesis 细粒度互连合成

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689061

A. Rodionov, David Biancolin, Jonathan Rose

{"title":"Fine-Grained Interconnect Synthesis","authors":"A. Rodionov, David Biancolin, Jonathan Rose","doi":"10.1145/2684746.2689061","DOIUrl":"https://doi.org/10.1145/2684746.2689061","url":null,"abstract":"One of the key challenges for the FPGA industry going forward is to make the task of designing hardware easier. A significant portion of that design task is the creation of the interconnect pathways between functional structures. We present a synthesis tool that automates this process and focuses on the interconnect needs in the fine-grained (sub-IP-block) design space. Here there are several issues that prior research and tools do not address well: the need to have fixed, deterministic latency between communicating units (to enable high-performance local communication without the area overheads of latency-insensitivity), and the ability to avoid generating un-necessary arbitration hardware when the application design can avoid it. Using a design example, our tool generates interconnect that requires 72% fewer lines of specification code than a hand-written Verilog implementation, which is a 33% overall reduction for the entire application. The resulting system, while requiring 4% more total functional and interconnect area, achieves the same performance. We also show a quantitative and qualitative advantages against an existing commercial interconnect synthesis tool, over which we achieve a 25% performance advantage and 17%/57% logic/memory area savings.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130244845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An FPGA-Based Accelerator for the 2D Implicit FDM and Its Application to Heat Conduction Simulations (Abstract Only) 基于fpga的二维隐式FDM加速器及其在热传导仿真中的应用(摘要)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689112

Yutaro Ishigaki, Ning Li, Yoichi Tomioka, A. Miyazaki, H. Kitazawa

{"title":"An FPGA-Based Accelerator for the 2D Implicit FDM and Its Application to Heat Conduction Simulations (Abstract Only)","authors":"Yutaro Ishigaki, Ning Li, Yoichi Tomioka, A. Miyazaki, H. Kitazawa","doi":"10.1145/2684746.2689112","DOIUrl":"https://doi.org/10.1145/2684746.2689112","url":null,"abstract":"Field-programmable gate arrays (FPGAs) are extremely advanced with regard to high performance; they are becoming one of the primary device choices to realize high-performance computing (HPC). In this work, we propose an FPGA-based accelerator for the two-dimensional (2D) finite difference method (FDM) with the implicit scheme and implement a 2D unsteady-state heat conduction simulation using red/black successive over-relaxation (SOR). The accelerator consists of a 2D single-instruction multiple-data (SIMD) array processor, which has pipelined processing elements (PEs) including 32-bit floating point calculation units. This processor can avoid the memory-access bottleneck and perform with high operating efficiency and low waiting time for data transfer by applying the proposed control method with synchronous shift data transfer. We demonstrate that the experimental hardware implemented on an Altera Stratix V FPGA (5SGSMD5K2F40C2N) reaches a 99.83% operating rate of the calculation units for the computation of red/black SOR. In addition, it is approximately six times faster than GPU computing on an NVIDIA GeForce GTX 770 for a 32-bit floating-point calculation of a printed circuit board (PCB) heat conduction simulation, and it is about eight times faster than an NVIDIA Tesla C2075 for the same calculation.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"42 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130734585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wavefront Skipping using BRAMs for Conditional Algorithms on Vector Processors 基于BRAMs的矢量处理器条件算法的波前跳变

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689072

Aaron Severance, Joe Edwards, G. Lemieux

{"title":"Wavefront Skipping using BRAMs for Conditional Algorithms on Vector Processors","authors":"Aaron Severance, Joe Edwards, G. Lemieux","doi":"10.1145/2684746.2689072","DOIUrl":"https://doi.org/10.1145/2684746.2689072","url":null,"abstract":"Soft vector processors can accelerate data parallel algorithms on FPGAs while retaining software programmability. To handle divergent control flow, vector processors typically use mask registers and predicated instructions. These work by executing all branches and finally selecting the correct one. Our work improves FPGA based vector processors by adding wavefront skipping, where wavefronts that are completely masked off are skipped. This accelerates conditional algorithms, particularly useful where elements terminate early if simple tests fail but require extensive processing in the worst case. The difference in logic speed and RAM area for FPGA based circuits versus ASICs led us to a different implementation than used in fixed vector processors, storing wavefront offsets in on-chip BRAM rather than computing wavefronts skipped dynamically. Additionally, we allow for partitioning the wavefronts so that partial wavefronts can skip independently of one another. We show that <5% extra area can give up to 3.2× better performance on conditional algorithms. Partial wavefront skipping may not be generally useful enough to be added to a fixed vector processor; it provides up to 65% more performance for up to 27% more area. In an FGPA, however, the designer can use it to make application specific tradeoffs between area and performance.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127644565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only) 面向多标准视频解码的混合粒度可重构计算平台(仅摘要)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689116

Leibo Liu, Victor Y. Chen, Dong Wang, Min Zhu, S. Yin, Shaojun Wei

{"title":"A Mixed-Grained Reconfigurable Computing Platform for Multiple-Standard Video Decoding (Abstract Only)","authors":"Leibo Liu, Victor Y. Chen, Dong Wang, Min Zhu, S. Yin, Shaojun Wei","doi":"10.1145/2684746.2689116","DOIUrl":"https://doi.org/10.1145/2684746.2689116","url":null,"abstract":"A mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of which consists of 16×16 multi-functional Processing Elements (PEs) and are implemented in TSMC 65 nm technology and two Altera Stratix IV EP4SE820 FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including H.264, MPEG-2, AVS and HEVC. Two types of platform configuration are tested in this work. One configuration utilizes two RPUs and targets multiple-standard high-definition (HD) video decoding, while the other utilizes only one RPU, which works under a lower frequency and targets at standard resolution (SD) decoding. The HD configuration can decode 1920×1080 H.264 video streams at 30 frames per second (fps) under 200 MHz and 1920×1080 HEVC video streams at 30 fps under 236 MHz. It achieves a 25% performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 3.85× performance boosts over the Intel i5 general-purpose CPU for HEVC decoding.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134108406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Session details: Technical Session 2: Configuration and Processing 会议详情:技术会议2:配置和处理

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/3251651

S. Trimberger

引用次数: 0

REPROC: A Dynamically Reconfigurable Architecture for Symmetric Cryptography (Abstract Only) 对称密码的动态可重构体系结构(仅摘要)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689121

Bo D. Wang, Leibo Liu

引用次数: 1

Physical Design Space Exploration 物理设计空间探索

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689080

Ephrem Wu, Inkeun Cho

引用次数: 0

EURECA: On-Chip Configuration Generation for Effective Dynamic Data Access EURECA:片上配置生成有效的动态数据访问

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI: 10.1145/2684746.2689076

Xinyu Niu, W. Luk, Yu Wang

引用次数: 17