Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献

Accelerating parameter estimation for multivariate self-exciting point processes 多元自激点过程的加速参数估计

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554765

Ce Guo, W. Luk

引用次数: 9

Hardware acceleration of database operations 数据库操作的硬件加速

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554787

J. Casper, K. Olukotun

{"title":"Hardware acceleration of database operations","authors":"J. Casper, K. Olukotun","doi":"10.1145/2554688.2554787","DOIUrl":"https://doi.org/10.1145/2554688.2554787","url":null,"abstract":"As the amount of memory in database systems grows, entire database tables, or even databases, are able to fit in the system's memory, making in-memory database operations more prevalent. This shift from disk-based to in-memory database systems has contributed to a move from row-wise to columnar data storage. Furthermore, common database workloads have grown beyond online transaction processing (OLTP) to include online analytical processing and data mining. These workloads analyze huge datasets that are often irregular and not indexed, making traditional database operations like joins much more expensive. In this paper we explore using dedicated hardware to accelerate in-memory database operations. We present hardware to accelerate the selection process of compacting a single column into a linear column of selected data, joining two sorted columns via merging, and sorting a column. Finally, we put these primitives together to accelerate an entire join operation. We implement a prototype of this system using FPGAs and show substantial improvements in both absolute throughput and utilization of memory bandwidth. Using the prototype as a guide, we explore how the hardware resources required by our design change with the desired throughput.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Revisiting and-inverter cones 重访和逆变锥

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554791

Grace Zgheib, Liqun Yang, Zhihong Huang, D. Novo, H. Parandeh-Afshar, Haigang Yang, P. Ienne

{"title":"Revisiting and-inverter cones","authors":"Grace Zgheib, Liqun Yang, Zhihong Huang, D. Novo, H. Parandeh-Afshar, Haigang Yang, P. Ienne","doi":"10.1145/2554688.2554791","DOIUrl":"https://doi.org/10.1145/2554688.2554791","url":null,"abstract":"And-Invert Cones (AICs) have been suggested as an alternative to the ubiquitous Look-Up Tables (LUTs) used in commercial FPGAs. The original article suggesting the new architecture made some untested assumptions on the circuitry needed to implement AIC architectures and did not develop completely the toolset necessary to assess comprehensively the idea. In this paper, we pick up the architecture that some of us proposed in the original AIC paper and try to implement it as thoroughly as we can afford. We build all components for the logic cluster at transistor level in a 40~nm technology as well as a LUT-based architecture inspired by Altera's Stratix~IV. We first determine that the characteristics of our LUT-based architecture are reasonably similar to those of the commercial counterpart. Then, we compare the AIC architecture to the baseline on a number of benchmarks, and we find a few difficulties that had been overlooked before. We thus explore other design possibilities around the original design point and show their detailed impact. Finally, we discuss how the very structure of current logic clusters seems not perfectly appropriate for getting the best out of AICs and conclude that, even though they are not confirmed as an immediate blessing today, AICs still offer rich research opportunities.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114396523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Methodology to generate multi-dimensional systolic arrays for FPGAs using openCL (abstract only) 使用openCL为fpga生成多维收缩数组的方法(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554750

Nick Ni

引用次数: 0

Session details: Physical design 会话细节:物理设计

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/3260936

Jonathan Rose

引用次数: 0

1K manycore FPGA shared memory architecture for SOC (abstract only) 用于SOC的1K多核FPGA共享内存架构(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554699

Y. Ben-Asher, Jacob Gendel, Gadi Haber, Oren Segal, Yousef Shajrawi

{"title":"1K manycore FPGA shared memory architecture for SOC (abstract only)","authors":"Y. Ben-Asher, Jacob Gendel, Gadi Haber, Oren Segal, Yousef Shajrawi","doi":"10.1145/2554688.2554699","DOIUrl":"https://doi.org/10.1145/2554688.2554699","url":null,"abstract":"Manycore shared memory architectures hold a significant premise to speed up and simplify SOCs. Using many homogeneous small-cores will allow replacing the hardware accelerators of SOCs by parallel algorithms communicating through shared memory. Currently shared memory is realized by maintaining cache-consistency across the cores, caching all the connected cores to one main memory module. This approach, though used today, is not likely to be scalable enough to support the high number of cores needed for highly parallel SOCs. Therefore we consider a theoretical scheme for shared memory wherein: the shared address space is divided between a set of memory modules; and a communication network allows each core to access every such module in parallel. Load-balancing between the memory modules is obtained by rehashing the memory address-space. We have designed a simple generic shared memory architecture, synthesized it to 2,4,8,,..1024-cores for FPGA virtex-7 and evaluated it on several parallel programs. The synthesis results and the execution measurements show that, for the FPGA, all problematic aspects of this construction can be resolved. For example, unlike ASICs, the growing complexity of the communication network is absorbed by the FPGA's routing grid and by its routing mechanism. This makes this type of architectures particularly suitable for FPGAs. We used 32-bits modified PACOBLAZE cores and tested different parameters of this architecture verifying its ability to achieve high speedups. The results suggest that re-hashing is not essential and one hash-function suffice (compared to the family of universal hash functions that is needed by the theoretical construction).","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combining computation and communication optimizations in system synthesis for streaming applications 流应用系统综合中计算与通信优化的结合

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554771

J. Cong, Muhuan Huang, Peng Zhang

{"title":"Combining computation and communication optimizations in system synthesis for streaming applications","authors":"J. Cong, Muhuan Huang, Peng Zhang","doi":"10.1145/2554688.2554771","DOIUrl":"https://doi.org/10.1145/2554688.2554771","url":null,"abstract":"Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127520149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Using DSP blocks to compute CRC hash in FPGA (abstract only) 在FPGA中使用DSP块计算CRC哈希(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554689

V. Pus, Lukás Kekely, Tomás Závodník

{"title":"Using DSP blocks to compute CRC hash in FPGA (abstract only)","authors":"V. Pus, Lukás Kekely, Tomás Závodník","doi":"10.1145/2554688.2554689","DOIUrl":"https://doi.org/10.1145/2554688.2554689","url":null,"abstract":"Hash table and its variations are common ways to implement lookup operations in FPGA. The process of adding to, deleting from, and searching in the hash table uses one or more hash functions to compute the address to the table. A suitable hash function must meet statistical properties such as uniform distribution, use of all input bits, large change of output based on small change of input. Other desirable parameters are high throughput and low FPGA resources usage. We propose a novel approach to the CRC hash computation in FPGA. The method is suitable for applications such as hash tables, which use parallel inputs of fixed size and require high throughput. We employ DSP blocks present in modern FPGAs to perform all the necessary XOR operations, therefore our solution does not use any LUTs. We propose a Monte Carlo based heuristic to reduce the number of DSP blocks required. Our experimental results show that one DSP block capable of 48 XOR operations can replace around eleven 6-input LUTs. Our results further show that our solution performs less XOR operations than the solution with LUTs optimized by the synthesizer.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127360374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transformations for throughput optimization in high-level synthesis (abstract only) 在高级合成中进行吞吐量优化的转换(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554772

Peng Li, L. Pouchet, Deming Chen, J. Cong

{"title":"Transformations for throughput optimization in high-level synthesis (abstract only)","authors":"Peng Li, L. Pouchet, Deming Chen, J. Cong","doi":"10.1145/2554688.2554772","DOIUrl":"https://doi.org/10.1145/2554688.2554772","url":null,"abstract":"Programming productivity of FPGA devices remains a significant challenge, despite the emergence of robust high level synthesis tools to automatically transform codes written in high-level languages into RTL implementations. Focusing on a class of programs with regular loop bounds and array accesses (so-called affine programs), the polyhedral compilation framework provides a convenient environment to automate many of the manual program transformation tasks that are still needed to improve the QoR of the HLS tool. In this work, we demonstrate that tiling-driven affine loop transformations, while mandatory to ensure good data reuse and reduce off-chip communication volumes, are not always enough to achieve the best throughput, determined by the Initiation Interval (II) for loop pipelining. We develop additional techniques to optimize the computation part to be executed on the FPGA, using Index-Set Splitting (ISS) to split loops into sub-loops with different properties (sequential/parallel, different memory port conflicts features). This is motivated by the presence of non-uniform data dependences in some affine benchmarks, which are not effectively handled by the affine transformation system for tiling implemented in the PolyOpt/HLS software. We develop a customized affine+ISS optimization algorithm that aims at reducing the II of pipelined inner loops to reduce the program latency. We report experimental results on numerous affine computations.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128927381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Producing high-quality real-time HDR video system with FPGA (abstract only) 用FPGA制作高质量实时HDR视频系统(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554738

T. Ai, Mir Adnan Ali, J. Steffan, Kalin Ovtcharov, Sarmad Zulfiqar, Steve Mann

{"title":"Producing high-quality real-time HDR video system with FPGA (abstract only)","authors":"T. Ai, Mir Adnan Ali, J. Steffan, Kalin Ovtcharov, Sarmad Zulfiqar, Steve Mann","doi":"10.1145/2554688.2554738","DOIUrl":"https://doi.org/10.1145/2554688.2554738","url":null,"abstract":"Video cameras can only take photographs with limited dynamic range. One method to overcome this is to combine differently exposed images of the same subject matter (i.e. a Wyckoff Set), producing a High Dynamic Range (HDR) result. HDR digital photography started almost 20 years ago. Now, it is possible to produce HDR video in real-time, on both high-power CPU/GPU systems, as well as low-power FPGA boards. However, other FPGA implementations have relied upon methods that are less accurate than current CPU and GPU-based methods. Namely, the earlier FPGA approaches used weighted sum for image compositing. In this paper we provide a novel method for real-time HDR com-positing. As an essential part of an upgraded HDR video production system, the resulting system combines differently exposed video stream (of the same subject matter) in Full HD (1080p at 60fps) on a Kintex-7 FPGA. The proposed work flow, implemented with software written in C, estimates the camera response function according to its quadtree representation and generates the compositing circuit in Verilog HDL from a Wyckoff Set. This circuit consists of parts that perform addressing using multiplexer networks and estimation with bilinear interpolation. It is parameterizable by user-specified error constraints, allowing us to explore the trade-offs in resource usage and precision of the implementation. Here is an MD5 hash function sum generated for the rest of the paper: 07897e61027d15dc3600fadbccfbd67d, citation date: December 18, 2013.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2