Gerardo Soria García, Adrian Pedroza de la Cruz, S. Ortega Cisneros, J. J. Raygoza Panduro, Eduardo Bayro Corrochano
{"title":"A Hardware Implementation of a Unit for Geometric Algebra Operations With Parallel Memory Arrays (Abstract Only)","authors":"Gerardo Soria García, Adrian Pedroza de la Cruz, S. Ortega Cisneros, J. J. Raygoza Panduro, Eduardo Bayro Corrochano","doi":"10.1145/2684746.2689132","DOIUrl":"https://doi.org/10.1145/2684746.2689132","url":null,"abstract":"Geometric algebra (GA) is a powerful and versatile mathematical tool which helps to intuitively express and manipulate complex geometric relationships. It has recently been used in engineering problems such computer graphics, machine vision, robotics, among others. The problem with GA in its numeric version is that it requires many arithmetic operations, and the length of the input vectors is unknown until runtime in a generic architecture operating over homogeneous elements. Few works in hardware architectures for GA were developed to improve the performance in GA applications. In this work, a hardware architecture of a unit for GA operations (geometric product) for FPGA is presented. The main contribution of this work is the use of parallel memory arrays with access conflict avoidance for dealing with the issue of unknown length of input/output vectors, the intention is to reduce memory wasted when storing the input and output vectors. In this first stage of the project, we have implemented only a single access function (fixed-length) in the memory array in order to test the core of geometric product. In future works we will implement a full set of access functions with different lengths and shapes. In this work, only the simulations are presented; in the future, we will also present the experimental results","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123406236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngsoo Kim, William Harding, C. Gloster, W. Alexander
{"title":"Acceleration of Synthetic Aperture Radar (SAR) Algorithms using Field Programmable Gate Arrays (FPGAs) (Abstract Only)","authors":"Youngsoo Kim, William Harding, C. Gloster, W. Alexander","doi":"10.1145/2684746.2689125","DOIUrl":"https://doi.org/10.1145/2684746.2689125","url":null,"abstract":"Algorithms for radar signal processing, such as Synthetic Aperture Radar (SAR) are computationally intensive and require considerable execution time on a general purpose processor. Reconfigurable logic can be used to off-load the primary computational kernel onto a custom computing machine in order to reduce execution time by an order of magnitude as compared to kernel execution on a general purpose processor. Specifically, Field Programmable Gate Arrays (FPGAs) can be used to house hardware-based custom implementations of these kernels to speed up these applications. In this paper, we demonstrate a methodology for algorithm acceleration. We used SAR as a case study to illustrate the tremendous potential for algorithm acceleration offered by FPGAs. Initially, we profiled the SAR algorithm and implemented a homomorphic filter using a hardware implementation of the natural logarithm. Experimental results show an average speed-up of 188 when using the FPGA-based hardware accelerator as opposed to using a software implementation running on a typical general purpose processor.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124450545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Implementation of LUT with Large Numbers of Inputs (Abstract Only)","authors":"M. Fujita","doi":"10.1145/2684746.2689107","DOIUrl":"https://doi.org/10.1145/2684746.2689107","url":null,"abstract":"A LUT is implemented with a set of flipflops which are connected to a series of multiplexers, or alternatively with a small memory, and needs exponentially many storage elements with respect to the numbers of inputs. Due to this FPGA uses LUTs having around 6 inputs, but LUTs with larger numbers of inputs may be better from various performance viewpoints as well as its applications to flexible logic debugging and Engineering Change Order (ECO) as there are less interconnects among LUTs. Such LUTs may accommodate changes of designs including logic debugging and ECO. We discuss implementations for LUTs having relatively large numbers of inputs, such as 12-inputs. If we implement a single LUT with 12-inputs, we need 212 = 4,096 storage elements. On the other hand, we can construct 12-input subcircuits of fixed topologies only with sets of LUTs having small numbers of inputs, such as 4-inputs. Although such subcircuits can only realize very small subsets of all possible logic functions with 12-inputs, if they can realize most of the logic functions we need for actual designs by only reprogramming the sets of 4-input LUTs, they are practically worthwhile to be used. We present several such fixed-topology subcircuits as well as automatic compilation methods from given logic functions. Experimental results show almost all functions (more than 99%) which appear benchmark circuits with partially disjoint decomposability can be implemented by the proposed topologies. Sophisticated circuit portioning methods can always generate networks of subcircuits with partially disjoint decomposability.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121295306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of Specific Delay Window Routing for Timing Optimization in FPGA Designs","authors":"Evan Wegley, Qinhai Zhang","doi":"10.1145/2684746.2689059","DOIUrl":"https://doi.org/10.1145/2684746.2689059","url":null,"abstract":"In addition to optimizing for timing performance and routability, commercial FPGA routing engines must also support various timing constraints enabling the designer to fine tune aspects of their design. The many intricacies of commercial FPGA architectures add difficulty to the problem of supporting such constraints. In this paper, we show how the method of specific delay window routing can be applied to optimize for these various timing constraints constituting both long- and short-path requirements. Additionally, we enhance existing methods of routing according to specified delay by using dual wave expansion instead of single wave expansion with target delay estimation in order to improve accuracy and support sparser, more varied interconnect structures. Our results show that specific delay window routing is well-suited for optimization targeting a variety of timing constraints, and that using dual wave expansion to eliminate the estimation part of the router's delay cost function enables the router to support tighter timing constraints. For a suite of designs with known hold timing violations, we found that the dual wave approach can correct all such violations, whereas the single wave approach failed to correct the hold timing violations for several designs. Furthermore, for a suite of designs with maximum skew constraints of 250 ps on certain nets and buses, the dual wave approach met the constraints for all designs, whereas the single wave approach failed to meet the constraints for a majority of the designs.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126619724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhilei Chai, Jin Yu, Zhibin Wang, Jie Zhang, Haojie Zhou
{"title":"An Embedded FPGA Operating System Optimized for Vision Computing (Abstract Only)","authors":"Zhilei Chai, Jin Yu, Zhibin Wang, Jie Zhang, Haojie Zhou","doi":"10.1145/2684746.2689127","DOIUrl":"https://doi.org/10.1145/2684746.2689127","url":null,"abstract":"Although FPGA's power and performance advantages were recognized widely, designing applications on FPGA-based systems is traditionally a task undertaken by hardware experts. It is significant to allow application-level programmers with less system-level but more algorithm knowledge to realize their applications conveniently on FPGAs. In this paper, an embedded FPGA operating system is proposed to facilitate application-level programmers to use FPGAs. Firstly, it builds specific I/Os and optimizes bus interconnection among I/Os, DDR memory, user IPs etc within the FPGA for vision computing. Secondly, it manages resources of the FPGA such as I/Os, DDR memory, communication etc, frees users from low-level details. Thirdly, it schedules tasks (IPs) executed on the FPGA dynamically in runtime, which makes the FPGA multiplexed when necessary. After porting the FPGA operating system to different FPGA platforms and implementing vision algorithms based on that, it shows the FPGA operating system is able to simplify algorithm development on FPGA platforms and improve portability of user applications. Furthermore, implementation results of several popular vision algorithms show the FPGA operating system is efficient and effective for vision computing. Finally, experimental results shows that for multiple algorithms requiring more FPGA resources, runtime task scheduling of multiple IPs is more efficient than a fixed IP when the SoC of FPGA is considered.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131450063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based BLOB Detection Using Dual-pipelining (Abstract Only)","authors":"Naoto Nojiri, Lin Meng, K. Yamazaki","doi":"10.1145/2684746.2689118","DOIUrl":"https://doi.org/10.1145/2684746.2689118","url":null,"abstract":"Binary Large OBject (BLOB) detection is utilized in various fields such as car cameras, traffic sign recognition and surveillance systems. Although labeling is an important component in BLOB detection, it is difficult to be parallelized using a look-up table (LUT) in terms of data dependency. Since BLOB detection takes a long time, recognition speed and accuracy need to be improved. This research aims to detect BLOBs as fast as possible by using dual-pipelining image processing on the FPGA. Dual-pipelining is to perform pipeline processing in parallel to the upper and lower portions of an original image after dividing it into two portions. We have to consider the timing of each module around the borderline because of the data dependency in label generation. The image processing consists of Gaussian filtering, binarization, labeling, and BLOB analysis. Generally, labeling uses a LUT to combine multiple numbers for one object into the smallest number of temporary labels. In order to simplify the labeling, the connected components of each BLOB are stored and revised just in the LUT. In our approach, a BLOB can be detected when multiple temporary labels are stored in a same entry of the LUT, thus enabling us to detect BLOBs by dual-pipelining. Although our labeling method does not revise temporary labels into a unified label, BLOBs can be detected and their numbers, areas, and centroids are correctly computed. We compared our approach with a related work, which consists of three steps: identifying the connected pixels in each row, labeling the counted pixels in different rows, computing the area and centroid. Experimental results show that the dual-pipelining system using FPGA can detect BLOBs in 0.06 ms, which is 3.92 times faster than the related work and 1.83 times faster than a single-pipelining system. The dual-pipelining system utilized 1.5% of Registers, 8.4% of LUT, 24.3% of LUT-FF pairs, 91.9% of BRAM in Virtex V. The dual-pipelining system is about twice as large as the single-pipelining system. Our approach can be applied for the other areas such as traffic sign recognition and vehicle detection.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122864996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MedianPipes: An FPGA based Highly Pipelined and Scalable Technique for Median Filtering (Abstract Only)","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1145/2684746.2689142","DOIUrl":"https://doi.org/10.1145/2684746.2689142","url":null,"abstract":"We propose MedianPipes, a novel, FPGA based, highly pipelined and scalable architecture for median filtering. Median filters and its variants are widely used for noise suppression in image processing. All variants of median filter depend on the computation of median values. MedianPipe is a highly pipelined architecture and hence an ideal fit for FPGAs. It does not make any assumptions about the image to fit on the on-chip memory. Instead, the image is assumed to be streamed-in in the form of image slices. Multiple MedianPipe modules are used depending on the size of image slice and hence the overall hardware complexity of proposed technique scales linearly with image-slice size. The architecture for MedianPipe is based on the principle of merge sort and uses a median window of size 3 x 3. It consists of two stepped sorting process: The first step is to sort the pixels within each row of median window to get sorted rows. This sorting is done using a single comparator over multiple clock cycles. The sorted rows are saved in block memory based First-In-First-Out (FIFO) memory and reused to calculate the medians corresponding to three median windows. The second step is to merge these sorted rows to find the median using a merger block. The merger block consists of three comparators and read out a single value every cycle once the pipeline is filled. Without loss of generality, the pixels of an image slice are assumed to be read in a column major format. All the median values within the column of the image slice can be computed in parallel using multiple MedianPipes. The computation of median values in the following column is delayed by a clock cycle. Hardware resources scale linearly by varying the pixel sizes and number of MedianPipes. The pixel rate achieved for various pixel sizes is well above 124 MHz which is the standard for 1080p High-Definition.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128710781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junbin Wang, Leibo Liu, Jianfeng Zhu, S. Yin, Shaojun Wei
{"title":"A Novel Composite Method to Accelerate Control Flow on Reconfigurable Architecture (Abstract Only)","authors":"Junbin Wang, Leibo Liu, Jianfeng Zhu, S. Yin, Shaojun Wei","doi":"10.1145/2684746.2689124","DOIUrl":"https://doi.org/10.1145/2684746.2689124","url":null,"abstract":"Reconfigurable Architecture provides a promising solution for embedded systems for high performance, low power and flexibility. Control dependence and control divergence are critical problems that impact the performance. Many methods were proposed to handle control flows efficiently, such as predicated execution and speculative execution. However, they exhibit different performances for different types of control flows, so composite methods are required to provide overall optimal performance. In this paper, a novel architecture is proposed which combines Triggered Instruction and parallel condition. It is designed on the basis of triggered instruction architecture (TIA) while each PE incorporates multiple arithmetic logic units with fast mutual control as in the technique of parallel condition. It can remove branch instructions as well as parallelize control and compute instructions without reconciliation operation, so it explores parallelism in branch level while avoids over-serialization execution in program-counter-based PE. The experiment was conducted on a model in C language and the result shows that the proposed architecture can achieve 80.0% higher performance on average than TIA.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125916182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Data Forwarding in Deeply Pipelined Soft Processors","authors":"Hui Yan Cheah, Suhaib A. Fahmy, Nachiket Kapre","doi":"10.1145/2684746.2689067","DOIUrl":"https://doi.org/10.1145/2684746.2689067","url":null,"abstract":"We can design high-frequency soft-processors on FPGAs that exploit deep pipelining of DSP primitives, supported by selective data forwarding, to deliver up to 25% performance improvements across a range of benchmarks. Pipelined, in-order, scalar processors can be small and lightweight but suffer from a large number of idle cycles due to dependency chains in the instruction sequence. Data forwarding allows us to more deeply pipeline the processor stages while avoiding an associated increase in the NOP cycles between dependent instructions. Full forwarding can be prohibitively complex for a lean soft processor, so we explore two approaches: an external forwarding path around the DSP block execution unit in FPGA logic and using the intrinsic loopback path within the DSP block primitive. We show that internal loopback improves performance by 5% compared to external forwarding, and up to 25% over no data forwarding. The result is a processor that runs at a frequency close to the fabric limit of 500 MHz, but without the significant dependency overheads typical of such processors.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117272729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Lockwood, Michael Adler, Dan Mansur, Derek Chiou, M. Strickland, J. Cong, S. Teig
{"title":"Growing a Healthy FPGA Ecosystem","authors":"J. Lockwood, Michael Adler, Dan Mansur, Derek Chiou, M. Strickland, J. Cong, S. Teig","doi":"10.1145/2684746.2721404","DOIUrl":"https://doi.org/10.1145/2684746.2721404","url":null,"abstract":"The personal computer market grew exponentially in the 1980's for vendors such as Apple, Microsoft, and Intel when there was a healthy mix of software, tools, and microprocessor devices. At the time, killer applications that drove the market were spreadsheets, compilers, and games that ran on the personal computer. Thirty years later, we now have a similar opportunity to grow a healthy ecosystem as developers and vendors bring killer applications, tools, and programmable logic devices to the market to accelerate datacenters for cloud computing.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129255485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}