Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第4页

High Performance Linkage Disequilibrium: FPGAs Hold the Key 高性能联动不平衡:fpga持有的关键

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847271

Nikolaos S. Alachiotis, G. Weisz

{"title":"High Performance Linkage Disequilibrium: FPGAs Hold the Key","authors":"Nikolaos S. Alachiotis, G. Weisz","doi":"10.1145/2847263.2847271","DOIUrl":"https://doi.org/10.1145/2847263.2847271","url":null,"abstract":"DNA sequencing technologies allow the rapid sequencing of full genomes in a cost-effective way, leading to ever-growing genomic datasets that comprise thousands of genomes and millions of genetic variants. In population genomics and genome-wide association studies, widely used statistics such as linkage disequilibrium become computationally demanding when thousands of whole genomes are investigated. Long analysis times and excessive memory requirements usually prevent researchers from conducting exhaustive analyses, sacrificing the ability to detect distant genetic associations. In this work, we describe a generic algorithmic approach for organizing arbitrarily distant computations on full genomes, and to offload operations from the host processor to accelerators. We explore FPGAs as accelerators for linkage disequilibrium because the bulk of required operations are discrete, making them a good fit for reconfigurable fabric. We describe a versatile and trivially expandable architecture, and develop an automatic RTL generation software to search the design space. We find that, when thousands of genomes from complex species such as humans, are analyzed, current FPGAs can achieve up to 50X faster processing than state-of-the-art software running on multi-core workstations.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"792 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123283023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths FPGA数据路径位宽优化的gpu加速高级综合

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847266

Nachiket Kapre, Deheng Ye

{"title":"GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths","authors":"Nachiket Kapre, Deheng Ye","doi":"10.1145/2847263.2847266","DOIUrl":"https://doi.org/10.1145/2847263.2847266","url":null,"abstract":"Bitwidth optimization of FPGA datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization on the GPU by performing a fast brute-force search over a carefully constrained search space. We develop a high-level synthesis methodology suitable for rapid prototyping of bitwidth-annotated RTL code generation using gcc's GIMPLE backend. For range analysis, we perform parallel evaluation of sub-intervals to provide tighter bounds compared to ordinary interval arithmetic. For bitwidth allocation, we enumerate the different bitwidth combinations in parallel by assigning each combination to a GPU thread. We demonstrate up to 10?1000x speedups for range analysis and 50?200x speedups for bitwidth allocation when comparing NVIDIA K20 GPU implementation to an Intel Core i5-4570 CPU while maintaining identical solution quality across various benchmarks. This allows us to generate tailor-made RTL with minimum bitwidths in hundreds of milliseconds instead of hundreds of minutes when starting from high-level C descriptions of dataflow computations.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131607410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An FPGA-SOC Based Accelerating Solution for N-body Simulations in MOND (Abstract Only) 基于FPGA-SOC的MOND中n体仿真加速解决方案

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847307

Bo Peng, Tianqi Wang, Xi Jin, Chuanjun Wang

{"title":"An FPGA-SOC Based Accelerating Solution for N-body Simulations in MOND (Abstract Only)","authors":"Bo Peng, Tianqi Wang, Xi Jin, Chuanjun Wang","doi":"10.1145/2847263.2847307","DOIUrl":"https://doi.org/10.1145/2847263.2847307","url":null,"abstract":"Modified Newtonian dynamics (MOND) has shown a great success as a modified-potential theory of gravity. In this paper, we present a highly integrated accelerating solution for N-body MOND simulations. By using the FPGA-SoC, which integrates both FPGA and SOC (system on chip) in one chip, our solution exhibits potential for better performance, higher integration, and lower power consumption. To handle the calculation bottleneck of potential summation, on one hand, we develop a strategy to simplify the pipeline, in which the square calculation task is conducted by the DSP48E1 of Xilinx 7 series FPGAs, so as to reduce the logic resource consumption of each pipeline; on the other hand, advantages of particle-mesh scheme are taken to overcome the bottleneck on bandwidth. Our experiment results show that 2 more pipelines can be integrated in Zynq-7020 FPGA-SoC with the simplified pipeline, and the bandwidth requirement is reduced significantly. Furthermore, our accelerating solution has a full range of advantages over different processors. Compared with GPU, our work is about better in both performance per Watt and performance per cost.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126494571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning LMC:自动资源感知程序优化内存分区

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847283

Hsin-Jung Yang, Kermin Fleming, Michael Adler, F. Winterstein, J. Emer

引用次数: 7

Enhanced TERO-PUF Implementations and Characterization on FPGAs (Abstract Only) fpga上的增强TERO-PUF实现与表征(仅摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847298

Cédric Marchand, L. Bossuet, A. Cherkaoui

{"title":"Enhanced TERO-PUF Implementations and Characterization on FPGAs (Abstract Only)","authors":"Cédric Marchand, L. Bossuet, A. Cherkaoui","doi":"10.1145/2847263.2847298","DOIUrl":"https://doi.org/10.1145/2847263.2847298","url":null,"abstract":"Physical unclonable functions (PUF) are a promising approach in design for trust and security. A PUF derives a unique identifier using physical characteristics of different dies containing an identical circuit, so it can be used to authenticate chips and for identification. The transient effect ring oscillator (TERO) PUF is based on the extraction of entropy due to process variations by comparing TERO cells characteristics. The TERO cell is designed and implemented with a symmetric structure that requires special selection of the gates used and the delays of all connections inside the cell. Implementing this cell in FPGAs is challenging because the structure of FPGAs does not automatically allow designers to choose connections between elements. However, by manually specifying constraints and using specific features of the target FPGA family, the symmetry of the TERO cell can be established and reproduced in larger designs. In this work, the design of the TERO cell is described for two different FGPA technologies (45nm Xilinx Spartan 6 and 28nm Altera Cyclone V). The statistical characterization of the TERO-PUF with the two targeted FPGAs has resulted in a uniqueness of 48.46% with Spartan 6 and 47.62% with Cyclone V. The result for the steadiness is 2.63% with Spartan 6 and 1.8% with Cyclone V. These results are close to the results obtained by several works that use ring oscillator RO-PUF which are considered the best candidate for PUF implementation on FPGAs. However, TERO-PUF is less sensitive to electromagnetic analysis than RO-PUF. Additionally, unlike RO-PUF, TERO-PUF is able to generate multiple bits per challenge (from one to three) and we have shown during the statistical characterization that the TERO-PUF provides from 0.85 to 1 bits of entropy per response bit. As a conclusion, our work clearly shows that TERO-PUF is an serious alternative to RO-PUF for PUF implementation on FPGAs with strong statistical characteristics and more security than RO-PUF.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130293292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Re-targeting Optimization Sequences from Scalar Processors to FPGAs in HLS compilers (Abstract Only) HLS编译器中从标量处理器到fpga的重新定位优化序列(仅摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847315

Ronak Kogta, Suresh Purini, Ajit Mathew

{"title":"Re-targeting Optimization Sequences from Scalar Processors to FPGAs in HLS compilers (Abstract Only)","authors":"Ronak Kogta, Suresh Purini, Ajit Mathew","doi":"10.1145/2847263.2847315","DOIUrl":"https://doi.org/10.1145/2847263.2847315","url":null,"abstract":"A high-level synthesis compiler translates a source program written in a high level programming language such as C or SystemC into an equivalent circuit. The performance of the generated circuit in terms of metrics such as area, frequency and clock cycles depends on the compiler optimizations enabled and their order of application. Finding an optimal sequence for a given program is a hard combinatorial optimization problem. In this paper, we propose a practical and search time efficient technique for finding a near-optimal sequence for a given program. The main idea is to strike a balance between the search for a universally good sequence (like that of O3) which works for all programs vis-a-vis finding a good sequence on a per-program basis. Towards that, we construct a rich downsampled sequence set, which caters to different program classes, from the unbounded optimization sequence space by applying heuristic search algorithms on a set of Microkernel benchmark programs. The optimization metric that we use while constructing the downsampled sequence set is the execution time on a scalar processor. Given a new program, we try all the sequences from the downsampled sequence setand pick the best. Applying this technique in the LegUp high-level synthesis compiler, we are able to obtain 23% and 40% improvement on CHStone and Machsuite benchmark programs respectively. We also propose techniques to further reduce the size of the downsampled sequence set to improve the sequence search time.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126924828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FGPU: An SIMT-Architecture for FPGAs FGPU: fpga的simt架构

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847273

Muhammed Al Kadi, Benedikt Janßen, M. Hübner

引用次数: 33

Increasing the Utility of Self-Calibration Methods in High-Precision Time Measurement Systems (Abstract Only) 提高自校准方法在高精度时间测量系统中的应用(摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847311

Matthias Hinkfoth, R. Salomon

{"title":"Increasing the Utility of Self-Calibration Methods in High-Precision Time Measurement Systems (Abstract Only)","authors":"Matthias Hinkfoth, R. Salomon","doi":"10.1145/2847263.2847311","DOIUrl":"https://doi.org/10.1145/2847263.2847311","url":null,"abstract":"Asynchronously operating systems, such as tapped delay lines, are the designer?s favorite, if high resolution and precision in time are required. Their drawback, however, is that they require extensive calibration, which prohibits, among other things, sporadic recalibration during the mode of operation. Recent research has shown that the tight coupling of two selective high-precision systems inside a single FPGA substantially reduces the required calibration time: it was reduced from several hours to about 30 minutes. But even this method has not solved the problem that human intervention is required for selecting suitable calibration points. The research presented in this poster suggests that a hybrid approach is able to solve this problem: rather than tightly coupling two systems, the present approach employs hybrid elements, called X-BOUNCE, that seamlessly incorporate an X-ORCA element into a BOUNCE element. In the practical experiments, X-BOUNCE has reduced the required calibration time from 30 minutes to one second and has abandoned any human intervention. Furthermore, the proposed X-BOUNCE element can be realized by just one FPGA-LUT, which allows for easy scalability. The results were produced on a Cyclone II FPGA that has implemented 200 X-BOUNCE elements. Unfortunately, some elements exhibit a calibration inaccuracy that can be as large as 300 ps.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123960965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated Verification Code Generation in HLS Using Software Execution Traces (Abstract Only) 在HLS中使用软件执行跟踪自动生成验证码(仅抽象)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847313

Liwei Yang, S. Gurumani, Suhaib A. Fahmy, Deming Chen, K. Rupnow

{"title":"Automated Verification Code Generation in HLS Using Software Execution Traces (Abstract Only)","authors":"Liwei Yang, S. Gurumani, Suhaib A. Fahmy, Deming Chen, K. Rupnow","doi":"10.1145/2847263.2847313","DOIUrl":"https://doi.org/10.1145/2847263.2847313","url":null,"abstract":"Improved quality of results from high level synthesis (HLS) tools has led to their increased adoption. Despite the automated translation from high level descriptions to register-transfer level (RTL) implementations, functional verification remains a major challenge. Verification can take significantly more time than the design process; if there is a functional mismatch, developers must back-trace thousands of signals and cycles to determine underlying cause. The challenge is further exacerbated with HLS-produced RTL, which is often not human readable. To overcome these challenges, we present a verification technique that uses software-execution traces and automated insertion of verification code into the HLS-generated RTL to assist in debugging. The verification code helps pinpoint the earliest instance of RTL simulation mismatch, either caused by HLS engine bugs or design bugs, and related instructions. We also integrate a watchdog timer to examine the execution of control-flow and perform source-to-source transformation on benchmarks to take advantage of our proposed instrumentation. We also create a framework to insert various types of bugs, e.g. data-flow, control-flow and operational bugs, to evaluate our technique. We use the CHStone benchmark suite and demonstrate that our verification detects over 90% of the inserted bugs, with over 70% of them detected within 10 cycles. In addition, the proposed flow can detect real-life bugs existing in previously released versions of CHStone suite as well.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

An Improved Global Stereo-Matching on FPGA for Real-Time Applications (Abstract Only) 面向实时应用的改进的FPGA全局立体匹配(仅摘要)

Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2016-02-21 DOI: 10.1145/2847263.2847292

Daolu Zha, Xi Jin, Tian Xiang

引用次数: 0