{"title":"High Performance Linkage Disequilibrium: FPGAs Hold the Key","authors":"Nikolaos S. Alachiotis, G. Weisz","doi":"10.1145/2847263.2847271","DOIUrl":"https://doi.org/10.1145/2847263.2847271","url":null,"abstract":"DNA sequencing technologies allow the rapid sequencing of full genomes in a cost-effective way, leading to ever-growing genomic datasets that comprise thousands of genomes and millions of genetic variants. In population genomics and genome-wide association studies, widely used statistics such as linkage disequilibrium become computationally demanding when thousands of whole genomes are investigated. Long analysis times and excessive memory requirements usually prevent researchers from conducting exhaustive analyses, sacrificing the ability to detect distant genetic associations. In this work, we describe a generic algorithmic approach for organizing arbitrarily distant computations on full genomes, and to offload operations from the host processor to accelerators. We explore FPGAs as accelerators for linkage disequilibrium because the bulk of required operations are discrete, making them a good fit for reconfigurable fabric. We describe a versatile and trivially expandable architecture, and develop an automatic RTL generation software to search the design space. We find that, when thousands of genomes from complex species such as humans, are analyzed, current FPGAs can achieve up to 50X faster processing than state-of-the-art software running on multi-core workstations.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"792 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123283023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-Accelerated High-Level Synthesis for Bitwidth Optimization of FPGA Datapaths","authors":"Nachiket Kapre, Deheng Ye","doi":"10.1145/2847263.2847266","DOIUrl":"https://doi.org/10.1145/2847263.2847266","url":null,"abstract":"Bitwidth optimization of FPGA datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization on the GPU by performing a fast brute-force search over a carefully constrained search space. We develop a high-level synthesis methodology suitable for rapid prototyping of bitwidth-annotated RTL code generation using gcc's GIMPLE backend. For range analysis, we perform parallel evaluation of sub-intervals to provide tighter bounds compared to ordinary interval arithmetic. For bitwidth allocation, we enumerate the different bitwidth combinations in parallel by assigning each combination to a GPU thread. We demonstrate up to 10?1000x speedups for range analysis and 50?200x speedups for bitwidth allocation when comparing NVIDIA K20 GPU implementation to an Intel Core i5-4570 CPU while maintaining identical solution quality across various benchmarks. This allows us to generate tailor-made RTL with minimum bitwidths in hundreds of milliseconds instead of hundreds of minutes when starting from high-level C descriptions of dataflow computations.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131607410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-SOC Based Accelerating Solution for N-body Simulations in MOND (Abstract Only)","authors":"Bo Peng, Tianqi Wang, Xi Jin, Chuanjun Wang","doi":"10.1145/2847263.2847307","DOIUrl":"https://doi.org/10.1145/2847263.2847307","url":null,"abstract":"Modified Newtonian dynamics (MOND) has shown a great success as a modified-potential theory of gravity. In this paper, we present a highly integrated accelerating solution for N-body MOND simulations. By using the FPGA-SoC, which integrates both FPGA and SOC (system on chip) in one chip, our solution exhibits potential for better performance, higher integration, and lower power consumption. To handle the calculation bottleneck of potential summation, on one hand, we develop a strategy to simplify the pipeline, in which the square calculation task is conducted by the DSP48E1 of Xilinx 7 series FPGAs, so as to reduce the logic resource consumption of each pipeline; on the other hand, advantages of particle-mesh scheme are taken to overcome the bottleneck on bandwidth. Our experiment results show that 2 more pipelines can be integrated in Zynq-7020 FPGA-SoC with the simplified pipeline, and the bandwidth requirement is reduced significantly. Furthermore, our accelerating solution has a full range of advantages over different processors. Compared with GPU, our work is about better in both performance per Watt and performance per cost.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126494571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsin-Jung Yang, Kermin Fleming, Michael Adler, F. Winterstein, J. Emer
{"title":"LMC: Automatic Resource-Aware Program-Optimized Memory Partitioning","authors":"Hsin-Jung Yang, Kermin Fleming, Michael Adler, F. Winterstein, J. Emer","doi":"10.1145/2847263.2847283","DOIUrl":"https://doi.org/10.1145/2847263.2847283","url":null,"abstract":"As FPGAs have grown in size and capacity, FPGA memory systems have become both richer and more diverse in order to support the increased computational capacity of FPGA fabrics. Using these resources, and using them well, has become commensurately more difficult, especially in the context of legacy designs ported from smaller, simpler FPGA systems. This growing complexity necessitates resource-aware compilers that can make good use of memory resources on behalf of the programmer. In this work, we introduce the LEAP Memory Compiler (LMC), which can synthesize application-optimized cache networks for systems with multiple memory resources, enabling user programs to automatically take advantage of the expanded memory capabilities of modern FPGA systems. In our experiments, the optimized cache network achieves up to 49% performance gains for throughput-oriented applications and 15% performance gains for latency-oriented applications, while increasing design area by less than 6% of the total chip area.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125635217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhanced TERO-PUF Implementations and Characterization on FPGAs (Abstract Only)","authors":"Cédric Marchand, L. Bossuet, A. Cherkaoui","doi":"10.1145/2847263.2847298","DOIUrl":"https://doi.org/10.1145/2847263.2847298","url":null,"abstract":"Physical unclonable functions (PUF) are a promising approach in design for trust and security. A PUF derives a unique identifier using physical characteristics of different dies containing an identical circuit, so it can be used to authenticate chips and for identification. The transient effect ring oscillator (TERO) PUF is based on the extraction of entropy due to process variations by comparing TERO cells characteristics. The TERO cell is designed and implemented with a symmetric structure that requires special selection of the gates used and the delays of all connections inside the cell. Implementing this cell in FPGAs is challenging because the structure of FPGAs does not automatically allow designers to choose connections between elements. However, by manually specifying constraints and using specific features of the target FPGA family, the symmetry of the TERO cell can be established and reproduced in larger designs. In this work, the design of the TERO cell is described for two different FGPA technologies (45nm Xilinx Spartan 6 and 28nm Altera Cyclone V). The statistical characterization of the TERO-PUF with the two targeted FPGAs has resulted in a uniqueness of 48.46% with Spartan 6 and 47.62% with Cyclone V. The result for the steadiness is 2.63% with Spartan 6 and 1.8% with Cyclone V. These results are close to the results obtained by several works that use ring oscillator RO-PUF which are considered the best candidate for PUF implementation on FPGAs. However, TERO-PUF is less sensitive to electromagnetic analysis than RO-PUF. Additionally, unlike RO-PUF, TERO-PUF is able to generate multiple bits per challenge (from one to three) and we have shown during the statistical characterization that the TERO-PUF provides from 0.85 to 1 bits of entropy per response bit. As a conclusion, our work clearly shows that TERO-PUF is an serious alternative to RO-PUF for PUF implementation on FPGAs with strong statistical characteristics and more security than RO-PUF.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130293292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Re-targeting Optimization Sequences from Scalar Processors to FPGAs in HLS compilers (Abstract Only)","authors":"Ronak Kogta, Suresh Purini, Ajit Mathew","doi":"10.1145/2847263.2847315","DOIUrl":"https://doi.org/10.1145/2847263.2847315","url":null,"abstract":"A high-level synthesis compiler translates a source program written in a high level programming language such as C or SystemC into an equivalent circuit. The performance of the generated circuit in terms of metrics such as area, frequency and clock cycles depends on the compiler optimizations enabled and their order of application. Finding an optimal sequence for a given program is a hard combinatorial optimization problem. In this paper, we propose a practical and search time efficient technique for finding a near-optimal sequence for a given program. The main idea is to strike a balance between the search for a universally good sequence (like that of O3) which works for all programs vis-a-vis finding a good sequence on a per-program basis. Towards that, we construct a rich downsampled sequence set, which caters to different program classes, from the unbounded optimization sequence space by applying heuristic search algorithms on a set of Microkernel benchmark programs. The optimization metric that we use while constructing the downsampled sequence set is the execution time on a scalar processor. Given a new program, we try all the sequences from the downsampled sequence setand pick the best. Applying this technique in the LegUp high-level synthesis compiler, we are able to obtain 23% and 40% improvement on CHStone and Machsuite benchmark programs respectively. We also propose techniques to further reduce the size of the downsampled sequence set to improve the sequence search time.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126924828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FGPU: An SIMT-Architecture for FPGAs","authors":"Muhammed Al Kadi, Benedikt Janßen, M. Hübner","doi":"10.1145/2847263.2847273","DOIUrl":"https://doi.org/10.1145/2847263.2847273","url":null,"abstract":"Driven by its high flexibility, good performance and energy efficiency, GPGPU has taken on an increasingly important role in embedded systems. In this paper, we present the basic core of FGPU: a GPU-like, scalable and portable integer soft SIMT-processor implemented in RTL and optimized for FPGA synthesis with a single-level cache system. Compared to a performance-optimized MicroBlaze implementation on the same FPGA, the biggest implemented core of FGPU achieves average wall clock speedups of 49x and a measured power saving of 3.7x with an area overhead of 17.7x. Compared to an ARM CPU with a NEON vector processor, we measured an average speedup of 3.5x over the used benchmark. FGPU is highly parametrizable and it does not contain any manufacturer-specific IP-cores or primitives.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129181360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Increasing the Utility of Self-Calibration Methods in High-Precision Time Measurement Systems (Abstract Only)","authors":"Matthias Hinkfoth, R. Salomon","doi":"10.1145/2847263.2847311","DOIUrl":"https://doi.org/10.1145/2847263.2847311","url":null,"abstract":"Asynchronously operating systems, such as tapped delay lines, are the designer?s favorite, if high resolution and precision in time are required. Their drawback, however, is that they require extensive calibration, which prohibits, among other things, sporadic recalibration during the mode of operation. Recent research has shown that the tight coupling of two selective high-precision systems inside a single FPGA substantially reduces the required calibration time: it was reduced from several hours to about 30 minutes. But even this method has not solved the problem that human intervention is required for selecting suitable calibration points. The research presented in this poster suggests that a hybrid approach is able to solve this problem: rather than tightly coupling two systems, the present approach employs hybrid elements, called X-BOUNCE, that seamlessly incorporate an X-ORCA element into a BOUNCE element. In the practical experiments, X-BOUNCE has reduced the required calibration time from 30 minutes to one second and has abandoned any human intervention. Furthermore, the proposed X-BOUNCE element can be realized by just one FPGA-LUT, which allows for easy scalability. The results were produced on a Cyclone II FPGA that has implemented 200 X-BOUNCE elements. Unfortunately, some elements exhibit a calibration inaccuracy that can be as large as 300 ps.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123960965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liwei Yang, S. Gurumani, Suhaib A. Fahmy, Deming Chen, K. Rupnow
{"title":"Automated Verification Code Generation in HLS Using Software Execution Traces (Abstract Only)","authors":"Liwei Yang, S. Gurumani, Suhaib A. Fahmy, Deming Chen, K. Rupnow","doi":"10.1145/2847263.2847313","DOIUrl":"https://doi.org/10.1145/2847263.2847313","url":null,"abstract":"Improved quality of results from high level synthesis (HLS) tools has led to their increased adoption. Despite the automated translation from high level descriptions to register-transfer level (RTL) implementations, functional verification remains a major challenge. Verification can take significantly more time than the design process; if there is a functional mismatch, developers must back-trace thousands of signals and cycles to determine underlying cause. The challenge is further exacerbated with HLS-produced RTL, which is often not human readable. To overcome these challenges, we present a verification technique that uses software-execution traces and automated insertion of verification code into the HLS-generated RTL to assist in debugging. The verification code helps pinpoint the earliest instance of RTL simulation mismatch, either caused by HLS engine bugs or design bugs, and related instructions. We also integrate a watchdog timer to examine the execution of control-flow and perform source-to-source transformation on benchmarks to take advantage of our proposed instrumentation. We also create a framework to insert various types of bugs, e.g. data-flow, control-flow and operational bugs, to evaluate our technique. We use the CHStone benchmark suite and demonstrate that our verification detects over 90% of the inserted bugs, with over 70% of them detected within 10 cycles. In addition, the proposed flow can detect real-life bugs existing in previously released versions of CHStone suite as well.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115542103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Improved Global Stereo-Matching on FPGA for Real-Time Applications (Abstract Only)","authors":"Daolu Zha, Xi Jin, Tian Xiang","doi":"10.1145/2847263.2847292","DOIUrl":"https://doi.org/10.1145/2847263.2847292","url":null,"abstract":"A real-time global stereo matching algorithm is implemented on FPGA. Stereo matching is frequently used in stereo vision systems, e.g. for stereo vision applications like objects detection and autonomous vehicles. Global algorithms perform much more significant than local algorithms, but global algorithms are not implemented on FPGA by reason of rely on the high-end hardware resources. In this implementation the stereo pairs are divided into blocks, the hardware resources are reduced by processing one block once. The hardware implementation is based on a Xilinx®Kintex 7 FPGA. Experiment results show the implementation performances significant and 30 fps@1920x1680 is achieved.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124283889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}