Shih-Hao Hung, Min-yu Tsai, B. Huang, Chia-Heng Tu
{"title":"A Platform-Oblivious Approach for Heterogeneous Computing: A Case Study with Monte Carlo-based Simulation for Medical Applications","authors":"Shih-Hao Hung, Min-yu Tsai, B. Huang, Chia-Heng Tu","doi":"10.1145/2847263.2847335","DOIUrl":"https://doi.org/10.1145/2847263.2847335","url":null,"abstract":"Light is important and helpful in many medical applications, such as cancer treatment. Computer modeling and simulation of light transport are often adopted to improve the quality of medical treatments. In particular, Monte Carlo-based simulations are considered to deliver accurate results, but require intensive computational resources. While several attempts to accelerate the Monte Carlo-based methods for the simulation of photon transport with platform-specific programming schemes, such as CUDA on GPU and HDL on FPGA, have been proposed, the approach has limited portability and prolongs software updates. In this paper, we parallelize the Monte Carlo modeling of light transport in multi-layered tissues (MCML) program with OpenCL, an open standard supported by a wide range of platforms. We characterize the performance of the parallelized MCML kernel program runs on CPU, GPU and FPGA. Compared to platform-specific programming schemes, our platform-oblivious approach provides a unified, highly portable code and delivers competitive performance and power efficiency.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124405680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Matai, D. Richmond, Dajung Lee, Z. Blair, Qiongzhi Wu, Amin Abazari, R. Kastner
{"title":"Resolve: Generation of High-Performance Sorting Architectures from High-Level Synthesis","authors":"J. Matai, D. Richmond, Dajung Lee, Z. Blair, Qiongzhi Wu, Amin Abazari, R. Kastner","doi":"10.1145/2847263.2847268","DOIUrl":"https://doi.org/10.1145/2847263.2847268","url":null,"abstract":"Field Programmable Gate Array (FPGA) implementations of sorting algorithms have proven to be efficient, but existing implementations lack portability and maintainability because they are written in low-level hardware description languages that require substantial domain expertise to develop and maintain. To address this problem, we develop a framework that generates sorting architectures for different requirements (speed, area, power, etc.). Our framework provides ten highly optimized basic sorting architectures, easily composes basic architectures to generate hybrid sorting architectures, enables non-hardware experts to quickly design efficient hardware sorters, and facilitates the development of customized heterogeneous FPGA/CPU sorting systems. Experimental results show that our framework generates architectures that perform at least as well as existing RTL implementations for arrays smaller than 16K elements, and are comparable to RTL implementations for sorting larger arrays. We demonstrate a prototype of an end-to-end system using our sorting architectures for large arrays (16K-130K) on a heterogeneous FPGA/CPU system.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123675113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vinod Kathail, James Hwang, Welson Sun, Yogesh Chobe, T. Shui, Jorge E. Carrillo
{"title":"SDSoC: A Higher-level Programming Environment for Zynq SoC and Ultrascale+ MPSoC","authors":"Vinod Kathail, James Hwang, Welson Sun, Yogesh Chobe, T. Shui, Jorge E. Carrillo","doi":"10.1145/2847263.2847284","DOIUrl":"https://doi.org/10.1145/2847263.2847284","url":null,"abstract":"Zynq-7000 All Programmable SoC and the new Zynq Ultrascale+ MPSoC provide proven alternatives to traditional domain-specific application SoCs and enable extensive system-level differentiation, integration and flexibility through hardware, software and I/O programmability. The SDSoC Development Environment is a heterogeneous design environment for implementing embedded systems using the Zynq SoC and MPSoC. It enables the broader community of embedded software developers to leverage the power of hardware and software programmable devices, entirely from a higher-level of abstraction. The SDSoC environment provides a greatly simplified embedded C/C++ application programming experience including an easy-to-use Eclipse IDE and a comprehensive development platform. SDSoC includes a full-system optimizing C/C++ compiler, system-level profiling and hardware/software event tracing, automated software acceleration in programming logic, automated generation of SW-HW connectivity, and integration with libraries to speed programing. The SDSoC compiler transforms programs into complete hardware/software systems based on user-specified target platform and functions within the program to compile into programmable hardware logic. Hardware accelerators communicate with the CPU and external memory through an automatically-generated, application-specific data motion network comprised of DMAs, interconnects and other standard IP blocks. The SDSoC Environment also provides flows for customer and 3rd party developers to enable their platforms and integrate RTL IPs as C-callable libraries. It builds upon customer-proven design tools from Xilinx including Vivado Design Suite, Vivado High-level Synthesis and SDK. In this presentation, we will introduce the motivation and basic concepts behind SDSoC, describe its capabilities and the user-flow, and provide a brief demonstration of the tool using an example.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129777382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 6: System-level Tools","authors":"Mingjie Lin","doi":"10.1145/3250865","DOIUrl":"https://doi.org/10.1145/3250865","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128260998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Ibraheem, S. Z. Ahmed, K. Hachicha, S. Hochberg, P. Garda
{"title":"A Low DDR Bandwidth 100FPS 1080p Video 2D Discrete Wavelet Transform Implementation on FPGA (Abstract Only)","authors":"M. Ibraheem, S. Z. Ahmed, K. Hachicha, S. Hochberg, P. Garda","doi":"10.1145/2847263.2847321","DOIUrl":"https://doi.org/10.1145/2847263.2847321","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124224335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraphOps: A Dataflow Library for Graph Analytics Acceleration","authors":"Tayo Oguntebi, K. Olukotun","doi":"10.1145/2847263.2847337","DOIUrl":"https://doi.org/10.1145/2847263.2847337","url":null,"abstract":"Analytics and knowledge extraction on graph data structures have become areas of great interest. For frequently executed algorithms, dedicated hardware accelerators are an energy-efficient avenue to high performance. Unfortunately, they are notoriously labor-intensive to design and verify while meeting stringent time-to-market goals. In this paper, we present GraphOps, a modular hardware library for quickly and easily constructing energy-efficient accelerators for graph analytics algorithms. GraphOps provide a hardware designer with a set of composable graph-specific building blocks, broad enough to target a wide array of graph analytics algorithms. The system is built upon a dataflow execution platform and targets FPGAs, allowing a vendor to use the same hardware to accelerate different types of analytics computation. Low-level hardware implementation details such as flow control, input buffering, rate throttling, and host/interrupt interaction are automatically handled and built into the design of the GraphOps, greatly reducing design time. As an enabling contribution, we also present a novel locality-optimized graph data structure that improves spatial locality and memory efficiency when accessing the graph in main memory. Using the GraphOps system, we construct six different hardware accelerators. Results show that the GraphOps-based accelerators are able to operate close to the bandwidth limit of the hardware platform, the limiting constraint in graph analytics computation.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134376642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA Power Estimation Using Automatic Feature Selection (Abstract Only)","authors":"Yunxuan Yu, Lei He","doi":"10.1145/2847263.2847327","DOIUrl":"https://doi.org/10.1145/2847263.2847327","url":null,"abstract":"Because layout stage consumes the lion share of FPGA synthesis runtime, pre-layout power estimation can be viewed as an early stage estimation and is needed for power minimization at the early design stage. Consisting two phases of feature selection and model training, data mining is effective for data based modeling, yet it has not been applied in a rigid fashion for FPGA power estimation as the existing algorithms can be viewed as model training using features selected manually. In this paper, we apply machine learning with automatic feature selection to pre- and post- logic synthesis estimations, named pre-synthesis and post-synthesis estimation. Experiments using Lattice Diamond MachXO2 family show that compared to the post-layout power simulation, post-synthesis estimation is 20x faster with 8.62% average error, while pre-synthesis estimation is 600x faster with considerably larger error that still needs further improvement. Furthermore, compared to existing algorithms using manually selected features, our post-synthesis estimation using automatic feature selection reduces error by 2-3 times. Finally, the ranking of features is able to provide insights for power minimization.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124540888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boolean Satisfiability-Based Routing and Its Application to Xilinx UltraScale Clock Network","authors":"H. Fraisse, A. Joshi, D. Gaitonde, A. Kaviani","doi":"10.1145/2847263.2847342","DOIUrl":"https://doi.org/10.1145/2847263.2847342","url":null,"abstract":"Boolean Satisfiability (SAT)-based routing offers a unique advantage over conventional routing algorithms by providing an exhaustive approach to find a solution. Despite that advantage, commercial FPGA CAD tools rarely use SAT-based routers due to scalability issues. In this paper, we revisit SAT-based routing and propose two SAT formulations independent of routing architecture. We then demonstrate that SAT-based routing using either formulation dramatically outperforms conventional routing algorithms in both runtime and robustness for the clock routing of Xilinx UltraScale devices. Finally, we experimentally show that one of the proposed SAT formulations leads to a routing 18x faster and produces formulas 20x more compact than the other. This framework has been implemented into Vivado and is now currently used in production.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117148146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination","authors":"D. Boland","doi":"10.1145/2847263.2847281","DOIUrl":"https://doi.org/10.1145/2847263.2847281","url":null,"abstract":"Gaussian elimination is a well-known technique to compute the solution to a system of linear equations and boosting its performance is highly desirable. While straightforward parallel techniques are limited either by I/O or on-chip memory bandwidth, block-based algorithms offer the potential to bridge this gap by interleaving I/O with computation. However, these algorithms require the amount of on-chip memory to be at least the square of the number of processing elements available. Using the latest generation Altera FPGAs with hardened floating-point units, this is no longer the case. It follows that the amount of on-chip memory limits performance, a problem that is only likely to increase unless on-chip memory dominates FPGA architecture. In addition to this limitation, existing FPGA implementations of block-based Gaussian elimination either sacrifice numerical stability or efficiency. The former limits the usefulness of these implementations to a small class of matrices, the latter limits its performance. This paper presents a high-performance and numerically stable method to perform Gaussian elimination on an FPGA. This modified algorithm makes use of a deep pipeline to store the matrix and ensures that the peak performance is once again limited by the number of floating-point units that can fit on the FPGA. When applied to large matrices, this technique can obtain a sustained performance of up to 256 GFLOPs on an Arria 10, beginning to tap into the full potential of these devices. This performance is comparable to the peak that could be achieved using a simple block-based algorithm, with the performance on a Stratix 10 predicted to be superior. This is in spite of the fact that the underlying algorithm for the implementation in this paper, Gaussian elimination with pairwise pivoting, is more complex and applicable to a wider range of practical problems.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDA-to-FPGA Compiler","authors":"T. Nguyen, S. Gurumani, K. Rupnow, Deming Chen","doi":"10.1145/2847263.2847344","DOIUrl":"https://doi.org/10.1145/2847263.2847344","url":null,"abstract":"Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instantiations. However, study of high level synthesis for parallel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integration, which can have a significant impact on achieved performance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDA-to-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark applications. We demonstrate platform integration of 16 benchmarks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128528043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}