{"title":"Machine-Learning driven Auto-Tuning of High-Level Synthesis for FPGAs (Abstract Only)","authors":"Li Ting, Harri Wijaya, Nachiket Kapre","doi":"10.1145/2847263.2847297","DOIUrl":"https://doi.org/10.1145/2847263.2847297","url":null,"abstract":"Modern High-Level Synthesis (HLS) tools allow C descriptions of computation to be compiled to optimized low-level RTL, but expose a range of manual optimization options, compiler directives and tweaks to the developer. In many instances, this results in a tedious iterative development flow to meet resource, timing and power constraints which defeats the purpose of adopting the high-level abstraction in the first place. In this paper, we show how to use Machine Learning routines to predict the impact of HLS compiler optimization on final FPGA utilization metrics. We compile multiple variations of the high-level C code across a range of compiler optimizations and pragmas to generate a large design space of candidate solutions. On the Machsuite benchmarks, we are able to train a linear regression model to predict resources, latency and frequency metrics with high accuracy (R2 > 0.75). We expect such developer-assistance tools to (1) offer insight to drive manual selection of suitable directive combinations, and (2) automate the process of selecting directives in the complex design space of modern HLS design.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124186727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 5: Architecture and Tools","authors":"Jonathan Rose","doi":"10.1145/3250864","DOIUrl":"https://doi.org/10.1145/3250864","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120949687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Technical Session 7: High-level Synthesis and Tools","authors":"David Biancolin","doi":"10.1145/3250866","DOIUrl":"https://doi.org/10.1145/3250866","url":null,"abstract":"","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123410909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naveen Suda, V. Chandra, Ganesh S. Dasika, Abinash Mohanty, Yufei Ma, S. Vrudhula, Jae-sun Seo, Yu Cao
{"title":"Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks","authors":"Naveen Suda, V. Chandra, Ganesh S. Dasika, Abinash Mohanty, Yufei Ma, S. Vrudhula, Jae-sun Seo, Yu Cao","doi":"10.1145/2847263.2847276","DOIUrl":"https://doi.org/10.1145/2847263.2847276","url":null,"abstract":"Convolutional Neural Networks (CNNs) have gained popularity in many computer vision applications such as image classification, face detection, and video analysis, because of their ability to train and classify with high accuracy. Due to multiple convolution and fully-connected layers that are compute-/memory-intensive, it is difficult to perform real-time classification with low power consumption on today?s computing systems. FPGAs have been widely explored as hardware accelerators for CNNs because of their reconfigurability and energy efficiency, as well as fast turn-around-time, especially with high-level synthesis methodologies. Previous FPGA-based CNN accelerators, however, typically implemented generic accelerators agnostic to the CNN configuration, where the reconfigurable capabilities of FPGAs are not fully leveraged to maximize the overall system throughput. In this work, we present a systematic design space exploration methodology to maximize the throughput of an OpenCL-based FPGA accelerator for a given CNN model, considering the FPGA resource constraints such as on-chip memory, registers, computational resources and external memory bandwidth. The proposed methodology is demonstrated by optimizing two representative large-scale CNNs, AlexNet and VGG, on two Altera Stratix-V FPGA platforms, DE5-Net and P395-D8 boards, which have different hardware resources. We achieve a peak performance of 136.5 GOPS for convolution operation, and 117.8 GOPS for the entire VGG network that performs ImageNet classification on P395-D8 board.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131282883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Just In Time Assembly of Accelerators","authors":"Sen Ma, Zeyad Aklah, D. Andrews","doi":"10.1145/2847263.2847341","DOIUrl":"https://doi.org/10.1145/2847263.2847341","url":null,"abstract":"Despite the significant advancements that have been made in High Level Synthesis, the reconfigurable computing community has failed at getting programmers to use Field Programmable Gate Arrays (FPGAs). Existing barriers that prevent programmers from using FPGAs include the need to work within vendor specific CAD tools, knowledge of hardware programming models, and the requirement to pass each design through synthesis, place and route. In this paper we present a new approach that takes these barriers out of the design flows for programmers. Synthesis is eliminated from the application programmers path by becoming part of the initial coding process when creating the programming patterns that define a Domain Specific Language. Programmers see no difference between creating software or hardware functionality when using the DSL. A run time interpreter is introduced that assembles hardware accelerators within a configurable tile array of partially reconfigurable slots at run time. Initial results show the approach allows hardware accelerators to be compiled 100x faster compared to the time required to synthesize the same functionality. Initial performance results further show a compilation/interpretation approach can achieve approximately equivalent performance for matrix operations and filtering compared to synthesizing a custom accelerator.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123093327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Stochastic Computing to Reduce the Hardware Requirements for a Restricted Boltzmann Machine Classifier","authors":"Bingzhe Li, M. Najafi, D. Lilja","doi":"10.1145/2847263.2847340","DOIUrl":"https://doi.org/10.1145/2847263.2847340","url":null,"abstract":"Artificial neural networks are powerful computational systems with interconnected neurons. Generally, these networks have a very large number of computation nodes which forces the designer to use software-based implementations. However, the software based implementations are offline and not suitable for portable or real-time applications. Experiments show that compared with the software based implementations, FPGA-based systems can greatly speed up the computation time, making them suitable for real-time situations and portable applications. However, the FPGA implementation of neural networks with a large number of nodes is still a challenging task. In this paper, we exploit stochastic bit streams in the Restricted Boltzmann Machine (RBM) to implement the classification of the RBM handwritten digit recognition application completely on an FPGA. We use finite state machine-based (FSM) stochastic circuits to implement the required sigmoid function and use the novel stochastic computing approach to perform all large matrix multiplications. Experimental results show that the proposed stochastic architecture has much more potential for tolerating faults while requiring much less hardware compared to the currently un-implementable deterministic binary approach when the RBM consists of a large number of neurons. Exploiting the features of stochastic circuits, our implementation achieves much better performance than a software-based approach.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128920432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stratix™ 10 High Performance Routable Clock Networks","authors":"C. Ebeling, D. How, D. Lewis, H. Schmit","doi":"10.1145/2847263.2847279","DOIUrl":"https://doi.org/10.1145/2847263.2847279","url":null,"abstract":"We present the clock architecture of the Stratix?10 FPGA, which uses a routable clock network rather than the fixed clock networks of previous generations. We describe the flexibility provided by this routable clock network and how arbitrarily sized clock trees can be synthesized and placed anywhere on the FPGA. We show how this capability to generate customized clock trees can provide better performance through reduced clock loss while maintaining the ability to handle the large number of clock domains that modern systems require. We experimentally demonstrate how a routable clock tree reduces the clock loss of the user design implementation by up to 6% of clock insertion delay.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129260582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intel Acquires Altera: How Will the World of FPGAs be Affected?","authors":"Derek Chiou","doi":"10.1145/2847263.2857658","DOIUrl":"https://doi.org/10.1145/2847263.2857658","url":null,"abstract":"Intel's purchase of Altera is very likely to be the biggest single event in FPGA history and, therefore, have a profound impact on the FPGA world. This panel intends to explore the business and research opportunities that are potentially enabled and potentially squashed by the acquisition. Questions that will be explored by the panel include: What will be the impact on FPGA applications? Clearly, there is the potential of much tighter integration of CPU and FPGA, but what applications and usage models does that really enable? What will be the impact on FPGA business? What will be the impact on the FPGA research community?","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122068931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 1 GSa/s, Reconfigurable Soft-core FPGA ADC (Abstract Only)","authors":"Stefan Visser, H. Homulle, E. Charbon","doi":"10.1145/2847263.2847310","DOIUrl":"https://doi.org/10.1145/2847263.2847310","url":null,"abstract":"There exist many applications where analog interfacing is abundant, e.g. sensor networks, automotive, industrial control, (quantum) physics etc. In those fields the use of FPGAs is continuously growing, however a direct link between the analog world and the digital FPGA is still missing (except for the newest generation of FPGAs, where analog-to-digital conversion is present, but limited in performance). External analog-to-digital converters (ADCs) are combined together with the FPGA to form a complete, application-specific system. This system is thus limited in compactness, flexibility, and reconfigurability. To address those issues we propose an ADC architecture, implemented in a FPGA, that is fully reconfigurable and easy to calibrate. This allows to alter the design, according to the system requirements. Therefore it can be used in a wide range of operating conditions and adjusted to changes in supply voltage and FPGA temperature. This architecture employs time-to-digital converters (TDCs) and phase interpolation techniques to reach a sampling rate higher than the clock frequency (400 MHz) of up to 1.2 GSa/s. The resulting FPGA ADC can achieve a 6 bit resolution over a 0.6 to 1.9 V input range. The system non-linearities (INL, DNL) are less than 0.45 LSB. The main advantages of this architecture are its scalability and reconfigurability, enabling applications with changing demands, on one single platform.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133719977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PRFloor: An Automatic Floorplanner for Partially Reconfigurable FPGA Systems","authors":"T. D. A. Nguyen, Akash Kumar","doi":"10.1145/2847263.2847270","DOIUrl":"https://doi.org/10.1145/2847263.2847270","url":null,"abstract":"Partial reconfiguration (PR) is gaining more attention from the research community because of its flexibility in dynamically changing some parts of the system at runtime. However, the current PR tools need the designer's involvement in manually specifying the shapes and locations for the PR regions (PRRs). It requires not only deep knowledge of the FPGA device, the system architecture, but also many trial-and-error attempts to find the best-possible floorplan. Therefore, many research works have been conducted to propose automatic floorplanners for PR systems. However, one of the most significant limitations of those works is that they only consider the PRRs and ignore all other static modules. In this paper, we propose a novel PR floorplanner called PRFloor. It takes into account all components in the system. The main ideas behind PRFloor are the unique recursive pseudo-bipartitioning heuristic using a new, simple, yet effective Nonlinear Integer Programming-based bipartitioner. The PRFloor performs very well in the experiments with various synthetic PR system setups with up to 130 modules, 24 PRRs and 85% of the FPGA resource. The average maximum clock frequency obtained for the actual PR systems implemented using PRFloor is even 3% higher than the similar systems without PR capability.","PeriodicalId":438572,"journal":{"name":"Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123269508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}