{"title":"GridGAS: An I/O-Efficient Heterogeneous FPGA+CPU Computing Platform for Very Large-Scale Graph Analytics","authors":"Yu Zou, Mingjie Lin","doi":"10.1109/FPT.2018.00045","DOIUrl":"https://doi.org/10.1109/FPT.2018.00045","url":null,"abstract":"In this paper, we develop a highly scalable approach to constructing an efficient heterogeneous graph processing engine in order to handle extremely large graph size beyond its on-board memory capacity. Our FPGA-based computing engine not only surpasses cutting-edge GPU-based engines in terms of computing performance and energy efficiency, but also proves to be highly versatile and thus can be applied to many types of low-latency and high-throughput graph analytic tasks central to the next-generation graph-based machine learning. We analyze in detail the difference between GPU's and FPGA's architectures and provide several fundamental reasons why, for irregular computations, FPGA may surpass GPU in computing latency and energy efficiency, and discuss some \"golden rules\" for designing an efficient FPGA+CPU heterogeneous platform and GPU's inefficiency when handling extremely large-scale graph datasets. To validate our approach, we implement our FPGA-based GridGAS computing engine with a KC705 Xilinx FPGA board and a baseline implementation using a Quadro K420 GPU following the same approach and test with large-scale graph datasets. Using PCIe 2.0 x8 only, our architecture achieves up to 170.4 MTEPS and 14.8 times speedup over the GPU baseline for datasets exceeding 1.4 GB in size.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128353475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed-Memory Based FPGA Debug: Design Timing Impact","authors":"R. Hale, B. Hutchings","doi":"10.1109/FPT.2018.00071","DOIUrl":"https://doi.org/10.1109/FPT.2018.00071","url":null,"abstract":"In FPGAs, debug observability is often achievedby attaching memory-based recording circuitry to user signals. Block-RAM (BRAM)-based embedded logic analyzers are ofteninserted into user circuits to observe circuit behavior. Incontrast with BRAM-based approaches, distributed memory:1) is almost always available (user circuits may consume allBRAMs but even highly utilized circuits contain unused LUTs), and 2) can usually be physically located very near to user signals(LUTs are spread across the entire device while BRAMs arelocated only in specific columns). Previous work has shownbasic feasibility and demonstrated that distributed memoriescan provide debug observability for highly utilized circuits. Thispaper focuses on timing impacts and describes the quantitativetradeoff between FPGA device utilization, debug probe count, and clock frequency. For example, a design with 70% of LUTsutilized, with no debug logic, can operate at a minimum clockperiod of 5ns. Instrumenting 300 debug probes increases thisperiod to 7ns, and 1500 probes to 8ns. Placing trace bufferswith a simulated annealing algorithm improved success ratesfrom 20% to 50% depending on the design and probe count.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133371409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation of an Autonomous Driving System for FPT2018 FPGA Design Competition Using the Zynqberry Processing Board","authors":"Yohei Shimmyo, Maiko Arakawa, Shunsuke Mie, Hiroaki Saito, Y. Okuyama, Hiroki Yomogita","doi":"10.1109/FPT.2018.00086","DOIUrl":"https://doi.org/10.1109/FPT.2018.00086","url":null,"abstract":"We propose an autonomous driving system with vision-based algorithms on FPGA. Our car is composed of Zynqberry processing board, general USB Web camera, and gearbox and motors that drive facing two wheels. We built customized Ubuntu Linux including libraries for Web camera, Wi-Fi, OpenCV on ARM cores on the Zynq processor. We implement the motor control module as hardware logic on the FPGA part and connected to the Linux on processing system via AXI-bus. On the processing system, software self-localization, and path planning modules are running with the help of Linux. We tested all of the functions of our driving system. However, further software tuning is needed to control the vehicle accurately.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128849192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teng Yu, Bo Feng, Mark Stillwell, Liucheng Guo, Yuchun Ma, John Thomson
{"title":"Lattice-Based Scheduling for Multi-FPGA Systems","authors":"Teng Yu, Bo Feng, Mark Stillwell, Liucheng Guo, Yuchun Ma, John Thomson","doi":"10.1109/FPT.2018.00063","DOIUrl":"https://doi.org/10.1109/FPT.2018.00063","url":null,"abstract":"Accelerators are becoming increasingly prevalent in distributed computation. FPGAs have been shown to be fast and power efficient for particular tasks, yet scheduling on FPGA-based multi-accelerator systems is challenging when workloads vary significantly in granularity in terms of task size and/or number of computational units required. We present a novel approach for dynamically scheduling tasks on networked multi-FPGA systems which maintains high performance, even in the presence of irregular tasks. Our topological ranking-based scheduling allows realistic irregular workloads to be processed while maintaining a significantly higher level of performance than existing schedulers.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115185909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
William Diehl, Farnoud Farahmand, Abubakr Abdulgadir, J. Kaps, K. Gaj
{"title":"Face-off Between the CAESAR Lightweight Finalists: ACORN vs. Ascon","authors":"William Diehl, Farnoud Farahmand, Abubakr Abdulgadir, J. Kaps, K. Gaj","doi":"10.1109/FPT.2018.00066","DOIUrl":"https://doi.org/10.1109/FPT.2018.00066","url":null,"abstract":"Authenticated ciphers potentially provide resource savings and security improvements over the joint use of secret-key ciphers and message authentication codes. The CAESAR competition aims to choose the most suitable authenticated ciphers for several categories of applications, including a lightweight use case, for which the primary criteria are performance in resource-constrained devices, and ease of protection against side channel attacks (SCA). In March 2018, two of the candidates from this category, ACORN and Ascon, were selected as CAESAR contest finalists. In this research, we compare two SCA-resistant FPGA implementations of ACORN and Ascon, where one set of implementations has area consumption nearly equivalent to the defacto standard AES-GCM, and the other set has throughput (TP) close to that of AES-GCM. The results show that protected implementations of ACORN and Ascon, with area consumption less than but close to AES-GCM, have 23.3 and 2.5 times, respectively, the TP of AES-GCM. Likewise, implementations of ACORN and Ascon with TP greater than but close to AES-GCM, consume 18% and 74% of the area, respectively, of AES-GCM.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114761229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Digital Transformation of Automobile and Mobility Service","authors":"Hiroshi Miyata","doi":"10.1109/FPT.2018.00012","DOIUrl":"https://doi.org/10.1109/FPT.2018.00012","url":null,"abstract":"The traffic system for automobiles has not changed its physical, industrial, and social structures in more than 100 years since its introduction to society. It has been deployed at a large scale and plays an important role in mobility. The system elements which is driver, automobile, and road physically contact each other, and the system is managed only by humans. Advancements in electric and electronic technologies for over 30 years have improved performance of automobile, but they have not improved performance of drivers and road. However, drivers, automobiles, and roads have begun to be connected each other through digital data, and the traffic system is now starting to be managed not only by humans but also by information technology such as artificial intelligence. This situation is assumed to change the system value, size, range, and role dramatically. This is the digital transformation of automobile and mobility service. New trends of CASE, i.e., connected car, automated driving, sharing car, mobility as a service, and electrification have made large-scale innovation in not only automobile and service but also automobile traffic system, automotive industry, and society as a whole. This paper outlines these new trends of system and service. Then, the latest needs of a data cycle of digital transformation for improving the systems and services are described because they change every year, with the involvement of people. Moreover, the paper discusses why the automobile digital transformation requires scalability, flexibility, security, traceability, safety, and reliability, and describes the expectation for field programmable technology as a candidate for the requirement.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125669927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction of MNSTbot","authors":"Kyosuke Mori, Y. Saitoh, N. Nakasato","doi":"10.1109/FPT.2018.00083","DOIUrl":"https://doi.org/10.1109/FPT.2018.00083","url":null,"abstract":"We are developing MNSTbot as our entry to FPT2018 FPGA Design Competition. MNSTbot is a modified robot kit called TurtleBot3 Burger. We add a system-on-chip FPGA on the robot kit to automatically drive the robot for the competition. In this paper, we present hardware and software design of MNSTbot and describe our auto-driving algorithms.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131741297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling Up Loop Pipelining for High-Level Synthesis: A Non-iterative Approach","authors":"L. Rosa, Vanderlei Bonato, C. Bouganis","doi":"10.1109/FPT.2018.00020","DOIUrl":"https://doi.org/10.1109/FPT.2018.00020","url":null,"abstract":"High-level synthesis is a powerful tool for increasing productivity in digital hardware design. However, as digital systems become larger and more complex, designers have to consider an increased number of optimizations and directives offered by high-level synthesis tools to control the hardware generation process, resulting in a large design space to be explored. One of the most impactful optimizations is loop pipelining due to its large improvement in the hardware throughput. Nevertheless, the modulo scheduling algorithms that are used for loop pipelining are computationally expensive, and their application to the whole design space can make its exploration inviable, leading to sub-optimum solutions. Current state-of-the-art tools for modulo scheduling follow an iterative approach, which solves O(n^2) optimization problems, where n is the loop code size. To address this problem, this work proposes a novel data-flow-based approach that solves exactly 2 optimization problems, independently of the loop code size. Results show orders-of-magnitude savings in the computation time, leading to significant design space exploration time savings when compared with the state-of-the-art. As such, the proposed method produces hardware designs of higher performance than the ones produced by the current state of the art for large and complex loops, maintaining a similar resource utilization.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128373304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAT Based Place-And-Route for High-Speed Designs on 2.5D FPGAs","authors":"C. Ravishankar, H. Fraisse, D. Gaitonde","doi":"10.1109/FPT.2018.00027","DOIUrl":"https://doi.org/10.1109/FPT.2018.00027","url":null,"abstract":"2.5D stacking technology allows us to build high performance and high capacity FPGA devices at reasonable costs. The communication between multiple dies happen on a passive silicon interposer at high speed, which pose several interesting challenges. Due to clock skew characteristics across multiple dies and increase in the min-max spread of delays, place-and-route tools need to address inter-die hold violations and optimize for performance. We implement a tractable SAT based methodology to achieve this by minimally detouring data paths to meet all hold requirements while optimizing performance. We also confine the solution to a small window around each inter-die (Laguna) channel to reduce routing resource utilization, congestion, and scale the methodology to any Laguna channel utilization. We improve performance across the interface by 11% compared to a state-of-the-art commercial flow and meet a 500MHz spec on Xilinx(R) UltraScale+(TM) devices in 2E speedgrade. We address the scalability concerns of SAT and show how we can use this in practice with negligible runtimes in implementation tools. Our solution paves the way for FPGA-as-a-service platforms where fast inter-die communication, that does not interfere with user specific logic, is pivotal to their success.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130665133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}