M. Leeser, Mehmet Güngör, Kai Huang, Stratis Ioannidis
{"title":"Accelerating Large Garbled Circuits on an FPGA-enabled Cloud","authors":"M. Leeser, Mehmet Güngör, Kai Huang, Stratis Ioannidis","doi":"10.1109/H2RC49586.2019.00008","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00008","url":null,"abstract":"Garbled Circuits (GC) is a technique for ensuring the privacy of inputs from users and is particularly well suited for FPGA implementations in the cloud where data analytics is frequently run. Secure Function Evaluation, such as that enabled by GC, is orders of magnitude slower than processing in the clear. We present our best implementation of GC on Amazon Web Services (AWS) that implements garbling on Amazon’s FPGA enabled F1 instances. In this paper we present the largest problems garbled to date on FPGA instances, which includes problems that are represented by over four million gates. Our implementation speeds up garbling 20 times over software over a range of different circuit sizes.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115029035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joao Carlos Bittencourt, João Souza, Adhvan Furtado, E. Nascimento, Wagner Oliveira, A. Nascimento, L. Fialho, J. Oliveira, R. Tutu, Georgina Rojas, L. Jesus, André Lima
{"title":"Performance and Energy Efficiency Analysis of Reverse Time Migration on a FPGA Platform","authors":"Joao Carlos Bittencourt, João Souza, Adhvan Furtado, E. Nascimento, Wagner Oliveira, A. Nascimento, L. Fialho, J. Oliveira, R. Tutu, Georgina Rojas, L. Jesus, André Lima","doi":"10.1109/H2RC49586.2019.00012","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00012","url":null,"abstract":"Reverse time migration (RTM) modeling is a computationally intensive component in the seismic processing workflow of oil and gas exploration, often demanding the manipulation of terabytes of data. Therefore, the computational kernels of the RTM algorithms need to access a large range of memory locations. However, most of these accesses result in cache misses, degrading the overall system performance. GPGPUs and FPGAs are the two endpoints in the spectrum of acceleration platforms, since both can achieve better performance in comparison to CPU on several high-performance applications. Recent literature highlights FPGA better energy efficiency when compared to GPGPU. The present work proposes a FPGA accelerated platform prototype targeting the computation of the RTM algorithm on an HPC environment. Experimental results highlight that speedups of 112x can be achieved, when compared to a sequential execution on CPU. When compared to a GPU, the power consumption has been reduced up to 55%.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126630678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"It's All About Data Movement: Optimising FPGA Data Access to Boost Performance","authors":"Nick Brown, D. Dolman","doi":"10.1109/H2RC49586.2019.00006","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00006","url":null,"abstract":"The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance the entire algorithm must be rethought and recast in a dataflow style. Reducing the cost of data movement for all computing devices is critically important, and in this paper we explore the most appropriate techniques for FPGAs. We do this by describing the optimisation of an existing FPGA implementation of an atmospheric model's advection scheme. By taking an FPGA code that was over four times slower than running on the CPU, mainly due to data movement overhead, we describe the profiling and optimisation strategies adopted to significantly reduce the runtime and bring the performance of our FPGA kernels to a much more practical level for real-world use. The result of this work is a set of techniques, steps, and lessons learnt that we have found significantly improves the performance of FPGA based HPC codes and that others can adopt in their own codes to achieve similar results.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122188277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabien Chaix, Georgios Ailamakis, Theocharis Vavouris, A. Damianakis, M. Katevenis, I. Mavroidis, Aggelos D. Ioannou, Nikolaos Kossifidis, Nikolaos Dimou, Giorgos Ieronymakis, M. Marazakis, Vassilis D. Papaefstathiou, Vassilis Flouris, Mihailis Ligerakis
{"title":"Implementation and Impact of an Ultra-Compact Multi-FPGA Board for Large System Prototyping","authors":"Fabien Chaix, Georgios Ailamakis, Theocharis Vavouris, A. Damianakis, M. Katevenis, I. Mavroidis, Aggelos D. Ioannou, Nikolaos Kossifidis, Nikolaos Dimou, Giorgos Ieronymakis, M. Marazakis, Vassilis D. Papaefstathiou, Vassilis Flouris, Mihailis Ligerakis","doi":"10.1109/H2RC49586.2019.00010","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00010","url":null,"abstract":"Efficient prototyping of a large complex system can be significantly facilitated by the use of a flexible and versatile physical platform where both new hardware and software components can readily be implemented and tightly integrated in a timely manner. Towards this end, we have developed the 120 130 mm QFDB board and associated firmware, including the system software environment. We developed a large system based on this advanced dense and modular building block. The QFDB features 4 interconnected Xilinx Zynq Ultrascale+ devices, each one consisting of an ARM-based subsystem tightly coupled with reconfigurable logic. Each Zynq Ultrascale+ is connected to 16 GB of DDR4 memory. In addition, one Zynq provides storage through an M.2 Solid State Disk (SSD). In this paper, we present the design and the implementation of this board, as well as the software environment for board operation. Moreover, we describe a 10 Gb Ethernet communication infrastructure for interconnecting multiple boards together. Finally, we highlight the impact of this board on a number of ongoing research activities that leverage the QFDB versatility, both as a largescale prototyping system for HPC solutions, and as a host for the development of FPGA integration techniques.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131582200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, A. Koch
{"title":"High-Throughput Multi-Threaded Sum-Product Network Inference in the Reconfigurable Cloud","authors":"Micha Ober, Jaco A. Hofmann, Lukas Sommer, Lukas Weber, A. Koch","doi":"10.1109/H2RC49586.2019.00009","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00009","url":null,"abstract":"Large cloud providers have started to make powerful FPGAs available as part of their public cloud offers. One promising application area for this kind of instances is the acceleration of machine learning tasks. This work presents an accelerator architecture that uses multiple accelerator cores for the inference in so-called Sum-Product Networks and complements it with a host software interface that overlaps data-transfer and actual computation. The evaluation shows that, the proposed architecture deployed to Amazon AWS F1 instances is able to outperform a 12-core Xeon processor by a factor of up to 1.9x and a Nvidia Tesla V100 GPU by a factor of up to 6.6x.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133875056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Perfect Shuffle and Bitonic Networks for Efficient Quantum Sorting","authors":"Naveed Mahmud, Bailey Srimoungchanh, Bennett Haase-Divine, Nolan Blankenau, Annika Kuhnke, E. El-Araby","doi":"10.1109/H2RC49586.2019.00011","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00011","url":null,"abstract":"The emergence of quantum computers in the last decade has generated research interest in applications such as quantum sorting. Quantum sorting plays a critical role in creating ordered sets of data that can be better utilized, e.g., quantum ordered search or quantum network switching. In this paper, we propose a quantum sorting algorithm that combines highly parallelizable bitonic merge networks with perfect shuffle permutations (PSP), for sorting data represented in the quantum domain. The combination of bitonic networks with PSP improves the temporal complexity of bitonic merge sorting which is critical for reducing decoherence effects for quantum processing. We present space-efficient quantum circuits that can be used for quantum bit comparison and permutation. We also present a reconfigurable hardware quantum emulator for prototyping the proposed quantum algorithm. The emulator has a fully-pipelined architecture and supports double-precision floating-point computations, resulting in high throughput and accuracy. The proposed hardware architectures are implemented on a high-performance reconfigurable computer (HPRC). In our experiments, we emulated quantum sorting circuits of up to 31 fully-entangled quantum bits on a single FPGA node of the HPRC platform. To the best of our knowledge, our effort is the first to investigate a reconfigurable hardware emulation of quantum sorting using bitonic networks and perfect shuffle.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130794219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface","authors":"H. Zohouri, S. Matsuoka","doi":"10.1109/H2RC49586.2019.00007","DOIUrl":"https://doi.org/10.1109/H2RC49586.2019.00007","url":null,"abstract":"Supported by their high power efficiency and recent advancements in High Level Synthesis (HLS), FPGAs are quickly finding their way into HPC and cloud systems. Large amounts of work have been done so far on loop and area optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and efficiency of the memory controller of FPGAs is missing in literature, which becomes even more crucial when the limited memory bandwidth of modern FPGAs compared to their GPU counterparts is taken into account. In this work, we will analyze the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of overlapped blocking. Our results point to multiple shortcomings in the memory controller of Intel FPGAs, especially with respect to memory access alignment, that can hinder the programmer’s ability in maximizing memory performance in their design. For some of these cases, we will provide work-arounds to improve memory bandwidth efficiency; however, a general solution will require major changes in the memory controller itself.","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122478972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. F. Tinder, S. Yanushkevich, C. Hamacher, Z. Vranesic, S. Zaky, J. Raymond
{"title":"Organization","authors":"R. F. Tinder, S. Yanushkevich, C. Hamacher, Z. Vranesic, S. Zaky, J. Raymond","doi":"10.1201/9781315220659-8","DOIUrl":"https://doi.org/10.1201/9781315220659-8","url":null,"abstract":"","PeriodicalId":413478,"journal":{"name":"2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134181401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}