Qingqing Xiong, Rushi Patel, Chen Yang, Tong Geng, A. Skjellum, M. Herbordt
{"title":"GhostSZ: A Transparent FPGA-Accelerated Lossy Compression Framework","authors":"Qingqing Xiong, Rushi Patel, Chen Yang, Tong Geng, A. Skjellum, M. Herbordt","doi":"10.1109/FCCM.2019.00042","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00042","url":null,"abstract":"High-performance computing (HPC) applications often generate enormous amounts of data that must be transferred for check-pointing, in situ processing, or post-execution analysis. To reduce the related network traffic and storage consumption, lossy compression schemes that target scientific data are often used. SZ compression emerged three years ago and has gained much attention because of its high compression ratio. However, performing SZ compression can take half a day per Terabyte of data; this could be a drawback to adoption. We propose GhostSZ an FPGA framework for accelerating tasks in SZ at line rate, and so transparently. The critical problem to be overcome is the tight data dependence central to SZ. GhostSZ solves this with a data transfer path having novel staged hardware. We test our implementation with both synthetic and real HPC application data and show 9.5×-80× core versus pipeline speedup over the optimized production version running on a state-of-the-art CPU and 8.2× per chip. Much of the variance in performance is due to the FPGA already running at line rate and so benefiting less from optimizations applicable to the CPU only on the most favorable data sets. The significance of this work is the possibility of a major reduction in required networking and storage in HPC installations. For example, using GhostSZ, fewer than 10 FPGAs would be sufficient to handle the entire I/O bandwidth of the top entry on the latest IO-500 list.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116073975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sergiu Mosanu, Xinfei Guo, Mohamed El-Hadedy, L. Anghel, M. Stan
{"title":"Flexi-AES: A Highly-Parameterizable Cipher for a Wide Range of Design Constraints","authors":"Sergiu Mosanu, Xinfei Guo, Mohamed El-Hadedy, L. Anghel, M. Stan","doi":"10.1109/FCCM.2019.00079","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00079","url":null,"abstract":"Interconnected devices communicate efficiently and securely over untrusted networks via security protocols that employ various encryption algorithms, often as hardware modules. State-of-the-art hardware implementations typically focus on optimizing a single metric and are tedious to adapt to a wider set of design constraints. In this work, we develop an open-source, flexible and parameterizable hardware implementation of the Advanced Encryption Standard (AES). We present a feature-rich implementation in Chisel that is simple to employ to any architectures and to fine-tune to specific design requirements. Despite the larger design space, we use 50% fewer lines of code than existing Verilog versions, thus enabling a higher level of development productivity.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133699614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large-Scale and High-Throughput QR Decomposition on an FPGA","authors":"Dajung Lee, A. Hagiescu, Dan Pritsker","doi":"10.1109/FCCM.2019.00078","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00078","url":null,"abstract":"The QR decomposition, also called the QR factorization, is one of core matrix operations that is used to solve a linear inverse/solver problem. We develop a high-throughput QR decomposition for large-scale matrices on FPGA. We refine the modified Gram-Schmit QRD algorithm into a hardware-friendly algorithmic flow and describe the core architecture in C. We synthesize our code using an Intel FPGA SDK for OpenCL targeting an Intel Arria 10 FPGA device.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126375249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. R. Babu, Farah Naz Taher, Anjana Balachandran, Benjamin Carrión Schäfer
{"title":"Efficient Hardware Acceleration for Design Diversity Calculation to Mitigate Common Mode Failures","authors":"M. R. Babu, Farah Naz Taher, Anjana Balachandran, Benjamin Carrión Schäfer","doi":"10.1109/FCCM.2019.00043","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00043","url":null,"abstract":"This paper presents an FPGA-based hardware acceleration of the design diversity calculation to build robust redundant hardware systems against common model failures. We exploit the benefits of C-based VLSI design to generate a design pool of micro-architectures with unique characteristics from the same behavioral description. To identify the most diverse design pairs from this massive design pool, a computationally-intensive fault-injection based process is needed. Thus, in this work, we leverage the use of FPGAs to accelerate the design diversity calculation. Experimental results show an average of 2x speedup compared to a traditional software implementation. We also show that much higher speedups can be achieved when using larger FPGAs that can host a larger pool of designs.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129204612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub Cabal, Pavel Benácek, Jana Foltova, J. Holub
{"title":"Scalable P4 Deparser for Speeds Over 100 Gbps","authors":"Jakub Cabal, Pavel Benácek, Jana Foltova, J. Holub","doi":"10.1109/FCCM.2019.00064","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00064","url":null,"abstract":"The P4 language is a language suitable for the description of packet processing inside a network device. The typical P4 device consists of three main building blocks: Parser, Match+Action Tables and Deparser. The deparsing is the most challenging block because the main task of this block is to assemble the output packet based on changes in Match+Action Tables. This operation can be quite complicated in the case of high-speed networks. In this work, we present the scalable architecture (in term of the throughput) of a deparsing circuit which is suitable for implementation in FPGAs.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125698550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Templatised Soft Floating-Point for High-Level Synthesis","authors":"David B. Thomas","doi":"10.1109/FCCM.2019.00038","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00038","url":null,"abstract":"High-level Synthesis (HLS) tools have greatly increased the productivity of FPGA application development, making it possible to easily create highly-parallel application-accelerators. However, while FPGAs are known for the ability to customise the number representation of data-paths, most HLS work only uses custom-precision for fixed-point representations, and for floating-point relies on the 64-, 32-, and 16-bitformats provided by vendors. This paper presents a solution for parametrised floating-point in HLS via C++ templates, allowing for compile-time selection of exponent and fraction widths, including the use of mixed precisions for input arguments and result types. By using arbitrary width integers and compile-time logic the resulting operators describe the same data-path as an external floating-point IP generator, while still allowing the HLS tool to perform detailed optimisation and scheduling of the internal components. We show that the resulting custom-width HLS cores provide similar area and performance to platform-native vendor IP blocks, while adding full support for heterogeneous precision floating-point data-paths to HLS tools.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131276615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Peverelli, Marco Rabozzi, Salvatore Cardamone, Emanuele Del Sozzo, A. Thom, M. Santambrogio, Lorenzo Di Tucci
{"title":"Automated Acceleration of Dataflow-Oriented C Applications on FPGA-Based Systems","authors":"Francesco Peverelli, Marco Rabozzi, Salvatore Cardamone, Emanuele Del Sozzo, A. Thom, M. Santambrogio, Lorenzo Di Tucci","doi":"10.1109/FCCM.2019.00054","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00054","url":null,"abstract":"The acceleration of compute-intensive applications on FPGA-based systems has become an increasingly common trend thanks to their availability as cloud commodities. This trend has also been accompanied by wider support of High-Level Synthesis tools. Despite these solutions reduce the learning curve for hardware development, the programmer still requires specific expertise in order to achieve efficient implementations. In this paper, we propose an automated approach for the acceleration of C applications into dataflow kernels on FPGAs.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133393496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kota Yoshida, Takaya Kubota, M. Shiozaki, T. Fujino
{"title":"Model-Extraction Attack Against FPGA-DNN Accelerator Utilizing Correlation Electromagnetic Analysis","authors":"Kota Yoshida, Takaya Kubota, M. Shiozaki, T. Fujino","doi":"10.1109/FCCM.2019.00059","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00059","url":null,"abstract":"This work presents a model-extraction attack on a DNN accelerator, which is implemented on FPGA. An adversary can get DNN model parameters by exploiting electromagnetic leakage from the accelerator during operation. Our experimental results show that the adversary can extract trained model parameters from a DNN accelerator even if the DNN model parameters are protected with data encryption. This suggests that countermeasures against side-channel leaks are important when implementing a DNN accelerator on FPGA or ASIC.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134479429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Murad Qasaimeh, Joseph Zambreno, Phillip H. Jones, K. Denolf, Jack Lo, K. Vissers
{"title":"Analyzing the Energy-Efficiency of Vision Kernels on Embedded CPU, GPU and FPGA Platforms","authors":"Murad Qasaimeh, Joseph Zambreno, Phillip H. Jones, K. Denolf, Jack Lo, K. Vissers","doi":"10.1109/FCCM.2019.00077","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00077","url":null,"abstract":"This paper presents a benchmark of the energy efficiency of a wide range of vision kernels on three commonly used hardware accelerators for embedded vision applications: ARM57 CPU, Jetson TX2 GPU and ZCU102 FPGA, using their vendor optimized vision libraries: OpenCV, VisionWorks and xfOpenCV. Our results show that the GPU achieves an energy/frame reduction ratio of 1.1-3.2x compared to CPU and FPGA for simple kernels. While for more complicated kernels, the FPGA outperforms the others with energy/frame reduction ratios of 1.2-22.3x. It is also observed that the FPGA performs increasingly better as a vision kernel's complexity grows.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115705100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils Voss, Pablo Quintana, O. Mencer, W. Luk, G. Gaydadjiev
{"title":"Memory Mapping for Multi-die FPGAs","authors":"Nils Voss, Pablo Quintana, O. Mencer, W. Luk, G. Gaydadjiev","doi":"10.1109/FCCM.2019.00021","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00021","url":null,"abstract":"This paper proposes an algorithm for mapping logical to physical memory resources on FPGAs. Our greedy strategy based algorithm is specifically designed to facilitate timing closure on modern multi-die FPGAs for static-dataflow accelerators utilising most of the on-chip resources. The main objective of the proposed algorithm is to ensure that specific sub-parts of the design under consideration can fully reside within a single die to limit inter-die communication. The above is achieved by performing the memory mapping for each sub-part of the design separately while keeping allocation of the available physical resources balanced. As a result the number of inter-die connections is reduced on average by 50% compared to an algorithm targeting minimal area usage for real, complex applications using most of the on-chip's resources. Additionally, our algorithm is the only one out of the four evaluated approaches which successfully produces place and route results for all 33 applications and benchmarks.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"57 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116432298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}