L. Pileggi, Siyuan Chen, Keshav Harisrikanth, Guanglin Xu, K. Mai, F. Franchetti
{"title":"A High Throughput Hardware Accelerator for FFTW Codelets: A First Look","authors":"L. Pileggi, Siyuan Chen, Keshav Harisrikanth, Guanglin Xu, K. Mai, F. Franchetti","doi":"10.1109/HPEC55821.2022.9926333","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926333","url":null,"abstract":"The Fast Fourier Transform (FFT) is a critical computation for numerous applications in science and engineering. Its implementation has been widely studied and optimized on various computing platforms, with the FFTW library becoming the standard interface in HPC. In this work, we propose hardware acceleration of the FFTW library by putting a software code let into hardware. The hardware is exposed to the user through an FFTW -compatible software library while actual computation takes place behind the scenes on a custom accelerator. To demonstrate a first look at this idea, we design a high throughput accelerator for FFTW twiddle codelets. The FFT hardware is automatically generated using SPIRAL and a test chip is fabricated in a TSMC 28nm process. We provide measured results of the test chip and discuss many opportunities for future work.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121597992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU-Accelerated High-Bandwidth Radar Centroiding","authors":"D. Brigada, Maximilian Merfeld, Kara Warner","doi":"10.1109/HPEC55821.2022.9926364","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926364","url":null,"abstract":"Radar signal processing is a computationally inten-sive task, especially for high-bandwidth systems. Traditionally, such systems have relied on the interleaving of processing on multiple nodes of large compute clusters to achieve the necessary throughput. Development in general-purpose GPU computing has led to a massive increase in the computational power available to highly parallel tasks. Most parts of the radar signal processing pipeline are well suited for such a task. This paper describes an algorithm for centroiding, a key part of the search radar pipeline that has not yet been demonstrated on a GPU. With this centroiding algorithm, the entire high-data-rate portion of the processing pipeline can be run on the GPU, yielding a speedup factor of approximately 40. The primary benefit of this approach is a massive reduction in data copying from the GPU to the CPU-a factor of over 1200 in this case-alleviating the main barrier to G PU - based radar processing systems.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114797378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes","authors":"Chenxu Niu, Wei Zhang, S. Byna, Yong Chen","doi":"10.1109/HPEC55821.2022.9926389","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926389","url":null,"abstract":"Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec, GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we propose kv2vec, a method that captures relationships between keys and values and represents key-value pairs in dense vectors. The fundamental idea of the kv2vec method is utilizing recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems. We evaluate the kv2vec method with multiple real-world datasets, and the results show that kv2vec outperforms existing models.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132463817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Resource-Constrained Optimizations For Synthetic Aperture Radar On-Board Image Processing","authors":"Maron Schlemon, M. Schulz, R. Scheiber","doi":"10.1109/HPEC55821.2022.9926327","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926327","url":null,"abstract":"Synthetic Aperture Radar (SAR) can be used to create realistic and high-resolution 2D or 3D reconstructions of landscapes. The data capture is typically deployed using radar instruments in specially equipped, low flying planes, resulting in a large amount of raw data, which needs to be processed for image reconstruction. However, due to limited on-board processing capacities on the plane (power, size, weight, cooling, communication bandwidth to ground stations, etc.) and the need to capture many images during a single flight, the raw data must be processed on-board and then sent to the ground station efficiently as image products. In this paper we describe the processing architecture of the digital beamforming SAR (DBFSAR) of the German Areaospace Center (DLR) and the special steps that had to be taken to enable the on-board processing. We explain the required software optimizations and under which conditions their integration in the SAR imaging process leads to (near) real-time capability. We further describe the lessons learned in our work and discuss how they can be applied to other processing scenarios with limited resource availability.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132725135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehmet Gungor, Kai Huang, Stratis Ioannidis, M. Leeser
{"title":"Optimizing Designs Using Several Types of Memories on Modern FPGAs","authors":"Mehmet Gungor, Kai Huang, Stratis Ioannidis, M. Leeser","doi":"10.1109/HPEC55821.2022.9926306","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926306","url":null,"abstract":"Modern FPGAs targeting data centers are designed to accelerate problems with large data. They offer many different types of memory including on-chip and on-board memories. A recent addition is High Bandwidth Memory (HBM), whose advantages have been demonstrated by others. However, there is little research that looks at how interactions among different memory types impact application performance. We investigate how a combination of HBM and on-chip memory (BRAM or URAM) impact clock rate and overall application latency. In these designs, the on-chip memory is used as an on-chip cache for the larger amounts of data stored in HBM. Our experiments show that as the size of data stored in BRAM or URAM increases, the achievable clock speed is reduced. This in turn may result in degraded performance. We examine Garbled Circuits, an implementation of Secure Function Evaluation (SFE) with high memory demands and out-of-order data access, and examine how different choices of BRAM, URAM and HBM usage alters its performance.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132971122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner
{"title":"AI and ML Accelerator Survey and Trends","authors":"A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner","doi":"10.1109/HPEC55821.2022.9926331","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926331","url":null,"abstract":"This paper updates the survey of AI accelerators and processors from past three years. This paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and power consumption numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Two new trends plots based on accelerator release dates are included in this year's paper, along with the additional trends of some neuromorphic, photonic, and memristor-based inference accelerators.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133041657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Wei Wu, Xi Jin
{"title":"HuGraph: Acceleration of GCN Training on Heterogeneous FPGA Clusters with Quantization","authors":"Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Wei Wu, Xi Jin","doi":"10.1109/HPEC55821.2022.9926312","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926312","url":null,"abstract":"Graph convolutional networks (GCNs) have suc-ceeded significantly in numerous fields, but the need for higher performance and energy efficiency training GCN on larger graphs continues unabated. At the same time, since recon-figurable accelerators have the ability to fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN computing. Furthermore, to scale GCN computation, the use of heteroge-neous FPGAs is inevitable due to the constant iteration of new FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the-shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full-process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and mem-ory consumption. Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that HuGraph achieves up to 102.3 ×, 4.62×, and 11.1× speedup compared with the state-of-the-art works on CPU, GPU, and FPGA platforms, respectively, with negligible accuracy loss.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123602757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Kalman Filter Driven Estimation of Community Structure in Time Varying Graphs","authors":"L. Durbeck, P. Athanas","doi":"10.1109/HPEC55821.2022.9926358","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926358","url":null,"abstract":"Community detection is an NP-hard graph problem that has been the subject of decades of research. Moreover, efficient methods are needed for time-varying graphs. In this paper we propose and evaluate a method of approximating the latent block structure within a time-varying graph using a Kalman filter. The method described breaks a stream of graph updates into samples of sufficient size, each one forming a graph $G_{t}$, and has the desirable feature that it accurately updates its representation of the latent block structure using a relatively small amount of information: the prior $t-1$ predicted block structure and the current datastream sample $G_{t}$. This paper details the underlying system of linear equations, used here to represent community detection, that achieves 97 % accuracy estimating the latent block representation as the community structure changes. This is demonstrated for synthetic graphs generated by a hybrid mixed-model stochastic block model from the DARPAIMIT Graph Challenge with time-varying block structure.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116732791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Floros, Tiancheng Liu, N. Pitsianis, Xiaobai Sun
{"title":"Fast Graph Algorithms for Superpixel Segmentation","authors":"D. Floros, Tiancheng Liu, N. Pitsianis, Xiaobai Sun","doi":"10.1109/HPEC55821.2022.9926359","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926359","url":null,"abstract":"We introduce the novel graph-based algorithm SLAM (simultaneous local assortative mixing) for fast and high-quality superpixel segmentation of any large color image. Super-pixels are compact semantic image elements; superpixel segmen-tation is fundamental to a broad range of vision tasks in existing and emerging applications, especially, to safety-critical and time-critical applications. SLAM leverages a graph representation of the image, which encodes the pixel features and similarities, for its rich potential in implicit feature transformation and extra means for feature differentiation and association at multiple resolution scales. We demonstrate, with our experimental results on 500 benchmark images, that SLAM outperforms the state-of-art algorithms in superpixel quality, by multiple measures, within the same time frame. The contributions are at least two-fold: SLAM breaks down the long-standing speed barriers in graph-based algorithms for superpixel segmentation; it lifts the fundamental limitations in the feature-point-based algorithms.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116863963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wissam M. Sid-Lakhdar, M. Aznaveh, P. Luszczek, J. Dongarra
{"title":"Deep Gaussian process with multitask and transfer learning for performance optimization","authors":"Wissam M. Sid-Lakhdar, M. Aznaveh, P. Luszczek, J. Dongarra","doi":"10.1109/HPEC55821.2022.9926396","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926396","url":null,"abstract":"We combine Deep Gaussian Processes with multitask and transfer learning for the performance modeling and optimization of HPC applications. Deep Gaussian processes merge the uncertainty quantification advantage of Gaussian Processes with the predictive power of deep learning. Multitask and transfer learning allow for improved learning efficiency when several similar tasks are to be learned simultaneously and when previous learned models are sought to help in the learning of new tasks, respectively. A comparison with state-of-the-art autotuners shows the advantage of our approach on two application problems.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117137552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}