Wei Wu, Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Xi Jin
{"title":"An SSD-Based Accelerator for Singular Value Decomposition Recommendation Algorithm on Edge","authors":"Wei Wu, Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Xi Jin","doi":"10.1109/HPEC55821.2022.9926379","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926379","url":null,"abstract":"Recommender system (RS) is widely used in social networks, computational advertising, video platforms and many other Internet applications. Most RSs are based on the cloud-to-edge framework. Recommended item lists are computed in the cloud server and then transmitted to the edge device. Network bandwidth and latency between cloud server and edge may cause the delays in recommendation. Edge computing could help obtain user's real-time preferences and thus improve the performance of recommendation. However, the increasing complexity of rec-ommendation algorithms and data scales cause challenges to real-time recommendation on edge. To solve these problems, in this paper, we mainly focus on the Jacobi-based singular value decomposition (SVD) algorithm because of its high parallel processing potential and cost effective NVM-storage. We propose an SSD-based accelerator for the one-sided Jacobi transformation algorithm. We implement a hardware prototype on a real Xilinx FPGA development board. Experimental results show that the proposed SVD engine can achieve 3.4x speedup to 5.8x speedup compared with software SVD solvers such as MATLAB running on a high-performance CPU.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128379009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingru Zeng, Quanxin Li, Baoze Zhao, Han Jiao, Yihua Huang
{"title":"Hardware Design and Implementation of Post-Quantum Cryptography Kyber","authors":"Qingru Zeng, Quanxin Li, Baoze Zhao, Han Jiao, Yihua Huang","doi":"10.1109/HPEC55821.2022.9926344","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926344","url":null,"abstract":"In order to resist quantum attacks, post-quantum cryptographic algorithms have become the focus of cryptog-raphy research. As a lattice-based key algorithm, the Kyber protocol has great advantages in the selection of post-quantum algorithms. This paper proposes an efficient hardware design scheme for Kyber512 whose security level is Ll. This paper first design a general hash module to reuse computing cores to improve resource utilization. A ping-pong RAM and a pipeline structure is used to design a general-purpose NTT processor to support all operations on polynomial multiplication. Finally, the inter-module cooperation and data scheduling are compactly designed to shorten the working cycle. In this paper, the top-level key generation, public key encryption and private key decryption modules are implemented on Artix 7 FPGA with 204MHz frequency. The times of the corresponding modules are 11.5s, 17.3s, and 23.5s, respectively. Compared with the leading hardware implementation, the design in this paper reduces the area-delay product by 10.2 %, achieving an effective balance between resources and area.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123809086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph McDonald, J. Kurdzo, P. Stepanian, M. Veillette, David Bestor, Michael Jones, V. Gadepally, S. Samsi
{"title":"Performance Estimation for Efficient Image Segmentation Training of Weather Radar Algorithms","authors":"Joseph McDonald, J. Kurdzo, P. Stepanian, M. Veillette, David Bestor, Michael Jones, V. Gadepally, S. Samsi","doi":"10.1109/HPEC55821.2022.9926400","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926400","url":null,"abstract":"Deep Learning has a dramatically increasing demand for compute resources and a corresponding increase in the energy required to develop, explore, and test model architectures for various applications. Parameter tuning for networks customarily involves training multiple models in a search over a grid of parameter choices either randomly or exhaustively, and strategies applying complex search methods to identify candidate model architectures require significant computation for each possible architecture sampled in the model spaces. However, these approaches of extensively training many individual models in order to choose a single best performing model for future inference can seem unnecessarily wasteful at a time when energy efficiency and minimizing computing's environmental impact are increasingly important. Techniques or algorithms that reduce the computational budget to identify and train accurate deep networks among many options are of great need. This work considers one recently proposed approach, Training Speed Estimation, alongside deep learning approaches for a common hydrometeor classification problem, hail prediction through semantic image segmentation. We apply this method to the training of a variety of segmentation models and evaluate its effectiveness as a performance tracking approach for energy-aware neural network applications. This approach, together with early-stopping, offers a straightforward strategy for minimizing energy expenditure. By measuring consumption and estimating the level of energy savings, we are able to characterize this strategy as a practical method for minimizing deep learning's energy and carbon impact.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125771954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaoyang Han, Yiyue Jiang, Rahul Mushini, J. Dooley, M. Leeser
{"title":"Hardware Software Codesign of Applications on the Edge: Accelerating Digital PreDistortion for Wireless Communications","authors":"Zhaoyang Han, Yiyue Jiang, Rahul Mushini, J. Dooley, M. Leeser","doi":"10.1109/HPEC55821.2022.9926314","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926314","url":null,"abstract":"We present a real-time adaptive Digital PreDistortion (DPD) system developed on a System-on-Chip (SoC) platform with integrated RF front end, namely the AMD/Xilinx RFSoC. The design utilizes the heterogeneity of the RFSoC and is carefully partitioned. The control logic and training algorithm are implemented on the embedded ARM processor, while the predistorter module is placed on the FPGA fabric. To better coordinate both the hardware and software implementations, the training algorithm has been optimized for a shorter training time which results in a system that adapts to current environmental conditions with a shorter latency. Specifically, the number of signal samples used in training are reduced by applying the probability distribution information from the input signal in order to reduce the training time while retaining the important data samples. Results show that this reduced training set maintains the accuracy of the full data set. The implemented design balances the processing on the ARM processor and FPGA fabric resulting in a computationally efficient solution which makes good use of the different resources available. It has been experimentally validated on an AMD/Xilinx Gen3 RFSoC board with an exsternal GaN Power Amplifier (PA).","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Calculation of Triangle Centrality in Big Data Networks","authors":"Wali Mohammad Abdullah, David Awosoga, S. Hossain","doi":"10.1109/HPEC55821.2022.9926324","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926324","url":null,"abstract":"The notion of “centrality” within graph analytics has led to the creation of well-known metrics such as Google's Page Rank [1], which is an extension of eigenvector centrality [2]. Triangle centrality is a related metric [3] that utilizes the presence of triangles, which play an important role in network analysis, to quantitatively determine the relative “importance” of a node in a network. Efficiently counting and enumerating these triangles are a major backbone to understanding network characteristics, and linear algebraic methods have utilized the correspondence between sparse adjacency matrices and graphs to perform such calculations, with sparse matrix-matrix multiplication as the main computational kernel. In this paper, we use an intersection representation of graph data implemented as a sparse matrix, and engineer an algorithm to compute the triangle centrality of each vertex within a graph. The main computational task of calculating these sparse matrix-vector products is carefully crafted by employing compressed vectors as accumulators. As with other state-of-the-art algorithms [4], our method avoids redundant work by counting and enumerating each triangle exactly once. We present results from extensive computational experiments on large-scale real-world and synthetic graph in-stances that demonstrate good scalability of our method. We also present a shared memory parallel implementation of our algorithm.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123617811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating Permutations Using Hash Tables","authors":"Oded Green, Corey J. Nolet, Joe Eaton","doi":"10.1109/HPEC55821.2022.9926387","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926387","url":null,"abstract":"Given a set of $N$ distinct values, the operation of shuffling those elements and creating a random order is used in a wide range range of applications, including (but not limited to) statistical analysis, machine learning, games, and bootstrapping. The operation of shuffling elements is equivalent to generating a random permutation and applying the permutation. For example, the random permutation of an input allows splitting into two or more subsets without bias. This operation is repeated for machine learning applications when both a train and test data set are needed. In this paper we describe a new method for creating random permutations that is scalable, efficient, and simple. We show that the operation of generating a random permutation shares traits with building a hashtable. Our method uses a fairly new hash table, called HashGraph, to generate the permutation. HashGraph's unique data-structure ensures easy generation and retrieval of the permutation. HashGraph is one of the fastest known hash-tables for the GPU and also outperforms many leading CPU hash-tables. We show the performance of our new permutation generation scheme using both Python and CUDA versions of HashGraph. Our CUDA implementation is roughly 10% faster than our Python implementation. In contrast to the shuffle operation in NVIDIA's Thrust and cuPy frameworks, our new permutation generation algorithm is 2.6 x and 1.73 x faster, respectively and up to 150 x faster than numPy.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"69 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128669178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the Impacts of Software Cache Configuration for In-line Compressed Arrays","authors":"Sansriti Ranjan, Dakota Fulp, Jon C. Calhoun","doi":"10.1109/HPEC55821.2022.9926289","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926289","url":null,"abstract":"In order to compute on or analyze large data sets, applications need access to large amounts of memory. To increase the amount of physical memory requires costly hardware upgrades. Compressing large arrays stored in an application's memory does not require hardware upgrades, while enabling the appearance of more physical memory. In-line compressed arrays compress and decompress data needed by the application as it moves in and out of it's working set that resides in main memory. Naive compressed arrays require a compression or decompression operation for each store or load, respectively, which significantly hurts performance. Caching decompressed values in a software managed cache limits the number of compression/decompression operations, improving performance. The structure of the software cache impacts the performance of the application. In this paper, we build and utilize a compression cache simulator to analyze and simulate various cache configurations for an application. Our simulator is able to leverage and model the multidimensional nature of high-performance computing (HPC) data and compressors. We evaluate both direct-mapped and set-associative caches on five HPC kernels. Finally, we construct a performance model to explore runtime impacts of cache configurations. Results show that cache policy tuning by increasing the block size, associativity and cache size improves the hit rate significantly for all applications. Incorporating dimensionality further improves locality and hit rate, achieving speedup in the performance of an application by up to 28.25 %.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125147965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clayton J. Faber, S. Harris, Zhili Xiao, R. Chamberlain, A. Cabrera
{"title":"Challenges Designing for FPGAs Using High-Level Synthesis","authors":"Clayton J. Faber, S. Harris, Zhili Xiao, R. Chamberlain, A. Cabrera","doi":"10.1109/HPEC55821.2022.9926398","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926398","url":null,"abstract":"High-Level Synthesis (HLS) tools are aimed at enabling performant FPGA designs that are authored in a high-level language. While commercial HLS tools are available today, there is still a substantial performance gap between most designs developed via HLS relative to traditional, labor intensive approaches. We report on several cases where an anticipated performance improvement was either not realized or resulted in decreased performance. These include: programming paradigm choices between data parallel vs. pipelined designs; dataflow implementations; configuration parameter choices; and handling odd data set sizes. The results point to a number of improvements that are needed for HLS tool flows, including a strong need for performance modeling that can reliably guide the compilation optimization process.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114680771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ghazanfar Ali, Sridutt Bhalachandra, N. Wright, Mert Side, Yong Chen
{"title":"Optimal GPU Frequency Selection using Multi-Objective Approaches for HPC Systems","authors":"Ghazanfar Ali, Sridutt Bhalachandra, N. Wright, Mert Side, Yong Chen","doi":"10.1109/HPEC55821.2022.9926317","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926317","url":null,"abstract":"Power consumption poses a significant challenge in current and emerging GPU-enabled high-performance computing (HPC) systems. In modern GPUs, controls like dynamic voltage frequency scaling (DVFS), among others, exist to regulate power consumption. Due to varying computational intensities and the availability of a wide range of frequency settings, selecting the optimal frequency configuration for a given GPU workload is non-trivial. Applying a power control with the single objective of reducing power may cause performance degradation, leading to more energy consumption. In this study, we characterize and identify GPU utilization metrics that influence both the power and execution time of a given workload. Analytical models for power and execution time are then proposed using the charac-terized feature set. Multi-objective functions (i.e., energy-delay product (EDP) and ED2p) are used to select an optimal GPU DVFS configuration for a workload such that power consumption is reduced with no or negligible degradation in performance. The evaluation was conducted using SPEC ACCEL benchmarks on NVIDIA GV100 GPU. The proposed power and performance analytical models demonstrated prediction accuracies of up to 99.2% and 98.8%, respectively. On average, the benchmarks showed 28.6% and 25.2% energy savings using EDP and ED2p approaches, respectively, without performance degradation. Fur-thermore, the proposed models require metric collection at only the maximum frequency rather than all supported DVFS configurations.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125625605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}