Rushi Patel, Pouya Haghi, Shweta Jain, A. Kot, V. Krishnan, Mayank Varia, Martin C. Herbordt
{"title":"Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework","authors":"Rushi Patel, Pouya Haghi, Shweta Jain, A. Kot, V. Krishnan, Mayank Varia, Martin C. Herbordt","doi":"10.1109/HPEC55821.2022.9926388","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926388","url":null,"abstract":"Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication and lower latency. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and commu-nication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography - secure Multi-Party Computation (MPC) - that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gb/s of communication bandwidth while using only 3% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130832726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Jones, J. Kepner, Daniel Andersen, A. Buluç, C. Byun, K. Claffy, Tim Davis, W. Arcand, Jonathan Bernays, David Bestor, William Bergeron, V. Gadepally, Micheal Houle, M. Hubbell, Hayden Jananthan, Anna Klein, C. Meiners, Lauren Milechin, J. Mullen, Sandeep Pisharody, Andrew Prout, A. Reuther, Antonio Rosa, S. Samsi, Jon Sreekanth, Douglas Stetson, Charles Yee, P. Michaleas
{"title":"GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic","authors":"Michael Jones, J. Kepner, Daniel Andersen, A. Buluç, C. Byun, K. Claffy, Tim Davis, W. Arcand, Jonathan Bernays, David Bestor, William Bergeron, V. Gadepally, Micheal Houle, M. Hubbell, Hayden Jananthan, Anna Klein, C. Meiners, Lauren Milechin, J. Mullen, Sandeep Pisharody, Andrew Prout, A. Reuther, Antonio Rosa, S. Samsi, Jon Sreekanth, Douglas Stetson, Charles Yee, P. Michaleas","doi":"10.1109/HPEC55821.2022.9926332","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926332","url":null,"abstract":"Long range detection is a cornerstone of defense in many operating domains (land, sea, undersea, air, space,…,). In the cyber domain, long range detection requires the analysis of significant network traffic from a variety of observatories and outposts. Construction of anonymized hypersparse traffic matrices on edge network devices can be a key enabler by providing significant data compression in a rapidly analyzable format that protects privacy. GraphBLAS is ideally suited for both constructing and analyzing anonymized hypersparse traffic matrices. The performance of GraphBLAS on an Accolade Technologies edge network device is demonstrated on a near worse case traffic scenario using a continuous stream of CAIDA Telescope darknet packets. The performance for varying numbers of traffic buffers, threads, and processor cores is explored. Anonymized hypersparse traffic matrices can be constructed at a rate of over 50,000,000 packets per second; exceeding a typical 400 Gigabit network link. This performance demonstrates that anonymized hypersparse traffic matrices are readily computable on edge network devices with minimal compute resources and can be a viable data product for such devices.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121755340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nathan C Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, V. Gadepally, S. Samsi
{"title":"Benchmarking Resource Usage for Efficient Distributed Deep Learning","authors":"Nathan C Frey, Baolin Li, Joseph McDonald, Dan Zhao, Michael Jones, David Bestor, Devesh Tiwari, V. Gadepally, S. Samsi","doi":"10.1109/HPEC55821.2022.9926375","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926375","url":null,"abstract":"Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy -saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126664161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory","authors":"M. S. Islam, Qiqi Wang","doi":"10.1109/HPEC55821.2022.9926410","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926410","url":null,"abstract":"This paper presents an algorithm to accelerate the Jacobi iteration for solving linear systems of equations arising from structured problems on graphics processing units (GPUs). Acceleration is achieved by utilization of on-chip GPU shared memory via a domain decomposition procedure. In particular, the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory while avoiding expensive global memory accesses every iteration, resulting in a hierarchical algorithm (which takes advantage of the GPU memory hierarchy). We investigate the algorithm performance on the linear systems arising from the discretization of Poisson's equation in 1D and 2D, and observe an 8x speedup in convergence in the 1D problem and a nearly 6x speedup in 2D compared to a conventional GPU implementation of Jacobi iteration which only relies on global memory.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117191476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}