{"title":"Bandwidth Allocation in Silicon-Photonic Networks Using Application Instrumentation","authors":"A. Narayan, A. Joshi, A. Coskun","doi":"10.1109/HPEC43674.2020.9286151","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286151","url":null,"abstract":"Photonic network-on-chips, despite their low-latency and high-bandwidth-density advantages in large manycore systems, suffer from high power overhead. This overhead is further exacerbated by the high bandwidth demands of data-centric applications. Prior works utilize bandwidth allocation policies at system-level to minimize photonic power and provide required bandwidth for applications. We present an approach to minimize the bandwidth requirements by instrumenting an application at the software level. This instrumented information is used to assist bandwidth allocation at system-level, thereby reducing the photonic power. We instrument PageRank application and demonstrate 35% lower power using instrumentation-assisted bandwidth allocation on PageRank running real-world graphs compared to bandwidth allocation on uninstrumented PageRank.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123033583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Efficient Analysis of Synchrophasor Data using the NVIDIA Jetson Nano","authors":"Suzanne J. Matthews, A. S. Leger","doi":"10.1109/HPEC43674.2020.9286226","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286226","url":null,"abstract":"Smart Grid Technology is an important part of increasing resilience and reliability of power grids. Applying Phasor Measurement Units (PMUs) to obtain synchronized phasor measurements, or synchrophasors, provides more detailed, higher fidelity data that can enhance situational awareness by rapidly detecting anomalous conditions. However, sample rates of PMUs are up to three orders of magnitude faster than traditional telemetry, resulting in large datasets that require novel computing methods to process the data quickly and efficiently. This work aims to improve calculation speed and energy efficiency of anomaly detection by leveraging manycore computing on a NVIDIA Jetson Nano. This work translates an existing PMU anomaly detection scheme into a novel GPU-compute algorithm and compares the computational performance and energy efficiency of the GPU approach to serial and multicore CPU methods. The GPU algorithm was benchmarked on a real dataset of 11.3 million measurements derived from 8 PMUs from a 1:1000 scale emulation of a power grid, and two additional datasets derived from the original dataset. Results show that the GPU detection scheme is up to 51.91 times faster than the serial method, and over 13 times faster than the multicore method. Additionally, the GPU approach exhibits up to 92.3% run-time energy reduction compared to serial method and 78.4% reduction compared to the multicore approach.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128063507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure","authors":"Sayan Ghosh, M. Halappanavar","doi":"10.1109/HPEC43674.2020.9286167","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286167","url":null,"abstract":"Graph analytics has emerged as an important tool in the analysis of large scale data from diverse application domains such as social networks, cyber security and bioinformatics. Counting the number of triangles in a graph is a fundamental kernel with several applications such as detecting the community structure of a graph or in identifying important vertices in a graph. The ubiquity of massive datasets is driving the need to scale graph analytics on parallel systems. However, numerous challenges exist in efficiently parallelizing graph algorithms, especially on distributed-memory systems. Irregular memory accesses and communication patterns, low computation to communication ratios, and the need for frequent synchronization are some of the leading challenges. In this paper, we present TriC, our distributed-memory implementation of triangle counting in graphs using the Message Passing Interface (MPI), as a submission to the 2020 Graph Challenge competition. Using a set of synthetic and real-world inputs from the challenge, we demonstrate a speedup of up to 90 x relative to previous work on 32 processor-cores of a NERSC Cori node. We also provide details from distributed runs with up to 8192 processes along with strong scaling results. The observations presented in this work provide an understanding of the system-level bottlenecks at scale that specifically impact sparse-irregular workloads and will therefore benefit other efforts to parallelize graph algorithms.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133799184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Pascoe, Lawrence C. Stewart, B. W. Sherman, Vipin Sachdeva, Martin C. Herbordt
{"title":"Execution of Complete Molecular Dynamics Simulations on Multiple FPGAs","authors":"C. Pascoe, Lawrence C. Stewart, B. W. Sherman, Vipin Sachdeva, Martin C. Herbordt","doi":"10.1109/HPEC43674.2020.9286155","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286155","url":null,"abstract":"We have modified the open source molecular dynamics (MD) simulation code OpenMM [1] to add support for running complete MD timesteps on a cluster of FPGAs. The overall structure of the application is shown in Figure 1. MD proceeds by calculating forces on individual particles and integrating those forces to update velocities/positions on a per timestep basis. A variety of forces apply to each particle and we subdivide them into three categories based on the computation requirements: range limited (RL), long range (LR), and bonded. RL interactions comprise Lennard Jones and electrostatic forces between all particle pairs within a radial cutoff. LR interactions comprise electrostatic forces beyond the RL cutoff, where pairwise computation would be too costly. We calculate LR forces using the Smooth Particle Mesh Ewald (PME) method, which uses 3D Fast Fourier Transforms (FFTs) to accelerate computation. Bonded interactions are the focus of future work. Kernels are coded in OpenCL for ease of hardware development and application integration. The design uses a mix of fixedpoint and single-/double-precision floating-point arithmetic where needed to maintain the same level of accuracy as CPU and GPU implementations. The ultimate goal of this project is to perform MD simulation of biologically-relevant systems within the context of drug discovery (i.e., periodic systems of 50,000–100,000 particles with approximate density of 1 atom per 10 cubic Å) with strong scaling performance greater than possible with other technologies such as GPUs.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123419110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Ciccarelli, Michael Nolan, H. Rao, Tanya Talkar, A. O'Brien, G. Vergara-Diaz, R. Zafonte, T. Quatieri, R. McKindles, P. Bonato, A. Lammert
{"title":"Human balance models optimized using a large-scale, parallel architecture with applications to mild traumatic brain injury","authors":"G. Ciccarelli, Michael Nolan, H. Rao, Tanya Talkar, A. O'Brien, G. Vergara-Diaz, R. Zafonte, T. Quatieri, R. McKindles, P. Bonato, A. Lammert","doi":"10.1109/HPEC43674.2020.9286217","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286217","url":null,"abstract":"Static and dynamic balance are frequently disrupted through brain injuries. The impairment can be complex and for mild traumatic brain injury (mTBI) can be undetectable by standard clinical tests. Therefore, neurologically relevant modeling approaches are needed for detection and inference of mechanisms of injury. The current work presents models of static and dynamic balance that have a high degree of correspondence. Emphasizing structural similarity between the domains facilitates development of both. Furthermore, particular attention is paid to components of sensory feedback and sensory integration to ground mechanisms in neurobiology. Models are adapted to fit experimentally collected data from 10 healthy control volunteers and 11 mild traumatic brain injury volunteers. Through an analysis by synthesis approach whose implementation was made possible by a state-of-the-art high performance computing system, we derived an interpretable, model based feature set that could classify mTBI and controls in a static balance task with an ROC AUC of 0.72.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128466328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yutai Zhou, Shawn Manuel, Peter Morales, Sheng Li, Jaime Peña, R. Allen
{"title":"Towards a Distributed Framework for Multi-Agent Reinforcement Learning Research","authors":"Yutai Zhou, Shawn Manuel, Peter Morales, Sheng Li, Jaime Peña, R. Allen","doi":"10.1109/HPEC43674.2020.9286212","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286212","url":null,"abstract":"Some of the most important publications in deep reinforcement learning over the last few years have been fueled by access to massive amounts of computation through large scale distributed systems. The success of these approaches in achieving human-expert level performance on several complex video-game environments has motivated further exploration into the limits of these approaches as computation increases. In this paper, we present a distributed RL training framework designed for super computing infrastructures such as the MIT SuperCloud. We review a collection of challenging learning environments-such as Google Research Football, StarCraft II, and Multi-Agent Mujoco- which are at the frontier of reinforcement learning research. We provide results on these environments that illustrate the current state of the field on these problems. Finally, we also quantify and discuss the computational requirements needed for performing RL research by enumerating all experiments performed on these environments.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129633666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incremental Streaming Graph Partitioning","authors":"L. Durbeck, P. Athanas","doi":"10.1109/HPEC43674.2020.9286181","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286181","url":null,"abstract":"Graph partitioning is an NP-hard problem whose efficient approximation has long been a subject of interest. The I/O bounds of contemporary computing environments favor incremental or streaming graph partitioning methods. Methods have sought a balance between latency, simplicity, accuracy, and memory size. In this paper, we apply an incremental approach to streaming partitioning that tracks changes with a lightweight proxy to trigger partitioning as the clustering error increases. We evaluate its performance on the DARPA/MIT Graph Challenge streaming stochastic block partition dataset, and find that it can dramatically reduce the invocation of partitioning, which can provide an order of magnitude speedup.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"432 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122801525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dynamically Configurable Network for Software-Defined Hardware","authors":"William Butera","doi":"10.1109/HPEC43674.2020.9286148","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286148","url":null,"abstract":"This paper describes an on-die network architecture targeted for Software-Defined Hardware (SDH). Key performance goals are near ASIC-level performance over a wide range of communication patterns, dynamically configured for operation on tile arrays with O(104) tiles and defect densities in excess of 10%. We describe a network architecture based on two recent Intel circuit studies, and present simulator results that demonstrate extremes for configurability, scale-invariant place & route and resilience to defect","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129529395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Denial of Service in CPU-GPU Heterogeneous Architectures","authors":"Hao Wen, W. Zhang","doi":"10.1109/HPEC43674.2020.9286228","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286228","url":null,"abstract":"Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC), on-chip network and memory. In this paper, we demonstrate that both CPU and GPU applications can maliciously or unintentionally monopolize the shared resource such as LLC and on-chip interconnection, resulting in significant performance loss to each other.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127791065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven Harris, R. Chamberlain, Christopher D. Gill
{"title":"OpenCL Performance on the Intel Heterogeneous Architecture Research Platform","authors":"Steven Harris, R. Chamberlain, Christopher D. Gill","doi":"10.1109/HPEC43674.2020.9286213","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286213","url":null,"abstract":"The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several traditional matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on cache coherent heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were designed for existing architectures when used on emerging heterogeneous systems.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115771233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}