{"title":"Towards an Objective Metric for the Performance of Exact Triangle Count","authors":"Mark P. Blanco, Scott McMillan, Tze Meng Low","doi":"10.1109/HPEC43674.2020.9286188","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286188","url":null,"abstract":"The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114462600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Floros, Tiancheng Liu, N. Pitsianis, Xiaobai Sun
{"title":"Using Graphlet Spectrograms for Temporal Pattern Analysis of Virus-Research Collaboration Networks","authors":"D. Floros, Tiancheng Liu, N. Pitsianis, Xiaobai Sun","doi":"10.1109/HPEC43674.2020.9286161","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286161","url":null,"abstract":"We introduce a new method for temporal pattern analysis of scientific collaboration networks. We investigate in particular virus research activities through five epidemic or pandemic outbreaks in the recent two decades and in the ongoing pandemic with COVID-19. Our method embodies two innovative components. The first is a simple model of temporal collaboration networks with time segmented in publication time and convolved in citation history, to effectively capture and accommodate collaboration activities at mixed time scales. The second component is the novel use of graphlets to encode topological structures and to detect change and persistence in collaboration activities over time. We discover in particular two unique and universal roles of bi-fork graphlet in (1) identifying bridges among triangle clusters and (2) quantifying grassroots as the backbone of every collaboration network. We present a number of intriguing patterns and findings about the virus-research activities.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126390204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner
{"title":"Survey of Machine Learning Accelerators","authors":"A. Reuther, P. Michaleas, Michael Jones, V. Gadepally, S. Samsi, J. Kepner","doi":"10.1109/HPEC43674.2020.9286149","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286149","url":null,"abstract":"New machine learning accelerators are being announced and released each month for a variety of applications from speech recognition, video object detection, assisted driving, and many data center applications. This paper updates the survey of of AI accelerators and processors from last year's IEEE-HPEC paper. This paper collects and summarizes the current accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. This year, there are many more announced accelerators that are implemented with many more architectures and technologies from vector engines, dataflow engines, neuromorphic designs, flash-based analog memory processing, and photonic-based processing.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128428367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Homomorphic Encryption for Quantum Annealing with Spin Reversal Transformations","authors":"D. O’Malley, John K. Golden","doi":"10.1109/HPEC43674.2020.9286176","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286176","url":null,"abstract":"Homomorphic encryption has been an area of study in classical computing for decades. The fundamental goal of homomorphic encryption is to enable (untrusted) Oscar to perform a computation for Alice without Oscar knowing the input to the computation or the output from the computation. Alice encrypts the input before sending it to Oscar, and Oscar performs the computation directly on the encrypted data, producing an encrypted result. Oscar then sends the encrypted result of the computation back to Alice, who can decrypt it. We describe an approach to homomorphic encryption for quantum annealing based on spin reversal transformations and show that it comes with little or no performance penalty. This is in contrast to approaches to homomorphic encryption for classical computing, which incur a significant additional computational cost. This implies that the performance gap between quantum annealing and classical computing is reduced when both paradigms use homomorphic encryption. Further, homomorphic encryption is critical for quantum annealing because quantum annealers are native to the cloud - a third party (such as untrusted Oscar) performs the computation. If sensitive information, such as health-related data subject to the Health Insurance Portability and Accountability Act, is to be processed with quantum annealers, such a technique could be useful.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125882551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimising AI Training Deployments using Graph Compilers and Containers","authors":"Nina Mujkanovic, K. Sivalingam, A. Lazzaro","doi":"10.1109/HPEC43674.2020.9286153","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286153","url":null,"abstract":"Artificial Intelligence (AI) applications based on Deep Neural Networks (DNN) or Deep Learning (DL) have become popular due to their success in solving problems like image analysis and speech recognition. Training a DNN is computationally intensive and High Performance Computing (HPC) has been a key driver in AI growth. Virtualisation and container technology have led to the convergence of cloud and HPC infrastructure. These infrastructures with diverse hardware increase the complexity of deploying and optimising AI training workloads. AI training deployments in HPC or cloud can be optimised with target-specific libraries, graph compilers, and by improving data movement or IO. Graph compilers aim to optimise the execution of a DNN graph by generating an optimised code for a target hardware/backend. As part of SODALITE (a Horizon 2020 project), MODAK tool is developed to optimise application deployment in software defined infrastructures. Using input from the data scientist and performance modelling, MODAK maps optimal application parameters to a target infrastructure and builds an optimised container. In this paper, we introduce MODAK and review container technologies and graph compilers for AI. We illustrate optimisation of AI training deployments using graph compilers and Singularity containers. Evaluation using MNIST-CNN and ResNet50 training workloads shows that custom built optimised containers outperform the official images from DockerHub. We also found that the performance of graph compilers depends on the target hardware and the complexity of the neural network.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115601837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Hutchinson, S. Samsi, W. Arcand, David Bestor, Bill Bergeron, C. Byun, Micheal Houle, M. Hubbell, Michael J. Jones, J. Kepner, Andrew Kirby, P. Michaleas, Lauren Milechin, J. Mullen, Andrew Prout, Antonio Rosa, A. Reuther, Charles Yee, V. Gadepally
{"title":"Accuracy and Performance Comparison of Video Action Recognition Approaches","authors":"Matthew Hutchinson, S. Samsi, W. Arcand, David Bestor, Bill Bergeron, C. Byun, Micheal Houle, M. Hubbell, Michael J. Jones, J. Kepner, Andrew Kirby, P. Michaleas, Lauren Milechin, J. Mullen, Andrew Prout, Antonio Rosa, A. Reuther, Charles Yee, V. Gadepally","doi":"10.1109/HPEC43674.2020.9286249","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286249","url":null,"abstract":"Over the past few years, there has been significant interest in video action recognition systems and models. However, direct comparison of accuracy and computational performance results remain clouded by differing training environments, hardware specifications, hyperparameters, pipelines, and inference methods. This article provides a direct comparison between fourteen “off-the-shelf” and state-of-the-art models by ensuring consistency in these training characteristics in order to provide readers with a meaningful comparison across different types of video action recognition algorithms. Accuracy of the models is evaluated using standard Top-1 and Top-5 accuracy metrics in addition to a proposed new accuracy metric. Additionally, we compare computational performance of distributed training from two to sixty-four GPUs on a state-of-the-art HPC system.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115729528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compute, Time and Energy Characterization of Encoder-Decoder Networks with Automatic Mixed Precision Training","authors":"S. Samsi, Michael Jones, M. Veillette","doi":"10.1109/HPEC43674.2020.9286241","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286241","url":null,"abstract":"Deep neural networks have shown great success in many diverse fields. The training of these networks can take significant amounts of time, compute and energy. As datasets get larger and models become more complex, the exploration of model architectures becomes prohibitive. In this paper we examine the compute, energy and time costs of training a U-Net based deep neural network for the problem of predicting short term weather forecasts (called precipitation Nowcasting). By leveraging a combination of data distributed and mixed-precision training, we explore the design space for this problem. We also show that larger models with better performance come at a potentially incremental cost if appropriate optimizations are used. We show that it is possible to achieve a significant improvement in training time by leveraging mixed-precision training without sacrificing model performance. Additionally, we find that a 1549% increase in the number of trainable parameters for a network comes at a relatively smaller 63.22% increase in energy usage for a UNet with 4 encoding layers.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116247425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Samsi, Andrew Prout, Michael Jones, Andrew Kirby, Bill Arcand, Bill Bergeron, David Bestor, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, Charles Yee, A. Reuther, J. Kepner
{"title":"Benchmarking network fabrics for data distributed training of deep neural networks","authors":"S. Samsi, Andrew Prout, Michael Jones, Andrew Kirby, Bill Arcand, Bill Bergeron, David Bestor, C. Byun, V. Gadepally, Michael Houle, M. Hubbell, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Antonio Rosa, Charles Yee, A. Reuther, J. Kepner","doi":"10.1109/HPEC43674.2020.9286232","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286232","url":null,"abstract":"Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114779405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Byun, J. Kepner, W. Arcand, David Bestor, Bill Bergeron, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Andrew Kirby, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther
{"title":"Best of Both Worlds: High Performance Interactive and Batch Launching","authors":"C. Byun, J. Kepner, W. Arcand, David Bestor, Bill Bergeron, V. Gadepally, Michael Houle, M. Hubbell, Michael Jones, Andrew Kirby, Anna Klein, P. Michaleas, Lauren Milechin, J. Mullen, Andrew Prout, Antonio Rosa, S. Samsi, Charles Yee, A. Reuther","doi":"10.1109/HPEC43674.2020.9286142","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286142","url":null,"abstract":"Rapid launch of thousands of jobs is essential for effective interactive supercomputing, big data analysis, and AI algorithm development. Achieving thousands of launches per second has required hardware to be available to receive these jobs. This paper presents a novel preemptive approach to implement “spot” jobs on MIT SuperCloud systems allowing the resources to be fully utilized for both long running batch jobs while still providing fast launch for interactive jobs. The new approach separates the job preemption and scheduling operations and can achieve 100 times faster performance in the scheduling of a job with preemption when compared to using the standard scheduler-provided automatic preemption-based capability. The results demonstrate that the new approach can schedule interactive jobs preemptively at a performance comparable to when the required computing resources are idle and available. The spot job capability can be deployed without disrupting the interactive user experience while increasing the overall system utilization.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128201540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manish Bhattarai, Gopinath Chennupati, E. Skau, Raviteja Vangara, Hirsto Djidjev, B. Alexandrov
{"title":"Distributed Non-Negative Tensor Train Decomposition","authors":"Manish Bhattarai, Gopinath Chennupati, E. Skau, Raviteja Vangara, Hirsto Djidjev, B. Alexandrov","doi":"10.1109/HPEC43674.2020.9286234","DOIUrl":"https://doi.org/10.1109/HPEC43674.2020.9286234","url":null,"abstract":"The era of exascale computing opens new venues for innovations and discoveries in many scientific, engineering, and commercial fields. However, with the exaflops also come the extra-large high-dimensional data generated by highperformance computing. High-dimensional data is presented as multidimensional arrays, aka tensors. The presence of latent (not directly observable) structures in the tensor allows a unique representation and compression of the data by classical tensor factorization techniques. However, the classical tensor methods are not always stable or they can be exponential in their memory requirements, which makes them not suitable for high-dimensional tensors. Tensor train (TT) is a state-of-the-art tensor network introduced for factorization of high-dimensional tensors. TT transforms the initial high-dimensional tensor in a network of three-dimensional tensors that requires only a linear storage. Many real-world data, such as, density, temperature, population, probability, etc., are non-negative and for an easy interpretation, the algorithms preserving non-negativity are preferred. Here, we introduce a distributed non-negative tensor-train and demonstrate its scalability and the compression on synthetic and realworld big datasets.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128888563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}