S. M. Ghazimirsaeed, Qinghua Zhou, Amit Ruhela, Mohammadreza Bayatpour
{"title":"A Hierarchical and Load-Aware Design for Large Message Neighborhood Collectives","authors":"S. M. Ghazimirsaeed, Qinghua Zhou, Amit Ruhela, Mohammadreza Bayatpour","doi":"10.1109/SC41405.2020.00038","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00038","url":null,"abstract":"The MPI-3.0 standard introduced neighborhood collective to support sparse communication patterns used in many applications. In this paper, we propose a hierarchical and distributed graph topology that considers the physical topology of the system and the virtual communication pattern of processes to improve the performance of large message neighborhood collectives. Moreover, we propose two design alternatives on top of the hierarchical design: 1. LAG-H: assumes the same communication load for all processes, 2. LAW-H: considers the communication load of processes for fair distribution of load between them. We propose a mathematical model to determine the communication capacity of each process. Then, we use the derived capacity to fairly distribute the load between processes. Our experimental results on up to 28,672 processes show up to 9x speedup for various process topologies. We also observe up to 8.2% performance gain and 34x speedup for NAS-DT and SpMM, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134507856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sneha D. Goenka, Yatish Turakhia, B. Paten, M. Horowitz
{"title":"SegAlign: A Scalable GPU-Based Whole Genome Aligner","authors":"Sneha D. Goenka, Yatish Turakhia, B. Paten, M. Horowitz","doi":"10.1109/SC41405.2020.00043","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00043","url":null,"abstract":"Pairwise Whole Genome Alignment (WGA) is a crucial first step to understanding evolution at the DNA sequence-level. Pairwise WGA of thousands of currently available species genomes could help make biological discoveries, however, computing them for even a fraction of the millions of possible pairs is prohibitive – WGA of a single pair of vertebrate genomes (human-mouse) takes 11 hours on a 96-core Amazon Web Services (AWS) instance (c5.24xlarge). This paper presents SegAlign – a scalable, GPU-accelerated system for computing pairwise WGA. SegAlign is based on the standard seed-filter-extend heuristic, in which the filtering stage dominates the runtime (e.g. 98% for human-mouse WGA), and is accelerated using GPU(s). Using three vertebrate genome pairs, we show that SegAlign provides a speedup of up to $14 times $ on an 8-GPU, 64-core AWS instance (p3.16xlarge) for WGA and nearly $2.3 times $ reduction in dollar cost. SegAlign also allows parallelization over multiple GPU nodes and scales efficiently.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132274172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arnab Das, Ian Briggs, G. Gopalakrishnan, S. Krishnamoorthy, P. Panchekha
{"title":"Scalable yet Rigorous Floating-Point Error Analysis","authors":"Arnab Das, Ian Briggs, G. Gopalakrishnan, S. Krishnamoorthy, P. Panchekha","doi":"10.1109/SC41405.2020.00055","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00055","url":null,"abstract":"Automated techniques for rigorous floating-point round-off error analysis are a prerequisite to placing important activities in HPC such as precision allocation, verification, and code optimization on a formal footing. Yet existing techniques cannot provide tight bounds for expressions beyond a few dozen operators–barely enough for HPC. In this work, we offer an approach embedded in a new tool called SATIHE that scales error analysis by four orders of magnitude compared to today’s best-of-class tools. We explain how three key ideas underlying SATIHE helps it attain such scale: path strength reduction, bound optimization, and abstraction. SATIHE provides tight bounds and rigorous guarantees on significantly larger expressions with well over a hundred thousand operators, covering important examples including FFT, matrix multiplication, and PDE stencils.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114714853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, Bihai Wu, Jiazhen Lin, H. Ao, Wanhong Xu, J. Shu
{"title":"Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations","authors":"Minhui Xie, Kai Ren, Youyou Lu, Guangxu Yang, Qingxing Xu, Bihai Wu, Jiazhen Lin, H. Ao, Wanhong Xu, J. Shu","doi":"10.1109/SC41405.2020.00025","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00025","url":null,"abstract":"Modern recommendation systems in industry often use deep learning (DL) models that achieve better model accuracy with more data and model parameters. However, current opensource DL frameworks, such as TensorFlow and PyTorch, show relatively low scalability on training recommendation models with terabytes of parameters. To efficiently learn large-scale recommendation models from data streams that generate hundreds of terabytes training data daily, we introduce a continual learning system called Kraken. Kraken contains a special parameter server implementation that dynamically adapts to the rapidly changing set of sparse features for the continual training and serving of recommendation models. Kraken provides a sparsity-aware training system that uses different learning optimizers for dense and sparse parameters to reduce memory overhead. Extensive experiments using real-world datasets confirm the effectiveness and scalability of Kraken. Kraken can benefit the accuracy of recommendation tasks with the same memory resources, or trisect the memory usage while keeping model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129846628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Srinivas Eswar, Koby Hayashi, Grey Ballard, R. Kannan, R. Vuduc, Haesun Park
{"title":"Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization","authors":"Srinivas Eswar, Koby Hayashi, Grey Ballard, R. Kannan, R. Vuduc, Haesun Park","doi":"10.1109/SC41405.2020.00078","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00078","url":null,"abstract":"We develop the first distributed-memory parallel implementation of Symmetric Nonnegative Matrix Factorization (SymNMF), a key data analytics kernel for clustering and dimensionality reduction. Our implementation includes two different algorithms for SymNMF, which give comparable results in terms of time and accuracy. The first algorithm is a parallelization of an existing sequential approach that uses solvers for non symmetric NMF. The second algorithm is a novel approach based on the Gauss-Newton method. It exploits second-order information without incurring large computational and memory costs. We evaluate the scalability of our algorithms on the Summit system at Oak Ridge National Laboratory, scaling up to 128 nodes (4,096 cores) with 70% efficiency. Additionally, we demonstrate our software on an image segmentation task.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122927359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PLINER: Isolating Lines of Floating-Point Code for Compiler-Induced Variability","authors":"Hui Guo, I. Laguna, Cindy Rubio-González","doi":"10.1109/SC41405.2020.00053","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00053","url":null,"abstract":"Scientific applications are often impacted by numerical inconsistencies when using different compilers or when a compiler is used with different optimization levels; such inconsistencies hinder reproducibility and can be hard to diagnose. We present PLINER, a tool to automatically pinpoint code lines that trigger compiler-induced variability. PLINER uses a novel approach to enhance floating-point precision at different levels of code granularity, and performs a guided search to identify locations affected by numerical inconsistencies. We demonstrate PLINER on a real-world numerical inconsistency that required weeks to diagnose, which PLINER isolates in minutes. We also evaluate PLiNER on 100 synthetic programs, and the NAS Parallel Benchmarks (NPB). On the synthetic programs, PLiNER detects the affected lines of code 87% of the time while the stateof-the-art approach only detects the affected lines 6% of the time. Furthermore, PLINER successfully isolates all numerical inconsistencies found in the NPB.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128671661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Di Zhang, Dong Dai, Youbiao He, F. S. Bao, Bing Xie
{"title":"RLScheduler: An Automated HPC Batch Job Scheduler Using Reinforcement Learning","authors":"Di Zhang, Dong Dai, Youbiao He, F. S. Bao, Bing Xie","doi":"10.1109/SC41405.2020.00035","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00035","url":null,"abstract":"Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority functions can hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations, we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124430479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinsung Kim, Ajay Panyala, B. Peng, K. Kowalski, P. Sadayappan, S. Krishnamoorthy
{"title":"Scalable Heterogeneous Execution of a Coupled-Cluster Model with Perturbative Triples","authors":"Jinsung Kim, Ajay Panyala, B. Peng, K. Kowalski, P. Sadayappan, S. Krishnamoorthy","doi":"10.1109/SC41405.2020.00083","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00083","url":null,"abstract":"The CCSD(T) coupled-cluster model with perturbative triples is considered a gold standard for computational modeling of the correlated behavior of electrons in molecular systems. A fundamental constraint is the relatively small global-memory capacity in GPUs compared to the main-memory capacity on host nodes, necessitating relatively smaller tile sizes for high-dimensional tensor contractions in NWChem’s GPU-accelerated implementation of the CCSD(T) method. A coordinated redesign is described to address this limitation and associated data movement overheads, including a novel fused GPU kernel for a set of tensor contractions, along with inter-node communication optimization and data caching. The new implementation of GPU-accelerated CCSD(T) improves overall performance by $3.4 times$. Finally, we discuss the trade-offs in using this fused algorithm on current and future supercomputing platforms.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122041195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani
{"title":"GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training","authors":"Arpan Jain, A. Awan, Asmaa Aljuhani, J. Hashmi, Quentin G. Anthony, H. Subramoni, D. Panda, R. Machiraju, A. Parwani","doi":"10.1109/SC41405.2020.00049","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00049","url":null,"abstract":"Data-parallelism has become an established paradigm to train DNNs that fit inside GPU memory on large-scale HPC systems. However, model-parallelism is required to train out-of-core DNNs. In this paper, we deal with emerging requirements brought forward by very large DNNs being trained using high-resolution images common in digital pathology. To address these, we propose, design, and implement GEMS; a GPU-Enabled Memory-Aware Model-Parallelism System. We present several design schemes like GEMS-MAST, GEMS-MASTER, and GEMS-Hybrid that offer excellent speedups over state-of-the-art systems like Mesh-TensorFlow and FlexFlow. Furthermore, we combine model-parallelism and data-parallelism to train a 1000-1ayer ResNet-lk model using 1,024 Volta V100 GPUs with 97.32% scaling-efficiency. For the real-world histopathology whole-slide-image (WSI) of 100,000 x 100,000 pixels, we train custom ResNet-110-v2 on image tiles of size 1024 x 1024 and reduce the training time from seven hours to 28 minutes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131719451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pascal Grosset, C. Biwer, Jesus Pulido, A. Mohan, Ayan Biswas, J. Patchett, Terece L. Turton, D. Rogers, D. Livescu, J. Ahrens
{"title":"Foresight: Analysis That Matters for Data Reduction","authors":"Pascal Grosset, C. Biwer, Jesus Pulido, A. Mohan, Ayan Biswas, J. Patchett, Terece L. Turton, D. Rogers, D. Livescu, J. Ahrens","doi":"10.1109/SC41405.2020.00087","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00087","url":null,"abstract":"As the computation power of supercomputers increases, so does simulation size, which in turn produces orders-of-magnitude more data. Because generated data often exceed the simulation’s disk quota, many simulations would stand to benefit from data-reduction techniques to reduce storage requirements. Such techniques include autoencoders, data compression algorithms, and sampling. Lossy compression techniques can significantly reduce data size, but such techniques come at the expense of losing information that could result in incorrect post hoc analysis results. To help scientists determine the best compression they can get while keeping their analyses accurate, we have developed Foresight, an analysis framework that enables users to evaluate how different data-reduction techniques will impact their analyses. We use particle data from a cosmology simulation, turbulence data from Direct Numerical Simulation, and asteroid impact data from xRage to demonstrate how Foresight can help scientists determine the best data-reduction technique for their simulations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130418725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}