A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda
{"title":"OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00024","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00024","url":null,"abstract":"Existing frameworks cannot train large DNNs that do not fit the GPU memory without explicit memory management schemes. In this paper, we propose OC-DNN - a novel Out-of-Core DNN training framework that exploits new Unified Memory features along with new hardware mechanisms in Pascal and Volta GPUs. OC-DNN has two major design components — 1) OC-Caffe; an enhanced version of Caffe that exploits innovative UM features like asynchronous prefetching, managed page-migration, exploitation of GPU-based page faults, and the cudaMemAdvise interface to enable efficient out-of-core training for very large DNNs, and 2) an interception library to transpar-ently leverage these cutting-edge features for other frameworks. We provide a comprehensive performance characterization of our designs. OC-Caffe provides comparable performance (to Caffe) for regular DNNs. OC-Caffe-Opt is up to 1.9X faster than OC-Caffe-Naive and up to 5X faster than optimized CPU-based training for out-of-core workloads. OC-Caffe also allows scale-up (DGX-1) and scale-out on multi-GPU clusters.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114729789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Priyanka Singla, Shubhankar Suman Singh, Krishnamoorthy Gopinath, S. Sarangi
{"title":"Probabilistic Sequential Consistency in Social Networks","authors":"Priyanka Singla, Shubhankar Suman Singh, Krishnamoorthy Gopinath, S. Sarangi","doi":"10.1109/HiPC.2018.00020","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00020","url":null,"abstract":"Researchers have proposed numerous consistency models in distributed systems that offer higher performance than classical sequential consistency (SC). Even though these models do not guarantee sequential consistency; they either behave like an SC model under certain restrictive scenarios, or ensure SC behavior for a part of the system. We propose a different line of thinking where we try to accurately estimate the number of SC violations, and then try to adapt our system to optimally tradeoff performance, resource usage, and the number of SC violations. In this paper, we propose a generic theoretical model that can be used to analyze systems that are comprised of multiple sub-domains – each sequentially consistent. It is validated with real world measurements. Next, we use this model to propose a new form of consistency called social consistency, where socially connected users perceive an SC execution, whereas the rest of the users need not. We create a prototype social network application and implement it on the Cassandra key-value store. We show that our system has 2.4× more throughput than Cassandra and provides 37% better quality-of-experience.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127332738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs","authors":"Shaleen Garg, Kishore Kothapalli, Suresh Purini","doi":"10.1109/HiPC.2018.00041","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00041","url":null,"abstract":"Time-sharing, which allows for multiple users to use a shared resource, is an important and fundamental aspect of modern computing systems. However, accelerators such as GPUs, that come without a native operating system do not support time sharing. The inability of accelerators to support time-sharing limits their applicability especially as they get deployed in Platform-as-a-Service and Resource-as-a-Service environmen ts. In the former, elastic demands may require preemption where as in the latter, fine-grained economic models of service cost can be supported with time sharing. In this paper, we extend the concept of time sharing to the GPGPU computational space using cooperative multitasking approach. Our technique is applicable to any GPGPU program written in Compute Unified Device Architecture (CUDA) API provided for C/C++ programming languages. With minimal support from the programmer, our framework incorporates process scheduling, light-weight memory management, and multi-GPU support. Our framework provides an abstraction where, in a round-robin manner, every workload can use a GPU(s) over a time quantum exclusively. We demonstrate the applicability of our scheduling framework, by running many workloads concurrently in a time sharing manner.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124855211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Vallecorsa, Diana Moise, F. Carminati, G. Khattak
{"title":"Data-Parallel Training of Generative Adversarial Networks on HPC Systems for HEP Simulations","authors":"S. Vallecorsa, Diana Moise, F. Carminati, G. Khattak","doi":"10.1109/HiPC.2018.00026","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00026","url":null,"abstract":"In the field of High Energy Physics (HEP), simulating the interaction of particle detector materials is a compute-intensive task, that currently uses 50% of the computing resources globally available as part of the Worldwide LCH Computing Grid (WLCG). Since some level of approximation is acceptable, it is possible to implement fast simulation simplified models that have the advantage of being less computationally intensive. In this work, we present a fast simulation approach based on Generative Adversarial Networks (GANs). The model consists of a conditional generative network that describes the detector response and a discriminative network; both networks are trained in adversarial manner. The adversarial training process is computationally intensive and the application of a distributed approach is not straightforward. We rely on the MPI-based Cray Machine Learning Plugin to efficiently train the GAN over multiple nodes and GPGPUs. We report preliminary results on the accuracy of the generated samples and on the scaling of the time to solution. We demonstrate how HPC systems could be utilized to optimize this kind of models, on account of their large computational power and highly efficient interconnect.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131920274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Straube, Jason Lowe-Power, C. Nitta, M. Farrens, V. Akella
{"title":"Improving Provisioned Power Efficiency in HPC Systems with GPU-CAPP","authors":"K. Straube, Jason Lowe-Power, C. Nitta, M. Farrens, V. Akella","doi":"10.1109/HiPC.2018.00021","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00021","url":null,"abstract":"In this paper we propose a microarchitectural technique called GPU Constant Average Power Processing (GPU-CAPP) that improves the power utilization of power provisioning-limited systems by using provisioned power as much as possible to accelerate computation on parallel work-loads. GPU-CAPP uses a flexible, decentralized control to ensure fast response times and the scalability required for increasingly parallel GPU designs. We use GPGPU-Sim and GPUWattch to simulate GPU-CAPP and evaluate its capabilities on a subset of the Rodinia benchmark suite. Overall, GPU-CAPP enables speedup by an average of 26% and 12% over equivalent fixed frequency systems at two power targets.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134202772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Srinivasan, Sara Riazi, B. Norris, Sajal K. Das, S. Bhowmick
{"title":"A Shared-Memory Parallel Algorithm for Updating Single-Source Shortest Paths in Large Dynamic Networks","authors":"S. Srinivasan, Sara Riazi, B. Norris, Sajal K. Das, S. Bhowmick","doi":"10.1109/HiPC.2018.00035","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00035","url":null,"abstract":"Computing the single-source shortest path (SSSP) is one of the fundamental graph algorithms, and is used in many applications. Here, we focus on computing SSSP on large dynamic graphs, i.e. graphs whose structure evolves with time. We posit that instead of recomputing the SSSP for each set of changes on the dynamic graphs, it is more efficient to update the results based only on the region of change. To this end, we present a novel two-step shared-memory algorithm for updating SSSP on weighted large-scale graphs. The key idea of our algorithm is to identify changes, such as vertex/edge addition and deletion, that affect the shortest path computations and update only the parts of the graphs affected by the change. We provide the proof of correctness of our proposed algorithm. Our experiments on real and synthetic networks demonstrate that our algorithm is as much as 4X faster compared to computing SSSP with Galois, a state-of-the-art parallel graph analysis software for shared memory architectures. We also demonstrate how increasing the asynchrony can lead to even faster updates. To the best of our knowledge, this is one of the first practical parallel algorithms for updating networks on shared-memory systems, that is also scalable to large networks.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132592202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating TensorFlow with Adaptive RDMA-Based gRPC","authors":"Rajarshi Biswas, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00010","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00010","url":null,"abstract":"Google's TensorFlow is one of the most popular Deep Learning frameworks nowadays. Distributed TensorFlow supports various channels to efficiently transfer tensors, such as gRPC over TCP/IP, gRPC+Verbs, and gRPC+MPI. At present, the community lacks a thorough characterization of distributed TensorFlow communication channels. This is critical because high-performance Deep Learning with TensorFlow needs an efficient communication runtime. Thus, we conduct a thorough analysis of the communication characteristics of distributed TensorFlow. Our studies show that none of the existing channels in TensorFlow can support adaptive and efficient communication for Deep Learning workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this paper proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for Deep Learning workloads. Our performance evaluations show that AR-gRPC can significantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. Comet supercomputer shows that AR-gRPC design can reduce the Point-to-Point latency by up to 75% compared to the default gRPC design. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training speedup over default gRPC-IPoIB based TensorFlow.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128581881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability","authors":"Omer Subasi, R. Tipireddy, S. Krishnamoorthy","doi":"10.1109/HiPC.2018.00029","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00029","url":null,"abstract":"Checkpointing is the most widely used technique in high-performance computing (HPC) to ensure the application progress in the presence of failures. In this paper, we present mathematical models of checkpointing systems to quantify their reliability and availability. We perform trade-off analysis with respect to resource costs and reliability. Then, we explore the optimal checkpoint placement for checkpointing systems to maximize system availability. Finally, in a rigorous manner, we comparatively analyze the behavior of redundant systems where replication and repair mechanisms are employed. We postulate that the proposed models can aid system designers, who can instantiate our models to assess and quantify the availability and reliability of systems of interest.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122051908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiPC 2018 Committees","authors":"","doi":"10.1109/hipc.2018.00008","DOIUrl":"https://doi.org/10.1109/hipc.2018.00008","url":null,"abstract":"","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117121236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vasudevan Rengasamy, M. Kandemir, P. Medvedev, Kamesh Madduri
{"title":"Parallel Read Partitioning for Concurrent Assembly of Metagenomic Data","authors":"Vasudevan Rengasamy, M. Kandemir, P. Medvedev, Kamesh Madduri","doi":"10.1109/HiPC.2018.00044","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00044","url":null,"abstract":"We present MetaPartMin and MetaPart, two new lightweight parallel metagenomic read partitioning strategies. Metagenomic data partitioning can aid the concurrent de novo assembly of partitions. Prior read partitioning methods tend to create a giant component of reads. We avoid this problem with new heuristics amenable to statically load-balanced parallelization. Our strategies require enumerating and sorting k-mers and minimizers from the input read sequences, and traversing an implicit graph to identify components. MetaPartMin uses minimizers to significantly lower aggregate main memory use, thereby enabling the processing of massive datasets on a modest number of compute nodes. All steps in our strategies exploit hybrid multicore and distributed-memory parallelism. We demonstrate scaling and efficiency on a collection of large-scale datasets. MetaPartMin can process a 1.25 terabase soil metagenome in 6 minutes on just 32 Intel Skylake nodes (48 cores each) of the Stampede2 supercomputer, and a 252 gigabase soil metagenome in 54 seconds on 16 Stampede2 Skylake nodes. The source code is available at https://github.com/vasupsu/MetaPart.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114308753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}