2018 IEEE 25th International Conference on High Performance Computing (HiPC)最新文献_第2页

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training OC-DNN:利用CUDA 9和Volta gpu的高级统一内存功能进行核外DNN训练

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00024

A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda

{"title":"OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training","authors":"A. Awan, Ching-Hsiang Chu, H. Subramoni, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00024","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00024","url":null,"abstract":"Existing frameworks cannot train large DNNs that do not fit the GPU memory without explicit memory management schemes. In this paper, we propose OC-DNN - a novel Out-of-Core DNN training framework that exploits new Unified Memory features along with new hardware mechanisms in Pascal and Volta GPUs. OC-DNN has two major design components — 1) OC-Caffe; an enhanced version of Caffe that exploits innovative UM features like asynchronous prefetching, managed page-migration, exploitation of GPU-based page faults, and the cudaMemAdvise interface to enable efficient out-of-core training for very large DNNs, and 2) an interception library to transpar-ently leverage these cutting-edge features for other frameworks. We provide a comprehensive performance characterization of our designs. OC-Caffe provides comparable performance (to Caffe) for regular DNNs. OC-Caffe-Opt is up to 1.9X faster than OC-Caffe-Naive and up to 5X faster than optimized CPU-based training for out-of-core workloads. OC-Caffe also allows scale-up (DGX-1) and scale-out on multi-GPU clusters.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114729789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Probabilistic Sequential Consistency in Social Networks 社会网络中的概率顺序一致性

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00020

Priyanka Singla, Shubhankar Suman Singh, Krishnamoorthy Gopinath, S. Sarangi

{"title":"Probabilistic Sequential Consistency in Social Networks","authors":"Priyanka Singla, Shubhankar Suman Singh, Krishnamoorthy Gopinath, S. Sarangi","doi":"10.1109/HiPC.2018.00020","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00020","url":null,"abstract":"Researchers have proposed numerous consistency models in distributed systems that offer higher performance than classical sequential consistency (SC). Even though these models do not guarantee sequential consistency; they either behave like an SC model under certain restrictive scenarios, or ensure SC behavior for a part of the system. We propose a different line of thinking where we try to accurately estimate the number of SC violations, and then try to adapt our system to optimally tradeoff performance, resource usage, and the number of SC violations. In this paper, we propose a generic theoretical model that can be used to analyze systems that are comprised of multiple sub-domains – each sequentially consistent. It is validated with real world measurements. Next, we use this model to propose a new form of consistency called social consistency, where socially connected users perceive an SC execution, whereas the rest of the users need not. We create a prototype social network application and implement it on the Cassandra key-value store. We show that our system has 2.4× more throughput than Cassandra and provides 37% better quality-of-experience.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127332738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs Share-a-GPU:在gpu上提供简单有效的分时功能

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00041

Shaleen Garg, Kishore Kothapalli, Suresh Purini

{"title":"Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs","authors":"Shaleen Garg, Kishore Kothapalli, Suresh Purini","doi":"10.1109/HiPC.2018.00041","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00041","url":null,"abstract":"Time-sharing, which allows for multiple users to use a shared resource, is an important and fundamental aspect of modern computing systems. However, accelerators such as GPUs, that come without a native operating system do not support time sharing. The inability of accelerators to support time-sharing limits their applicability especially as they get deployed in Platform-as-a-Service and Resource-as-a-Service environmen ts. In the former, elastic demands may require preemption where as in the latter, fine-grained economic models of service cost can be supported with time sharing. In this paper, we extend the concept of time sharing to the GPGPU computational space using cooperative multitasking approach. Our technique is applicable to any GPGPU program written in Compute Unified Device Architecture (CUDA) API provided for C/C++ programming languages. With minimal support from the programmer, our framework incorporates process scheduling, light-weight memory management, and multi-GPU support. Our framework provides an abstraction where, in a round-robin manner, every workload can use a GPU(s) over a time quantum exclusively. We demonstrate the applicability of our scheduling framework, by running many workloads concurrently in a time sharing manner.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124855211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Data-Parallel Training of Generative Adversarial Networks on HPC Systems for HEP Simulations 生成对抗网络在HPC系统上的数据并行训练

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00026

S. Vallecorsa, Diana Moise, F. Carminati, G. Khattak

{"title":"Data-Parallel Training of Generative Adversarial Networks on HPC Systems for HEP Simulations","authors":"S. Vallecorsa, Diana Moise, F. Carminati, G. Khattak","doi":"10.1109/HiPC.2018.00026","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00026","url":null,"abstract":"In the field of High Energy Physics (HEP), simulating the interaction of particle detector materials is a compute-intensive task, that currently uses 50% of the computing resources globally available as part of the Worldwide LCH Computing Grid (WLCG). Since some level of approximation is acceptable, it is possible to implement fast simulation simplified models that have the advantage of being less computationally intensive. In this work, we present a fast simulation approach based on Generative Adversarial Networks (GANs). The model consists of a conditional generative network that describes the detector response and a discriminative network; both networks are trained in adversarial manner. The adversarial training process is computationally intensive and the application of a distributed approach is not straightforward. We rely on the MPI-based Cray Machine Learning Plugin to efficiently train the GAN over multiple nodes and GPGPUs. We report preliminary results on the accuracy of the generated samples and on the scaling of the time to solution. We demonstrate how HPC systems could be utilized to optimize this kind of models, on account of their large computational power and highly efficient interconnect.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131920274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Improving Provisioned Power Efficiency in HPC Systems with GPU-CAPP 利用GPU-CAPP提高高性能计算系统的配置功率效率

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00021

K. Straube, Jason Lowe-Power, C. Nitta, M. Farrens, V. Akella

引用次数: 6

A Shared-Memory Parallel Algorithm for Updating Single-Source Shortest Paths in Large Dynamic Networks 大型动态网络中单源最短路径更新的共享内存并行算法

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00035

S. Srinivasan, Sara Riazi, B. Norris, Sajal K. Das, S. Bhowmick

{"title":"A Shared-Memory Parallel Algorithm for Updating Single-Source Shortest Paths in Large Dynamic Networks","authors":"S. Srinivasan, Sara Riazi, B. Norris, Sajal K. Das, S. Bhowmick","doi":"10.1109/HiPC.2018.00035","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00035","url":null,"abstract":"Computing the single-source shortest path (SSSP) is one of the fundamental graph algorithms, and is used in many applications. Here, we focus on computing SSSP on large dynamic graphs, i.e. graphs whose structure evolves with time. We posit that instead of recomputing the SSSP for each set of changes on the dynamic graphs, it is more efficient to update the results based only on the region of change. To this end, we present a novel two-step shared-memory algorithm for updating SSSP on weighted large-scale graphs. The key idea of our algorithm is to identify changes, such as vertex/edge addition and deletion, that affect the shortest path computations and update only the parts of the graphs affected by the change. We provide the proof of correctness of our proposed algorithm. Our experiments on real and synthetic networks demonstrate that our algorithm is as much as 4X faster compared to computing SSSP with Galois, a state-of-the-art parallel graph analysis software for shared memory architectures. We also demonstrate how increasing the asynchrony can lead to even faster updates. To the best of our knowledge, this is one of the first practical parallel algorithms for updating networks on shared-memory systems, that is also scalable to large networks.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132592202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Accelerating TensorFlow with Adaptive RDMA-Based gRPC 基于自适应rdma的gRPC加速TensorFlow

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00010

Rajarshi Biswas, Xiaoyi Lu, D. Panda

{"title":"Accelerating TensorFlow with Adaptive RDMA-Based gRPC","authors":"Rajarshi Biswas, Xiaoyi Lu, D. Panda","doi":"10.1109/HiPC.2018.00010","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00010","url":null,"abstract":"Google's TensorFlow is one of the most popular Deep Learning frameworks nowadays. Distributed TensorFlow supports various channels to efficiently transfer tensors, such as gRPC over TCP/IP, gRPC+Verbs, and gRPC+MPI. At present, the community lacks a thorough characterization of distributed TensorFlow communication channels. This is critical because high-performance Deep Learning with TensorFlow needs an efficient communication runtime. Thus, we conduct a thorough analysis of the communication characteristics of distributed TensorFlow. Our studies show that none of the existing channels in TensorFlow can support adaptive and efficient communication for Deep Learning workloads with different message sizes. Moreover, the community needs to maintain these different channels while the users are also expected to tune these channels to get the desired performance. Therefore, this paper proposes a unified approach to have a single gRPC runtime (i.e., AR-gRPC) in TensorFlow with Adaptive and efficient RDMA protocols. In AR-gRPC, we propose designs such as hybrid communication protocols, message pipelining and coalescing, zero-copy transmission etc. to make our runtime be adaptive to different message sizes for Deep Learning workloads. Our performance evaluations show that AR-gRPC can significantly speedup gRPC performance by up to 4.1x and 2.3x compared to the default gRPC design on IPoIB and another RDMA-based gRPC design in the community. Comet supercomputer shows that AR-gRPC design can reduce the Point-to-Point latency by up to 75% compared to the default gRPC design. By integrating our AR-gRPC with TensorFlow, we can achieve up to 3x distributed training speedup over default gRPC-IPoIB based TensorFlow.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128581881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Quantification, Trade-off Analysis, and Optimal Checkpoint Placement for Reliability and Availability 量化，权衡分析，以及可靠性和可用性的最佳检查点放置

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00029

Omer Subasi, R. Tipireddy, S. Krishnamoorthy

引用次数: 0

HiPC 2018 Committees

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/hipc.2018.00008

引用次数: 0

Parallel Read Partitioning for Concurrent Assembly of Metagenomic Data 面向宏基因组数据并发组装的并行读分区

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00044

Vasudevan Rengasamy, M. Kandemir, P. Medvedev, Kamesh Madduri

{"title":"Parallel Read Partitioning for Concurrent Assembly of Metagenomic Data","authors":"Vasudevan Rengasamy, M. Kandemir, P. Medvedev, Kamesh Madduri","doi":"10.1109/HiPC.2018.00044","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00044","url":null,"abstract":"We present MetaPartMin and MetaPart, two new lightweight parallel metagenomic read partitioning strategies. Metagenomic data partitioning can aid the concurrent de novo assembly of partitions. Prior read partitioning methods tend to create a giant component of reads. We avoid this problem with new heuristics amenable to statically load-balanced parallelization. Our strategies require enumerating and sorting k-mers and minimizers from the input read sequences, and traversing an implicit graph to identify components. MetaPartMin uses minimizers to significantly lower aggregate main memory use, thereby enabling the processing of massive datasets on a modest number of compute nodes. All steps in our strategies exploit hybrid multicore and distributed-memory parallelism. We demonstrate scaling and efficiency on a collection of large-scale datasets. MetaPartMin can process a 1.25 terabase soil metagenome in 6 minutes on just 32 Intel Skylake nodes (48 cores each) of the Stampede2 supercomputer, and a 252 gigabase soil metagenome in 54 seconds on 16 Stampede2 Skylake nodes. The source code is available at https://github.com/vasupsu/MetaPart.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114308753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0