2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

Network Size Estimation in Small-World Networks Under Byzantine Faults 拜占庭故障下小世界网络的网络大小估计

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00094

Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson

{"title":"Network Size Estimation in Small-World Networks Under Byzantine Faults","authors":"Soumyottam Chatterjee, Gopal Pandurangan, Peter Robinson","doi":"10.1109/IPDPS.2019.00094","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00094","url":null,"abstract":"We study the fundamental problem of counting the number of nodes in a sparse network (of unknown size) under the presence of a large number of Byzantine nodes. We assume the full information model where the Byzantine nodes have complete knowledge about the entire state of the network at every round (including random choices made by all the nodes), have unbounded computational power, and can deviate arbitrarily from the protocol. Essentially all known algorithms for fundamental Byzantine problems (e.g., agreement, leader election, sampling) studied in the literature assume the knowledge (or at least an estimate) of the size of the network. It is non-trivial to design algorithms for Byzantine problems that work without knowledge of the network size, especially in bounded-degree (expander) networks where the local views of all nodes are (essentially) the same and limited, and Byzantine nodes can quite easily fake the presence/absence of non-existing nodes. To design truly local algorithms that do not rely on any global knowledge (including network size), estimating the size of the network under Byzantine nodes is an important first step. Our main contribution is a randomized distributed algorithm that estimates the size of a network under the presence of a large number of Byzantine nodes. In particular, our algorithm estimates the size of a sparse, \"small-world\", expander network with up to O(n^1-Δ) Byzantine nodes, where n is the (unknown) network size and Δ > 0 can be be any arbitrarily small (but fixed) constant. Our algorithm outputs a (fixed) constant factor estimate of log(n) with high probability; the correct estimate of the network size will be known to a large fraction (1 - ε)-fraction, for any fixed positive constant ε) of the honest nodes. Our algorithm is fully distributed, lightweight, and simple to implement, runs in O(log^3 n) rounds, and requires nodes to send and receive messages of only small-sized messages per round; any node's local computation cost per round is also small.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114633849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dynamic Memory Management for GPU-Based Training of Deep Neural Networks 基于gpu的深度神经网络训练动态内存管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00030

B. ShriramS, Anshuj Garg, Purushottam Kulkarni

{"title":"Dynamic Memory Management for GPU-Based Training of Deep Neural Networks","authors":"B. ShriramS, Anshuj Garg, Purushottam Kulkarni","doi":"10.1109/IPDPS.2019.00030","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00030","url":null,"abstract":"Deep learning has been widely adopted for different applications of artificial intelligence - speech recognition, natural language processing, computer vision etc. The growing size of Deep Neural Networks (DNNs) has compelled the researchers to design memory efficient and performance optimal algorithms. Apart from algorithmic improvements, specialized hardware like Graphics Processing Units (GPUs) are being widely employed to accelerate the training and inference phases of deep networks. However, the limited GPU memory capacity limits the upper bound on the size of networks that can be offloaded to and trained using GPUs. vDNN addresses the GPU memory bottleneck issue and provides a solution which enables training of deep networks that are larger than GPU memory. In our work, we characterize and identify multiple bottlenecks with vDNN like delayed computation start, high pinned memory requirements and GPU memory fragmentation. We present vDNN++ which extends vDNN and resolves the identified issues. Our results show that the performance of vDNN++ is comparable or better (up to 60% relative improvement) than vDNN. We propose different heuristics and order for memory allocation, and empirically evaluate the extent of memory fragmentation with them. We are also able to reduce the pinned memory requirement by up to 60%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122060192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

QoS-Driven Coordinated Management of Resources to Save Energy in Multi-core Systems 基于qos驱动的多核系统资源协同节能管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00040

M. Nejat, M. Pericàs, P. Stenström

引用次数: 9

LLC-Guided Data Migration in Hybrid Memory Systems 混合存储系统中llc引导的数据迁移

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00101

E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis

{"title":"LLC-Guided Data Migration in Hybrid Memory Systems","authors":"E. Vasilakis, Vassilis D. Papaefstathiou, P. Trancoso, I. Sourdis","doi":"10.1109/IPDPS.2019.00101","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00101","url":null,"abstract":"Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Exploring MPI Communication Models for Graph Applications Using Graph Matching as a Case Study 使用图匹配作为案例研究探索图应用程序的MPI通信模型

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00085

Sayan Ghosh, M. Halappanavar, A. Kalyanaraman, Arif M. Khan, A. Gebremedhin

{"title":"Exploring MPI Communication Models for Graph Applications Using Graph Matching as a Case Study","authors":"Sayan Ghosh, M. Halappanavar, A. Kalyanaraman, Arif M. Khan, A. Gebremedhin","doi":"10.1109/IPDPS.2019.00085","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00085","url":null,"abstract":"Traditional implementations of parallel graph operations on distributed memory platforms are written using Message Passing Interface (MPI) point-to-point communication primitives such as Send-Recv (blocking and nonblocking). Apart from this classical model, the MPI community has over the years added other communication models; however, their suitability for handling the irregular traffic workloads typical of graph operations remain comparatively less explored. Our aim in this paper is to study these relatively underutilized communication models of MPI for graph applications. More specifically, we evaluate MPI's one-sided programming, or Remote Memory Access (RMA), and nearest neighborhood collectives using a process graph topology. There are features in these newer models that are intended to better map to irregular communication patterns, as exemplified in graph algorithms. As a concrete application for our case study, we use distributed memory implementations of an approximate weighted graph matching algorithm to investigate performances of MPI-3 RMA and neighborhood collective operations compared to nonblocking Send-Recv. A matching in a graph is a subset of edges such that no two matched edges are incident on the same vertex. A maximum weight matching is a matching of maximum weight computed as the sum of the weights of matched edges. Execution of graph matching is dominated by high volume of irregular memory accesses, making it an ideal candidate for studying the effects of various MPI communication models on graph applications at scale. Our neighborhood collectives and RMA implementations yield up to 6x speedup over traditional nonblocking Send-Recv implementations on thousands of cores of the NERSC Cori supercomputer. We believe the lessons learned from this study can be adopted to benefit a wider range of graph applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123305950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Tight & Simple Load Balancing 紧凑和简单的负载平衡

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00080

P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling

引用次数: 12

A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning 使用DAG分区的同构处理器的可伸缩的基于集群的任务调度程序

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00026

M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek

{"title":"A Scalable Clustering-Based Task Scheduler for Homogeneous Processors Using DAG Partitioning","authors":"M. Özkaya, A. Benoit, B. Uçar, J. Herrmann, Ümit V. Çatalyürek","doi":"10.1109/IPDPS.2019.00026","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00026","url":null,"abstract":"When scheduling a directed acyclic graph (DAG) of tasks with communication costs on computational platforms, a good trade-off between load balance and data locality is necessary. List-based scheduling techniques are commonly-used greedy approaches for this problem. The downside of list-scheduling heuristics is that they are incapable of making short-term sacrifices for the global efficiency of the schedule. In this work, we describe new list-based scheduling heuristics based on clustering for homogeneous platforms, under the realistic duplex single-port communication model. Our approach uses an acyclic partitioner for DAGs for clustering. The clustering enhances the data locality of the scheduler with a global view of the graph. Furthermore, since the partition is acyclic, we can schedule each part completely once its input tasks are ready to be executed. We present an extensive experimental evaluation showing the trade-offs between the granularity of clustering and the parallelism, and how this affects the scheduling. Furthermore, we compare our heuristics to the best state-of-the-art list-scheduling and clustering heuristics, and obtain more than three times better makespan in cases with many communications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125680807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud BigSpa:云端高效的过程间静态分析引擎

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00086

Zhiqiang Zuo, Rong Gu, Xi Jiang, Zhaokang Wang, Yihua Huang, Linzhang Wang, Xuandong Li

引用次数: 9

SAC Goes Cluster: Fully Implicit Distributed Computing SAC Goes集群:完全隐式分布式计算

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00107

Thomas Macht, C. Grelck

{"title":"SAC Goes Cluster: Fully Implicit Distributed Computing","authors":"Thomas Macht, C. Grelck","doi":"10.1109/IPDPS.2019.00107","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00107","url":null,"abstract":"SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125173176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The Path to Delivering Programable Exascale Systems 提供可编程百亿亿级系统的途径

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00081

L. DeRose

{"title":"The Path to Delivering Programable Exascale Systems","authors":"L. DeRose","doi":"10.1109/IPDPS.2019.00081","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00081","url":null,"abstract":"The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not losing sight of performance portability. In this talk, I will discuss the current trends in computer architecture and their implications in application development and will present Cray’s high level parallel programming environment for performance and programmability on current and future supercomputers. I will also discuss some of the challenges and open research problems that need to be addressed in order to build a software developer environment for extreme-scale systems that helps users solve multi-disciplinary and multi-scale problems with high levels of performance, programmability, and scalability.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1