2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL 多核存储系统的能力模型:Xeon Phi KNL的案例研究

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-07-03 DOI: 10.1109/IPDPS.2017.30

Sabela Ramos, T. Hoefler

{"title":"Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL","authors":"Sabela Ramos, T. Hoefler","doi":"10.1109/IPDPS.2017.30","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.30","url":null,"abstract":"Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128193529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework 生产硬件过度配置:使用可扩展的电源感知资源管理框架的实际性能优化

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-06-30 DOI: 10.1109/IPDPS.2017.107

Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, M. Ueda, Tapasya Patki, D. Ellsworth, B. Rountree, M. Schulz

{"title":"Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework","authors":"Ryuichi Sakamoto, Thang Cao, Masaaki Kondo, Koji Inoue, M. Ueda, Tapasya Patki, D. Ellsworth, B. Rountree, M. Schulz","doi":"10.1109/IPDPS.2017.107","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.107","url":null,"abstract":"Limited power budgets will be one of the biggest challenges for deploying future exascale supercomputers. One of the promising ways to deal with this challenge is hardware overprovisioning, that is, installingmore hardware resources than can be fully powered under a given power limit coupled with software mechanisms to steer the limited power to where it is needed most. Prior research has demonstrated the viability of this approach, but could only rely on small-scale simulations of the software stack. While such research is useful to understand the boundaries of performance benefits that can be achieved, it does not cover any deployment or operational concerns of using overprovisioning on production systems. This paper is the first to present an extensible power-aware resource management framework for production-sized overprovisioned systems based on the widely established SLURM resource manager. Our framework provides flexible plugin interfaces and APIs for power management that can be easily extended to implement site-specific strategies and for comparison of different power management techniques. We demonstrate our framework on a 965-node HA8000 production system at Kyushu University. Our results indicate that it is indeed possible to safely overprovision hardware in production. We also find that the power consumption of idle nodes, which depends on the degree of overprovisioning, can become a bottleneck. Using real-world data, we then draw conclusions about the impact of the total number of nodes provided in an overprovisioned environment.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"174 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132530364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Toucan — A Translator for Communication Tolerant MPI Applications 巨嘴鸟-一个用于通信容忍MPI应用程序的翻译器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-06-30 DOI: 10.1109/IPDPS.2017.44

Sergio M. Martin, M. Berger, S. Baden

引用次数: 7

Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs 多核和gpu上基于任务的运行时系统快速高效列表调度算法的逼近证明

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.71

Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar

{"title":"Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs","authors":"Olivier Beaumont, Lionel Eyraud-Dubois, Suraj Kumar","doi":"10.1109/IPDPS.2017.71","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.71","url":null,"abstract":"In High Performance Computing, heterogeneity is now the normwith specialized accelerators like GPUs providing efficientcomputational power. The added complexity has led to the developmentof task-based runtime systems, which allow complex computations to beexpressed as task graphs, and rely on scheduling algorithms to performload balancing between all resources of the platforms. Developing goodscheduling algorithms, even on a single node, and analyzing them canthus have a very high impact on the performance of current HPCsystems. The special case of two types of resources (namely CPUs andGPUs) is of practical interest. HeteroPrio is such an algorithm whichhas been proposed in the context of fast multipole computations, andthen extended to general task graphs with very interesting results. Inthis paper, we provide a theoretical insight on the performance ofHeteroPrio, by proving approximation bounds compared to the optimalschedule in the case where all tasks are independent and for differentplatform sizes. Interestingly, this shows that spoliation allows toprove approximation ratios for a list scheduling algorithm on twounrelated resources, which is not possible otherwise. We also establishthat almost all our bounds are tight. Additionally, we provide anexperimental evaluation of HeteroPrio on real task graphs from denselinear algebra computation, which highlights the reasons explainingits good practical performance.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121826196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation 双对角化和r -双对角化:并行平铺算法、关键路径和分布式内存实现

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.46

Mathieu Faverge, J. Langou, Y. Robert, J. Dongarra

{"title":"Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation","authors":"Mathieu Faverge, J. Langou, Y. Robert, J. Dongarra","doi":"10.1109/IPDPS.2017.46","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.46","url":null,"abstract":"We study tiled algorithms for going from a \"full\" matrix to a condensed \"band bidiagonal\" form using orthog-onal transformations: (i) the tiled bidiagonalization algorithm BIDIAG, which is a tiled version of the standard scalar bidiago-nalization algorithm; and (ii) the R-bidiagonalization algorithm R-BIDIAG, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R- factor. For both BIDIAG and R-BIDIAG, we use four main types of reduction trees, namely FLATTS, FLATTT, GREEDY, and a newly introduced auto-adaptive tree, AUTO. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BIDIAG has a shorter critical path length than BIDIAG for tall and skinny matrices, and (ii) GREEDY based schemes are much better than earlier proposed algorithms with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared- memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"292 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132194454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Dynamic Memory-Aware Task-Tree Scheduling 动态内存感知任务树调度

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-29 DOI: 10.1109/IPDPS.2017.58

G. Aupy, Clement Brasseur, L. Marchal

{"title":"Dynamic Memory-Aware Task-Tree Scheduling","authors":"G. Aupy, Clement Brasseur, L. Marchal","doi":"10.1109/IPDPS.2017.58","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.58","url":null,"abstract":"Factorizing sparse matrices using direct multifrontal methods generates directed tree-shaped task graphs, where edges represent data dependency between tasks. This paper revisits the execution of tree-shaped task graphs using multiple processors that share a bounded memory. A task can only be executed if all its input and output data can fit into the memory. The key difficulty is to manage the order of the task executions so that we can achieve high parallelism while staying below the memory bound. In particular, because input data of unprocessed tasks must be kept in memory, a bad scheduling strategy might compromise the termination of the algorithm. In the single processor case, solutions that are guaranteed to be below a memory bound are known. The multi-processor case (when one tries to minimize the total completion time) has been shown to be NP-complete. We present in this paper a novel heuristic solution that has a low complexity and is guaranteed to complete the tree within a given memory bound.We compare our algorithm to state of the art strategies, and observe that on both actual execution trees and synthetic trees, we always perform better than these solutions, with average speedups between 1.25 and 1.45 on actual assembly trees. Moreover, we show that the overhead of our algorithm is negligible even on deep trees (10 5), and would allow its runtime execution.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125204083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems 集成CPU-GPU系统中带功率上限的协同运行调度

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.124

Qingnhua Zhu, Bo Wu, Xipeng Shen, Li Shen, Zhiying Wang

引用次数: 27

Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory 高带宽多核处理器上的稀疏张量分解

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.84

Shaden Smith, Jongsoo Park, G. Karypis

{"title":"Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory","authors":"Shaden Smith, Jongsoo Park, G. Karypis","doi":"10.1109/IPDPS.2017.84","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.84","url":null,"abstract":"HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands. To address these challenging demands, HPC systems are turning to many-core architectures that feature a large number of energy-efficient cores backed by high-bandwidth memory. These features are exemplified in Intel's recent Knights Landing many-core processor (KNL), which typically has 68 cores and 16GB of on-package multi-channel DRAM (MCDRAM). This work investigates how the novel architectural features offered by KNL can be used in the context of decomposing sparse, unstructured tensors using the canonical polyadic decomposition (CPD). The CPD is used extensively to analyze large multi-way datasets arising in various areas including precision healthcare, cybersecurity, and e-commerce. Towards this end, we (i) develop problem decompositions for the CPD which are amenable to hundreds of concurrent threads while maintaining load balance and low synchronization costs; and (ii) explore the utilization of architectural features such as MCDRAM. Using one KNL processor, our algorithm achieves up to 1.8x speedup over a dual socket Intel Xeon system with 44 cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121281108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Leader Election in a Smartphone Peer-to-Peer Network 智能手机点对点网络中的领导者选举

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.11

Calvin C. Newport

{"title":"Leader Election in a Smartphone Peer-to-Peer Network","authors":"Calvin C. Newport","doi":"10.1109/IPDPS.2017.11","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.11","url":null,"abstract":"In this paper, we study the fundamental problem of leader election in the mobile telephone model: a recently introduced variation of the classical telephone model modified to better describe the local peer-to-peercommunication services implemented in many popular smartphone operating systems. In more detail, the mobile telephone model differs from the classical telephone model in three ways: (1) each devicecan participate in at most one connection per round; (2) the network topology can undergo a parameterizedrate of change; and (3) devices can advertise a parameterized number of bits to their neighbors in each round before connection attempts are initiated. We begin by describing and analyzing a new leader election algorithm in this model that works under the harshest possible parameter assumptions: maximum rate of topology changes and no advertising bits. We then apply this result to resolve an open question from [Ghaffari, 2016] on the efficiency of PUSH-PULL rumor spreading under these conditions. We then turn our attention to the slightly easier case where devices can advertise a single bit in each round. We demonstrate a large gap in time complexity between these zero bit and one bit cases. In more detail, we describe and analyze a new algorithm that solves leader election with a time complexitythat includes the parameter bounding topology changes. For all values of this parameter, this algorithm is faster than the previous result, with a gap that grows quickly as the parameter increases (indicating lower rates of change). We conclude by describing and analyzing a modified version of this algorithmthat does not require the assumptionthat all devices start during the same round. This new version has a similar time complexity (the rounds required differ only by a polylogarithmic factor),but now requires slightly larger advertisement tags.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123270272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Automatic-Signal Monitors with Multi-object Synchronization 具有多对象同步的自动信号监视器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.57

W. Hung, V. Garg

{"title":"Automatic-Signal Monitors with Multi-object Synchronization","authors":"W. Hung, V. Garg","doi":"10.1109/IPDPS.2017.57","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.57","url":null,"abstract":"Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125354307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1