2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

O(log N)-Time Complete Visibility for Asynchronous Robots with Lights 带灯异步机器人的O(log N)时间完全可见性

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.51

Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai

引用次数: 25

Reducing Pagerank Communication via Propagation Blocking 通过传播阻塞减少网页排名通信

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.112

S. Beamer, K. Asanović, D. Patterson

{"title":"Reducing Pagerank Communication via Propagation Blocking","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IPDPS.2017.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.112","url":null,"abstract":"Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125550010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Monitoring Properties of Large, Distributed, Dynamic Graphs 大型、分布式、动态图的监控属性

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.123

Gal Yehuda, D. Keren, Islam Akaria

{"title":"Monitoring Properties of Large, Distributed, Dynamic Graphs","authors":"Gal Yehuda, D. Keren, Islam Akaria","doi":"10.1109/IPDPS.2017.123","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.123","url":null,"abstract":"The following is a very common question in numerous theoretical and application-related domains: given a graph G, does it satisfy some given property? For example, is G connected? Is its diameter smaller than a given threshold? Is its average degree larger than a certain threshold? Traditionally, algorithms to quickly answer such questions were developed for static and centralized graphs (i.e. G is stored in a central server and the list of its vertices and edges is static and quickly accessible). Later, as dictated by practical considerations, a great deal of attention was given to on-line algorithms for dynamic graphs (where vertices and edges can be added and deleted); the focus of research was to quickly decide whether the new graph still satisfies the given property. Today, a more difficult version of this problem, referred to as the distributed monitoring problem, is becoming increasingly important: large graphs are not only dynamic, but also distributed, that is, G is partitioned between a few servers, none of which \"sees\" G in its entirety. The question is how to define local conditions, such that as long as they hold on the local graphs, it is guaranteed that the desired property holds for the global G. Such local conditions are crucial for avoiding a huge communication overhead. While defining local conditions for linear properties (e.g. average degree) is relatively easy, they are considerably more difficult to derive for non-linear functions over graphs. We propose a solution and a general definition of solution optimality, and demonstrate how to apply it to two important graph properties – the spectral gap and the number of triangles. We also define an absolute lower bound on the communication overhead for distributed monitoring, and compare our algorithm to it, with excellent results. Last but not least, performance improves as the graph becomes larger and denser – that is, when distributing it is more important.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126540236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Scalable Lock-Free Vector with Combining 具有组合的可伸缩无锁矢量

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.73

Ivan Walulya, P. Tsigas

引用次数: 6

Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging 使用共享内存多核架构加速图形和机器学习工作负载，并辅助支持硬件内显式消息传递

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.116

H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan

{"title":"Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging","authors":"H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan","doi":"10.1109/IPDPS.2017.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.116","url":null,"abstract":"Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133880400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

RCube: A Power Efficient and Highly Available Network for Data Centers RCube:高效、高可用的数据中心网络

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.50

Zhenhua Li, Yuanyuan Yang

{"title":"RCube: A Power Efficient and Highly Available Network for Data Centers","authors":"Zhenhua Li, Yuanyuan Yang","doi":"10.1109/IPDPS.2017.50","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.50","url":null,"abstract":"Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131880980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor 在Intel Knights Landing多核处理器上实现高度可扩展的从头算分子动力学(AIMD)模拟

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.26

M. Jacquelin, W. A. Jong, E. Bylaska

{"title":"Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor","authors":"M. Jacquelin, W. A. Jong, E. Bylaska","doi":"10.1109/IPDPS.2017.26","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.26","url":null,"abstract":"The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage DEFT-Cache:一种高性价比、高可靠性的RAID存储SSD缓存

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.54

Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie

{"title":"DEFT-Cache: A Cost-Effective and Highly Reliable SSD Cache for RAID Storage","authors":"Ji-guang Wan, Wei Wu, Ling Zhan, Q. Yang, Xiaoyang Qu, C. Xie","doi":"10.1109/IPDPS.2017.54","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.54","url":null,"abstract":"This paper proposes a new SSD cache architecture, DEFT-cache, Delayed Erasing and Fast Taping, that maximizes I/O performance and reliability of RAID storage. First of all, DEFT-Cache exploits the inherent physical properties of flash memory SSD by making use of old data that have been overwritten but still in existence in SSD to minimize small write penalty of RAID5/6. As data pages being overwritten in SSD, old data pages are invalidated and become candidates for erasure and garbage collections. Our idea is to selectively delay the erasure of the pages and let these otherwise useless old data in SSD contribute to I/O performance for parity computations upon write I/Os. Secondly, DEFT-Cache provides inexpensive redundancy to the SSD cache by having one physical SSD and one virtual SSD as a mirror cache. The virtual SSD is implemented on HDD but using log-structured data layout, i.e. write data are quickly logged to HDD using sequential write. The dual and redundant caches provide a cost-effective and highly reliable write-back SSD cache. We have implemented DEFT-Cache on Linux system. Extensive experiments have been carried out to evaluate the potential benefits of our new techniques. Experimental results on SPC and Microsoft traces have shown that DEFT-Cache improves I/O performance by 26.81% to 56.26% in terms of average user response time. The virtual SSD mirror cache can absorb write I/Os as fast as physical SSD providing the same reliability as two physical SSD caches without noticeable performance loss.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126625551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem 后缀树的并行构造与全最近邻小值问题

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.62

P. Flick, S. Aluru

{"title":"Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem","authors":"P. Flick, S. Aluru","doi":"10.1109/IPDPS.2017.62","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.62","url":null,"abstract":"A Suffix tree is a fundamental and versatile string data structure that is frequently used in important application areas such as text processing, information retrieval, and computational biology. Sequentially, the construction of suffix trees takes linear time, and optimal parallel algorithms exist only for the PRAM model. Recent works mostly target low core-count shared-memory implementations but achieve suboptimal complexity, and prior distributed-memory parallel algorithms have quadratic worst-case complexity. Suffix trees can be constructed from suffix and longest common prefix (LCP) arrays by solving the All-Nearest-Smaller-Values(ANSV) problem. In this paper, we formulate a more generalized version of the ANSV problem, and present a distributed-memory parallel algorithm for solving it in O(n/p +p) time. Our algorithm minimizes the overall and per-node communication volume. Building on this, we present a parallel algorithm for constructing a distributed representation of suffix trees, yielding both superior theoretical complexity and better practical performance compared to previous distributed-memory algorithms. We demonstrate the construction of the suffix tree for the human genome given its suffix and LCP arrays in under 2 seconds on 1024 Intel Xeon cores.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116226446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors 大规模并行SIMT处理器上高性能消息传递的松弛

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.94

Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison

{"title":"Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors","authors":"Benjamin Klenk, H. Fröning, H. Eberle, Larry R. Dennison","doi":"10.1109/IPDPS.2017.94","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.94","url":null,"abstract":"Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators are recently being augmented with peer-to-peer communication capabilities that allow for autonomous traffic sourcing and sinking. While appropriate hardware support is becoming available, it seems that the right communication semantics are yet to be identified. Maintaining the semantics of existing communication models, such as the Message Passing Interface (MPI), seems problematic as they have been designed for the CPU’s execution model, which inherently differs from such specialized processors. In this paper, we analyze the compatibility of traditional message passing with massively parallel Single Instruction Multiple Thread (SIMT) architectures, as represented by GPUs, and focus on the message matching problem. We begin with a fully MPI-compliant set of guarantees, including tag and source wildcards and message ordering. Based on an analysis of exascale proxy applications, we start relaxing these guarantees to adapt message passing to the GPU’s execution model. We present suitable algorithms for message matching on GPUs that can yield matching rates of 60M and 500M matches/s, depending on the constraints that are being relaxed. We discuss our experiments and create an understanding of the mismatch of current message passing protocols and the architecture and execution model of SIMT processors.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125567458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21