Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai
{"title":"O(log N)-Time Complete Visibility for Asynchronous Robots with Lights","authors":"Gokarna Sharma, R. Vaidyanathan, J. Trahan, C. Busch, S. Rai","doi":"10.1109/IPDPS.2017.51","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.51","url":null,"abstract":"We consider the distributed setting of N autonomous mobile robots that operate in Look-Compute-Move (LCM) cycles and communicate with other robots using colored lights (the robots with lights model). We study the fundamental problem of repositioning N autonomous robots on a plane sothat each robot is visible to all others (the Complete Visibility problem) on this model; a robot cannot see another robot if a third robot is positioned between them on the straight line connecting them. There exists an O(1) time, O(1) color algorithm for this problem in the semi-synchronous setting. In this paper, we provide the first O(log N) time, O(1) color algorithm for this problem in the asynchronous setting. This is a significant improvement over an O(N)-time translation of the semi-synchronous algorithm to the asynchronous setting. The proposed algorithm is collision-free - robots do not share positions and their paths do not cross.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129731829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing Pagerank Communication via Propagation Blocking","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IPDPS.2017.112","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.112","url":null,"abstract":"Reducing communication is an important objective, as it can save energy or improve the performance of a communication-bound application. The graph algorithm PageRank computes the importance of vertices in a graph, and it serves as an important benchmark for graph algorithm performance. If the input graph to PageRank has poor locality, the execution will need to read many cache lines from memory, some of which may not be fully utilized. We present propagation blocking, an optimization to improve spatial locality, and we demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, we partition the data transfers between vertices (propagations). If the input graph has poor locality, our approach will reduce communication. Our approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size. To evaluate our approach, we use both simple analytic models to gain insights and precise hardware performance counter measurements to compare implementations on a suite of 8 real-world and synthetic graphs. We demonstrate our parallel implementations substantially outperform prior work in execution time and communication volume. Although we present results for PageRank, propagation blocking could be generalized to SpMV (sparse matrix multiplying dense vector) or other graph programming models.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125550010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monitoring Properties of Large, Distributed, Dynamic Graphs","authors":"Gal Yehuda, D. Keren, Islam Akaria","doi":"10.1109/IPDPS.2017.123","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.123","url":null,"abstract":"The following is a very common question in numerous theoretical and application-related domains: given a graph G, does it satisfy some given property? For example, is G connected? Is its diameter smaller than a given threshold? Is its average degree larger than a certain threshold? Traditionally, algorithms to quickly answer such questions were developed for static and centralized graphs (i.e. G is stored in a central server and the list of its vertices and edges is static and quickly accessible). Later, as dictated by practical considerations, a great deal of attention was given to on-line algorithms for dynamic graphs (where vertices and edges can be added and deleted); the focus of research was to quickly decide whether the new graph still satisfies the given property. Today, a more difficult version of this problem, referred to as the distributed monitoring problem, is becoming increasingly important: large graphs are not only dynamic, but also distributed, that is, G is partitioned between a few servers, none of which \"sees\" G in its entirety. The question is how to define local conditions, such that as long as they hold on the local graphs, it is guaranteed that the desired property holds for the global G. Such local conditions are crucial for avoiding a huge communication overhead. While defining local conditions for linear properties (e.g. average degree) is relatively easy, they are considerably more difficult to derive for non-linear functions over graphs. We propose a solution and a general definition of solution optimality, and demonstrate how to apply it to two important graph properties – the spectral gap and the number of triangles. We also define an absolute lower bound on the communication overhead for distributed monitoring, and compare our algorithm to it, with excellent results. Last but not least, performance improves as the graph becomes larger and denser – that is, when distributing it is more important.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126540236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Lock-Free Vector with Combining","authors":"Ivan Walulya, P. Tsigas","doi":"10.1109/IPDPS.2017.73","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.73","url":null,"abstract":"Dynamic vectors are among the most commonly used data structures in programming. They provide constant time random access and resizable data storage. Additionally, they provide constant time insertion (pushback) and deletion (popback) at the end of the sequence. However, in a multithreaded system, concurrent pushback and popback operations attempt to update the same shared object, creating a synchronization bottleneck. In this paper, we present a lock-free vector design that efficiently addresses the synchronization bottlenecks by utilizing a combining technique on pushback operations. Typical combining techniques come with the price of blocking. Our design introduces combining without sacrificing lock-freedom. We evaluate the performance of our design on a dual socket NUMA Intel server. The results show that our design performs comparably at low loads, and out-performs prior concurrent blocking and non-blocking vector implementations at high contention, by as much as 2.7x.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134517680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan
{"title":"Accelerating Graph and Machine Learning Workloads Using a Shared Memory Multicore Architecture with Auxiliary Support for In-hardware Explicit Messaging","authors":"H. Dogan, Farrukh Hijaz, Masab Ahmad, B. Kahne, Peter Wilson, O. Khan","doi":"10.1109/IPDPS.2017.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.116","url":null,"abstract":"Shared Memory stands out as a sine qua non for parallel programming of many commercial and emerging multicore processors. It optimizes patterns of communication that benefit common programming styles. As parallel programming is now mainstream, those common programming styles are challenged with emerging applications that communicate often and involve large amount of data. Such applications include graph analytics and machine learning, and this paper focuses on these domains. We retain the shared memory model and introduce a set of lightweight in-hardware explicit messaging instructions in the instruction set architecture (ISA). A set of auxiliary communication models are proposed that utilize explicit messages to accelerate synchronization primitives, and efficiently move computation towards data. The results on a 256-core simulated multicore demonstrate that the proposed communication models improve performance and dynamic energy by an average of 4x and 42% respectively over traditional shared memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133880400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RCube: A Power Efficient and Highly Available Network for Data Centers","authors":"Zhenhua Li, Yuanyuan Yang","doi":"10.1109/IPDPS.2017.50","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.50","url":null,"abstract":"Designing a cost-effective network for data centers that can deliver sufficient bandwidth and provide high availability has drawn tremendous attentions recently. In this paper, we propose a novel server-centric network structure called RCube, which is energy efficient and can deploy a redundancy scheme to improve the availability of data centers. Moreover, RCube shares many good properties with BCube, a well known server-centric network structure, yet its network size can be adjusted more conveniently. We also present a routing algorithm to find paths in RCube and an algorithm to build multiple parallel paths between any pair of source and destination servers. In addition, we theoretically analyze the power efficiency of the network and availability of RCube under server failure. Our comprehensive simulations demonstrate that RCube provides higher availability and flexibility to make trade-off among many factors, such as power consumption and aggregate throughput, than BCube, while delivering similar performance to BCube in many critical metrics, such as average path length, path distribution and graceful degradation, which makes RCube a very promising empirical structure for an enterprise data center network product.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131880980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Highly scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor","authors":"M. Jacquelin, W. A. Jong, E. Bylaska","doi":"10.1109/IPDPS.2017.26","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.26","url":null,"abstract":"The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schrodinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, we focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange Multiplier and non-local pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multipliers kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim
{"title":"SimProf: A Sampling Framework for Data Analytic Workloads","authors":"Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim","doi":"10.1109/IPDPS.2017.118","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.118","url":null,"abstract":"Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running nature of the workloads, it is not trivial to identify which parts of the analysis to simulate. In the current work, we introduce SimProf, a sampling framework for data analytic workloads. Using this tool, we are able to select representative simulation points based on the phase behavior of the analysis at a method level granularity. This provides a better understanding of the simulation point and also reduces the simulation time for different input sets. We present the framework for Apache Hadoop and Apache Spark frameworks, which can be easily extended to other data analytic workloads.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121549719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Altisen, A. Datta, Stéphane Devismes, Anaïs Durand, L. Larmore
{"title":"Leader Election in Asymmetric Labeled Unidirectional Rings","authors":"K. Altisen, A. Datta, Stéphane Devismes, Anaïs Durand, L. Larmore","doi":"10.1109/IPDPS.2017.23","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.23","url":null,"abstract":"We study (deterministic) leader election in unidirectional rings of homonym processes that have no a priori knowledge on the number of processes. In this context, we show that there is no algorithm that solves process-terminating leader election for the class of asymmetric labeled rings. In particular, there is no process-terminating leader election algorithm in rings in which at least one label is unique. However, we show that process-terminating leader election is possible for the subclass of asymmetric rings, where multiplicity is bounded. We confirm this positive results by proposing two algorithms, which achieve the classical trade-off between time and space.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"331 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115876149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé
{"title":"Improving the Integration of Task Nesting and Dependencies in OpenMP","authors":"Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé","doi":"10.1109/IPDPS.2017.69","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.69","url":null,"abstract":"The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121655390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}