V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral
{"title":"Shedding Light on Lithium/Air Batteries Using Millions of Threads on the BG/Q Supercomputer","authors":"V. Weber, C. Bekas, T. Laino, A. Curioni, A. Bertsch, S. Futral","doi":"10.1109/IPDPS.2014.81","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.81","url":null,"abstract":"In this work, we present a novel parallelization scheme for a highly efficient evaluation of the Hartree-Fock exact exchange (HFX) in ab initio molecular dynamics simulations, specifically tailored for condensed phase simulations. Our developments allow one to achieve the necessary accuracy for the evaluation of the HFX in a highly controllable manner. We show here that our solutions can take great advantage of the latest trends in HPC platforms, such as extreme threading, short vector instructions and highly dimensional interconnection networks. Indeed, all these trends are evident in the IBM Blue Gene/Q supercomputer. We demonstrate an unprecedented scalability up to 6,291,456 threads (96 BG/Q racks) with a near perfect parallel efficiency, which represents a more than 20-fold improvement as compared to the current state of the art. In terms of reduction of time to solution, we achieved an improvement that can surpass a 10-fold decrease in runtime with respect to directly comparable approaches. We exploit this development to enhance the accuracy of DFT based molecular dynamics by using the PBE0 hybrid functional. This approach allowed us to investigate the chemical behavior of organic solvents in one of the most challenging research topics in energy storage, lithium/air batteries, and to propose alternative solvents with enhanced stability to ensure an appropriate reversible electrochemical reaction. This step is key for the development of a viable lithium/air storage technology, which would have been a daunting computational task using standard methods. Recent research has shown that the electrolyte plays a key role in non-aqueous lithium/air batteries in producing the appropriate reversible electrochemical reduction. In particular, the chemical degradation of propylene carbonate, the typical electrolyte used, by lithium peroxide has been demonstrated by molecular dynamics simulations of highly realistic models. Reaching the necessary high accuracy in these simulations is a daunting computational task using standard methods.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130792369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterization and Optimization of Memory-Resident MapReduce on HPC Systems","authors":"Yandong Wang, R. Goldstone, Weikuan Yu, Teng Wang","doi":"10.1109/IPDPS.2014.87","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.87","url":null,"abstract":"MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric paradigm such as MapReduce. This work characterizes the performance impact of key differences between compute- and data-centric paradigms and then provides optimizations to enable a dual-purpose HPC system that can efficiently support conventional HPC applications and new data analytics applications. Using a state-of-the-art MapReduce implementation Spark and the Hyperion system at Lawrence Livermore National Laboratory, we have examined the impact of storage architectures, data locality and task scheduling to the memory-resident MapReduce jobs. Based on our characterization and findings of the performance behaviors, we have introduced two optimization techniques, namely Enhanced Load Balancer and Congestion-Aware Task Dispatching, to improve the performance of Spark applications.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131553743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Smith, R. Geijn, M. Smelyanskiy, J. Hammond, F. V. Zee
{"title":"Anatomy of High-Performance Many-Threaded Matrix Multiplication","authors":"T. Smith, R. Geijn, M. Smelyanskiy, J. Hammond, F. V. Zee","doi":"10.1109/IPDPS.2014.110","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.110","url":null,"abstract":"BLIS is a new framework for rapid instantiation of the BLAS. We describe how BLIS extends the \"GotoBLAS approach\" to implementing matrix multiplication (GEMM). While GEMM was previously implemented as three loops around an inner kernel, BLIS exposes two additional loops within that inner kernel, casting the computation in terms of the BLIS micro-kernel so that porting GEMM becomes a matter of customizing this micro-kernel for a given architecture. We discuss how this facilitates a finer level of parallelism that greatly simplifies the multithreading of GEMM as well as additional opportunities for parallelizing multiple loops. Specifically, we show that with the advent of many-core architectures such as the IBM PowerPC A2 processor (used by Blue Gene/Q) and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability. The resulting implementations deliver what we believe to be the best open source performance for these architectures, achieving both impressive performance and excellent scalability.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134081067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Venkatesan T. Chakaravarthy, Fabio Checconi, F. Petrini, Yogish Sabharwal
{"title":"Scalable Single Source Shortest Path Algorithms for Massively Parallel Systems","authors":"Venkatesan T. Chakaravarthy, Fabio Checconi, F. Petrini, Yogish Sabharwal","doi":"10.1109/IPDPS.2014.96","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.96","url":null,"abstract":"In the single-source shortest path (SSSP) problem, we have to find the shortest paths from a source vertex v to all other vertices in a graph. In this paper, we introduce a novel parallel algorithm, derived from the Bellman-Ford and Delta-stepping algorithms. We employ various pruning techniques, such as edge classification and direction-optimization, to dramatically reduce inter-node communication traffic, and we propose load balancing strategies to handle higher-degree vertices. The extensive performance analysis shows that our algorithms work well on scale-free and real-world graphs. In the largest tested configuration, an R-MAT graph with 238 vertices and 242 edges on 32,768 Blue Gene/Q nodes, we have achieved a processing rate of three Trillion Edges Per Second (TTEPS), a four orders of magnitude improvement over the best published results.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131459293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Ibrahim, Paul H. Hargrove, Costin Iancu, K. Yelick
{"title":"An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect","authors":"K. Ibrahim, Paul H. Hargrove, Costin Iancu, K. Yelick","doi":"10.1109/IPDPS.2014.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.116","url":null,"abstract":"The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition to using Cray native communication APIs, we use UPC and MPI micro-benchmarks to capture one- and two-sided semantics respectively. Our results indicate that relaxing the message delivery order can improve performance up to 4.6x when compared with strict ordering. When hardware allows it, high-level one-sided programming models can already take advantage of message reordering. Enforcing the ordering semantics of two-sided communication comes with a performance penalty. Furthermore, we argue that exposing out-of-order delivery at the application level is required for the next-generation programming models. Any ordering constraints in the language specifications reduce communication performance for small messages and increase the number of active cores required for peak throughput.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang
{"title":"Fair Maximal Independent Sets","authors":"Jeremy T. Fineman, Calvin C. Newport, M. Sherr, Tonghe Wang","doi":"10.1109/IPDPS.2014.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.79","url":null,"abstract":"Finding a maximal independent set (MIS) is a classic problem in graph theory that has been widely studied in the context of distributed algorithms. Standard distributed solutions to the MIS problem focus on time complexity. In this paper, we also consider fairness. For a given MIS algorithm A and graph G, we define the inequality factor for A on G to be the largest ratio between the probabilities of the nodes joining an MIS in the graph. We say an algorithm is fair with respect to a family of graphs if it achieves a constant inequality factor for all graphs in the family. In this paper, we seek efficient and fair algorithms for common graph families. We begin by describing an algorithm that is fair and runs in O(log* n)-time in rooted trees of size n. Moving to unrooted trees, we describe a fair algorithm that runs in O(log n) time. Generalizing further to bipartite graphs, we describe a third fair algorithm that requires O(log2 n) rounds. We also show a fair algorithm for planar graphs that runs in O(log2 n) rounds, and describe an algorithm that can be run in any graph, yielding good bounds on inequality in regions that can be efficiently colored with a small number of colors. We conclude our theoretical analysis with a lower bound that identifies a graph where all MIS algorithms achieve an inequality bound in Ω(n)-eliminating the possibility of an MIS algorithm that is fair in all graphs. Finally, to motivate the need for provable fairness guarantees, we simulate both our tree algorithm and Luby's MIS algorithm [13] in a variety of different tree topologies-some synthetic and some derived from real world data. Whereas our algorithm always yield an inequality factor ≤3.25 in these simulations, Luby's algorithms yields factors as large as 168.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121171701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Coprocessor Sharing-Aware Scheduler for Xeon Phi-Based Compute Clusters","authors":"G. Coviello, S. Cadambi, S. Chakradhar","doi":"10.1109/IPDPS.2014.44","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.44","url":null,"abstract":"We propose a cluster scheduling technique for compute clusters with Xeon Phi coprocessors. Even though the Xeon Phi runs Linux which allows multiprocessing, cluster schedulers generally do not allow jobs to share coprocessors because sharing can cause oversubscription of coprocessor memory and thread resources. It has been shown that memory or thread oversubscription on a many core like the Phi results in job crashes or drastic performance loss. We first show that such an exclusive device allocation policy causes severe coprocessor underutilization: for typical workloads, on average only 38% of the Xeon Phi cores are busy across the cluster. Then, to improve coprocessor utilization, we propose a scheduling technique that enables safe coprocessor sharing without resource oversubscription. Jobs specify their maximum memory and thread requirements, and our scheduler packs as many jobs as possible on each coprocessor in the cluster, subject to resource limits. We solve this problem using a greedy approach at the cluster level combined with a knapsack-based algorithm for each node. Every coprocessor is modeled as a knapsack and jobs are packed into each knapsack with the goal of maximizing job concurrency, i.e., as many jobs as possible executing on each coprocessor. Given a set of jobs, we show that this strategy of packing for high concurrency is a good proxy for (i) reducing make span, without the need for users to specify job execution times and (ii) reducing coprocessor footprint, or the number of coprocessors required to finish the jobs without increasing make span. We implement the entire system as a seamless add on to Condor, a popular distributed job scheduler, and show make span and footprint reductions of more than 50% across a wide range of workloads.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122016194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"It's About Time: On Optimal Virtual Network Embeddings under Temporal Flexibilities","authors":"Matthias Rost, S. Schmid, A. Feldmann","doi":"10.1109/IPDPS.2014.14","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.14","url":null,"abstract":"Distributed applications often require high-performance networks with strict connectivity guarantees. For instance, many cloud applications suffer from today's variations of the intra-cloud bandwidth, which leads to poor and unpredictable application performance. Accordingly, we witness a trend towards virtual networks (VNets) which can provide resource isolation. Interestingly, while the problem of where to embed a VNet is fairly well-understood today, much less is known about when to optimally allocate a VNet. This however is important, as the requirements specified for a VNet do not have to be static, but can vary over time and even include certain temporal flexibilities. This paper initiates the study of the temporal VNet embedding problem (TVNEP). We propose a continuous-time mathematical programming approach to solve the TVNEP, and present and compare different algorithms. Based on these insights, we present the CSM-Model which incorporates both symmetry and state-space reductions to significantly speed up the process of computing exact solutions to the TVNEP. Based on the CSM-Model, we derive a greedy algorithm OGA to compute fast approximate solutions. In an extensive computational evaluation, we show that despite the hardness of the TVNEP, the CSM-Model is sufficiently powerful to solve moderately sized instances to optimality within one hour and under different objective functions (such as maximizing the number of embeddable VNets). We also show that the greedy algorithm exploits flexibilities well and yields good solutions. More generally, our results suggest that already little time flexibilities can improve the overall system performance significantly.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123472117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra
{"title":"Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment","authors":"A. Haidar, Chongxiao Cao, A. YarKhan, P. Luszczek, S. Tomov, K. Kabir, J. Dongarra","doi":"10.1109/IPDPS.2014.58","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.58","url":null,"abstract":"Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126239849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jae-Seung Yeom, A. Bhatele, K. Bisset, Eric J. Bohm, Abhishek K. Gupta, L. Kalé, M. Marathe, Dimitrios S. Nikolopoulos, M. Schulz, Lukasz Wesolowski
{"title":"Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters","authors":"Jae-Seung Yeom, A. Bhatele, K. Bisset, Eric J. Bohm, Abhishek K. Gupta, L. Kalé, M. Marathe, Dimitrios S. Nikolopoulos, M. Schulz, Lukasz Wesolowski","doi":"10.1109/IPDPS.2014.83","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.83","url":null,"abstract":"Modeling dynamical systems represents an important application class covering a wide range of disciplines including but not limited to biology, chemistry, finance, national security, and health care. Such applications typically involve large-scale, irregular graph processing, which makes them difficult to scale due to the evolutionary nature of their workload, irregular communication and load imbalance. EpiSimdemics is such an application simulating epidemic diffusion in extremely large and realistic social contact networks. It implements a graph-based system that captures dynamics among co-evolving entities. This paper presents an implementation of EpiSimdemics in Charm++ that enables future research by social, biological and computational scientists at unprecedented data and system scales. We present new methods for application-specific processing of graph data and demonstrate the effectiveness of these methods on a Cray XE6, specifically NCSA's Blue Waters system.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}