Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen
{"title":"PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs","authors":"Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3298989","DOIUrl":"https://doi.org/10.1145/3298989","url":null,"abstract":"Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"5 1","pages":"13:1-13:39"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90205401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka
{"title":"Lock Contention Management in Multithreaded MPI","authors":"A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka","doi":"10.1145/3275443","DOIUrl":"https://doi.org/10.1145/3275443","url":null,"abstract":"In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"35 1","pages":"12:1-12:21"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84422335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun
{"title":"An Autotuning Protocol to Rapidly Build Autotuners","authors":"Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun","doi":"10.1145/3291527","DOIUrl":"https://doi.org/10.1145/3291527","url":null,"abstract":"Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"43 1","pages":"9:1-9:25"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86650051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong
{"title":"Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees","authors":"Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong","doi":"10.1145/3291529","DOIUrl":"https://doi.org/10.1145/3291529","url":null,"abstract":"We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"113 4","pages":"10:1-10:19"},"PeriodicalIF":1.6,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3291529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72538506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code","authors":"Michel Müller, T. Aoki","doi":"10.1145/3291523","DOIUrl":"https://doi.org/10.1145/3291523","url":null,"abstract":"We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"31 1","pages":"7:1-7:42"},"PeriodicalIF":1.6,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79334909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs","authors":"Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang","doi":"10.1145/2990849","DOIUrl":"https://doi.org/10.1145/2990849","url":null,"abstract":"The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"26 1","pages":"16:1-16:33"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82616127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf
{"title":"Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations","authors":"Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf","doi":"10.1145/2987371","DOIUrl":"https://doi.org/10.1145/2987371","url":null,"abstract":"With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time.\u0000 In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability.\u0000 In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"13 1","pages":"15:1-15:43"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91252020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparently Space Sharing a Multicore Among Multiple Processes","authors":"T. Creech, R. Barua","doi":"10.1145/3001910","DOIUrl":"https://doi.org/10.1145/3001910","url":null,"abstract":"As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.\u0000 In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.\u0000 We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.\u0000 Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"28 1","pages":"17:1-17:35"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88132169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeffrey D. Blanchard, Erik Opavsky, Emircan Uysaler
{"title":"Selecting Multiple Order Statistics with a Graphics Processing Unit","authors":"Jeffrey D. Blanchard, Erik Opavsky, Emircan Uysaler","doi":"10.1145/2948974","DOIUrl":"https://doi.org/10.1145/2948974","url":null,"abstract":"Extracting a set of multiple order statistics from a huge data set provides important information about the distribution of the values in the full set of data. This article introduces an algorithm, bucketMultiSelect, for simultaneously selecting multiple order statistics with a graphics processing unit (GPU). Typically, when a large set of order statistics is desired, the vector is sorted. When the sorted version of the vector is not needed, bucketMultiSelect significantly reduces computation time by eliminating a large portion of the unnecessary operations involved in sorting. For large vectors, bucketMultiSelect returns thousands of order statistics in less time than sorting the vector while typically using less memory. For vectors containing 228 values of type double, bucketMultiSelect selects the 101 percentile order statistics in less than 95ms and is more than 8× faster than sorting the vector with a GPU optimized merge sort.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"6 1","pages":"10:1-10:23"},"PeriodicalIF":1.6,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78834095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula
{"title":"Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory","authors":"Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula","doi":"10.1145/2948975","DOIUrl":"https://doi.org/10.1145/2948975","url":null,"abstract":"Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them.\u0000 In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization. We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"17 1","pages":"12:1-12:28"},"PeriodicalIF":1.6,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89724962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}