ACM Transactions on Parallel Computing最新文献_第2页

Non-Clairvoyant Scheduling with Predictions 具有预测的非偷窥调度

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-05-02 DOI: 10.1145/3593969

Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit

引用次数: 0

Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks 对抗和随机匿名动态网络中更快的监督平均一致性

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-04-24 DOI: 10.1145/3593426

Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro

{"title":"Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks","authors":"Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro","doi":"10.1145/3593426","DOIUrl":"https://doi.org/10.1145/3593426","url":null,"abstract":"How do we reach consensus on an average value in a dynamic crowd without revealing identity? In this work, we study the problem of average network consensus in Anonymous Dynamic Networks (ADN). Network dynamicity is specified by the sequence of topology-graph isoperimetric numbers occurring over time, which we call the isoperimetric dynamicity of the network. The consensus variable is the average of values initially held by nodes, which is customary in the network-consensus literature. Given that having an algorithm to compute the average one can compute the network size (i.e., the counting problem) and vice versa, we focus on the latter. We present a deterministic distributed average network consensus algorithm for ADNs that we call isoperimetric Scalable Coordinated Anonymous Local Aggregation, and we analyze its performance for different scenarios, including worst-case (adversarial) and stochastic dynamic topologies. Our solution utilizes supervisor nodes, which have been shown to be necessary for computations in ADNs. The algorithm uses the isoperimetric dynamicity of the network as an input, meaning that only the isoperimetric number parameters (or their lower bound) must be given, but topologies may occur arbitrarily or stochastically as long as they comply with those parameters. Previous work for adversarial ADNs overestimates the running time to deal with worst-case scenarios. For ADNs with given isoperimetric dynamicity, our analysis shows improved performance for some practical dynamic topologies, with cubic time or better for stochastic ADNs, and our experimental evaluation indicates that our theoretical bounds could not be substantially improved for some models of dynamic networks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48744359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems 求解大型图优化问题的分布式gpu深度强化学习系统

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-03-23 DOI: 10.1145/3589188

Weijian Zheng, Dali Wang, Fengguang Song

{"title":"A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems","authors":"Weijian Zheng, Dali Wang, Fengguang Song","doi":"10.1145/3589188","DOIUrl":"https://doi.org/10.1145/3589188","url":null,"abstract":"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47081538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

POETS: An Event-driven Approach to Dissipative Particle Dynamics POETS:耗散粒子动力学的事件驱动方法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-20 DOI: 10.1145/3580372

Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming

{"title":"POETS: An Event-driven Approach to Dissipative Particle Dynamics","authors":"Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming","doi":"10.1145/3580372","DOIUrl":"https://doi.org/10.1145/3580372","url":null,"abstract":"HPC clusters have become ever more expensive, both in terms of capital cost and energy consumption; some estimates suggest that competitive installations at the end of the next decade will require their own power station. One way around this looming problem is to design bespoke computing engines, but while the performance benefits are good, the design costs are huge and cannot easily be amortized. Partially Ordered Event Triggered System (POETS)—the focus of this article—seeks to exploit a middle way: The architecture is tuned to a specific algorithmic pattern but, within that constraint, is fully programmable. POETS software is quasi-imperative: The user defines a set of sequential event handlers, defines the topology of a (typically large) concurrent ensemble of these, and lets them interact. The “solution” may be exfiltrated from the emergent behaviour of the ensemble. In this article, we describe (briefly) the architecture, and an example computational chemistry application, dissipative particle dynamics (DPD). The DPD algorithm is traditionally implemented using parallel computational techniques, but we re-cast it as a concurrent compute problem that is then ideally suited to POETS. Our prototype system is realised on a cluster of 48 FPGAs providing 50K concurrent hardware threads, and we report performance speedups of over two orders of magnitude better than a single thread baseline comparator and scaling behaviour that is almost constant. The results are validated against a “conventional” implementation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 32"},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44594257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

MCSH, a Lock with the Standard Interface MCSH，具有标准接口的锁

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-20 DOI: 10.1145/3584696

W. Hesselink, P. Buhr

{"title":"MCSH, a Lock with the Standard Interface","authors":"W. Hesselink, P. Buhr","doi":"10.1145/3584696","DOIUrl":"https://doi.org/10.1145/3584696","url":null,"abstract":"The MCS lock of Mellor-Crummey and Scott (1991), 23 pages. is a very efficient first-come first-served mutual-exclusion algorithm that uses the atomic hardware primitives fetch-and-store and compare-and-swap. However, it has the disadvantage that the calling thread must provide a pointer to an allocated record. This additional parameter violates the standard locking interface, which has only the lock as a parameter. Hence, it is impossible to switch to MCS without editing and recompiling an application that uses locks. This article provides a variation of MCS with the standard interface, which remains FCFS, called MCSH. One key ingredient is to stack allocate the necessary record in the acquire procedure of the lock, so its life-time only spans the delay to enter a critical section. A second key ingredient is communicating the allocated record between the acquire and release procedures through the lock to maintain the standard locking interface. Both of these practices are known to practitioners, but our solution combines them in a unique way. Furthermore, when these practices are used in prior papers, their correctness is often argued informally. The correctness of MCSH is verified rigorously with the proof assistant PVS, and experiments are run to compare its performance with MCS and similar locks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48547326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCN 基于GCN的CPU-GPU异构并行计算SpTTM优化方法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-17 DOI: 10.1145/3584373

Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li

引用次数: 1

GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems 异构多gpu系统的节能矩阵分解

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-17 DOI: 10.1145/3583590

Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen

{"title":"GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems","authors":"Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen","doi":"10.1145/3583590","DOIUrl":"https://doi.org/10.1145/3583590","url":null,"abstract":"The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (VsafeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49072865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigation and Implementation of Parallelism Resources of Numerical Algorithms 数值算法并行资源的研究与实现

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-15 DOI: 10.1145/3583755

Valentina N. Aleeva, R. Aleev

{"title":"Investigation and Implementation of Parallelism Resources of Numerical Algorithms","authors":"Valentina N. Aleeva, R. Aleev","doi":"10.1145/3583755","DOIUrl":"https://doi.org/10.1145/3583755","url":null,"abstract":"This article is devoted to an approach to solving a problem of the efficiency of parallel computing. The theoretical basis of this approach is the concept of a Q-determinant. Any numerical algorithm has a Q-determinant. The Q-determinant of the algorithm has clear structure and is convenient for implementation. The Q-determinant consists of Q-terms. Their number is equal to the number of output data items. Each Q-term describes all possible ways to compute one of the output data items based on the input data. We also describe a software Q-system for studying the parallelism resources of numerical algorithms. This system enables to compute and compare the parallelism resources of numerical algorithms. The application of the Q-system is shown on the example of numerical algorithms with different structures of Q-determinants. Furthermore, we suggest a method for designing of parallel programs for numerical algorithms. This method is based on a representation of a numerical algorithm in the form of a Q-determinant. As a result, we can obtain the program using the parallelism resource of the algorithm completely. Such programs are called Q-effective. The results of this research can be applied to increase the implementation efficiency of numerical algorithms, methods, as well as algorithmic problems on parallel computing systems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 64"},"PeriodicalIF":1.6,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42505176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition 张量不规则性的性能蕴涵与分布式张量分解的优化

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-02-07 DOI: 10.1145/3580315

Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li

{"title":"Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition","authors":"Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li","doi":"10.1145/3580315","DOIUrl":"https://doi.org/10.1145/3580315","url":null,"abstract":"Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 27"},"PeriodicalIF":1.6,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43336829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems 块三对角线和带状线性方程组的GPU库

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-01-31 DOI: 10.1145/3580373

Christopher J. Klein, R. Strzodka

引用次数: 0