ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks 条件DAG任务可行性分析的计算复杂度
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-07-05 DOI: 10.1145/3606342
Sanjoy Baruah, A. Marchetti-Spaccamela
{"title":"The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks","authors":"Sanjoy Baruah, A. Marchetti-Spaccamela","doi":"10.1145/3606342","DOIUrl":"https://doi.org/10.1145/3606342","url":null,"abstract":"The Conditional DAG (CDAG) task model is used for modeling multiprocessor real-time systems containing conditional expressions for which outcomes are not known prior to their evaluation. Feasibility analysis for CDAG tasks upon multiprocessor platforms is shown to be complete for the complexity class pspace; assuming np ≠ pspace, this result rules out the use of Integer Linear Programming solvers for solving this problem efficiently. It is further shown that there can be no pseudo-polynomial time algorithm that solves this problem unless p = pspace.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48543350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Algorithms for Right-Sizing Heterogeneous Data Centers 正确确定异构数据中心规模的算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-05-10 DOI: 10.1145/3595286
S. Albers, Jens Quedenfeld
{"title":"Algorithms for Right-Sizing Heterogeneous Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3595286","DOIUrl":"https://doi.org/10.1145/3595286","url":null,"abstract":"Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44289659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Non-Clairvoyant Scheduling with Predictions 具有预测的非偷窥调度
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-05-02 DOI: 10.1145/3593969
Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit
{"title":"Non-Clairvoyant Scheduling with Predictions","authors":"Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit","doi":"10.1145/3593969","DOIUrl":"https://doi.org/10.1145/3593969","url":null,"abstract":"In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44598514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks 对抗和随机匿名动态网络中更快的监督平均一致性
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-04-24 DOI: 10.1145/3593426
Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro
{"title":"Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks","authors":"Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro","doi":"10.1145/3593426","DOIUrl":"https://doi.org/10.1145/3593426","url":null,"abstract":"How do we reach consensus on an average value in a dynamic crowd without revealing identity? In this work, we study the problem of average network consensus in Anonymous Dynamic Networks (ADN). Network dynamicity is specified by the sequence of topology-graph isoperimetric numbers occurring over time, which we call the isoperimetric dynamicity of the network. The consensus variable is the average of values initially held by nodes, which is customary in the network-consensus literature. Given that having an algorithm to compute the average one can compute the network size (i.e., the counting problem) and vice versa, we focus on the latter. We present a deterministic distributed average network consensus algorithm for ADNs that we call isoperimetric Scalable Coordinated Anonymous Local Aggregation, and we analyze its performance for different scenarios, including worst-case (adversarial) and stochastic dynamic topologies. Our solution utilizes supervisor nodes, which have been shown to be necessary for computations in ADNs. The algorithm uses the isoperimetric dynamicity of the network as an input, meaning that only the isoperimetric number parameters (or their lower bound) must be given, but topologies may occur arbitrarily or stochastically as long as they comply with those parameters. Previous work for adversarial ADNs overestimates the running time to deal with worst-case scenarios. For ADNs with given isoperimetric dynamicity, our analysis shows improved performance for some practical dynamic topologies, with cubic time or better for stochastic ADNs, and our experimental evaluation indicates that our theoretical bounds could not be substantially improved for some models of dynamic networks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48744359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems 求解大型图优化问题的分布式gpu深度强化学习系统
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-03-23 DOI: 10.1145/3589188
Weijian Zheng, Dali Wang, Fengguang Song
{"title":"A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems","authors":"Weijian Zheng, Dali Wang, Fengguang Song","doi":"10.1145/3589188","DOIUrl":"https://doi.org/10.1145/3589188","url":null,"abstract":"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47081538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POETS: An Event-driven Approach to Dissipative Particle Dynamics POETS:耗散粒子动力学的事件驱动方法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-02-20 DOI: 10.1145/3580372
Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming
{"title":"POETS: An Event-driven Approach to Dissipative Particle Dynamics","authors":"Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming","doi":"10.1145/3580372","DOIUrl":"https://doi.org/10.1145/3580372","url":null,"abstract":"HPC clusters have become ever more expensive, both in terms of capital cost and energy consumption; some estimates suggest that competitive installations at the end of the next decade will require their own power station. One way around this looming problem is to design bespoke computing engines, but while the performance benefits are good, the design costs are huge and cannot easily be amortized. Partially Ordered Event Triggered System (POETS)—the focus of this article—seeks to exploit a middle way: The architecture is tuned to a specific algorithmic pattern but, within that constraint, is fully programmable. POETS software is quasi-imperative: The user defines a set of sequential event handlers, defines the topology of a (typically large) concurrent ensemble of these, and lets them interact. The “solution” may be exfiltrated from the emergent behaviour of the ensemble. In this article, we describe (briefly) the architecture, and an example computational chemistry application, dissipative particle dynamics (DPD). The DPD algorithm is traditionally implemented using parallel computational techniques, but we re-cast it as a concurrent compute problem that is then ideally suited to POETS. Our prototype system is realised on a cluster of 48 FPGAs providing 50K concurrent hardware threads, and we report performance speedups of over two orders of magnitude better than a single thread baseline comparator and scaling behaviour that is almost constant. The results are validated against a “conventional” implementation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44594257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MCSH, a Lock with the Standard Interface MCSH,具有标准接口的锁
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-02-20 DOI: 10.1145/3584696
W. Hesselink, P. Buhr
{"title":"MCSH, a Lock with the Standard Interface","authors":"W. Hesselink, P. Buhr","doi":"10.1145/3584696","DOIUrl":"https://doi.org/10.1145/3584696","url":null,"abstract":"The MCS lock of Mellor-Crummey and Scott (1991), 23 pages. is a very efficient first-come first-served mutual-exclusion algorithm that uses the atomic hardware primitives fetch-and-store and compare-and-swap. However, it has the disadvantage that the calling thread must provide a pointer to an allocated record. This additional parameter violates the standard locking interface, which has only the lock as a parameter. Hence, it is impossible to switch to MCS without editing and recompiling an application that uses locks. This article provides a variation of MCS with the standard interface, which remains FCFS, called MCSH. One key ingredient is to stack allocate the necessary record in the acquire procedure of the lock, so its life-time only spans the delay to enter a critical section. A second key ingredient is communicating the allocated record between the acquire and release procedures through the lock to maintain the standard locking interface. Both of these practices are known to practitioners, but our solution combines them in a unique way. Furthermore, when these practices are used in prior papers, their correctness is often argued informally. The correctness of MCSH is verified rigorously with the proof assistant PVS, and experiments are run to compare its performance with MCS and similar locks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48547326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCN 基于GCN的CPU-GPU异构并行计算SpTTM优化方法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-02-17 DOI: 10.1145/3584373
Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li
{"title":"A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCN","authors":"Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li","doi":"10.1145/3584373","DOIUrl":"https://doi.org/10.1145/3584373","url":null,"abstract":"Sparse Tensor-Times-Matrix (SpTTM) is the core calculation in tensor analysis. The sparse distributions of different tensors vary greatly, which poses a big challenge to designing efficient and general SpTTM. In this paper, we describe SpTTM on CPU-GPU heterogeneous hybrid systems and give a parallel execution strategy for SpTTM in different sparse formats. We analyze the theoretical computer powers and estimate the number of tasks to achieve the load balancing between the CPU and the GPU of the heterogeneous systems. We discuss a method to describe tensor sparse structure by graph structure and design a new graph neural network SPT-GCN to select a suitable tensor sparse format. Furthermore, we perform extensive experiments using real datasets to demonstrate the advantages and efficiency of our proposed input-aware slice-wise SpTTM. The experimental results show that our input-aware slice-wise SpTTM can achieve an average speedup of 1.310 × compared to ParTI! library on a CPU-GPU heterogeneous system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41725526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems 异构多gpu系统的节能矩阵分解
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-02-17 DOI: 10.1145/3583590
Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen
{"title":"GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems","authors":"Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen","doi":"10.1145/3583590","DOIUrl":"https://doi.org/10.1145/3583590","url":null,"abstract":"The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (VsafeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49072865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigation and Implementation of Parallelism Resources of Numerical Algorithms 数值算法并行资源的研究与实现
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-02-15 DOI: 10.1145/3583755
Valentina N. Aleeva, R. Aleev
{"title":"Investigation and Implementation of Parallelism Resources of Numerical Algorithms","authors":"Valentina N. Aleeva, R. Aleev","doi":"10.1145/3583755","DOIUrl":"https://doi.org/10.1145/3583755","url":null,"abstract":"This article is devoted to an approach to solving a problem of the efficiency of parallel computing. The theoretical basis of this approach is the concept of a Q-determinant. Any numerical algorithm has a Q-determinant. The Q-determinant of the algorithm has clear structure and is convenient for implementation. The Q-determinant consists of Q-terms. Their number is equal to the number of output data items. Each Q-term describes all possible ways to compute one of the output data items based on the input data. We also describe a software Q-system for studying the parallelism resources of numerical algorithms. This system enables to compute and compare the parallelism resources of numerical algorithms. The application of the Q-system is shown on the example of numerical algorithms with different structures of Q-determinants. Furthermore, we suggest a method for designing of parallel programs for numerical algorithms. This method is based on a representation of a numerical algorithm in the form of a Q-determinant. As a result, we can obtain the program using the parallelism resource of the algorithm completely. Such programs are called Q-effective. The results of this research can be applied to increase the implementation efficiency of numerical algorithms, methods, as well as algorithmic problems on parallel computing systems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42505176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信