{"title":"The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks","authors":"Sanjoy Baruah, A. Marchetti-Spaccamela","doi":"10.1145/3606342","DOIUrl":"https://doi.org/10.1145/3606342","url":null,"abstract":"The Conditional DAG (CDAG) task model is used for modeling multiprocessor real-time systems containing conditional expressions for which outcomes are not known prior to their evaluation. Feasibility analysis for CDAG tasks upon multiprocessor platforms is shown to be complete for the complexity class pspace; assuming np ≠ pspace, this result rules out the use of Integer Linear Programming solvers for solving this problem efficiently. It is further shown that there can be no pseudo-polynomial time algorithm that solves this problem unless p = pspace.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48543350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithms for Right-Sizing Heterogeneous Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3595286","DOIUrl":"https://doi.org/10.1145/3595286","url":null,"abstract":"Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44289659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit
{"title":"Non-Clairvoyant Scheduling with Predictions","authors":"Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit","doi":"10.1145/3593969","DOIUrl":"https://doi.org/10.1145/3593969","url":null,"abstract":"In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44598514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro
{"title":"Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks","authors":"Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro","doi":"10.1145/3593426","DOIUrl":"https://doi.org/10.1145/3593426","url":null,"abstract":"How do we reach consensus on an average value in a dynamic crowd without revealing identity? In this work, we study the problem of average network consensus in Anonymous Dynamic Networks (ADN). Network dynamicity is specified by the sequence of topology-graph isoperimetric numbers occurring over time, which we call the isoperimetric dynamicity of the network. The consensus variable is the average of values initially held by nodes, which is customary in the network-consensus literature. Given that having an algorithm to compute the average one can compute the network size (i.e., the counting problem) and vice versa, we focus on the latter. We present a deterministic distributed average network consensus algorithm for ADNs that we call isoperimetric Scalable Coordinated Anonymous Local Aggregation, and we analyze its performance for different scenarios, including worst-case (adversarial) and stochastic dynamic topologies. Our solution utilizes supervisor nodes, which have been shown to be necessary for computations in ADNs. The algorithm uses the isoperimetric dynamicity of the network as an input, meaning that only the isoperimetric number parameters (or their lower bound) must be given, but topologies may occur arbitrarily or stochastically as long as they comply with those parameters. Previous work for adversarial ADNs overestimates the running time to deal with worst-case scenarios. For ADNs with given isoperimetric dynamicity, our analysis shows improved performance for some practical dynamic topologies, with cubic time or better for stochastic ADNs, and our experimental evaluation indicates that our theoretical bounds could not be substantially improved for some models of dynamic networks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48744359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems","authors":"Weijian Zheng, Dali Wang, Fengguang Song","doi":"10.1145/3589188","DOIUrl":"https://doi.org/10.1145/3589188","url":null,"abstract":"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47081538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming
{"title":"POETS: An Event-driven Approach to Dissipative Particle Dynamics","authors":"Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming","doi":"10.1145/3580372","DOIUrl":"https://doi.org/10.1145/3580372","url":null,"abstract":"HPC clusters have become ever more expensive, both in terms of capital cost and energy consumption; some estimates suggest that competitive installations at the end of the next decade will require their own power station. One way around this looming problem is to design bespoke computing engines, but while the performance benefits are good, the design costs are huge and cannot easily be amortized. Partially Ordered Event Triggered System (POETS)—the focus of this article—seeks to exploit a middle way: The architecture is tuned to a specific algorithmic pattern but, within that constraint, is fully programmable. POETS software is quasi-imperative: The user defines a set of sequential event handlers, defines the topology of a (typically large) concurrent ensemble of these, and lets them interact. The “solution” may be exfiltrated from the emergent behaviour of the ensemble. In this article, we describe (briefly) the architecture, and an example computational chemistry application, dissipative particle dynamics (DPD). The DPD algorithm is traditionally implemented using parallel computational techniques, but we re-cast it as a concurrent compute problem that is then ideally suited to POETS. Our prototype system is realised on a cluster of 48 FPGAs providing 50K concurrent hardware threads, and we report performance speedups of over two orders of magnitude better than a single thread baseline comparator and scaling behaviour that is almost constant. The results are validated against a “conventional” implementation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44594257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MCSH, a Lock with the Standard Interface","authors":"W. Hesselink, P. Buhr","doi":"10.1145/3584696","DOIUrl":"https://doi.org/10.1145/3584696","url":null,"abstract":"The MCS lock of Mellor-Crummey and Scott (1991), 23 pages. is a very efficient first-come first-served mutual-exclusion algorithm that uses the atomic hardware primitives fetch-and-store and compare-and-swap. However, it has the disadvantage that the calling thread must provide a pointer to an allocated record. This additional parameter violates the standard locking interface, which has only the lock as a parameter. Hence, it is impossible to switch to MCS without editing and recompiling an application that uses locks. This article provides a variation of MCS with the standard interface, which remains FCFS, called MCSH. One key ingredient is to stack allocate the necessary record in the acquire procedure of the lock, so its life-time only spans the delay to enter a critical section. A second key ingredient is communicating the allocated record between the acquire and release procedures through the lock to maintain the standard locking interface. Both of these practices are known to practitioners, but our solution combines them in a unique way. Furthermore, when these practices are used in prior papers, their correctness is often argued informally. The correctness of MCSH is verified rigorously with the proof assistant PVS, and experiments are run to compare its performance with MCS and similar locks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48547326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li
{"title":"A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCN","authors":"Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li","doi":"10.1145/3584373","DOIUrl":"https://doi.org/10.1145/3584373","url":null,"abstract":"Sparse Tensor-Times-Matrix (SpTTM) is the core calculation in tensor analysis. The sparse distributions of different tensors vary greatly, which poses a big challenge to designing efficient and general SpTTM. In this paper, we describe SpTTM on CPU-GPU heterogeneous hybrid systems and give a parallel execution strategy for SpTTM in different sparse formats. We analyze the theoretical computer powers and estimate the number of tasks to achieve the load balancing between the CPU and the GPU of the heterogeneous systems. We discuss a method to describe tensor sparse structure by graph structure and design a new graph neural network SPT-GCN to select a suitable tensor sparse format. Furthermore, we perform extensive experiments using real datasets to demonstrate the advantages and efficiency of our proposed input-aware slice-wise SpTTM. The experimental results show that our input-aware slice-wise SpTTM can achieve an average speedup of 1.310 × compared to ParTI! library on a CPU-GPU heterogeneous system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41725526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen
{"title":"GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems","authors":"Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen","doi":"10.1145/3583590","DOIUrl":"https://doi.org/10.1145/3583590","url":null,"abstract":"The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (VsafeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49072865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation and Implementation of Parallelism Resources of Numerical Algorithms","authors":"Valentina N. Aleeva, R. Aleev","doi":"10.1145/3583755","DOIUrl":"https://doi.org/10.1145/3583755","url":null,"abstract":"This article is devoted to an approach to solving a problem of the efficiency of parallel computing. The theoretical basis of this approach is the concept of a Q-determinant. Any numerical algorithm has a Q-determinant. The Q-determinant of the algorithm has clear structure and is convenient for implementation. The Q-determinant consists of Q-terms. Their number is equal to the number of output data items. Each Q-term describes all possible ways to compute one of the output data items based on the input data. We also describe a software Q-system for studying the parallelism resources of numerical algorithms. This system enables to compute and compare the parallelism resources of numerical algorithms. The application of the Q-system is shown on the example of numerical algorithms with different structures of Q-determinants. Furthermore, we suggest a method for designing of parallel programs for numerical algorithms. This method is based on a representation of a numerical algorithm in the form of a Q-determinant. As a result, we can obtain the program using the parallelism resource of the algorithm completely. Such programs are called Q-effective. The results of this research can be applied to increase the implementation efficiency of numerical algorithms, methods, as well as algorithmic problems on parallel computing systems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42505176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}