{"title":"Introduction to the Special Issue for SPAA’21","authors":"Y. Azar, Julian Shun","doi":"10.1145/3630608","DOIUrl":"https://doi.org/10.1145/3630608","url":null,"abstract":"","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"2001 20","pages":"1 - 1"},"PeriodicalIF":1.6,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139001830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia
{"title":"A Conflict-Resilient Lock-Free Linearizable Calendar Queue","authors":"Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia","doi":"10.1145/3635163","DOIUrl":"https://doi.org/10.1145/3635163","url":null,"abstract":"In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"89 8","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138595960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters","authors":"Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li","doi":"10.1145/3630051","DOIUrl":"https://doi.org/10.1145/3630051","url":null,"abstract":"Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"Lucas Perotin, Hongyang Sun","doi":"10.1145/3630052","DOIUrl":"https://doi.org/10.1145/3630052","url":null,"abstract":"We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien
{"title":"Checkpointing strategies to tolerate non-memoryless failures on HPC platforms","authors":"Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien","doi":"10.1145/3624560","DOIUrl":"https://doi.org/10.1145/3624560","url":null,"abstract":"This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136015029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Graph Coloring Made Easy","authors":"Yannic Maus","doi":"10.1145/3605896","DOIUrl":"https://doi.org/10.1145/3605896","url":null,"abstract":"In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42561903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu
{"title":"A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms","authors":"Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu","doi":"10.1145/3606338","DOIUrl":"https://doi.org/10.1145/3606338","url":null,"abstract":"Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43986447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks","authors":"Sanjoy Baruah, A. Marchetti-Spaccamela","doi":"10.1145/3606342","DOIUrl":"https://doi.org/10.1145/3606342","url":null,"abstract":"The Conditional DAG (CDAG) task model is used for modeling multiprocessor real-time systems containing conditional expressions for which outcomes are not known prior to their evaluation. Feasibility analysis for CDAG tasks upon multiprocessor platforms is shown to be complete for the complexity class pspace; assuming np ≠ pspace, this result rules out the use of Integer Linear Programming solvers for solving this problem efficiently. It is further shown that there can be no pseudo-polynomial time algorithm that solves this problem unless p = pspace.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 22"},"PeriodicalIF":1.6,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48543350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithms for Right-Sizing Heterogeneous Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3595286","DOIUrl":"https://doi.org/10.1145/3595286","url":null,"abstract":"Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44289659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit
{"title":"Non-Clairvoyant Scheduling with Predictions","authors":"Sungjin Im, Ravi Kumar, Mahshid Montazer Qaem, Manish Purohit","doi":"10.1145/3593969","DOIUrl":"https://doi.org/10.1145/3593969","url":null,"abstract":"In the single-machine non-clairvoyant scheduling problem, the goal is to minimize the total completion time of jobs whose processing times are unknown a priori. We revisit this well-studied problem and consider the question of how to effectively use (possibly erroneous) predictions of the processing times. We study this question from ground zero by first asking what constitutes a good prediction; we then propose a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure. Our approach to derive a prediction error measure based on natural desiderata could find applications for other online problems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44598514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}