ACM Transactions on Parallel Computing最新文献

Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness. 大规模并行系统中的动态缓冲区管理：随机性的力量。

IF 0.9

ACM Transactions on Parallel Computing Pub Date : 2025-03-01 Epub Date: 2025-02-11 DOI: 10.1145/3701623

Minh Pham, Yongke Yuan, Hao Li, Chengcheng Mou, Yicheng Tu, Zichen Xu, Jinghan Meng

{"title":"Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness.","authors":"Minh Pham, Yongke Yuan, Hao Li, Chengcheng Mou, Yicheng Tu, Zichen Xu, Jinghan Meng","doi":"10.1145/3701623","DOIUrl":"10.1145/3701623","url":null,"abstract":"<p><p>Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today's data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of paramount importance. One such challenge is the design of a dynamic memory allocator to allocate memory at runtime. Traditionally, memory allocators have relied on maintaining a global data structure, such as a queue of free pages. However, in the context of massively parallel systems, accessing such global data structures can quickly become a bottleneck even with multiple queues in place. This paper presents a novel approach to dynamic memory allocation that eliminates the need for a centralized data structure. Our proposed approach revolves around letting threads employ random search procedures to locate free pages. Through mathematical proofs and extensive experiments, we demonstrate that the basic random search design achieves lower latency than the best-known existing solution in most situations. Furthermore, we develop more advanced techniques and algorithms to tackle the challenge of warp divergence and further enhance performance when free memory is limited. Building upon these advancements, our mathematical proofs and experimental results affirm that these advanced designs can yield an order of magnitude improvement over the basic design and consistently outperform the state-of-the-art by up to two orders of magnitude. To illustrate the practical implications of our work, we integrate our memory management techniques into two GPU algorithms: a hash join and a group-by. Both case studies provide compelling evidence of our approach's pronounced performance gains.</p>","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"12 1","pages":""},"PeriodicalIF":0.9,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11841858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Issue for SPAA’21 SPAA'21 特刊简介

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-12-14 DOI: 10.1145/3630608

Y. Azar, Julian Shun

引用次数: 0

A Conflict-Resilient Lock-Free Linearizable Calendar Queue 具有冲突恢复能力的无锁线性日历队列

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-12-06 DOI: 10.1145/3635163

Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia

引用次数: 0

HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters HPS choolesky:自适应参数的分层并行超节点choolesky

ACM Transactions on Parallel Computing Pub Date : 2023-10-26 DOI: 10.1145/3630051

Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li

{"title":"HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters","authors":"Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li","doi":"10.1145/3630051","DOIUrl":"https://doi.org/10.1145/3630051","url":null,"abstract":"Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models 常用加速模型下可塑任务图的改进在线调度

ACM Transactions on Parallel Computing Pub Date : 2023-10-26 DOI: 10.1145/3630052

Lucas Perotin, Hongyang Sun

{"title":"Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"Lucas Perotin, Hongyang Sun","doi":"10.1145/3630052","DOIUrl":"https://doi.org/10.1145/3630052","url":null,"abstract":"We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms 在HPC平台上容忍非无内存故障的检查点策略

ACM Transactions on Parallel Computing Pub Date : 2023-09-22 DOI: 10.1145/3624560

Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien

引用次数: 0

Distributed Graph Coloring Made Easy 分布式图形着色变得容易

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-08-17 DOI: 10.1145/3605896

Yannic Maus

引用次数: 0

A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms 基于快速傅里叶变换的非周期线性模板计算快速算法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-07-24 DOI: 10.1145/3606338

Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu

{"title":"A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms","authors":"Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu","doi":"10.1145/3606338","DOIUrl":"https://doi.org/10.1145/3606338","url":null,"abstract":"Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43986447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Computational Complexity of Feasibility Analysis for Conditional DAG Tasks 条件DAG任务可行性分析的计算复杂度

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-07-05 DOI: 10.1145/3606342

Sanjoy Baruah, A. Marchetti-Spaccamela

引用次数: 0

Algorithms for Right-Sizing Heterogeneous Data Centers 正确确定异构数据中心规模的算法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-05-10 DOI: 10.1145/3595286

S. Albers, Jens Quedenfeld

{"title":"Algorithms for Right-Sizing Heterogeneous Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3595286","DOIUrl":"https://doi.org/10.1145/3595286","url":null,"abstract":"Power consumption is a dominant and still growing cost factor in data centers. In time periods with low load, the energy consumption can be reduced by powering down unused servers. We resort to a model introduced by Lin, Wierman, Andrew and Thereska [23, 24] that considers data centers with identical machines, and generalize it to heterogeneous data centers with d different server types. The operating cost of a server depends on its load and is modeled by an increasing, convex function for each server type. In contrast to earlier work, we consider the discrete setting, where the number of active servers must be integral. Thereby, we seek truly feasible solutions. For homogeneous data centers (d = 1), both the offline and the online problem were solved optimally in [3, 4]. In this paper, we study heterogeneous data centers with general time-dependent operating cost functions. We develop an online algorithm based on a work function approach which achieves a competitive ratio of 2d + 1 + ϵ for any ϵ > 0. For time-independent operating cost functions, the competitive ratio can be reduced to 2d + 1. There is a lower bound of 2d shown in [5], so our algorithm is nearly optimal. For the offline version, we give a graph-based (1 + ϵ)-approximation algorithm. Additionally, our offline algorithm is able to handle time-variable data-center sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44289659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0