ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common Bugs Indigo3:用于探索实现方式和常见错误的并行图分析基准套件
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2024-05-15 DOI: 10.1145/3665251
Yiqian Liu, Noushin Azami, Avery VanAusdal, Martin Burtscher
{"title":"Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common Bugs","authors":"Yiqian Liu, Noushin Azami, Avery VanAusdal, Martin Burtscher","doi":"10.1145/3665251","DOIUrl":"https://doi.org/10.1145/3665251","url":null,"abstract":"Graph analytics codes are widely used and tend to exhibit input-dependent behavior, making them particularly interesting for software verification and validation. This paper presents Indigo3, a labeled benchmark suite based on 7 graph algorithms that are implemented in different styles, including versions with deliberately planted bugs. We systematically combine 13 sets of implementation styles and 15 common bug types to create the 41,790 CUDA, OpenMP, and parallel C programs in the suite. Each code is labeled with the styles and bugs it incorporates. We used 4 subsets of Indigo3 to test 5 program-verification tools. Our results show that the tools perform quite differently across the bug types and implementation styles, have distinct strengths and weaknesses, and generally struggle with graph codes. We discuss the styles and bugs that tend to be the most challenging as well as the programming patterns that yield false positives.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140973846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TLPGNN : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs TLPGNN:在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2024-02-09 DOI: 10.1145/3644712
Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang
{"title":"TLPGNN\u0000 : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs","authors":"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang","doi":"10.1145/3644712","DOIUrl":"https://doi.org/10.1145/3644712","url":null,"abstract":"\u0000 Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\u0000 TLPGNN\u0000 , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\u0000 vertex parallelism\u0000 for the first level and\u0000 feature parallelism\u0000 for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\u0000 TLPGNN\u0000 to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\u0000 TLPGNN\u0000 also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\u0000","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139848851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TLPGNN : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs TLPGNN:在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2024-02-09 DOI: 10.1145/3644712
Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang
{"title":"TLPGNN\u0000 : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs","authors":"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang","doi":"10.1145/3644712","DOIUrl":"https://doi.org/10.1145/3644712","url":null,"abstract":"\u0000 Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\u0000 TLPGNN\u0000 , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\u0000 vertex parallelism\u0000 for the first level and\u0000 feature parallelism\u0000 for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\u0000 TLPGNN\u0000 to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\u0000 TLPGNN\u0000 also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\u0000","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139788790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Issue for SPAA’21 SPAA'21 特刊简介
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-12-14 DOI: 10.1145/3630608
Y. Azar, Julian Shun
{"title":"Introduction to the Special Issue for SPAA’21","authors":"Y. Azar, Julian Shun","doi":"10.1145/3630608","DOIUrl":"https://doi.org/10.1145/3630608","url":null,"abstract":"","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139001830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Conflict-Resilient Lock-Free Linearizable Calendar Queue 具有冲突恢复能力的无锁线性日历队列
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-12-06 DOI: 10.1145/3635163
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia
{"title":"A Conflict-Resilient Lock-Free Linearizable Calendar Queue","authors":"Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia","doi":"10.1145/3635163","DOIUrl":"https://doi.org/10.1145/3635163","url":null,"abstract":"In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138595960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters HPS choolesky:自适应参数的分层并行超节点choolesky
ACM Transactions on Parallel Computing Pub Date : 2023-10-26 DOI: 10.1145/3630051
Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li
{"title":"HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters","authors":"Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li","doi":"10.1145/3630051","DOIUrl":"https://doi.org/10.1145/3630051","url":null,"abstract":"Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models 常用加速模型下可塑任务图的改进在线调度
ACM Transactions on Parallel Computing Pub Date : 2023-10-26 DOI: 10.1145/3630052
Lucas Perotin, Hongyang Sun
{"title":"Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"Lucas Perotin, Hongyang Sun","doi":"10.1145/3630052","DOIUrl":"https://doi.org/10.1145/3630052","url":null,"abstract":"We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Checkpointing strategies to tolerate non-memoryless failures on HPC platforms 在HPC平台上容忍非无内存故障的检查点策略
ACM Transactions on Parallel Computing Pub Date : 2023-09-22 DOI: 10.1145/3624560
Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien
{"title":"Checkpointing strategies to tolerate non-memoryless failures on HPC platforms","authors":"Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien","doi":"10.1145/3624560","DOIUrl":"https://doi.org/10.1145/3624560","url":null,"abstract":"This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136015029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Graph Coloring Made Easy 分布式图形着色变得容易
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-08-17 DOI: 10.1145/3605896
Yannic Maus
{"title":"Distributed Graph Coloring Made Easy","authors":"Yannic Maus","doi":"10.1145/3605896","DOIUrl":"https://doi.org/10.1145/3605896","url":null,"abstract":"In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42561903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms 基于快速傅里叶变换的非周期线性模板计算快速算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2023-07-24 DOI: 10.1145/3606338
Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu
{"title":"A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms","authors":"Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu","doi":"10.1145/3606338","DOIUrl":"https://doi.org/10.1145/3606338","url":null,"abstract":"Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43986447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信