ACM Transactions on Parallel Computing最新文献_第5页

Adaptive Erasure Coded Fault Tolerant Linear System Solver 自适应擦除编码容错线性系统求解器

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-12-08 DOI: 10.1145/3490557

X. Kang, D. Gleich, A. Sameh, A. Grama

{"title":"Adaptive Erasure Coded Fault Tolerant Linear System Solver","authors":"X. Kang, D. Gleich, A. Sameh, A. Grama","doi":"10.1145/3490557","DOIUrl":"https://doi.org/10.1145/3490557","url":null,"abstract":"As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 19"},"PeriodicalIF":1.6,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45097716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery 面向层次密集子图发现的二部网络并行剥离

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-10-24 DOI: 10.1145/3583084

Kartik Lakhotia, R. Kannan, V. Prasanna

{"title":"Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery","authors":"Kartik Lakhotia, R. Kannan, V. Prasanna","doi":"10.1145/3583084","DOIUrl":"https://doi.org/10.1145/3583084","url":null,"abstract":"Wing and Tip decomposition are motif-based analytics for bipartite graphs that construct a hierarchy of butterfly (2,2-biclique) dense edge and vertex induced subgraphs, respectively. They have applications in several domains, including e-commerce, recommendation systems, document analysis, and others. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of the subgraph density. They iteratively select the edges or vertices with minimum butterfly count peel, i.e., remove them along with their butterflies. The amount of butterflies in real-world bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a large number of sequentially dependent iterations. Consequently, parallel algorithms based on bottom up peeling incur heavy synchronization and poor scalability. In this article, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework that adopts a two-phased peeling approach to relax the order of peeling, and in turn, dramatically reduce synchronization. The first phase divides the decomposition hierarchy into few partitions and requires little synchronization. The second phase concurrently processes all partitions to generate individual levels of the hierarchy and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency of PBNG. We empirically evaluate PBNG using several real-world bipartite graphs and demonstrate radical improvements over the existing approaches. On a shared-memory 36 core server, PBNG achieves up to 19.7× self-relative parallel speedup. Compared to the state-of-the-art parallel framework ParButterfly, PBNG reduces synchronization by up to 15,260× and execution time by up to 295×. Furthermore, it achieves up to 38.5× speedup over state-of-the-art algorithms specifically tuned for wing decomposition. Our source code is made available at https://github.com/kartiklakhotia/RECEIPT.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2021-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44283113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Metrics and Design of an Instruction Roofline Model for AMD GPUs AMD GPU指令屋顶线模型的度量与设计

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-10-15 DOI: 10.1145/3505285

M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran

{"title":"Metrics and Design of an Instruction Roofline Model for AMD GPUs","authors":"M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran","doi":"10.1145/3505285","DOIUrl":"https://doi.org/10.1145/3505285","url":null,"abstract":"Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 14"},"PeriodicalIF":1.6,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41415302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to the Special Issue for SPAA 2019 SPAA 2019特刊简介

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3477610

P. Berenbrink

引用次数: 0

Study of Fine-grained Nested Parallelism in CDCL SAT Solvers CDCL SAT解算器中细粒度嵌套并行性的研究

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3470639

J. Edwards, U. Vishkin

引用次数: 3

Massively Parallel Computation via Remote Memory Access 基于远程内存访问的大规模并行计算

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3470631

Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Lacki, V. Mirrokni, W. Schudy

{"title":"Massively Parallel Computation via Remote Memory Access","authors":"Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Lacki, V. Mirrokni, W. Schudy","doi":"10.1145/3470631","DOIUrl":"https://doi.org/10.1145/3470631","url":null,"abstract":"We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms [8, 30] using MapReduce and a distributed hash table service [17]. This extension allows us to give new graph algorithms with much lower round complexities compared to the best-known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in O(1) rounds and connectivity/minimum spanning tree in O(log logm/n n rounds both using O(nδ) space per machine for constant δ < 1. In the same memory regime for MPC, the best-known algorithms for these problems require poly log n rounds. Our results imply that the 2-CYCLE conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45449480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Efficient Parallel 3D Computation of the Compressible Euler Equations with an Invariant-domain Preserving Second-order Finite-element Scheme 具有保持不变域的二阶有限元格式的可压缩欧拉方程的高效并行三维计算

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3470637

M. Maier, M. Kronbichler

引用次数: 15

External-memory Dictionaries in the Affine and PDAM Models 仿射和PDAM模型中的外部内存字典

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3470635

M. A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, P. Pandey, Donald E. Porter, Jun Yuan, Yang Zhan

{"title":"External-memory Dictionaries in the Affine and PDAM Models","authors":"M. A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, P. Pandey, Donald E. Porter, Jun Yuan, Yang Zhan","doi":"10.1145/3470635","DOIUrl":"https://doi.org/10.1145/3470635","url":null,"abstract":"Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of Bɛ-trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all Bɛ-trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and Bɛ-trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas Bɛ-trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, Bɛ-trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, Bɛ-trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 20"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48435982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Constant-Length Labeling Schemes for Deterministic Radio Broadcast 确定性无线电广播的定长标记方案

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-09-20 DOI: 10.1145/3470633

Faith Ellen, B. Gorain, Avery Miller, A. Pelc

{"title":"Constant-Length Labeling Schemes for Deterministic Radio Broadcast","authors":"Faith Ellen, B. Gorain, Avery Miller, A. Pelc","doi":"10.1145/3470633","DOIUrl":"https://doi.org/10.1145/3470633","url":null,"abstract":"Broadcast is one of the fundamental network communication primitives. One node of a network, called the source, has a message that has to be learned by all other nodes. We consider broadcast in radio networks, modeled as simple undirected connected graphs with a distinguished source. Nodes communicate in synchronous rounds. In each round, a node can either transmit a message to all its neighbours, or stay silent and listen. At the receiving end, a node v hears a message from a neighbour w in a given round if v listens in this round and if w is its only neighbour that transmits in this round. If more than one neighbour of a node v transmits in a given round, we say that a collision occurs at v. We do not assume collision detection: in case of a collision, node v does not hear anything (except the background noise that it also hears when no neighbour transmits). We are interested in the feasibility of deterministic broadcast in radio networks. If nodes of the network do not have any labels, deterministic broadcast is impossible even in the four-cycle. On the other hand, if all nodes have distinct labels, then broadcast can be carried out, e.g., in a round-robin fashion, and hence O(log n)-bit labels are sufficient for this task in n-node networks. In fact, O(log Δ)-bit labels, where Δ is the maximum degree, are enough to broadcast successfully. Hence, it is natural to ask if very short labels are sufficient for broadcast. Our main result is a positive answer to this question. We show that every radio network can be labeled using 2 bits in such a way that broadcast can be accomplished by some universal deterministic algorithm that does not know the network topology nor any bound on its size. Moreover, at the expense of an extra bit in the labels, we can get the following additional strong property of our algorithm: there exists a common round in which all nodes know that broadcast has been completed. Finally, we show that 3-bit labels are also sufficient to solve both versions of broadcast in the case where it is not known a priori which node is the source.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 17"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44815158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Bandwidth-Optimal Random Shuffling for GPUs gpu的带宽优化随机洗牌

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2021-06-11 DOI: 10.1145/3505287

Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes

{"title":"Bandwidth-Optimal Random Shuffling for GPUs","authors":"Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes","doi":"10.1145/3505287","DOIUrl":"https://doi.org/10.1145/3505287","url":null,"abstract":"Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 20"},"PeriodicalIF":1.6,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3