ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Joinable Parallel Balanced Binary Trees 可连接并行平衡二叉树
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512769
G. Blelloch, Daniel Ferizovic, Yihan Sun
{"title":"Joinable Parallel Balanced Binary Trees","authors":"G. Blelloch, Daniel Ferizovic, Yihan Sun","doi":"10.1145/3512769","DOIUrl":"https://doi.org/10.1145/3512769","url":null,"abstract":"In this article, we show how a single function, join, can be used to implement parallel balanced binary search trees (BSTs) simply and efficiently. Based on join , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework called join-based algorithms. We show that the join function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to as joinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, including insert , delete , union , intersection , difference , split , range , filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operations union , intersection , and difference have work ( O(mlog (frac{n}{m}+1)) ) and polylogarithmic span for input set sizes ( n ) and ( mle n ) . We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73 ( times ) speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 41"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48619182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms fgSpMSpV: HPC平台上的细粒度并行SpMSpV框架
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512770
Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya
{"title":"fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms","authors":"Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya","doi":"10.1145/3512770","DOIUrl":"https://doi.org/10.1145/3512770","url":null,"abstract":"Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 29"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48217478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast Concurrent Data Sketches 快速并发数据草图
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512758
Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky
{"title":"Fast Concurrent Data Sketches","authors":"Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky","doi":"10.1145/3512758","DOIUrl":"https://doi.org/10.1145/3512758","url":null,"abstract":"Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability, we prove our algorithm’s correctness and analyse the error it induces in some specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45254851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Chronic giant cranial diploe hematoma in hemophiliac. 血友病患者的慢性巨型头颅二叶血肿。
IF 1
ACM Transactions on Parallel Computing Pub Date : 2022-03-30 DOI: 10.1055/a-1813-0090
Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo
{"title":"Chronic giant cranial diploe hematoma in hemophiliac.","authors":"Weizhao Gong, Hanshi Wang, Taipeng Jiang, Dahui Zuo","doi":"10.1055/a-1813-0090","DOIUrl":"10.1055/a-1813-0090","url":null,"abstract":"<p><p>Cranial diploe hematoma is a hematoma that occurs between the inner and outer layer of the skull and is often in infants and young children. Hemophilia A is an inherited X-linked bleeding disorder caused by a deficiency of coagulation factor VIII (FVIII) . Epidemiological survey results show that the prevalence of hemophilia in 24 provinces and cities in China is 2.73/100,000, while only about 5% of patients are registered . Hemophilia is mainly characterized by bleeding, which can occur anywhere in the pa-tient's body and manifest as intracranial, gastrointestinal, or pharyngeal bleeding, which can be life-threatening in severe cases. This article shares a case of a patient with he-mophilia A complicated by a chronic giant diploe hematoma.</p>","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87263626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BQ: A Lock-Free Queue with Batching 基于批处理的无锁队列
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512757
Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank
{"title":"BQ: A Lock-Free Queue with Batching","authors":"Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank","doi":"10.1145/3512757","DOIUrl":"https://doi.org/10.1145/3512757","url":null,"abstract":"Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 49"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49253770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations 使用秩结构矩阵计算的高性能三维非结构化网格变形
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512756
Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes
{"title":"High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations","authors":"Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes","doi":"10.1145/3512756","DOIUrl":"https://doi.org/10.1145/3512756","url":null,"abstract":"The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45543029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations 用于有限元计算的局部精细网格上的高效分布式无矩阵多重网格方法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-23 DOI: 10.1145/3580314
Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler
{"title":"Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations","authors":"Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler","doi":"10.1145/3580314","DOIUrl":"https://doi.org/10.1145/3580314","url":null,"abstract":"This work studies three multigrid variants for matrix-free finite-element computations on locally refined meshes: geometric local smoothing, geometric global coarsening (both h-multigrid), and polynomial global coarsening (a variant of p-multigrid). We have integrated the algorithms into the same framework—the open source finite-element library deal.II—, which allows us to make fair comparisons regarding their implementation complexity, computational efficiency, and parallel scalability as well as to compare the measurements with theoretically derived performance metrics. Serial simulations and parallel weak and strong scaling on up to 147,456 CPU cores on 3,072 compute nodes are presented. The results obtained indicate that global-coarsening algorithms show a better parallel behavior for comparable smoothers due to the better load balance, particularly on the expensive fine levels. In the serial case, the costs of applying hanging-node constraints might be significant, leading to advantages of local smoothing, even though the number of solver iterations needed is slightly higher. When using p- and h-multigrid in sequence (hp-multigrid), the results indicate that it makes sense to decrease the degree of the elements first from a performance point of view due to the cheaper transfer.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 38"},"PeriodicalIF":1.6,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47033768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods 放大共轭梯度法的性能分析及最优节点感知通信
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-11 DOI: 10.1145/3580003
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
{"title":"Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3580003","DOIUrl":"https://doi.org/10.1145/3580003","url":null,"abstract":"Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48394069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM CC和DSM的确定性常摊销rmr可终止互斥
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-12-09 DOI: 10.1145/3490559
P. Jayanti, S. Jayanti
{"title":"Deterministic Constant-Amortized-RMR Abortable Mutex for CC and DSM","authors":"P. Jayanti, S. Jayanti","doi":"10.1145/3490559","DOIUrl":"https://doi.org/10.1145/3490559","url":null,"abstract":"The abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real-time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worst-case constant remote memory reference algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both cache coherent (CC) and distributed shared memory multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worst-case requirement to amortized, algorithms are only known for the CC model. In this article, we improve this state of the art by designing a deterministic algorithm that uses Fetch&Store to achieve amortized O(1) remote memory reference in both the CC and distributed shared memory models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal) and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O(1) space per process, and satisfies a novel fairness condition that we call Airline FCFS. Our algorithm is short with fewer than a dozen lines of code.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2021-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46508215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Adaptive Erasure Coded Fault Tolerant Linear System Solver 自适应擦除编码容错线性系统求解器
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-12-08 DOI: 10.1145/3490557
X. Kang, D. Gleich, A. Sameh, A. Grama
{"title":"Adaptive Erasure Coded Fault Tolerant Linear System Solver","authors":"X. Kang, D. Gleich, A. Sameh, A. Grama","doi":"10.1145/3490557","DOIUrl":"https://doi.org/10.1145/3490557","url":null,"abstract":"As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 19"},"PeriodicalIF":1.6,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45097716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信