ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator 一个粗粒度动态可重构多媒体加速器的设计与实现
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-07-09 DOI: 10.1145/3543544
Hung K. Nguyen, Xuan-Tu Tran
{"title":"Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator","authors":"Hung K. Nguyen, Xuan-Tu Tran","doi":"10.1145/3543544","DOIUrl":"https://doi.org/10.1145/3543544","url":null,"abstract":"This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46137129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies 多区间DomLock:提高层次结构中的并发性
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-07-08 DOI: 10.1145/3543543
M. A. Anju, R. Nasre
{"title":"Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies","authors":"M. A. Anju, R. Nasre","doi":"10.1145/3543543","DOIUrl":"https://doi.org/10.1145/3543543","url":null,"abstract":"Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43227379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simple Concurrent Connected Components Algorithms 简单并发连通组件算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-07-08 DOI: 10.1145/3543546
S. Liu, R. Tarjan
{"title":"Simple Concurrent Connected Components Algorithms","authors":"S. Liu, R. Tarjan","doi":"10.1145/3543546","DOIUrl":"https://doi.org/10.1145/3543546","url":null,"abstract":"We study a class of simple algorithms for concurrently computing the connected components of an n-vertex, m-edge graph. Our algorithms are easy to implement in either the COMBINING CRCW PRAM or the MPC computing model. For two related algorithms in this class, we obtain Θ (lg n) step and Θ (m lg n) work bounds.1 For two others, we obtain O(lg2 n) step and O(m lg2 n) work bounds, which are tight for one of them. All our algorithms are simpler than related algorithms in the literature. We also point out some gaps and errors in the analysis of previous algorithms. Our results show that even a basic problem like connected components still has secrets to reveal.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41743528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Joinable Parallel Balanced Binary Trees 可连接并行平衡二叉树
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512769
G. Blelloch, Daniel Ferizovic, Yihan Sun
{"title":"Joinable Parallel Balanced Binary Trees","authors":"G. Blelloch, Daniel Ferizovic, Yihan Sun","doi":"10.1145/3512769","DOIUrl":"https://doi.org/10.1145/3512769","url":null,"abstract":"In this article, we show how a single function, join, can be used to implement parallel balanced binary search trees (BSTs) simply and efficiently. Based on join , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework called join-based algorithms. We show that the join function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to as joinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, including insert , delete , union , intersection , difference , split , range , filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operations union , intersection , and difference have work ( O(mlog (frac{n}{m}+1)) ) and polylogarithmic span for input set sizes ( n ) and ( mle n ) . We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73 ( times ) speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48619182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms fgSpMSpV: HPC平台上的细粒度并行SpMSpV框架
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512770
Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya
{"title":"fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms","authors":"Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya","doi":"10.1145/3512770","DOIUrl":"https://doi.org/10.1145/3512770","url":null,"abstract":"Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48217478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast Concurrent Data Sketches 快速并发数据草图
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-04-11 DOI: 10.1145/3512758
Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky
{"title":"Fast Concurrent Data Sketches","authors":"Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky","doi":"10.1145/3512758","DOIUrl":"https://doi.org/10.1145/3512758","url":null,"abstract":"Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability, we prove our algorithm’s correctness and analyse the error it induces in some specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45254851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
BQ: A Lock-Free Queue with Batching 基于批处理的无锁队列
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512757
Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank
{"title":"BQ: A Lock-Free Queue with Batching","authors":"Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank","doi":"10.1145/3512757","DOIUrl":"https://doi.org/10.1145/3512757","url":null,"abstract":"Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49253770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations 使用秩结构矩阵计算的高性能三维非结构化网格变形
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-24 DOI: 10.1145/3512756
Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes
{"title":"High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations","authors":"Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes","doi":"10.1145/3512756","DOIUrl":"https://doi.org/10.1145/3512756","url":null,"abstract":"The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45543029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations 用于有限元计算的局部精细网格上的高效分布式无矩阵多重网格方法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-23 DOI: 10.1145/3580314
Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler
{"title":"Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations","authors":"Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler","doi":"10.1145/3580314","DOIUrl":"https://doi.org/10.1145/3580314","url":null,"abstract":"This work studies three multigrid variants for matrix-free finite-element computations on locally refined meshes: geometric local smoothing, geometric global coarsening (both h-multigrid), and polynomial global coarsening (a variant of p-multigrid). We have integrated the algorithms into the same framework—the open source finite-element library deal.II—, which allows us to make fair comparisons regarding their implementation complexity, computational efficiency, and parallel scalability as well as to compare the measurements with theoretically derived performance metrics. Serial simulations and parallel weak and strong scaling on up to 147,456 CPU cores on 3,072 compute nodes are presented. The results obtained indicate that global-coarsening algorithms show a better parallel behavior for comparable smoothers due to the better load balance, particularly on the expensive fine levels. In the serial case, the costs of applying hanging-node constraints might be significant, leading to advantages of local smoothing, even though the number of solver iterations needed is slightly higher. When using p- and h-multigrid in sequence (hp-multigrid), the results indicate that it makes sense to decrease the degree of the elements first from a performance point of view due to the cheaper transfer.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47033768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods 放大共轭梯度法的性能分析及最优节点感知通信
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2022-03-11 DOI: 10.1145/3580003
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
{"title":"Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3580003","DOIUrl":"https://doi.org/10.1145/3580003","url":null,"abstract":"Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48394069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信