{"title":"Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator","authors":"Hung K. Nguyen, Xuan-Tu Tran","doi":"10.1145/3543544","DOIUrl":"https://doi.org/10.1145/3543544","url":null,"abstract":"This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46137129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies","authors":"M. A. Anju, R. Nasre","doi":"10.1145/3543543","DOIUrl":"https://doi.org/10.1145/3543543","url":null,"abstract":"Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43227379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simple Concurrent Connected Components Algorithms","authors":"S. Liu, R. Tarjan","doi":"10.1145/3543546","DOIUrl":"https://doi.org/10.1145/3543546","url":null,"abstract":"We study a class of simple algorithms for concurrently computing the connected components of an n-vertex, m-edge graph. Our algorithms are easy to implement in either the COMBINING CRCW PRAM or the MPC computing model. For two related algorithms in this class, we obtain Θ (lg n) step and Θ (m lg n) work bounds.1 For two others, we obtain O(lg2 n) step and O(m lg2 n) work bounds, which are tight for one of them. All our algorithms are simpler than related algorithms in the literature. We also point out some gaps and errors in the analysis of previous algorithms. Our results show that even a basic problem like connected components still has secrets to reveal.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41743528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Joinable Parallel Balanced Binary Trees","authors":"G. Blelloch, Daniel Ferizovic, Yihan Sun","doi":"10.1145/3512769","DOIUrl":"https://doi.org/10.1145/3512769","url":null,"abstract":"In this article, we show how a single function, join, can be used to implement parallel balanced binary search trees (BSTs) simply and efficiently. Based on join , our approach applies to multiple balanced tree data structures, and a variety of functions for ordered sets and maps. We describe our technique as an algorithmic framework called join-based algorithms. We show that the join function fully captures what is needed for rebalancing trees for a variety of tree algorithms, as long as the balancing scheme satisfies certain properties, which we refer to as joinable trees. We discuss four balancing schemes that are joinable: AVL trees, red-black trees, weight-balanced trees, and treaps. We present a variety of tree algorithms that apply to joinable trees, including insert , delete , union , intersection , difference , split , range , filter , and so on, most of them also parallel. These algorithms are generic across balancing schemes. Many algorithms are optimal in the comparison model, and we provide a general proof to show the efficiency in work for joinable trees. The algorithms are highly parallel, all with polylogarithmic span (parallel dependence). Specifically, the set-set operations union , intersection , and difference have work ( O(mlog (frac{n}{m}+1)) ) and polylogarithmic span for input set sizes ( n ) and ( mle n ) . We implemented and tested our algorithms on the four balancing schemes. In general, all four schemes have quite similar performance, but the weight-balanced tree slightly outperforms the others. They have the same speedup characteristics, getting around 73 ( times ) speedup on 72 cores (144 hyperthreads). Experimental results also show that our implementation outperforms existing parallel implementations, and our sequential version achieves close or much better performance than the sequential merging algorithm in C++ Standard Template Library (STL) on various input sizes.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48619182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya
{"title":"fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms","authors":"Yuedan Chen, Guoqing Xiao, Kenli Li, F. Piccialli, Albert Y. Zomaya","doi":"10.1145/3512770","DOIUrl":"https://doi.org/10.1145/3512770","url":null,"abstract":"Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI ( + ) OpenMP ( +X ) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to ( 134.38times ) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to ( 21.68times ) over the state-of-the-arts.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48217478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky
{"title":"Fast Concurrent Data Sketches","authors":"Arik Rinberg, A. Spiegelman, Edward Bortnikov, Eshcar Hillel, I. Keidar, Lee Rhodes, Hadar Serviansky","doi":"10.1145/3512758","DOIUrl":"https://doi.org/10.1145/3512758","url":null,"abstract":"Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow parallelism for creating sketches using multiple threads or querying them while they are being built. We present a generic approach to parallelising data sketches efficiently and allowing them to be queried in real time, while bounding the error that such parallelism introduces. Utilising relaxed semantics and the notion of strong linearisability, we prove our algorithm’s correctness and analyse the error it induces in some specific sketches. Our implementation achieves high scalability while keeping the error small. We have contributed one of our concurrent sketches to the open-source data sketches library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45254851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank
{"title":"BQ: A Lock-Free Queue with Batching","authors":"Gal Milman-Sela, Alex Kogan, Yossi Lev, Victor Luchangco, E. Petrank","doi":"10.1145/3512757","DOIUrl":"https://doi.org/10.1145/3512757","url":null,"abstract":"Concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49253770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes
{"title":"High-performance 3D Unstructured Mesh Deformation Using Rank Structured Matrix Computations","authors":"Rabab Alomairy, W. Bader, H. Ltaief, Y. Mesri, D. Keyes","doi":"10.1145/3512756","DOIUrl":"https://doi.org/10.1145/3512756","url":null,"abstract":"The Radial Basis Function (RBF) technique is an interpolation method that produces high-quality unstructured adaptive meshes. However, the RBF-based boundary problem necessitates solving a large dense linear system with cubic arithmetic complexity that is computationally expensive and prohibitive in terms of memory footprint. In this article, we accelerate the computations of 3D unstructured mesh deformation based on RBF interpolations by exploiting the rank structured property of the matrix operator. The main idea consists in approximating the matrix off-diagonal tiles up to an application-dependent accuracy threshold. We highlight the robustness of our multiscale solver by assessing its numerical accuracy using realistic 3D geometries. In particular, we model the 3D mesh deformation on a population of the novel coronaviruses. We report and compare performance results on various parallel systems against existing state-of-the-art matrix solvers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45543029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler
{"title":"Efficient Distributed Matrix-free Multigrid Methods on Locally Refined Meshes for FEM Computations","authors":"Peter Munch, T. Heister, Laura Prieto Saavedra, M. Kronbichler","doi":"10.1145/3580314","DOIUrl":"https://doi.org/10.1145/3580314","url":null,"abstract":"This work studies three multigrid variants for matrix-free finite-element computations on locally refined meshes: geometric local smoothing, geometric global coarsening (both h-multigrid), and polynomial global coarsening (a variant of p-multigrid). We have integrated the algorithms into the same framework—the open source finite-element library deal.II—, which allows us to make fair comparisons regarding their implementation complexity, computational efficiency, and parallel scalability as well as to compare the measurements with theoretically derived performance metrics. Serial simulations and parallel weak and strong scaling on up to 147,456 CPU cores on 3,072 compute nodes are presented. The results obtained indicate that global-coarsening algorithms show a better parallel behavior for comparable smoothers due to the better load balance, particularly on the expensive fine levels. In the serial case, the costs of applying hanging-node constraints might be significant, leading to advantages of local smoothing, even though the number of solver iterations needed is slightly higher. When using p- and h-multigrid in sequence (hp-multigrid), the results indicate that it makes sense to decrease the degree of the elements first from a performance point of view due to the cheaper transfer.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47033768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson
{"title":"Performance Analysis and Optimal Node-aware Communication for Enlarged Conjugate Gradient Methods","authors":"S. Lockhart, Amanda Bienz, W. Gropp, Luke N. Olson","doi":"10.1145/3580003","DOIUrl":"https://doi.org/10.1145/3580003","url":null,"abstract":"Krylov methods are a key way of solving large sparse linear systems of equations but suffer from poor strong scalability on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this article, we present a performance study of an enlarged Krylov method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48394069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}