ACM Transactions on Parallel Computing最新文献_第3页

Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry 紧格式的非重叠高精度并行闭包:在多物理场和复杂几何中的应用

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-01-17 DOI: 10.1145/3580005

P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta

{"title":"Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry","authors":"P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta","doi":"10.1145/3580005","DOIUrl":"https://doi.org/10.1145/3580005","url":null,"abstract":"Compact schemes are often preferred in performing scientific computing for their superior spectral resolution. Error-free parallelization of a compact scheme is a challenging task due to the requirement of additional closures at the inter-processor boundaries. Here, sources of the error due to sub-domain boundary closures for the compact schemes are analyzed with global spectral analysis. A high-accuracy parallel computing strategy devised in “ A high-accuracy preserving parallel algorithm for compact schemes for DNS. ACM Trans. Parallel Comput. 7, 4, 1-32 (2020)” systematically eliminates error due to parallelization and does not require overlapping points at the sub-domain boundaries. This closure is applicable for any compact scheme and is termed here as non-overlapping high-accuracy parallel (NOHAP) sub-domain boundary closure. In the present work, the advantages of the NOHAP closure are shown with the model convection equation and by solving the compressible Navier–Stokes equation for three-dimensional Rayleigh–Taylor instability simulations involving multiphysics dynamics and high Reynolds number flow past a natural laminar flow airfoil using a body-conforming curvilinear coordinate system. Linear scalability of the NOHAP closure is shown for the large-scale simulations using up to 19,200 processors.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 28"},"PeriodicalIF":1.6,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45088678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles 简单、时间和跳跃约束循环枚举的快速并行算法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2023-01-03 DOI: 10.1145/3611642

J. Blanuša, K. Atasu, P. Ienne

{"title":"Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles","authors":"J. Blanuša, K. Atasu, P. Ienne","doi":"10.1145/3611642","DOIUrl":"https://doi.org/10.1145/3611642","url":null,"abstract":"Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45204474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Parallel Minimum Cuts in O(m log2 n) Work and Low Depth O（m log2n）工作和低深度的平行最小切口

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-12-16 DOI: 10.1145/3565557

Daniel Anderson, G. Blelloch

引用次数: 3

Optimal Algorithms for Right-sizing Data Centers 合适规模数据中心的最优算法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-10-11 DOI: 10.1145/3565513

S. Albers, Jens Quedenfeld

{"title":"Optimal Algorithms for Right-sizing Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3565513","DOIUrl":"https://doi.org/10.1145/3565513","url":null,"abstract":"Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 40"},"PeriodicalIF":1.6,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42464168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers 一类用于低延迟操作和项目传输的松弛并发队列

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-10-04 DOI: 10.1145/3565514

Giorgos Kappes, S. Anastasiadis

{"title":"A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers","authors":"Giorgos Kappes, S. Anastasiadis","doi":"10.1145/3565514","DOIUrl":"https://doi.org/10.1145/3565514","url":null,"abstract":"The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 37"},"PeriodicalIF":1.6,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45988660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations 大规模特征值计算中的正交并行层

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-09-05 DOI: 10.1145/3614444

A. Alvermann, G. Hager, H. Fehske

{"title":"Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations","authors":"A. Alvermann, G. Hager, H. Fehske","doi":"10.1145/3614444","DOIUrl":"https://doi.org/10.1145/3614444","url":null,"abstract":"We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 31"},"PeriodicalIF":1.6,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41741959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Checkpointing Workflows à la Young/Daly Is Not Good Enough 检查点工作流<s:1>年轻/每日不够好

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-09-02 DOI: 10.1145/3548607

A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun

引用次数: 2

Improving the Speed and Quality of Parallel Graph Coloring 提高并行图着色的速度和质量

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-07-11 DOI: 10.1145/3543545

Ghadeer Alabandi, Martin Burtscher

引用次数: 0

Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator 一个粗粒度动态可重构多媒体加速器的设计与实现

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-07-09 DOI: 10.1145/3543544

Hung K. Nguyen, Xuan-Tu Tran

{"title":"Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator","authors":"Hung K. Nguyen, Xuan-Tu Tran","doi":"10.1145/3543544","DOIUrl":"https://doi.org/10.1145/3543544","url":null,"abstract":"This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46137129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies 多区间DomLock：提高层次结构中的并发性

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2022-07-08 DOI: 10.1145/3543543

M. A. Anju, R. Nasre

{"title":"Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies","authors":"M. A. Anju, R. Nasre","doi":"10.1145/3543543","DOIUrl":"https://doi.org/10.1145/3543543","url":null,"abstract":"Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 27"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43227379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0