ACM Transactions on Parallel Computing最新文献_第6页

Engineering In-place (Shared-memory) Sorting Algorithms 工程就地(共享内存)排序算法

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2020-09-28 DOI: 10.1145/3505286

Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders

{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":"https://doi.org/10.1145/3505286","url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 62"},"PeriodicalIF":1.6,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43709384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing 异步多gpu编程模型与大规模图形处理的应用

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2020-08-01 DOI: 10.1145/3399730

Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali

{"title":"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":"https://doi.org/10.1145/3399730","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. ","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"32 1","pages":"18:1-18:27"},"PeriodicalIF":1.6,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

FEAST: A Lightweight Lock-free Concurrent Binary Search Tree FEAST:一个轻量级的无锁并发二叉搜索树

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2020-05-31 DOI: 10.1145/3391438

Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal

{"title":"FEAST: A Lightweight Lock-free Concurrent Binary Search Tree","authors":"Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal","doi":"10.1145/3391438","DOIUrl":"https://doi.org/10.1145/3391438","url":null,"abstract":"We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"30 1","pages":"10:1-10:64"},"PeriodicalIF":1.6,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82801233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows CoREC:可扩展和弹性的内存数据暂存原位工作流

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2020-05-16 DOI: 10.1145/3391448

Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar

引用次数: 5

ROC: A Reconfigurable Optical Computer for Simulating Physical Processes 用于模拟物理过程的可重构光学计算机

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2020-04-02 DOI: 10.1145/3380944

Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi

{"title":"ROC: A Reconfigurable Optical Computer for Simulating Physical Processes","authors":"Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi","doi":"10.1145/3380944","DOIUrl":"https://doi.org/10.1145/3380944","url":null,"abstract":"Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"21 1","pages":"8:1-8:29"},"PeriodicalIF":1.6,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Scheduling Mutual Exclusion Accesses in Equal-Length Jobs 调度等长作业中的互斥访问

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3342562

D. Kagaris, S. Dutta

{"title":"Scheduling Mutual Exclusion Accesses in Equal-Length Jobs","authors":"D. Kagaris, S. Dutta","doi":"10.1145/3342562","DOIUrl":"https://doi.org/10.1145/3342562","url":null,"abstract":"A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number L of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of L = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem J |dij=1, nj =2|Cmax is polynomially solvable, adding to the polynomial solutions known for the problems J2 | nj ≤ 2 | Cmax and J2 | dij = 1 | Cmax (whereas other closely related versions such as J2 | nj ≤ 3 | Cmax, J2 | dij ∈ { 1,2} | Cmax, J3 | nj ≤ 2 | Cmax, J3 | dij=1 | Cmax, and J |dij=1, nj ≤ 3| Cmax are all known to be NP-complete). For the general case L > 2 (i.e., for the job-shop problem J |dij=1, nj =L> 2| Cmax), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"8:1-8:26"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

I/O Scheduling Strategy for Periodic Applications 周期性应用的I/O调度策略

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3338510

G. Aupy, Ana Gainaru, Valentin Le Fèvre

{"title":"I/O Scheduling Strategy for Periodic Applications","authors":"G. Aupy, Ana Gainaru, Valentin Le Fèvre","doi":"10.1145/3338510","DOIUrl":"https://doi.org/10.1145/3338510","url":null,"abstract":"With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.\u0000 In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"3 1","pages":"7:1-7:26"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84499785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Modeling Universal Globally Adaptive Load-Balanced Routing 建模通用全局自适应负载均衡路由

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3349620

Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang

引用次数: 3

Scalable Deep Learning via I/O Analysis and Optimization 基于I/O分析和优化的可扩展深度学习

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3331526

S. Pumma, Min Si, W. Feng, P. Balaji

{"title":"Scalable Deep Learning via I/O Analysis and Optimization","authors":"S. Pumma, Min Si, W. Feng, P. Balaji","doi":"10.1145/3331526","DOIUrl":"https://doi.org/10.1145/3331526","url":null,"abstract":"Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"23 1","pages":"6:1-6:34"},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87253170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

A High-Quality and Fast Maximal Independent Set Implementation for GPUs 一种高质量、快速的gpu最大独立集实现

IF 1.6

ACM Transactions on Parallel Computing Pub Date : 2019-01-23 DOI: 10.1145/3291525

Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers

{"title":"A High-Quality and Fast Maximal Independent Set Implementation for GPUs","authors":"Martin Burtscher, Sindhu Devale, S. Azimi, J. Jaiganesh, Evan Powers","doi":"10.1145/3291525","DOIUrl":"https://doi.org/10.1145/3291525","url":null,"abstract":"Computing a maximal independent set is an important step in many parallel graph algorithms. This article introduces ECL-MIS, a maximal independent set implementation that works well on GPUs. It includes key optimizations to speed up computation, reduce the memory footprint, and increase the set size. Its CUDA implementation requires fewer than 30 kernel statements, runs asynchronously, and produces a deterministic result. It outperforms the maximal independent set implementations of Pannotia, CUSP, and IrGL on each of the 16 tested graphs of various types and sizes. On a Titan X GPU, ECL-MIS is between 3.9 and 100 times faster (11.5 times, on average). ECL-MIS running on the GPU is also faster than the parallel CPU codes Ligra, Ligra+, and PBBS running on 20 Xeon cores, which it outperforms by 4.1 times, on average. At the same time, ECL-MIS produces maximal independent sets that are up to 52% larger (over 10%, on average) compared to these preexisting CPU and GPU implementations. Whereas these codes produce maximal independent sets that are, on average, about 15% smaller than the largest possible such sets, ECL-MIS sets are less than 6% smaller than the maximum independent sets.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"26 1","pages":"8:1-8:27"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80137186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9