ACM Transactions on Parallel Computing最新文献

筛选
英文 中文
Bandwidth-Optimal Random Shuffling for GPUs gpu的带宽优化随机洗牌
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2021-06-11 DOI: 10.1145/3505287
Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes
{"title":"Bandwidth-Optimal Random Shuffling for GPUs","authors":"Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes","doi":"10.1145/3505287","DOIUrl":"https://doi.org/10.1145/3505287","url":null,"abstract":"Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Engineering In-place (Shared-memory) Sorting Algorithms 工程就地(共享内存)排序算法
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-09-28 DOI: 10.1145/3505286
Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders
{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":"https://doi.org/10.1145/3505286","url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43709384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing 异步多gpu编程模型与大规模图形处理的应用
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-08-01 DOI: 10.1145/3399730
Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali
{"title":"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":"https://doi.org/10.1145/3399730","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. ","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
FEAST: A Lightweight Lock-free Concurrent Binary Search Tree FEAST:一个轻量级的无锁并发二叉搜索树
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-05-31 DOI: 10.1145/3391438
Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal
{"title":"FEAST: A Lightweight Lock-free Concurrent Binary Search Tree","authors":"Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal","doi":"10.1145/3391438","DOIUrl":"https://doi.org/10.1145/3391438","url":null,"abstract":"We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82801233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows CoREC:可扩展和弹性的内存数据暂存原位工作流
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-05-16 DOI: 10.1145/3391448
Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar
{"title":"CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows","authors":"Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar","doi":"10.1145/3391448","DOIUrl":"https://doi.org/10.1145/3391448","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78792111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
ROC: A Reconfigurable Optical Computer for Simulating Physical Processes 用于模拟物理过程的可重构光学计算机
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2020-04-02 DOI: 10.1145/3380944
Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi
{"title":"ROC: A Reconfigurable Optical Computer for Simulating Physical Processes","authors":"Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi","doi":"10.1145/3380944","DOIUrl":"https://doi.org/10.1145/3380944","url":null,"abstract":"Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling Mutual Exclusion Accesses in Equal-Length Jobs 调度等长作业中的互斥访问
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3342562
D. Kagaris, S. Dutta
{"title":"Scheduling Mutual Exclusion Accesses in Equal-Length Jobs","authors":"D. Kagaris, S. Dutta","doi":"10.1145/3342562","DOIUrl":"https://doi.org/10.1145/3342562","url":null,"abstract":"A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number <i>L</i> of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of <i>L</i> = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =2|<i>C</i><sub>max</sub> is polynomially solvable, adding to the polynomial solutions known for the problems <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub> and <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> = 1 | <i>C</i><sub>max</sub> (whereas other closely related versions such as <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 3 | <i>C</i><sub>max</sub>, <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> ∈ { 1,2} | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>d</i><sub><i>ij</i></sub>=1 | <i>C</i><sub>max</sub>, and <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> ≤ 3| <i>C</i><sub>max</sub> are all known to be NP-complete). For the general case <i>L</i> > 2 (i.e., for the job-shop problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =<i>L</i>> 2| <i>C</i><sub>max</sub>), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
I/O Scheduling Strategy for Periodic Applications 周期性应用的I/O调度策略
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3338510
G. Aupy, Ana Gainaru, Valentin Le Fèvre
{"title":"I/O Scheduling Strategy for Periodic Applications","authors":"G. Aupy, Ana Gainaru, Valentin Le Fèvre","doi":"10.1145/3338510","DOIUrl":"https://doi.org/10.1145/3338510","url":null,"abstract":"With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.\u0000 In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84499785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Modeling Universal Globally Adaptive Load-Balanced Routing 建模通用全局自适应负载均衡路由
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3349620
Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang
{"title":"Modeling Universal Globally Adaptive Load-Balanced Routing","authors":"Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1145/3349620","DOIUrl":"https://doi.org/10.1145/3349620","url":null,"abstract":"Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84908923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable Deep Learning via I/O Analysis and Optimization 基于I/O分析和优化的可扩展深度学习
IF 1.6
ACM Transactions on Parallel Computing Pub Date : 2019-09-10 DOI: 10.1145/3331526
S. Pumma, Min Si, W. Feng, P. Balaji
{"title":"Scalable Deep Learning via I/O Analysis and Optimization","authors":"S. Pumma, Min Si, W. Feng, P. Balaji","doi":"10.1145/3331526","DOIUrl":"https://doi.org/10.1145/3331526","url":null,"abstract":"Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87253170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信