Proceedings of the 34th ACM International Conference on Supercomputing最新文献

Identifying and (automatically) remedying performance problems in CPU/GPU applications 识别和(自动)纠正CPU/GPU应用程序中的性能问题

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392759

Benjamin Welton, B. Miller

{"title":"Identifying and (automatically) remedying performance problems in CPU/GPU applications","authors":"Benjamin Welton, B. Miller","doi":"10.1145/3392717.3392759","DOIUrl":"https://doi.org/10.1145/3392717.3392759","url":null,"abstract":"GPU accelerators have become common on today's leadership-class computing platforms. Effective exploitation of the additional parallelism offered by GPUs is fraught with challenges. A key performance challenge faced by developers is how to limit the time consumed by synchronizations between the CPU and GPU. We introduce the extended feed-forward measurement (FFM) performance tool that provides an automated detection of synchronization problems, identifies if the synchronization problem is a component of a larger construct that exhibits a problem beyond an individual synchronization operation, identifies remedies that can correct the issue, and in some cases automatically applies remedies to problems exhibited by larger constructs. The extended FFM performance tool identifies three causes of unnecessary synchronizations: a problem caused by a single operation, a problem caused by memory management issues, and a problem caused by a memory transfer. The extended FFM model prescribes remedies for each construct and can automatically apply remedies for memory management and memory transfer cause problems. We created an implementation of the extended FFM performance tool and employed it to identify and automatically correct problems in three real-world scientific applications, resulting in an automatically obtained reduction in execution time between 9% and 43%.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117204275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Tools for top-down performance analysis of GPU-accelerated applications 用于自顶向下gpu加速应用程序性能分析的工具

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392752

K. Zhou, Mark W. Krentel, J. Mellor-Crummey

引用次数: 14

Fast distributed bandits for online recommendation systems 在线推荐系统的快速分布式强盗

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392748

K. Mahadik, Qingyun Wu, Shuai Li, Amit Sabne

{"title":"Fast distributed bandits for online recommendation systems","authors":"K. Mahadik, Qingyun Wu, Shuai Li, Amit Sabne","doi":"10.1145/3392717.3392748","DOIUrl":"https://doi.org/10.1145/3392717.3392748","url":null,"abstract":"Contextual bandit algorithms are commonly used in recommender systems, where content popularity can change rapidly. These algorithms continuously learn latent mappings between users and items, based on contexts associated with them both. Recent recommendation algorithms that learn clustering or social structures between users have exhibited higher recommendation accuracy. However, as the number of users and items in the environment increases, the time required to generate recommendations deteriorates significantly. As a result, these cannot be deployed in practice. The state-of-the-art distributed bandit algorithm - DCCB - relies on a peer-to-peer network to share information among distributed workers. However, this approach does not scale well with the increasing number of users. Furthermore, it suffers from slow discovery of clusters, resulting in accuracy degradation. To address the above issues, this paper proposes a novel distributed bandit-based algorithm called DistCLUB. This algorithm lazily creates clusters in a distributed manner, and dramatically reduces the network data sharing requirement, achieving high scalability. Additionally, DistCLUB finds clusters much faster, achieving better accuracy than the state-of-the-art algorithm. Evaluation over both real-world benchmarks and synthetic datasets shows that DistCLUB is on average 8.87x faster than DCCB, and achieves 14.5% higher normalized prediction performance.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131020305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Parallelizing pruned landmark labeling: dealing with dependencies in graph algorithms 并行修剪标记:处理图算法中的依赖关系

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392745

R. Jin, Zhen Peng, W. Wu, F. Dragan, G. Agrawal, Bin Ren

{"title":"Parallelizing pruned landmark labeling: dealing with dependencies in graph algorithms","authors":"R. Jin, Zhen Peng, W. Wu, F. Dragan, G. Agrawal, Bin Ren","doi":"10.1145/3392717.3392745","DOIUrl":"https://doi.org/10.1145/3392717.3392745","url":null,"abstract":"To help compute shortest path distances over large graphs efficiently, 2-hop labeling has emerged as a major tool, with Pruned Landmark Labeling (PPL) as a popular algorithm. This paper demonstrates the first scalable parallel implementation of the PPL algorithm that produces the same results as the sequential algorithm. Based on theoretical analysis, we show how computations on each vertex can be performed in parallel while maintaining correctness, resulting in the Vertex-Centrix PLL (VC-PLL) algorithm. We also show a formulation of this algorithm based on linear algebra and argue why the use of a library based on linear algebra operations will not produce an efficient implementation. Next, we introduce a batched VC-PLL (BVC-PLL) algorithm to reduce the computational inefficiency in VC-PLL. We have carried out a parallel implementation of this method for modern clusters, combining shared memory and distributed memory parallelism, that can efficiently execute on graphs with more than a billion edges. We also demonstrate how BVC-PLL algorithm can be extended to handle directed graphs and weighted graphs and how the version for weighted graphs can benefit from SIMD parallelization.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"18 23","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120873492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Computing, data and COVID-19 计算、数据和COVID-19

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3401882

K. Yelick

引用次数: 0

Mapping and scheduling HPC applications for optimizing I/O 映射和调度HPC应用程序以优化I/O

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392764

J. Carretero, E. Jeannot, Guillaume Pallez, D. E. Singh, Nicolas Vidal

{"title":"Mapping and scheduling HPC applications for optimizing I/O","authors":"J. Carretero, E. Jeannot, Guillaume Pallez, D. E. Singh, Nicolas Vidal","doi":"10.1145/3392717.3392764","DOIUrl":"https://doi.org/10.1145/3392717.3392764","url":null,"abstract":"In HPC platforms, concurrent applications are sharing the same file system. This can lead to conflicts, especially as applications are more and more data intensive. I/O contention can represent a performance bottleneck. The access to bandwidth can be split in two complementary yet distinct problems. The mapping problem and the scheduling problem. The mapping problem consists in selecting the set of applications that are in competition for the I/O resource. The scheduling problem consists then, given I/O requests on the same resource, in determining the order to these accesses to minimize the I/O time. In this work we propose to couple a novel bandwidth-aware mapping algorithm to I/O list-scheduling policies to develop a cross-layer optimization solution. We study this solution experimentally using an I/O middleware: CLARISSE. We show that naive policies such as FIFO perform relatively well in order to schedule I/O movements, and that the important part to reduce congestion lies mostly on the mapping part. We evaluate the algorithm that we propose using a simulator that we validated experimentally. This evaluation shows important gains for the simple, bandwidth-aware mapping solution that we provide compared to its non bandwidth-aware counterpart. The gains are both in terms of machine efficiency (makespan) and application efficiency (stretch). This stresses even more the importance of designing efficient, bandwidth-aware mapping strategies to alleviate the cost of I/O congestion.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":" 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120828875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

V-Combiner: speeding-up iterative graph processing on a shared-memory platform with vertex merging V-Combiner:利用顶点合并加速共享内存平台上的迭代图处理

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392739

Azin Heidarshenas, Serif Yesil, Dimitrios Skarlatos, Sasa Misailovic, Adam Morrison, J. Torrellas

{"title":"V-Combiner: speeding-up iterative graph processing on a shared-memory platform with vertex merging","authors":"Azin Heidarshenas, Serif Yesil, Dimitrios Skarlatos, Sasa Misailovic, Adam Morrison, J. Torrellas","doi":"10.1145/3392717.3392739","DOIUrl":"https://doi.org/10.1145/3392717.3392739","url":null,"abstract":"An iterative graph algorithm applies a vertex update operation to all vertices in a graph in every iteration. For large graphs, this computation is costly. However, in practice, not all the updates contribute equally to the end result and, in fact, an exact result may not be needed. In this work, we leverage these insights to speed-up iterative graph algorithms. We propose a mechanism to identify the less important vertices and omit computations for them. Our scheme, called V-Combiner, is a deterministic, fast, and application-transparent technique to construct an approximate graph to enable faster execution. The main idea behind V-Combiner is to merge certain vertices into hubs, which are vertices that have many connections and contribute heavily to the end result of the algorithm. We also propose an inexpensive correction step to recover the contribution of the merged vertices to get higher accuracy. We evaluate V-Combiner on 4 different applications and 5 datasets. For 44-threaded runs, V-Combiner achieves an average end-to-end speedup of 1.25X over the conventional system, with an accuracy of 91.8%. It also shows a better performance-accuracy trade-off than the existing sparsification and k-core techniques.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125656115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Sparse-TPU: adapting systolic arrays for sparse matrices 稀疏tpu:为稀疏矩阵调整收缩数组

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392751

Xin He, S. Pal, Aporva Amarnath, Siying Feng, Dong-hyeon Park, A. Rovinski, Haojie Ye, Kuan-Yu Chen, R. Dreslinski, T. Mudge

{"title":"Sparse-TPU: adapting systolic arrays for sparse matrices","authors":"Xin He, S. Pal, Aporva Amarnath, Siying Feng, Dong-hyeon Park, A. Rovinski, Haojie Ye, Kuan-Yu Chen, R. Dreslinski, T. Mudge","doi":"10.1145/3392717.3392751","DOIUrl":"https://doi.org/10.1145/3392717.3392751","url":null,"abstract":"While systolic arrays are widely used for dense-matrix operations, they are seldom used for sparse-matrix operations. In this paper, we show how a systolic array of Multiply-and-Accumulate (MAC) units, similar to Google's Tensor Processing Unit (TPU), can be adapted to efficiently handle sparse matrices. TPU-like accelerators are built upon a 2D array of MAC units and have demonstrated high throughput and efficiency for dense matrix multiplication, which is a key kernel in machine learning algorithms and is the target of the TPU. In this work, we employ a co-designed approach of first developing a packing technique to condense a sparse matrix and then propose a systolic array based system, Sparse-TPU, abbreviated to STPU, to accommodate the matrix computations for the packed denser matrix counterparts. To demonstrate the efficacy of our co-designed approach, we evaluate sparse matrix-vector multiplication on a broad set of synthetic and real-world sparse matrices. Experimental results show that STPU delivers 16.08X higher performance while consuming 4.39X and 19.79X lower energy for integer (int8) and floating point (float32) implementations, respectively, over a TPU baseline. Meanwhile, STPU has 12.93% area overhead and an average of 4.14% increase in dynamic energy over the TPU baseline for the float32 implementation.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132234900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

Graptor

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392753

H. Vandierendonck

引用次数: 8

End-to-end performance modeling of distributed GPU applications 分布式GPU应用程序的端到端性能建模

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-06-29 DOI: 10.1145/3392717.3392737

Jaemin Choi, D. Richards, L. Kalé, A. Bhatele

引用次数: 8