BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2015-10-16 DOI:10.1145/2925426.2926256

Linnan Wang, Wei Wu, Jianxiong Xiao, Yezhou Yang

{"title":"BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing","authors":"Linnan Wang, Wei Wu, Jianxiong Xiao, Yezhou Yang","doi":"10.1145/2925426.2926256","DOIUrl":null,"url":null,"abstract":"Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"49","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926256","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 49

Abstract

Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.

查看原文本刊更多论文

BLASX:面向异构多gpu计算的高性能三级BLAS库

基本线性代数子程序(Basic Linear Algebra Subprograms, BLAS)是一组低级线性代数核，广泛应用于深度学习和科学计算等领域。新兴GPU架构所带来的巨大而经济的计算能力推动了对在多GPU系统上实现计算密集型3级BLAS的兴趣。在本文中，我们研究了现有的多GPU级3 BLAS，并提出了1)问题，如负载平衡不当，通信效率低下，GPU流级并发性和数据缓存不足，阻碍了当前实现充分利用异构计算资源;2)和gpu间点对点(P2P)通信仍未探索。然后，我们提出了BLASX:一个高度优化的多gpu 3级BLAS。我们采用逐块算法的概念，将矩阵块作为基本数据单元，并将对块的操作作为基本任务。任务由动态异步运行时引导，它是缓存和位置感知的。在BLASX下的通信成本变得微不足道，因为它在异步任务进程中完美地重叠了跨多个流的通信和计算。它还通过利用gpu间P2P通信，提出了一种创新的2级分层块缓存，使当前的块缓存方案更进一步。因此，在多gpu配置下，BLASX可以观察到线性加速;广泛的基准测试表明，BLASX始终优于相关的领先工业和学术实现，如cuBLAS-XT, SuperMatrix, MAGMA。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量