Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2016-03-11 DOI:10.1145/2907294.2907297

Wei Tan, Liangliang Cao, L. Fong

{"title":"Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs","authors":"Wei Tan, Liangliang Cao, L. Fong","doi":"10.1145/2907294.2907297","DOIUrl":null,"url":null,"abstract":"Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme. With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"53 70 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2907294.2907297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 52

Abstract

Matrix factorization (MF) is used by many popular algorithms such as collaborative filtering. GPU with massive cores and high memory bandwidth sheds light on accelerating MF much further when appropriately exploiting its architectural characteristics. This paper presents cuMF, a CUDA-based matrix factorization library that optimizes alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of techniques to maximize the performance on single and multiple GPUs. These techniques include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism in conjunction with model parallelism, minimizing the communication overhead among GPUs, and a novel topology-aware parallel reduction scheme. With only a single machine with four Nvidia GPU cards, cuMF can be 6-10 times as fast, and 33-100 times as cost-efficient, compared with the state-of-art distributed CPU solutions. Moreover, cuMF can solve the largest matrix factorization problem ever reported in current literature, with impressively good performance.

查看原文本刊更多论文

更快更便宜:在gpu上并行化大规模矩阵分解

矩阵分解(MF)被用于许多流行的算法，如协同过滤。具有大量内核和高内存带宽的GPU在适当利用其架构特性时，可以进一步加速MF。本文提出了一个基于cuda的矩阵分解库cuMF，它优化了备用最小二乘(ALS)方法来求解非常大规模的MF。CuMF使用一组技术来最大化单个和多个gpu上的性能。这些技术包括利用GPU内存层次结构对稀疏数据进行智能访问，将数据并行性与模型并行性结合使用，最大限度地减少GPU之间的通信开销，以及一种新颖的拓扑感知并行缩减方案。与最先进的分布式CPU解决方案相比，只需一台带有4块Nvidia GPU卡的机器，cuMF的速度就可以提高6-10倍，成本效益可以提高33-100倍。此外，cuMF可以解决目前文献中报道的最大的矩阵分解问题，具有令人印象深刻的良好性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量