Parallel CCD++ on GPU for Matrix Factorization

Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan
{"title":"Parallel CCD++ on GPU for Matrix Factorization","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan","doi":"10.1145/3038228.3038240","DOIUrl":null,"url":null,"abstract":"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).
基于GPU的并行CCD++矩阵分解
不完全矩阵的矩阵分解在推荐系统等应用中非常有用。针对推荐系统的矩阵分解问题,提出了几种迭代算法,其中包括循环坐标下降算法(CCD)。最近,CCD的一种变体CCD++被开发出来,作为一种在多核处理器上并行实现的有吸引力的算法。在本文中,我们讨论了用于GPU的CCD++并行化。关键的考虑因素是减少从/到GPU全局内存传输的数据量和最小化warp内负载不平衡。从基本实现开始,我们使用环路融合和平铺,利用硬件计数器数据的性能见解,逐步改进CCD++的GPU实现。结果表明,该算法比目前报道的CCD++的最佳多核实现以及矩阵分解的最佳GPU实现(使用ALS,交替最小二乘法)要快。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信