基于GPU的并行CCD++矩阵分解

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI:10.1145/3038228.3038240

Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan

{"title":"基于GPU的并行CCD++矩阵分解","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan","doi":"10.1145/3038228.3038240","DOIUrl":null,"url":null,"abstract":"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Parallel CCD++ on GPU for Matrix Factorization\",\"authors\":\"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan\",\"doi\":\"10.1145/3038228.3038240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).\",\"PeriodicalId\":108772,\"journal\":{\"name\":\"Proceedings of the General Purpose GPUs\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the General Purpose GPUs\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3038228.3038240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

不完全矩阵的矩阵分解在推荐系统等应用中非常有用。针对推荐系统的矩阵分解问题，提出了几种迭代算法，其中包括循环坐标下降算法(CCD)。最近，CCD的一种变体CCD++被开发出来，作为一种在多核处理器上并行实现的有吸引力的算法。在本文中，我们讨论了用于GPU的CCD++并行化。关键的考虑因素是减少从/到GPU全局内存传输的数据量和最小化warp内负载不平衡。从基本实现开始，我们使用环路融合和平铺，利用硬件计数器数据的性能见解，逐步改进CCD++的GPU实现。结果表明，该算法比目前报道的CCD++的最佳多核实现以及矩阵分解的最佳GPU实现(使用ALS，交替最小二乘法)要快。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel CCD++ on GPU for Matrix Factorization

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the General Purpose GPUs

自引率

0.00%

发文量