Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan
{"title":"Parallel CCD++ on GPU for Matrix Factorization","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Rakshith Kunchum, P. Sadayappan","doi":"10.1145/3038228.3038240","DOIUrl":null,"url":null,"abstract":"Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21
Abstract
Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).