Fast and Robust Parallel SGD Matrix Factorization

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI:10.1145/2783258.2783322

Jinoh Oh, Wook-Shin Han, Hwanjo Yu, Xiaoqian Jiang

{"title":"Fast and Robust Parallel SGD Matrix Factorization","authors":"Jinoh Oh, Wook-Shin Han, Hwanjo Yu, Xiaoqian Jiang","doi":"10.1145/2783258.2783322","DOIUrl":null,"url":null,"abstract":"Matrix factorization is one of the fundamental techniques for analyzing latent relationship between two entities. Especially, it is used for recommendation for its high accuracy. Efficient parallel SGD matrix factorization algorithms have been developed for large matrices to speed up the convergence of factorization. However, most of them are designed for a shared-memory environment thus fail to factorize a large matrix that is too big to fit in memory, and their performances are also unreliable when the matrix is skewed. This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory. MLGF-MF uses Multi-Level Grid File (MLGF) for partitioning the matrix and minimizes the cost for scheduling parallel SGD updates on the partitioned regions by exploiting partial match queries processing}. Thereby, MLGF-MF produces reliable results efficiently even on skewed matrices. MLGF-MF is designed with asynchronous I/O permeated in the algorithm such that CPU keeps executing without waiting for I/O to complete. Thereby, MLGF-MF overlaps the CPU and I/O processing, which eventually offsets the I/O cost and maximizes the CPU utility. Recent flash SSD disks support high performance parallel I/O, thus are appropriate for executing the asynchronous I/O. From our extensive evaluations, MLGF-MF significantly outperforms (or converges faster than) the state-of-the-art algorithms in both shared-memory and block-storage environments. In addition, the outputs of MLGF-MF is significantly more robust to skewed matrices. Our implementation of MLGF-MF is available at http://dm.postech.ac.kr/MLGF-MF as executable files.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"48","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2783322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 48

Abstract

Matrix factorization is one of the fundamental techniques for analyzing latent relationship between two entities. Especially, it is used for recommendation for its high accuracy. Efficient parallel SGD matrix factorization algorithms have been developed for large matrices to speed up the convergence of factorization. However, most of them are designed for a shared-memory environment thus fail to factorize a large matrix that is too big to fit in memory, and their performances are also unreliable when the matrix is skewed. This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory. MLGF-MF uses Multi-Level Grid File (MLGF) for partitioning the matrix and minimizes the cost for scheduling parallel SGD updates on the partitioned regions by exploiting partial match queries processing}. Thereby, MLGF-MF produces reliable results efficiently even on skewed matrices. MLGF-MF is designed with asynchronous I/O permeated in the algorithm such that CPU keeps executing without waiting for I/O to complete. Thereby, MLGF-MF overlaps the CPU and I/O processing, which eventually offsets the I/O cost and maximizes the CPU utility. Recent flash SSD disks support high performance parallel I/O, thus are appropriate for executing the asynchronous I/O. From our extensive evaluations, MLGF-MF significantly outperforms (or converges faster than) the state-of-the-art algorithms in both shared-memory and block-storage environments. In addition, the outputs of MLGF-MF is significantly more robust to skewed matrices. Our implementation of MLGF-MF is available at http://dm.postech.ac.kr/MLGF-MF as executable files.

查看原文本刊更多论文

快速鲁棒并行SGD矩阵分解

矩阵分解是分析实体间潜在关系的基本技术之一。特别是，它的准确度高，用于推荐。为了加快分解的收敛速度，针对大矩阵提出了高效的并行SGD矩阵分解算法。然而，它们中的大多数都是为共享内存环境设计的，因此无法分解内存中无法容纳的大型矩阵，而且当矩阵倾斜时，它们的性能也不可靠。本文提出了一种快速鲁棒的并行SGD矩阵分解算法MLGF-MF，该算法对倾斜矩阵具有鲁棒性，并能在块存储设备(如SSD磁盘)和共享内存上高效运行。MLGF- mf使用多级网格文件(MLGF)对矩阵进行分区，并通过利用部分匹配查询处理最小化在分区区域上调度并行SGD更新的成本。因此，MLGF-MF即使在倾斜矩阵上也能有效地产生可靠的结果。MLGF-MF在算法中采用异步I/O渗透设计，使得CPU无需等待I/O完成即可继续执行。因此，MLGF-MF重叠了CPU和I/O处理，这最终抵消了I/O成本并最大化了CPU效用。最近的闪存SSD盘支持高性能并行I/O，因此适合执行异步I/O。根据我们的广泛评估，MLGF-MF在共享内存和块存储环境中都明显优于(或收敛速度快于)最先进的算法。此外，MLGF-MF的输出对倾斜矩阵的鲁棒性更强。MLGF-MF的实现可以在http://dm.postech.ac.kr/MLGF-MF上以可执行文件的形式获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量