Two-dimensional data partitioning for non-negative matrix tri-factorization

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2024-06-19 DOI:10.1016/j.bdr.2024.100473

Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang

{"title":"Two-dimensional data partitioning for non-negative matrix tri-factorization","authors":"Jiaxing Yan , Hai Liu , Zhiqi Lei , Yanghui Rao , Guan Liu , Haoran Xie , Xiaohui Tao , Fu Lee Wang","doi":"10.1016/j.bdr.2024.100473","DOIUrl":null,"url":null,"abstract":"<div><p>As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.</p></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100473"},"PeriodicalIF":4.2000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579624000492","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As a two-sided clustering and dimensionality reduction paradigm, Non-negative Matrix Tri-Factorization (NMTF) has attracted much attention in machine learning and data mining researchers due to its excellent performance and reliable theoretical support. Unlike Non-negative Matrix Factorization (NMF) methods applicable to one-sided clustering only, NMTF introduces an additional factor matrix and uses the inherent duality of data to realize the mutual promotion of sample clustering and feature clustering, thus showing great advantages in many scenarios (e.g., text co-clustering). However, the existing methods for solving NMTF usually involve intensive matrix multiplication, which is characterized by high time and space complexities, that is, there are limitations of slow convergence of the multiplicative update rules and high memory overhead. In order to solve the above problems, this paper develops a distributed parallel algorithm with a 2-dimensional data partition scheme for NMTF (i.e., PNMTF-2D). Experiments on multiple text datasets show that the proposed PNMTF-2D can substantially improve the computational efficiency of NMTF (e.g., the average iteration time is reduced by up to 99.7% on Amazon) while ensuring the effectiveness of convergence and co-clustering.

查看原文本刊更多论文

非负矩阵三因子化的二维数据分区

作为一种双面聚类和降维范式，非负矩阵三因式分解（NMTF）以其优异的性能和可靠的理论支持吸引了机器学习和数据挖掘研究人员的广泛关注。与只适用于单边聚类的非负矩阵因式分解（NMF）方法不同，NMTF 引入了额外的因式矩阵，利用数据固有的二元性实现了样本聚类和特征聚类的相互促进，因此在很多场景（如文本共聚类）中都显示出巨大的优势。然而，现有的 NMTF 求解方法通常涉及密集的矩阵乘法，具有时间和空间复杂度高的特点，即存在乘法更新规则收敛慢和内存开销大的局限性。为了解决上述问题，本文针对 NMTF 开发了一种具有二维数据分区方案的分布式并行算法（即 PNMTF-2D）。在多个文本数据集上的实验表明，所提出的 PNMTF-2D 可以大幅提高 NMTF 的计算效率（例如，在亚马逊上平均迭代时间最多可缩短 99.7%），同时确保收敛和共聚类的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.