广义矩阵因式分解：为大型数据阵列拟合广义线性潜变量模型的高效算法。

IF 5.2 3区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Journal of Machine Learning Research Pub Date : 2022-11-01

Łukasz Kidziński, Francis K C Hui, David I Warton, Trevor Hastie

{"title":"广义矩阵因式分解：为大型数据阵列拟合广义线性潜变量模型的高效算法。","authors":"Łukasz Kidziński, Francis K C Hui, David I Warton, Trevor Hastie","doi":"","DOIUrl":null,"url":null,"abstract":"Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.","PeriodicalId":50161,"journal":{"name":"Journal of Machine Learning Research","volume":"23 ","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129058/pdf/nihms-1843577.pdf","citationCount":"0","resultStr":"{\"title\":\"Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays.\",\"authors\":\"Łukasz Kidziński, Francis K C Hui, David I Warton, Trevor Hastie\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.\",\"PeriodicalId\":50161,\"journal\":{\"name\":\"Journal of Machine Learning Research\",\"volume\":\"23 \",\"pages\":\"\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2022-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10129058/pdf/nihms-1843577.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Machine Learning Research\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Machine Learning Research","FirstCategoryId":"94","ListUrlMain":"","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

心理学、生态学和医学等多个领域都在研究多变量测量之间的相关性。对于高斯测量，有一些经典的工具，如因子分析或主成分分析，具有成熟的理论和快速的算法。广义线性潜变量模型（GLLVM）将这些因子模型推广到非高斯响应。然而，目前在 GLLVMs 中估计模型参数的算法需要大量计算，无法扩展到包含数千个观察单元或反应的大型数据集。在本文中，我们提出了一种将 GLLVM 拟合到高维数据集的新方法，该方法基于使用惩罚准似然法逼近模型，然后使用牛顿方法和费雪评分来学习模型参数。在计算上，我们的方法明显更快、更稳定，能对比起以前更大的矩阵进行 GLLVM 拟合。我们在一个包含 48,000 个观测单元的数据集上应用了我们的方法，每个单元中有超过 2,000 个观测物种，结果发现大部分变异性都可以用少数几个因子来解释。我们发布了我们提出的拟合算法的易用实现方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays.

Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Machine Learning Research 工程技术-计算机：人工智能

CiteScore

18.80

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The Journal of Machine Learning Research (JMLR) provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. JMLR seeks previously unpublished papers on machine learning that contain: new principled algorithms with sound empirical validation, and with justification of theoretical, psychological, or biological nature; experimental and/or theoretical studies yielding new insight into the design and behavior of learning in intelligent systems; accounts of applications of existing techniques that shed light on the strengths and weaknesses of the methods; formalization of new learning tasks (e.g., in the context of new applications) and of methods for assessing performance on those tasks; development of new analytical frameworks that advance theoretical studies of practical learning methods; computational models of data from natural learning systems at the behavioral or neural level; or extremely well-written surveys of existing work.