面向大数据的PLS:正则化群PLS的统一并行算法

IF 11 Q1 STATISTICS & PROBABILITY

Statistics Surveys Pub Date : 2019-01-01 DOI:10.1214/19-ss125

P. L. D. Micheaux, B. Liquet, Matthew Sutton

{"title":"面向大数据的PLS:正则化群PLS的统一并行算法","authors":"P. L. D. Micheaux, B. Liquet, Matthew Sutton","doi":"10.1214/19-ss125","DOIUrl":null,"url":null,"abstract":"Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS. MSC 2010 subject classifications: Primary 6202, 62J99.","PeriodicalId":46627,"journal":{"name":"Statistics Surveys","volume":"148 1","pages":""},"PeriodicalIF":11.0000,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"PLS for Big Data: A unified parallel algorithm for regularised group PLS\",\"authors\":\"P. L. D. Micheaux, B. Liquet, Matthew Sutton\",\"doi\":\"10.1214/19-ss125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS. MSC 2010 subject classifications: Primary 6202, 62J99.\",\"PeriodicalId\":46627,\"journal\":{\"name\":\"Statistics Surveys\",\"volume\":\"148 1\",\"pages\":\"\"},\"PeriodicalIF\":11.0000,\"publicationDate\":\"2019-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistics Surveys\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1214/19-ss125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics Surveys","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1214/19-ss125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 6

摘要

偏最小二乘(PLS)方法已被大量利用来分析两个数据块之间的关联。这些强大的方法可以应用于变量数量大于观测数量以及变量之间存在高共线性的数据集。不同的稀疏版本的PLS已经开发集成多个数据集，同时选择贡献变量。稀疏建模是获得更好的估计器和识别多个数据集之间关联的关键因素。稀疏PLS方法的基础是矩阵的奇异值分解(SVD)(由原始数据的压缩版本构造)和线性回归中的最小二乘最小化之间的联系。我们回顾了两个数据块的四种流行的PLS方法。提出了一种统一的算法来执行所有四种类型的PLS，包括它们的正则化版本。我们提出了各种方法来减少计算时间，并展示了整个过程如何可扩展到大数据集。bigsgPLS R包实现了我们的统一算法，可在https://github.com/matt-sutton/bigsgPLS获得。MSC 2010学科分类:Primary 6202, 62J99。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PLS for Big Data: A unified parallel algorithm for regularised group PLS

Partial Least Squares (PLS) methods have been heavily exploited to analyse the association between two blocks of data. These powerful approaches can be applied to data sets where the number of variables is greater than the number of observations and in the presence of high collinearity between variables. Different sparse versions of PLS have been developed to integrate multiple data sets while simultaneously selecting the contributing variables. Sparse modeling is a key factor in obtaining better estimators and identifying associations between multiple data sets. The cornerstone of the sparse PLS methods is the link between the singular value decomposition (SVD) of a matrix (constructed from deflated versions of the original data) and least squares minimization in linear regression. We review four popular PLS methods for two blocks of data. A unified algorithm is proposed to perform all four types of PLS including their regularised versions. We present various approaches to decrease the computation time and show how the whole procedure can be scalable to big data sets. The bigsgPLS R package implements our unified algorithm and is available at https://github.com/matt-sutton/bigsgPLS. MSC 2010 subject classifications: Primary 6202, 62J99.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistics Surveys STATISTICS & PROBABILITY-

CiteScore

11.70

自引率

0.00%

发文量

期刊介绍： Statistics Surveys publishes survey articles in theoretical, computational, and applied statistics. The style of articles may range from reviews of recent research to graduate textbook exposition. Articles may be broad or narrow in scope. The essential requirements are a well specified topic and target audience, together with clear exposition. Statistics Surveys is sponsored by the American Statistical Association, the Bernoulli Society, the Institute of Mathematical Statistics, and by the Statistical Society of Canada.