不完全数据集PLS回归中分量数量的确定

IF 0.9 4区数学 Q3 Mathematics

Statistical Applications in Genetics and Molecular Biology Pub Date : 2018-10-18 DOI:10.1515/sagmb-2018-0059

T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer

{"title":"不完全数据集PLS回归中分量数量的确定","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":null,"url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":"24","resultStr":"{\"title\":\"Determining the number of components in PLS regression on incomplete data set\",\"authors\":\"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer\",\"doi\":\"10.1515/sagmb-2018-0059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.\",\"PeriodicalId\":49477,\"journal\":{\"name\":\"Statistical Applications in Genetics and Molecular Biology\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2018-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Applications in Genetics and Molecular Biology\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1515/sagmb-2018-0059\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Mathematics\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2018-0059","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 24

摘要

偏最小二乘回归-或PLS回归-是一种多变量方法，其中模型参数估计使用SIMPLS或NIPALS算法。PLS回归因其在分析结果与一个或多个成分之间的关系方面的有效性而被广泛应用于应用研究。注意，NIPALS算法可以在不完整的数据上提供估计参数。在PLS回归中，用于构建代表性模型的组件数量的选择是一个中心问题。然而，在使用PLS回归时如何处理缺失数据仍然是一个有争议的问题。文献中提出了几种方法，包括Q2标准、AIC和BIC标准。在这里，我们研究NIPALS算法在用于拟合PLS回归时的行为，用于不同比例的缺失数据和不同类型的缺失。我们比较了选择不完整数据集和输入数据集上PLS回归的组件数量的标准，使用三种输入方法:链式方程的多重输入，k近邻输入和奇异值分解输入。在不同的缺失假设下，我们用不同的缺失数据比例(从5%到50%不等)测试了各种标准。Q2-leave-one-out方法比基于AIC和bic的选择结果更可靠。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Determining the number of components in PLS regression on incomplete data set

Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Statistical Applications in Genetics and Molecular Biology 生物-生化与分子生物学

CiteScore

1.20

自引率

11.10%

发文量

审稿时长

6-12 weeks

期刊介绍： Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.