Determining the number of components in PLS regression on incomplete data set

IF 0.9 4区 数学 Q3 Mathematics
T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer
{"title":"Determining the number of components in PLS regression on incomplete data set","authors":"T. Nengsih, F. Bertrand, M. Maumy-Bertrand, Nicolas Meyer","doi":"10.1515/sagmb-2018-0059","DOIUrl":null,"url":null,"abstract":"Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.","PeriodicalId":49477,"journal":{"name":"Statistical Applications in Genetics and Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2018-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/sagmb-2018-0059","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Applications in Genetics and Molecular Biology","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1515/sagmb-2018-0059","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 24

Abstract

Abstract Partial least squares regression – or PLS regression – is a multivariate method in which the model parameters are estimated using either the SIMPLS or NIPALS algorithm. PLS regression has been extensively used in applied research because of its effectiveness in analyzing relationships between an outcome and one or several components. Note that the NIPALS algorithm can provide estimates parameters on incomplete data. The selection of the number of components used to build a representative model in PLS regression is a central issue. However, how to deal with missing data when using PLS regression remains a matter of debate. Several approaches have been proposed in the literature, including the Q2 criterion, and the AIC and BIC criteria. Here we study the behavior of the NIPALS algorithm when used to fit a PLS regression for various proportions of missing data and different types of missingness. We compare criteria to select the number of components for a PLS regression on incomplete data set and on imputed data set using three imputation methods: multiple imputation by chained equations, k-nearest neighbour imputation, and singular value decomposition imputation. We tested various criteria with different proportions of missing data (ranging from 5% to 50%) under different missingness assumptions. Q2-leave-one-out component selection methods gave more reliable results than AIC and BIC-based ones.
不完全数据集PLS回归中分量数量的确定
偏最小二乘回归-或PLS回归-是一种多变量方法,其中模型参数估计使用SIMPLS或NIPALS算法。PLS回归因其在分析结果与一个或多个成分之间的关系方面的有效性而被广泛应用于应用研究。注意,NIPALS算法可以在不完整的数据上提供估计参数。在PLS回归中,用于构建代表性模型的组件数量的选择是一个中心问题。然而,在使用PLS回归时如何处理缺失数据仍然是一个有争议的问题。文献中提出了几种方法,包括Q2标准、AIC和BIC标准。在这里,我们研究NIPALS算法在用于拟合PLS回归时的行为,用于不同比例的缺失数据和不同类型的缺失。我们比较了选择不完整数据集和输入数据集上PLS回归的组件数量的标准,使用三种输入方法:链式方程的多重输入,k近邻输入和奇异值分解输入。在不同的缺失假设下,我们用不同的缺失数据比例(从5%到50%不等)测试了各种标准。Q2-leave-one-out方法比基于AIC和bic的选择结果更可靠。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.20
自引率
11.10%
发文量
8
审稿时长
6-12 weeks
期刊介绍: Statistical Applications in Genetics and Molecular Biology seeks to publish significant research on the application of statistical ideas to problems arising from computational biology. The focus of the papers should be on the relevant statistical issues but should contain a succinct description of the relevant biological problem being considered. The range of topics is wide and will include topics such as linkage mapping, association studies, gene finding and sequence alignment, protein structure prediction, design and analysis of microarray data, molecular evolution and phylogenetic trees, DNA topology, and data base search strategies. Both original research and review articles will be warmly received.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信