Xudong Huang, Guangzao Huang, Xiaojing Chen, Zhonghao Xie, Shujat Ali, Xi Chen, Leiming Yuan, Wen Shi
{"title":"An adaptive strategy to improve the partial least squares model via minimum covariance determinant","authors":"Xudong Huang, Guangzao Huang, Xiaojing Chen, Zhonghao Xie, Shujat Ali, Xi Chen, Leiming Yuan, Wen Shi","doi":"10.1016/j.chemolab.2024.105120","DOIUrl":null,"url":null,"abstract":"<div><p>Partial least squares (PLS) regression is a linear regression technique that performs well with high-dimensional regressors. Similar to many other supervised learning techniques, PLS is susceptible to the problem that the prediction and training data are drawn from different distributions, which deteriorates the PLS performance. To address this problem, an adaptive strategy via the minimum covariance determinant (MCD) estimator is proposed to improve the PLS model, which aims to find an appropriate training set for the adaptive construction of an accurate PLS model to fit the prediction data. In this study, an <span><math><mrow><mi>h</mi></mrow></math></span>-subset of the merged set of prediction and training data with the smallest covariance determinant is found via the MCD estimator, and the prediction and training data with Mahalanobis distances to the <span><math><mrow><mi>h</mi></mrow></math></span>-subset less than or equal to a cutoff that is the square root of a quantile of the chi-squared distribution are assumed to have the same distribution, then a PLS model is built on these training data. The proposed method is applied to three real-world datasets and compared with the results of classic PLS, the most significant improvement is obtained for the m5 prediction data in the corn dataset, where the root mean square error of prediction (RMSEP) is reduced from 0.149 to 0.023. For other datasets, our method can also perform better than PLS. The experimental results show the effectiveness of our method.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"249 ","pages":"Article 105120"},"PeriodicalIF":3.7000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743924000601","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Partial least squares (PLS) regression is a linear regression technique that performs well with high-dimensional regressors. Similar to many other supervised learning techniques, PLS is susceptible to the problem that the prediction and training data are drawn from different distributions, which deteriorates the PLS performance. To address this problem, an adaptive strategy via the minimum covariance determinant (MCD) estimator is proposed to improve the PLS model, which aims to find an appropriate training set for the adaptive construction of an accurate PLS model to fit the prediction data. In this study, an -subset of the merged set of prediction and training data with the smallest covariance determinant is found via the MCD estimator, and the prediction and training data with Mahalanobis distances to the -subset less than or equal to a cutoff that is the square root of a quantile of the chi-squared distribution are assumed to have the same distribution, then a PLS model is built on these training data. The proposed method is applied to three real-world datasets and compared with the results of classic PLS, the most significant improvement is obtained for the m5 prediction data in the corn dataset, where the root mean square error of prediction (RMSEP) is reduced from 0.149 to 0.023. For other datasets, our method can also perform better than PLS. The experimental results show the effectiveness of our method.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.