Mahia V. Solout , Somaye Vali Zade , Hamid Abdollahi , Jahan B. Ghasemi
{"title":"偏最小二乘回归中增强数据点重要性的子集选择:与Kennard-Stone方法的比较研究","authors":"Mahia V. Solout , Somaye Vali Zade , Hamid Abdollahi , Jahan B. Ghasemi","doi":"10.1016/j.chemolab.2025.105416","DOIUrl":null,"url":null,"abstract":"<div><div>In multivariate data analysis, the selection of representative subsets of samples is crucial for developing accurate predictive models. This study evaluates the application of the Enhanced Data Point Importance (EDPI) method for subset selection, comparing its performance with the widely-used Kennard-Stone algorithm. The EDPI method ranks all the data points using the DPI and layered convex hull approach, resulting in a ranked sequence of points based on their importance in the dataset, with the most important point being the most informative. Both methods were applied to two distinct datasets, and Partial Least Squares Regression (PLSR) models were developed for each subset to assess predictive performance. The EDPI method demonstrated comparable performance to the Kennard-Stone method across various sample sizes. The EDPI-PLS models achieved lower Root Mean Square Error of Prediction (RMSEP) values with fewer samples, indicating efficient subset selection, and the method is less inclined to select the influential points in the dataset. Moreover, the running time analysis highlighted the computational efficiency of the EDPI method, especially in high-dimensional datasets. These findings suggest that EDPI is a robust and informative strategy for sample subset selection, offering advantages in predictive accuracy and computational efficiency.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"263 ","pages":"Article 105416"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhanced data point importance for subset selection in partial least squares regression: A comparative study with Kennard-Stone method\",\"authors\":\"Mahia V. Solout , Somaye Vali Zade , Hamid Abdollahi , Jahan B. Ghasemi\",\"doi\":\"10.1016/j.chemolab.2025.105416\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In multivariate data analysis, the selection of representative subsets of samples is crucial for developing accurate predictive models. This study evaluates the application of the Enhanced Data Point Importance (EDPI) method for subset selection, comparing its performance with the widely-used Kennard-Stone algorithm. The EDPI method ranks all the data points using the DPI and layered convex hull approach, resulting in a ranked sequence of points based on their importance in the dataset, with the most important point being the most informative. Both methods were applied to two distinct datasets, and Partial Least Squares Regression (PLSR) models were developed for each subset to assess predictive performance. The EDPI method demonstrated comparable performance to the Kennard-Stone method across various sample sizes. The EDPI-PLS models achieved lower Root Mean Square Error of Prediction (RMSEP) values with fewer samples, indicating efficient subset selection, and the method is less inclined to select the influential points in the dataset. Moreover, the running time analysis highlighted the computational efficiency of the EDPI method, especially in high-dimensional datasets. These findings suggest that EDPI is a robust and informative strategy for sample subset selection, offering advantages in predictive accuracy and computational efficiency.</div></div>\",\"PeriodicalId\":9774,\"journal\":{\"name\":\"Chemometrics and Intelligent Laboratory Systems\",\"volume\":\"263 \",\"pages\":\"Article 105416\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemometrics and Intelligent Laboratory Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169743925001017\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001017","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Enhanced data point importance for subset selection in partial least squares regression: A comparative study with Kennard-Stone method
In multivariate data analysis, the selection of representative subsets of samples is crucial for developing accurate predictive models. This study evaluates the application of the Enhanced Data Point Importance (EDPI) method for subset selection, comparing its performance with the widely-used Kennard-Stone algorithm. The EDPI method ranks all the data points using the DPI and layered convex hull approach, resulting in a ranked sequence of points based on their importance in the dataset, with the most important point being the most informative. Both methods were applied to two distinct datasets, and Partial Least Squares Regression (PLSR) models were developed for each subset to assess predictive performance. The EDPI method demonstrated comparable performance to the Kennard-Stone method across various sample sizes. The EDPI-PLS models achieved lower Root Mean Square Error of Prediction (RMSEP) values with fewer samples, indicating efficient subset selection, and the method is less inclined to select the influential points in the dataset. Moreover, the running time analysis highlighted the computational efficiency of the EDPI method, especially in high-dimensional datasets. These findings suggest that EDPI is a robust and informative strategy for sample subset selection, offering advantages in predictive accuracy and computational efficiency.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.