{"title":"利用PLS和Lasso在MARS高维FTIR数据:希夫碱化合物抗糖尿病活性的混合提议模型","authors":"Sughra Sarwar, Tahir Mehmood, Muhammad Arfan","doi":"10.1016/j.chemolab.2025.105418","DOIUrl":null,"url":null,"abstract":"<div><div>In this study, we utilized Fourier Transform Infrared (FTIR) spectral data to create and analyze multiple regression models to predict the anti-diabetic potential of synthesized Schiff bases. Schiff bases are a wide range of compounds characterized by a double bond between the nitrogen and carbon atoms. Their versatility stems from various strategies by which these can be coupled with multiple alkyl or aryl substitutes. The models that were examined consisted of MARS, PLS, SPLS, KPLS, MARS-SPLS, MARS-Kernel-PLS, and an innovative method called MARS-PLS-Lasso, which combines the traditional MARS algorithm with partial least squares and Lasso regularization. To assess the efficacy of the proposed method, we used a high-dimensional spectral data set comprising 19 samples and 1627 predictors. To capture nonlinear interactions in the data, MARS-PLS-Lasso improves the conventional MARS approach by creating adaptive basis functions for each predictor. Lasso regularization was used to choose the most pertinent basis functions and make sure that only the most important predictors were kept. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were used on train and test datasets to evaluate the prediction performance. The MARS-PLS-Lasso model outperformed the typical MARS (RMSE = 30.48, MAE = 23.46) and PLS (RMSE = 14.00, MAE = 11.90) models by achieving the lowest test RMSE of 13.00 and MAE of 10.55. When we performed simulation study, MARS-PLS-LASSO again performed the best among basis-integrated models in terms of both low and high correlated data, with the lowest RMSE (0.4708) and MAE (0.2812) in case of data with dimensions 20, 50 and RMSE (0.685, 0.4806) and MAE (0.1325, 0.3819) using data with dimensions 20, 5000 respectively. These results show that the best way to model complicated relationships in high-dimensional data is to use MARS-PLS-Lasso to improve predictive accuracy.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"263 ","pages":"Article 105418"},"PeriodicalIF":3.7000,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Leveraging PLS and Lasso in MARS for high-dimensional FTIR data: A hybrid proposed model for antidiabetic activity of schiff base compounds\",\"authors\":\"Sughra Sarwar, Tahir Mehmood, Muhammad Arfan\",\"doi\":\"10.1016/j.chemolab.2025.105418\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In this study, we utilized Fourier Transform Infrared (FTIR) spectral data to create and analyze multiple regression models to predict the anti-diabetic potential of synthesized Schiff bases. Schiff bases are a wide range of compounds characterized by a double bond between the nitrogen and carbon atoms. Their versatility stems from various strategies by which these can be coupled with multiple alkyl or aryl substitutes. The models that were examined consisted of MARS, PLS, SPLS, KPLS, MARS-SPLS, MARS-Kernel-PLS, and an innovative method called MARS-PLS-Lasso, which combines the traditional MARS algorithm with partial least squares and Lasso regularization. To assess the efficacy of the proposed method, we used a high-dimensional spectral data set comprising 19 samples and 1627 predictors. To capture nonlinear interactions in the data, MARS-PLS-Lasso improves the conventional MARS approach by creating adaptive basis functions for each predictor. Lasso regularization was used to choose the most pertinent basis functions and make sure that only the most important predictors were kept. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were used on train and test datasets to evaluate the prediction performance. The MARS-PLS-Lasso model outperformed the typical MARS (RMSE = 30.48, MAE = 23.46) and PLS (RMSE = 14.00, MAE = 11.90) models by achieving the lowest test RMSE of 13.00 and MAE of 10.55. When we performed simulation study, MARS-PLS-LASSO again performed the best among basis-integrated models in terms of both low and high correlated data, with the lowest RMSE (0.4708) and MAE (0.2812) in case of data with dimensions 20, 50 and RMSE (0.685, 0.4806) and MAE (0.1325, 0.3819) using data with dimensions 20, 5000 respectively. These results show that the best way to model complicated relationships in high-dimensional data is to use MARS-PLS-Lasso to improve predictive accuracy.</div></div>\",\"PeriodicalId\":9774,\"journal\":{\"name\":\"Chemometrics and Intelligent Laboratory Systems\",\"volume\":\"263 \",\"pages\":\"Article 105418\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemometrics and Intelligent Laboratory Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169743925001030\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925001030","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
Leveraging PLS and Lasso in MARS for high-dimensional FTIR data: A hybrid proposed model for antidiabetic activity of schiff base compounds
In this study, we utilized Fourier Transform Infrared (FTIR) spectral data to create and analyze multiple regression models to predict the anti-diabetic potential of synthesized Schiff bases. Schiff bases are a wide range of compounds characterized by a double bond between the nitrogen and carbon atoms. Their versatility stems from various strategies by which these can be coupled with multiple alkyl or aryl substitutes. The models that were examined consisted of MARS, PLS, SPLS, KPLS, MARS-SPLS, MARS-Kernel-PLS, and an innovative method called MARS-PLS-Lasso, which combines the traditional MARS algorithm with partial least squares and Lasso regularization. To assess the efficacy of the proposed method, we used a high-dimensional spectral data set comprising 19 samples and 1627 predictors. To capture nonlinear interactions in the data, MARS-PLS-Lasso improves the conventional MARS approach by creating adaptive basis functions for each predictor. Lasso regularization was used to choose the most pertinent basis functions and make sure that only the most important predictors were kept. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) were used on train and test datasets to evaluate the prediction performance. The MARS-PLS-Lasso model outperformed the typical MARS (RMSE = 30.48, MAE = 23.46) and PLS (RMSE = 14.00, MAE = 11.90) models by achieving the lowest test RMSE of 13.00 and MAE of 10.55. When we performed simulation study, MARS-PLS-LASSO again performed the best among basis-integrated models in terms of both low and high correlated data, with the lowest RMSE (0.4708) and MAE (0.2812) in case of data with dimensions 20, 50 and RMSE (0.685, 0.4806) and MAE (0.1325, 0.3819) using data with dimensions 20, 5000 respectively. These results show that the best way to model complicated relationships in high-dimensional data is to use MARS-PLS-Lasso to improve predictive accuracy.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.