[不同极性固定相气相色谱保留指数的机器学习集成预测模型的构建]。

Se pu = Chinese journal of chromatography Pub Date : 2025-04-08 DOI:10.3724/SP.J.1123.2024.07014

Qian-Yi Wang, Yong-le Zhu, Xue-Hua Li

{"title":"[不同极性固定相气相色谱保留指数的机器学习集成预测模型的构建]。","authors":"Qian-Yi Wang, Yong-le Zhu, Xue-Hua Li","doi":"10.3724/SP.J.1123.2024.07014","DOIUrl":null,"url":null,"abstract":"Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R2) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R2 of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.","PeriodicalId":101336,"journal":{"name":"Se pu = Chinese journal of chromatography","volume":"43 4","pages":"355-362"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966378/pdf/","citationCount":"0","resultStr":"{\"title\":\"[Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities].\",\"authors\":\"Qian-Yi Wang, Yong-le Zhu, Xue-Hua Li\",\"doi\":\"10.3724/SP.J.1123.2024.07014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R2) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R2 of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.\",\"PeriodicalId\":101336,\"journal\":{\"name\":\"Se pu = Chinese journal of chromatography\",\"volume\":\"43 4\",\"pages\":\"355-362\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11966378/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Se pu = Chinese journal of chromatography\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3724/SP.J.1123.2024.07014\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Se pu = Chinese journal of chromatography","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3724/SP.J.1123.2024.07014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

气相色谱法是一种广泛用于分离和鉴定各种化合物的分析技术。保留指数（RI）在气相色谱中起着重要的作用，因为它提供了在特定条件下表征化合物保留性能的标准化措施，是一种强大的化合物鉴定工具，特别是在处理复杂混合物时。因此，预测RI值的能力是一个有意义的目标，特别是对于多极相，因为在不同的极性固定相中RI有显著的变化。为了解决这一问题，我们从文献和数据库中收集了8种固定相上2499种化合物的4183条保留指数数据，建立了一个预测不同极性固定相气相色谱RIs的模型。根据固定相的麦克雷诺常数，将其进一步分为强极性、极性、中极性、弱极性和非极性五类。这种分类确保了该模型能够处理大范围的极性，从而增强了其通用性和对各种分析场景的适用性。通过整合两类复合特征构建预测模型。首先确定了化合物的一维和二维分子结构特征；这些特征捕获了化合物的化学和物理性质，包括它们的相对分子质量、官能团和拓扑指数。这些描述符提供了对影响保留行为的分子特征的全面理解。然后对静止相位极性进行单热编码，将绝对静止相位极性信息转换为可被机器学习算法有效使用的格式。这种编码技术保证了模型能够区分不同极性对化合物保留行为的影响。采用线性回归、决策树、随机森林、支持向量机（SVM）、k-近邻（KNN）、梯度增强决策树（GBDT）、极限梯度增强（XGBoost）、轻梯度增强（LightGBM）等9种算法构建预测机器学习模型。基于XGBoost和LightGBM算法，采用投票回归建立了性能最优的集成学习模型。该集成模型综合了多个独立模型的优势，训练集决定系数（R2）为0.99，训练集均方根误差（RMSE）为101.85，测试集R2为0.97，测试集RMSE为107.44。Williams图用于描述模型的应用领域，超过94%的数据位于该领域内，表明了广泛的适用性和高的预测置信度。该预测保留指数模型的成功开发代表了气相色谱领域的重大进步。开发的模型通过将先进的机器学习技术与全面的化学和物理性质数据相结合，提供了几个关键的优势；它非常准确地预测了大范围的极性固定相的RI值。与单个机器学习模型相比，开发的集成模型显示出优越的鲁棒性和预测能力。该模型的建立对于提高目标气相色谱分析和非目标气相色谱分析的效率和准确性具有重要的科学意义和实用价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

[Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities].

查看原文本刊更多论文

[Construction of a machine learning ensemble prediction model for gas chromatographic retention index on stationary phases with different polarities].

Gas chromatography is an analytical technique that is widely used to separate and identify various compounds. The retention index (RI) plays a significant role in gas chromatography because it provides a standardized measure for characterizing the retention performance of compounds under specific conditions and is a powerful compound-identification tool, particularly when dealing with complex mixtures. Consequently, the ability to predict RI values is a meaningful objective, particularly for multipolar phases, owing to significant variations in RI across various polar stationary phases. To address this issue, we developed a model for predicting gas-chromatographic RIs on stationary phases of varying polarity by collecting 4183 pieces of retention-index data for 2499 compounds on eight types of stationary phase from the literature and databases. Stationary phases were further classified into five categories based on their the McReynolds constants, namely: strongly polar, polar, medium polar, weakly polar, and non-polar. This classification ensured that the model is capable of handling a wide range of polarities, thereby enhancing its versatility and applicability to various analytical scenarios. The predictive model was constructed by integrating two types of composite feature. The 1D and 2D molecular-structural features of the compounds were first determined; these features capture the chemical and physical properties of the compounds, including their relative molecular masses, functional groups, and topological indices. These descriptors provide a comprehensive understanding of the molecular characteristics that influence retention behavior. Stationary-phase polarity was then one-hot encoded, which converted categorical stationary-phase-polarity information into a format that can be effectively used by machine-learning algorithms. This encoding technique ensures that the model can distinguish among the effects of various polarities on the retention behavior of the compounds. Nine algorithms were used to construct predictive machine-learning models, including linear regression, decision tree, random forest, support vector machine (SVM), k-nearest-neighbor (KNN), gradient-boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and light gradient boosting (LightGBM) algorithms. Voting regression was used to build an optimally performing ensemble learning model based on the XGBoost and LightGBM algorithms. This ensemble model, which combines the strengths of multiple individual models, exhibited exceptional performance, with a training set coefficient of determination (R²) of 0.99, a training set root mean square error (RMSE) of 101.85, a test set R² of 0.97, and a test set RMSE of 107.44. Williams plots were used to characterize the application domain of the model, with over 94% of the data lying within the domain, indicative of broad applicability and high predictive confidence. The successful development of this predictive retention-index model represents a significant advancement in the gas-chromatography field. The developed model offers several key benefits by integrating advanced machine learning techniques with comprehensive chemical- and physical-property data; it highly accurately predicts RI values across a wide range of polar stationary phases. The developed ensemble model exhibits superior robustness and predictive abilities compared to individual machine-learning models. The establishment of this model is of great scientific significance and practical value for improving the efficiency and accuracy of target and non-target gas-chromatographic analyses.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Se pu = Chinese journal of chromatography

自引率

0.00%

发文量