一种预测有机化合物液相色谱保留时间的高效广义MTSCAM模型的建立。

IF 11 1区综合性期刊 Q1 Multidisciplinary

Research Pub Date : 2025-02-07 eCollection Date: 2025-01-01 DOI:10.34133/research.0607

Mengdie Fan, Chenhui Sang, Hua Li, Yue Wei, Bin Zhang, Yang Xing, Jing Zhang, Jie Yin, Wei An, Bing Shao

{"title":"一种预测有机化合物液相色谱保留时间的高效广义MTSCAM模型的建立。","authors":"Mengdie Fan, Chenhui Sang, Hua Li, Yue Wei, Bin Zhang, Yang Xing, Jing Zhang, Jie Yin, Wei An, Bing Shao","doi":"10.34133/research.0607","DOIUrl":null,"url":null,"abstract":"Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure-retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R 2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.","PeriodicalId":21120,"journal":{"name":"Research","volume":"8 ","pages":"0607"},"PeriodicalIF":11.0000,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11803058/pdf/","citationCount":"0","resultStr":"{\"title\":\"Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds.\",\"authors\":\"Mengdie Fan, Chenhui Sang, Hua Li, Yue Wei, Bin Zhang, Yang Xing, Jing Zhang, Jie Yin, Wei An, Bing Shao\",\"doi\":\"10.34133/research.0607\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure-retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R 2 of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.\",\"PeriodicalId\":21120,\"journal\":{\"name\":\"Research\",\"volume\":\"8 \",\"pages\":\"0607\"},\"PeriodicalIF\":11.0000,\"publicationDate\":\"2025-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11803058/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.34133/research.0607\",\"RegionNum\":1,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.34133/research.0607","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}

引用次数: 0

摘要

液相色谱保留时间的准确预测在非靶向筛选应用中变得越来越重要。传统的保留时间方法严重依赖于标准化合物的使用，受到标准产品合成和制造速度的限制，并且耗时费力。近年来，机器学习和人工智能算法被应用于滞留时间预测，显示出传统实验方法无可比拟的优势。然而，现有的保留时间预测方法通常存在综合训练数据集的稀缺性、有效数据的稀疏性以及数据集缺乏分类等问题，导致泛化能力和准确性较差。在本研究中，构建了一个包含其保留时间的10,905种化合物的数据集。接下来，实现了一个创新的分类系统，基于官能团权重将10,905种化合物分为141类的3层层次。然后，采用简化分子输入行输入系统（SMILES）枚举结合结构相似性扩展对每个类别进行数据扩增。最后，通过对每一类化合物的最优定量结构-保留关系（QSRR）模型进行训练，并在预测期内通过判别分析选择最合适的模型进行预测，建立了一种新颖的通用高通量保留时间预测模型。结果表明，该模型的r2为0.98，平均预测误差为23 s，优于现有的模型。本研究为未知污染物的高通量快速预测、数据挖掘、非靶向筛选等提供了科学依据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Development of an Efficient and Generalized MTSCAM Model to Predict Liquid Chromatography Retention Times of Organic Compounds.

Accurate prediction of liquid chromatographic retention times is becoming increasingly important in nontargeted screening applications. Traditional retention time approaches heavily rely on the use of standard compounds, which is limited by the speed of synthesis and manufacture of standard products, and is time-consuming and labor-intensive. Recently, machine learning and artificial intelligence algorithms have been applied to retention time prediction, which show unparalleled advantages over traditional experimental methods. However, existing retention time prediction methods usually suffer from the scarcity of comprehensive training datasets, sparsity of valid data, and lack of classification in datasets, resulting in poor generalization capability and accuracy. In this study, a dataset for 10,905 compounds was constructed including their retention times. Next, an innovative classification system was implemented, classifying 10,905 compounds into a 3-tier hierarchy across 141 classes, based on functional group weighting. Then, data augmentation was performed within each category using simplified molecular input line entry system (SMILES) enumeration combined with structural similarity expansion. Finally, by training the optimal quantitative structure-retention relationship (QSRR) models for each category of compounds and selecting the best-fitting model for prediction via discriminant analysis during the prediction period, a novel and universal high-throughput retention time prediction model was established. The results demonstrate that this model achieves an R ² of 0.98 and an average prediction error of 23 s, outperforming currently published models. This study provides a scientific basis for high throughput and rapid prediction of unknown pollutants, data mining, nontargeted screening, etc.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Research Multidisciplinary-Multidisciplinary

CiteScore

13.40

自引率

3.60%

发文量

审稿时长

14 weeks

期刊介绍： Research serves as a global platform for academic exchange, collaboration, and technological advancements. This journal welcomes high-quality research contributions from any domain, with open arms to authors from around the globe. Comprising fundamental research in the life and physical sciences, Research also highlights significant findings and issues in engineering and applied science. The journal proudly features original research articles, reviews, perspectives, and editorials, fostering a diverse and dynamic scholarly environment.