通过保留时间预测提纯药物：利用图同构网络、有限数据和迁移学习

IF 2.8 3区工程技术 Q2 CHEMISTRY, ANALYTICAL

Journal of separation science Pub Date : 2025-06-02 DOI:10.1002/jssc.70178

Armen G. Beck, Rojan Shrestha, Jun Wang, Jonathan Fine, Erik L. Regalado, Kanaka Hettiarachchi, Katharine B. Williams, Edward C. Sherer, Pankaj Aggarwal

{"title":"通过保留时间预测提纯药物：利用图同构网络、有限数据和迁移学习","authors":"Armen G. Beck, Rojan Shrestha, Jun Wang, Jonathan Fine, Erik L. Regalado, Kanaka Hettiarachchi, Katharine B. Williams, Edward C. Sherer, Pankaj Aggarwal","doi":"10.1002/jssc.70178","DOIUrl":null,"url":null,"abstract":"<div>\n \n The design-make-test cycle for drug discovery is highly dependent on the purification of synthesized compounds. Prior to evaluation of suitability, ultrahigh-performance liquid chromatography is used for an initial standard analysis, where retention times of analytes are measured with a shorter standard gradient method and used to select the appropriate gradients for a final purification method. To circumvent this preliminary screening experiment for small molecule libraries, retention time prediction had been achieved previously by the use of commercial modeling methods. However, these retention time prediction models can have limited applicability when built from smaller datasets and are less effective when constructed from disparate data collected under differing chromatography conditions. Having thousands of measured retention times from high-throughput physiochemical screening, we sought to leverage these data for the construction of predictive models for a standard preliminary method enabling high-throughput purification of macrocyclic peptide libraries. Utilizing 4549 analytes and their retention times from high-throughput physiochemical screening, a structure-to-retention-time model was built using a graph isomorphism network, a form of artificial neural network architecture. Once fitted to high-throughput screening data, the model was re-trained with standard gradient method data, a technique known as transfer learning. Through transfer learning, a training set of 80 analytes yielded a neural network model that, when evaluated against a test set of 24 analytes, displays high performance metrics with a coefficient of determination (R2) of 0.82 and mean average error of 0.088 min, or 1.26% of the gradient time. Comparatively, the best commercial quantitative structure-retention relationship model poorly performed, with an R2 of 0.11 and mean average error of 0.202 min. This model has been deployed internally as a Dash app to help democratize the use of the developed models and is being used for selecting purification methods based on analyte structure.\n </div>","PeriodicalId":17098,"journal":{"name":"Journal of separation science","volume":"48 6","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Purification of Pharmaceuticals via Retention Time Prediction: Leveraging Graph Isomorphism Networks, Limited Data, and Transfer Learning\",\"authors\":\"Armen G. Beck, Rojan Shrestha, Jun Wang, Jonathan Fine, Erik L. Regalado, Kanaka Hettiarachchi, Katharine B. Williams, Edward C. Sherer, Pankaj Aggarwal\",\"doi\":\"10.1002/jssc.70178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n The design-make-test cycle for drug discovery is highly dependent on the purification of synthesized compounds. Prior to evaluation of suitability, ultrahigh-performance liquid chromatography is used for an initial standard analysis, where retention times of analytes are measured with a shorter standard gradient method and used to select the appropriate gradients for a final purification method. To circumvent this preliminary screening experiment for small molecule libraries, retention time prediction had been achieved previously by the use of commercial modeling methods. However, these retention time prediction models can have limited applicability when built from smaller datasets and are less effective when constructed from disparate data collected under differing chromatography conditions. Having thousands of measured retention times from high-throughput physiochemical screening, we sought to leverage these data for the construction of predictive models for a standard preliminary method enabling high-throughput purification of macrocyclic peptide libraries. Utilizing 4549 analytes and their retention times from high-throughput physiochemical screening, a structure-to-retention-time model was built using a graph isomorphism network, a form of artificial neural network architecture. Once fitted to high-throughput screening data, the model was re-trained with standard gradient method data, a technique known as transfer learning. Through transfer learning, a training set of 80 analytes yielded a neural network model that, when evaluated against a test set of 24 analytes, displays high performance metrics with a coefficient of determination (R2) of 0.82 and mean average error of 0.088 min, or 1.26% of the gradient time. Comparatively, the best commercial quantitative structure-retention relationship model poorly performed, with an R2 of 0.11 and mean average error of 0.202 min. This model has been deployed internally as a Dash app to help democratize the use of the developed models and is being used for selecting purification methods based on analyte structure.\\n </div>\",\"PeriodicalId\":17098,\"journal\":{\"name\":\"Journal of separation science\",\"volume\":\"48 6\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-06-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of separation science\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/jssc.70178\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, ANALYTICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of separation science","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jssc.70178","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

摘要

药物发现的设计-制造-测试周期高度依赖于合成化合物的纯化。在评估适用性之前，超高效液相色谱用于初始标准分析，其中分析物的保留时间用较短的标准梯度法测量，并用于选择最终纯化方法的适当梯度。为了避免这种小分子文库的初步筛选实验，保留时间预测之前已经通过使用商业建模方法实现。然而，当这些保留时间预测模型建立在较小的数据集上时，其适用性有限，并且当在不同色谱条件下收集不同数据时，其有效性较低。通过高通量物理化学筛选，我们测量了数千个保留时间，我们试图利用这些数据构建预测模型，以实现高通量纯化大环肽库的标准初步方法。利用高通量理化筛选的4549种分析物及其保留时间，利用图同构网络（一种人工神经网络架构）建立了结构-保留时间模型。一旦适合于高通量筛选数据，该模型将使用标准梯度方法数据重新训练，这是一种称为迁移学习的技术。通过迁移学习，80个分析物的训练集产生了一个神经网络模型，当与24个分析物的测试集进行评估时，该模型显示出高性能指标，决定系数（R2）为0.82，平均误差为0.088分钟，即1.26%的梯度时间。相比之下，最佳的商业定量结构-保留关系模型表现不佳，R2为0.11，平均平均误差为0.202分钟。该模型已在内部部署为Dash应用程序，以帮助开发的模型的使用大众化，并用于选择基于分析物结构的纯化方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Purification of Pharmaceuticals via Retention Time Prediction: Leveraging Graph Isomorphism Networks, Limited Data, and Transfer Learning

The design-make-test cycle for drug discovery is highly dependent on the purification of synthesized compounds. Prior to evaluation of suitability, ultrahigh-performance liquid chromatography is used for an initial standard analysis, where retention times of analytes are measured with a shorter standard gradient method and used to select the appropriate gradients for a final purification method. To circumvent this preliminary screening experiment for small molecule libraries, retention time prediction had been achieved previously by the use of commercial modeling methods. However, these retention time prediction models can have limited applicability when built from smaller datasets and are less effective when constructed from disparate data collected under differing chromatography conditions. Having thousands of measured retention times from high-throughput physiochemical screening, we sought to leverage these data for the construction of predictive models for a standard preliminary method enabling high-throughput purification of macrocyclic peptide libraries. Utilizing 4549 analytes and their retention times from high-throughput physiochemical screening, a structure-to-retention-time model was built using a graph isomorphism network, a form of artificial neural network architecture. Once fitted to high-throughput screening data, the model was re-trained with standard gradient method data, a technique known as transfer learning. Through transfer learning, a training set of 80 analytes yielded a neural network model that, when evaluated against a test set of 24 analytes, displays high performance metrics with a coefficient of determination (R²) of 0.82 and mean average error of 0.088 min, or 1.26% of the gradient time. Comparatively, the best commercial quantitative structure-retention relationship model poorly performed, with an R² of 0.11 and mean average error of 0.202 min. This model has been deployed internally as a Dash app to help democratize the use of the developed models and is being used for selecting purification methods based on analyte structure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of separation science 化学-分析化学

CiteScore

6.30

自引率

16.10%

发文量

408

审稿时长

1.8 months

期刊介绍： The Journal of Separation Science (JSS) is the most comprehensive source in separation science, since it covers all areas of chromatographic and electrophoretic separation methods in theory and practice, both in the analytical and in the preparative mode, solid phase extraction, sample preparation, and related techniques. Manuscripts on methodological or instrumental developments, including detection aspects, in particular mass spectrometry, as well as on innovative applications will also be published. Manuscripts on hyphenation, automation, and miniaturization are particularly welcome. Pre- and post-separation facets of a total analysis may be covered as well as the underlying logic of the development or application of a method.