通过微调GPT-3模型开发水污染物活性/性质的定量构效关系(QSAR)模型

IF 8.8 2区 环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL
Shifa Zhong,  and , Xiaohong Guan*, 
{"title":"通过微调GPT-3模型开发水污染物活性/性质的定量构效关系(QSAR)模型","authors":"Shifa Zhong,&nbsp; and ,&nbsp;Xiaohong Guan*,&nbsp;","doi":"10.1021/acs.estlett.3c00599","DOIUrl":null,"url":null,"abstract":"<p >In this study, we developed quantitative structure–activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: p<i>K</i>d, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (<i>Davinci</i>, <i>Curie</i>, <i>Babbage</i>, and <i>Ada</i>) obtained from OpenAI. The <i>Babbage</i> model demonstrated superior performance for the p<i>K</i>d data set, while the <i>Davinci</i> model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.</p>","PeriodicalId":37,"journal":{"name":"Environmental Science & Technology Letters Environ.","volume":"10 10","pages":"872–877"},"PeriodicalIF":8.8000,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Developing Quantitative Structure–Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models\",\"authors\":\"Shifa Zhong,&nbsp; and ,&nbsp;Xiaohong Guan*,&nbsp;\",\"doi\":\"10.1021/acs.estlett.3c00599\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >In this study, we developed quantitative structure–activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: p<i>K</i>d, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (<i>Davinci</i>, <i>Curie</i>, <i>Babbage</i>, and <i>Ada</i>) obtained from OpenAI. The <i>Babbage</i> model demonstrated superior performance for the p<i>K</i>d data set, while the <i>Davinci</i> model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.</p>\",\"PeriodicalId\":37,\"journal\":{\"name\":\"Environmental Science & Technology Letters Environ.\",\"volume\":\"10 10\",\"pages\":\"872–877\"},\"PeriodicalIF\":8.8000,\"publicationDate\":\"2023-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Science & Technology Letters Environ.\",\"FirstCategoryId\":\"1\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.estlett.3c00599\",\"RegionNum\":2,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ENVIRONMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science & Technology Letters Environ.","FirstCategoryId":"1","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.estlett.3c00599","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0

摘要

在本研究中,我们通过微调GPT-3模型,开发了水污染物活性/性质的定量结构-活性关系(QSAR)模型。我们还提出了一种新的用于模型解释的掩蔽原子重要性(MAI)方法和一种用于确定适用域的基于OpenAIEmbedding相似性的方法。我们利用了来自三个数据集的污染物及其相应活性/性质的简化分子输入线输入系统(SMILES):pKd、Koc和溶解度。这些分别用作输入提示和完成,以微调从OpenAI获得的四个GPT-3模型(Davinci、Curie、Babbage和Ada)。Babbage模型在pKd数据集方面表现出优异的性能,而Davinci模型在Koc和溶解度数据集方面也表现出色,甚至优于基于分子指纹(MF)CatBoost的QSAR模型。MAI解释结果与SHapley加性展开(SHAP)解释在质量上一致,但在定量分析中表现出较低的敏感性。基于OpenAIEmbedding相似性的适用域确定方法显示出与基于MF的相似性方法相当的效果,但具有更强的鲁棒性。这项研究强调了大型语言模型在开发QSAR模型中的潜力,为使用最先进的语言模型进行QSAR建模的进一步进步铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Developing Quantitative Structure–Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models

Developing Quantitative Structure–Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models

In this study, we developed quantitative structure–activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: pKd, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (Davinci, Curie, Babbage, and Ada) obtained from OpenAI. The Babbage model demonstrated superior performance for the pKd data set, while the Davinci model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Environmental Science & Technology Letters Environ.
Environmental Science & Technology Letters Environ. ENGINEERING, ENVIRONMENTALENVIRONMENTAL SC-ENVIRONMENTAL SCIENCES
CiteScore
17.90
自引率
3.70%
发文量
163
期刊介绍: Environmental Science & Technology Letters serves as an international forum for brief communications on experimental or theoretical results of exceptional timeliness in all aspects of environmental science, both pure and applied. Published as soon as accepted, these communications are summarized in monthly issues. Additionally, the journal features short reviews on emerging topics in environmental science and technology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信