{"title":"Developing Quantitative Structure–Activity Relationship (QSAR) Models for Water Contaminants’ Activities/Properties by Fine-Tuning GPT-3 Models","authors":"Shifa Zhong, and , Xiaohong Guan*, ","doi":"10.1021/acs.estlett.3c00599","DOIUrl":null,"url":null,"abstract":"<p >In this study, we developed quantitative structure–activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: p<i>K</i>d, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (<i>Davinci</i>, <i>Curie</i>, <i>Babbage</i>, and <i>Ada</i>) obtained from OpenAI. The <i>Babbage</i> model demonstrated superior performance for the p<i>K</i>d data set, while the <i>Davinci</i> model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.</p>","PeriodicalId":37,"journal":{"name":"Environmental Science & Technology Letters Environ.","volume":"10 10","pages":"872–877"},"PeriodicalIF":8.8000,"publicationDate":"2023-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Science & Technology Letters Environ.","FirstCategoryId":"1","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.estlett.3c00599","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we developed quantitative structure–activity relationship (QSAR) models for water contaminants’ activities/properties by fine-tuning GPT-3 models. We also proposed a novel masked atom importance (MAI) approach for model interpretation and an OpenAIEmbedding similarity-based method for determining the applicability domain. We utilized the Simplified Molecular-Input Line-Entry System (SMILES) of contaminants and their corresponding activities/properties from hree data sets: pKd, Koc, and Solubility. These were used as input prompts and completions, respectively, to fine-tune four GPT-3 models (Davinci, Curie, Babbage, and Ada) obtained from OpenAI. The Babbage model demonstrated superior performance for the pKd data set, while the Davinci model excelled with the Koc and Solubility data sets, even outperforming molecular fingerprint (MF) CatBoost-based QSAR models. The MAI interpretation results were qualitatively consistent with the SHapley additive expansion (SHAP) interpretation but exhibited less sensitivity in quantitative analysis. The OpenAIEmbedding similarity-based applicability domain determination approach showed efficacy comparable to that of the MF-based similarity approach but with added robustness. This study underscores the potential of large language models in developing QSAR models, paving the way for further advancements in QSAR modeling using state-of-the-art language models.
期刊介绍:
Environmental Science & Technology Letters serves as an international forum for brief communications on experimental or theoretical results of exceptional timeliness in all aspects of environmental science, both pure and applied. Published as soon as accepted, these communications are summarized in monthly issues. Additionally, the journal features short reviews on emerging topics in environmental science and technology.