在高度整理的数据集上用基于图卷积的神经网络预测水的溶解度

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Nadin Ulrich, Karsten Voigt, Anton Kudria, Alexander Böhme, Ralf-Uwe Ebert
{"title":"在高度整理的数据集上用基于图卷积的神经网络预测水的溶解度","authors":"Nadin Ulrich,&nbsp;Karsten Voigt,&nbsp;Anton Kudria,&nbsp;Alexander Böhme,&nbsp;Ralf-Uwe Ebert","doi":"10.1186/s13321-025-01000-9","DOIUrl":null,"url":null,"abstract":"<div><p>Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log <i>S</i><sub>w</sub> based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log <i>S</i><sub>w</sub> values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an <i>r</i><sup>2</sup> of 0.901, a <i>q</i><sup>2</sup> of 0.896, and an <i>rmse</i> of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.</p><p><b>Scientific contribution</b> Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01000-9","citationCount":"0","resultStr":"{\"title\":\"Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset\",\"authors\":\"Nadin Ulrich,&nbsp;Karsten Voigt,&nbsp;Anton Kudria,&nbsp;Alexander Böhme,&nbsp;Ralf-Uwe Ebert\",\"doi\":\"10.1186/s13321-025-01000-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log <i>S</i><sub>w</sub> based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log <i>S</i><sub>w</sub> values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an <i>r</i><sup>2</sup> of 0.901, a <i>q</i><sup>2</sup> of 0.896, and an <i>rmse</i> of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.</p><p><b>Scientific contribution</b> Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.</p></div>\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":7.1000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01000-9\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://link.springer.com/article/10.1186/s13321-025-01000-9\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-01000-9","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

水溶性是环境化学、毒理学和药物设计中一个重要的理化性质。虽然水溶性是除了辛醇-水分配系数、熔点、沸点等已有大量实验数据的性质外,还有更多的化合物在化学宇宙中缺乏关于其水溶性的信息。因此,需要具有广泛应用领域的预测工具来填补相应的数据空白。为此,我们开发了一个图卷积神经网络模型(GNN),以log Sw的形式预测基于9800种化学物质的高度整理数据集的水溶性。我们以AqSolDB数据的管理工作流开始模型开发,以7605个数据点结束。我们在数据集中添加了2195种化学物质,这些化学物质是我们在文献中找到的实验数据。在最终数据集中,log Sw的取值范围为−13.17到0.50。为了消除完全混相的化学物质,引入了一个截止值,排除了较高的值。我们通过对相应的训练集(70%的数据)和验证集(20%)进行五倍分割来开发共识GNN,并使用10%作为独立测试集,用于评估不同分割和共识模型的性能。通过这样做,我们在独立选择的测试集上获得了r2为0.901,q2为0.896,rmse为0.657,接近0.5至0.6 log单位的实验误差。我们进一步提供了有关应用领域的信息,并将我们的性能与其他现有预测工具进行了比较。基于高度整理的数据集,我们开发了一个神经网络来预测广泛应用领域的化学品的水溶性。数据管理是由我们在一个循序渐进的过程中完成的,在这个过程中我们发现了实验数据中的各种错误。基于一个独立的测试集,我们将我们的预测结果与现有的预测模型进行比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset

Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log Sw based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log Sw values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an r2 of 0.901, a q2 of 0.896, and an rmse of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.

Scientific contribution Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信