Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2025-04-21 DOI:10.1186/s13321-025-01000-9

Nadin Ulrich, Karsten Voigt, Anton Kudria, Alexander Böhme, Ralf-Uwe Ebert

{"title":"Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset","authors":"Nadin Ulrich, Karsten Voigt, Anton Kudria, Alexander Böhme, Ralf-Uwe Ebert","doi":"10.1186/s13321-025-01000-9","DOIUrl":null,"url":null,"abstract":"<div>Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log Sw based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log Sw values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an r2 of 0.901, a q2 of 0.896, and an rmse of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.Scientific contribution Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01000-9","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-01000-9","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log S_w based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log S_w values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an r² of 0.901, a q² of 0.896, and an rmse of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.

Scientific contribution Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.

查看原文本刊更多论文

在高度整理的数据集上用基于图卷积的神经网络预测水的溶解度

水溶性是环境化学、毒理学和药物设计中一个重要的理化性质。虽然水溶性是除了辛醇-水分配系数、熔点、沸点等已有大量实验数据的性质外，还有更多的化合物在化学宇宙中缺乏关于其水溶性的信息。因此，需要具有广泛应用领域的预测工具来填补相应的数据空白。为此，我们开发了一个图卷积神经网络模型（GNN），以log Sw的形式预测基于9800种化学物质的高度整理数据集的水溶性。我们以AqSolDB数据的管理工作流开始模型开发，以7605个数据点结束。我们在数据集中添加了2195种化学物质，这些化学物质是我们在文献中找到的实验数据。在最终数据集中，log Sw的取值范围为−13.17到0.50。为了消除完全混相的化学物质，引入了一个截止值，排除了较高的值。我们通过对相应的训练集（70%的数据）和验证集（20%）进行五倍分割来开发共识GNN，并使用10%作为独立测试集，用于评估不同分割和共识模型的性能。通过这样做，我们在独立选择的测试集上获得了r2为0.901，q2为0.896，rmse为0.657，接近0.5至0.6 log单位的实验误差。我们进一步提供了有关应用领域的信息，并将我们的性能与其他现有预测工具进行了比较。基于高度整理的数据集，我们开发了一个神经网络来预测广泛应用领域的化学品的水溶性。数据管理是由我们在一个循序渐进的过程中完成的，在这个过程中我们发现了实验数据中的各种错误。基于一个独立的测试集，我们将我们的预测结果与现有的预测模型进行比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.