Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set

IF 5.4 3区 材料科学 Q2 CHEMISTRY, PHYSICAL
Jiaxi Zhao, Eline Hermans, Kia Sepassi, Christophe Tistaert, Christel A. S. Bergström, Mazen Ahmad, Per Larsson
{"title":"Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set","authors":"Jiaxi Zhao, Eline Hermans, Kia Sepassi, Christophe Tistaert, Christel A. S. Bergström, Mazen Ahmad, Per Larsson","doi":"10.1021/acs.molpharmaceut.4c00685","DOIUrl":null,"url":null,"abstract":"Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log <i>D</i> to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log <i>S</i> ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.","PeriodicalId":4,"journal":{"name":"ACS Applied Energy Materials","volume":null,"pages":null},"PeriodicalIF":5.4000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Energy Materials","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1021/acs.molpharmaceut.4c00685","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.

Abstract Image

数据质量和数据数量对估算内在溶解度的影响:基于单一来源数据集的分析
水溶性是药物分子最重要的理化性质之一,也是口服药物吸收的主要驱动力。迄今为止,用于估算新型化学空间溶解度的硅学模型性能有限。为了研究造成这种情况的可能原因和补救措施,我们利用了强生公司内部超过 40,000 种化合物的水溶性数据。所有数据均通过相同的高通量试验生成,为探索数据质量、数量和模型估算之间的关系提供了独特的机会。利用三种不同的方法生成了六组具有不同大小和噪声水平的本征溶解度数据集:(i) 加入或排除无定形固体残留物;(ii) 用测量或实验对数 D 来确定本征溶解度;(iii) 在数据处理工作流程中采用或省略质量检查过程。使用从 RDKit、ADMET 预测器或 Mordred 计算出的三组不同描述符对数据集进行了随机森林回归训练,并通过嵌套交叉验证和 10 个精炼测试集对其性能进行了评估。这些模型证实,正如预期的那样,在数据集大小相同的情况下,高质量的数据能带来更好的模型性能;不过,与使用小型、干净和多样化数据集训练的模型相比,使用含有分析变异的较大数据集训练的模型也能提供同样准确的估计。然而,在训练数据集中加入非晶态固体溶解后测量所带来的噪声并不能通过增加数据量来克服,因为它们在数据集中引入了有偏差的系统性正误差,这证实了关键数据审查的重要性。最后,在第二次溶解度挑战赛的第一个测试集上测试了两个表现最佳的模型,其 RMSE 值分别为 0.74 和 0.72,log S ± 0.5 分别为 46% 和 48%。这些结果表明,与竞赛结果中报告的结果相比,这些模型的性能有所提高,突出表明单一来源的数据集可以提高内在溶解度的预测能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACS Applied Energy Materials
ACS Applied Energy Materials Materials Science-Materials Chemistry
CiteScore
10.30
自引率
6.20%
发文量
1368
期刊介绍: ACS Applied Energy Materials is an interdisciplinary journal publishing original research covering all aspects of materials, engineering, chemistry, physics and biology relevant to energy conversion and storage. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrate knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important energy applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信