Experimental Uncertainty in Training Data for Protein-Ligand Binding Affinity Prediction Models

Artificial intelligence in the life sciences Pub Date : 2023-10-04 DOI:10.1016/j.ailsci.2023.100087

Carlos A. Hernández-Garrido , Norberto Sánchez-Cruz

{"title":"Experimental Uncertainty in Training Data for Protein-Ligand Binding Affinity Prediction Models","authors":"Carlos A. Hernández-Garrido , Norberto Sánchez-Cruz","doi":"10.1016/j.ailsci.2023.100087","DOIUrl":null,"url":null,"abstract":"<div><p>The accuracy of machine learning models for protein-ligand binding affinity prediction depends on the quality of the experimental data they are trained on. Most of these models are trained and tested on different subsets of the PDBbind database, which is the main source of protein-ligand complexes with annotated binding affinity in the public domain. However, estimating its experimental uncertainty is not straightforward because just a few protein-ligand complexes have more than one measurement associated. In this work, we analyze bioactivity data from ChEMBL to estimate the experimental uncertainty associated with the three binding affinity measures included in the PDBbind (K<sub>i</sub>, K<sub>d</sub>, and IC<sub>50</sub>), as well as the effect of combining them. The experimental uncertainty of combining these three affinity measures was characterized by a mean absolute error of 0.78 logarithmic units, a root mean square error of 1.04 and a Pearson correlation coefficient of 0.76. These estimations were contrasted with the performances obtained by state-of-the-art machine learning models for binding affinity prediction, showing that these models tend to be overoptimistic when evaluated on the core set from PDBbind.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"4 ","pages":"Article 100087"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667318523000314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The accuracy of machine learning models for protein-ligand binding affinity prediction depends on the quality of the experimental data they are trained on. Most of these models are trained and tested on different subsets of the PDBbind database, which is the main source of protein-ligand complexes with annotated binding affinity in the public domain. However, estimating its experimental uncertainty is not straightforward because just a few protein-ligand complexes have more than one measurement associated. In this work, we analyze bioactivity data from ChEMBL to estimate the experimental uncertainty associated with the three binding affinity measures included in the PDBbind (K_i, K_d, and IC₅₀), as well as the effect of combining them. The experimental uncertainty of combining these three affinity measures was characterized by a mean absolute error of 0.78 logarithmic units, a root mean square error of 1.04 and a Pearson correlation coefficient of 0.76. These estimations were contrasted with the performances obtained by state-of-the-art machine learning models for binding affinity prediction, showing that these models tend to be overoptimistic when evaluated on the core set from PDBbind.

查看原文本刊更多论文

蛋白质配体结合亲和力预测模型训练数据的实验不确定性

用于蛋白质-配体结合亲和力预测的机器学习模型的准确性取决于它们所训练的实验数据的质量。这些模型中的大多数都是在PDBbind数据库的不同子集上训练和测试的，PDBBinding数据库是公共领域中具有注释结合亲和力的蛋白质-配体复合物的主要来源。然而，估计其实验不确定性并不简单，因为只有少数蛋白质-配体复合物具有多个相关测量。在这项工作中，我们分析了来自ChEMBL的生物活性数据，以估计与PDBbind中包括的三种结合亲和力测量（Ki、Kd和IC50）相关的实验不确定性，以及将它们结合的效果。组合这三种亲和性测量的实验不确定度的特征是平均绝对误差为0.78对数单位，均方根误差为1.04，Pearson相关系数为0.76。这些估计与用于结合亲和力预测的最先进的机器学习模型获得的性能进行了对比，表明当在PDBbind的核心集上进行评估时，这些模型往往过于乐观。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

15 days