dna编码文库数据中普遍存在的假阴性:链接器效应如何损害基于机器学习的先导预测

IF 7.6 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini
{"title":"dna编码文库数据中普遍存在的假阴性:链接器效应如何损害基于机器学习的先导预测","authors":"Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini","doi":"10.1039/d5sc00844a","DOIUrl":null,"url":null,"abstract":"DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.","PeriodicalId":9909,"journal":{"name":"Chemical Science","volume":"10 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction\",\"authors\":\"Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini\",\"doi\":\"10.1039/d5sc00844a\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.\",\"PeriodicalId\":9909,\"journal\":{\"name\":\"Chemical Science\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemical Science\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1039/d5sc00844a\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Science","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d5sc00844a","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

dna编码化学文库(decl)已经成为早期药物发现不可或缺的一部分,为基于机器学习(ML)的生物活性分子预测提供了活性化合物和广泛的标记数据集。然而,对DECL选择数据的信息内容的研究还很少。本研究首次系统地调查了DECL数据中假阴性的盛行率和链接器的影响。使用靶向多聚(adp -核糖)聚合酶PARP1/2和TNKS1/2的DECL作为模型系统,我们发现我们的DECL选择经常错过活性化合物,每个确定的命中都有许多假阴性。dna偶联连接物的存在是导致活性分子检测不足的一个因素。在学习PARP2数据时,通过分析欠采样和过采样技术的影响,可以确定DECL数据在确定命中优先级、预测目标选择性和训练ML模型方面的预测能力,这种对假阴性的偏见损害了DECL数据的预测能力。相反,连接体在decl中的存在提供了优势,例如即使底层分子本身可能不具有选择性,也可以识别目标选择性蛋白接合物。这些发现突出了DECL数据的挑战和机遇,强调了药物发现中数据处理和ML模型开发的最佳实践的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction
DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Chemical Science
Chemical Science CHEMISTRY, MULTIDISCIPLINARY-
CiteScore
14.40
自引率
4.80%
发文量
1352
审稿时长
2.1 months
期刊介绍: Chemical Science is a journal that encompasses various disciplines within the chemical sciences. Its scope includes publishing ground-breaking research with significant implications for its respective field, as well as appealing to a wider audience in related areas. To be considered for publication, articles must showcase innovative and original advances in their field of study and be presented in a manner that is understandable to scientists from diverse backgrounds. However, the journal generally does not publish highly specialized research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信