dna编码文库数据中普遍存在的假阴性：链接器效应如何损害基于机器学习的先导预测

IF 7.6 1区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Chemical Science Pub Date : 2025-05-09 DOI:10.1039/d5sc00844a

Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini

{"title":"dna编码文库数据中普遍存在的假阴性：链接器效应如何损害基于机器学习的先导预测","authors":"Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini","doi":"10.1039/d5sc00844a","DOIUrl":null,"url":null,"abstract":"DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.","PeriodicalId":9909,"journal":{"name":"Chemical Science","volume":"10 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction\",\"authors\":\"Alba Lucia Montoya Arias, Adam Hogendorf, Steven Tingey, Aadarsh Kuberan, Lik Hang Yuen, Herwig Schüler, Raphael Franzini\",\"doi\":\"10.1039/d5sc00844a\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.\",\"PeriodicalId\":9909,\"journal\":{\"name\":\"Chemical Science\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemical Science\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1039/d5sc00844a\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Science","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d5sc00844a","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

dna编码化学文库（decl）已经成为早期药物发现不可或缺的一部分，为基于机器学习（ML）的生物活性分子预测提供了活性化合物和广泛的标记数据集。然而，对DECL选择数据的信息内容的研究还很少。本研究首次系统地调查了DECL数据中假阴性的盛行率和链接器的影响。使用靶向多聚（adp -核糖）聚合酶PARP1/2和TNKS1/2的DECL作为模型系统，我们发现我们的DECL选择经常错过活性化合物，每个确定的命中都有许多假阴性。dna偶联连接物的存在是导致活性分子检测不足的一个因素。在学习PARP2数据时，通过分析欠采样和过采样技术的影响，可以确定DECL数据在确定命中优先级、预测目标选择性和训练ML模型方面的预测能力，这种对假阴性的偏见损害了DECL数据的预测能力。相反，连接体在decl中的存在提供了优势，例如即使底层分子本身可能不具有选择性，也可以识别目标选择性蛋白接合物。这些发现突出了DECL数据的挑战和机遇，强调了药物发现中数据处理和ML模型开发的最佳实践的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Widespread False Negatives in DNA-Encoded Library Data: How Linker Effects Impair Machine Learning-Based Lead Prediction

DNA-encoded chemical libraries (DECLs) have become integral to early-stage drug discovery, yielding active compounds and extensive labeled datasets for machine learning (ML)-based prediction of bioactive molecules. However, the information content of DECL selection data remains scarcely explored. This study systematically investigates for the first time the prevalence of false negatives and the influence of the linker in DECL data. Using a focused DECL targeting the poly-(ADP-ribose) polymerases PARP1/2 and TNKS1/2 as a model system, we found that our DECL selections frequently miss active compounds, with numerous false negatives for each identified hit. The presence of the DNA-conjugation linker emerged as a factor contributing to the underdetection of active molecules. This bias toward false negatives compromises the predictive power of DECL data for prioritizing hits, anticipating target selectivity, and training ML models, as determined by analyzing the effects of undersampling and oversampling techniques in learning the PARP2 data. Conversely, the linker’s presence in DECLs offers advantages, such as enabling the identification of target-selective protein engagers, even when the underlying molecules themselves may not be selective. These findings highlight the challenges and opportunities of DECL data, emphasizing the need for best practices in data handling and ML model development in drug discovery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Chemical Science CHEMISTRY, MULTIDISCIPLINARY-

CiteScore

14.40

自引率

4.80%

发文量

1352

审稿时长

2.1 months

期刊介绍： Chemical Science is a journal that encompasses various disciplines within the chemical sciences. Its scope includes publishing ground-breaking research with significant implications for its respective field, as well as appealing to a wider audience in related areas. To be considered for publication, articles must showcase innovative and original advances in their field of study and be presented in a manner that is understandable to scientists from diverse backgrounds. However, the journal generally does not publish highly specialized research.