InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Seungchan An, Yeonjin Lee, Junpyo Gong, Seokyoung Hwang, In Guk Park, Jayhyun Cho, Min Ju Lee, Minkyu Kim, Yun Pyo Kang, Minsoo Noh
{"title":"InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem","authors":"Seungchan An,&nbsp;Yeonjin Lee,&nbsp;Junpyo Gong,&nbsp;Seokyoung Hwang,&nbsp;In Guk Park,&nbsp;Jayhyun Cho,&nbsp;Min Ju Lee,&nbsp;Minkyu Kim,&nbsp;Yun Pyo Kang,&nbsp;Minsoo Noh","doi":"10.1186/s13321-025-00999-1","DOIUrl":null,"url":null,"abstract":"<div><p>The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets.</p><p><b>Scientific contributions</b></p><p>InertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00999-1","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-00999-1","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets.

Scientific contributions

InertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.

InertDB是一个生成ai扩展资源,从PubChem中获得生物无活性的小分子
强大的人工智能(AI)驱动的预测模型的发展依赖于高质量、多样化的化学数据集。然而,负面数据的缺乏和对正面结果的出版偏见往往阻碍准确的生物活性预测。为了应对这一挑战,我们引入了InertDB,这是一个综合数据库,包括3205种经过严格审查的非活性化合物(CICs),这些化合物是通过PubChem中460多万种化合物记录鉴定出来的。CIC选择优先考虑生物测定多样性,使用基于自然语言处理(NLP)的聚类指标确定,同时确保所有评估的生物测定中最小的生物活性。值得注意的是,97.2%的CICs遵循了“五法则”,这一比例明显高于《PubChem》整体数据集的比例。为了进一步扩展化学空间,InertDB还使用CIC数据集训练的深度生成人工智能模型生成了64,368种生成的非活性化合物(gic)。与随机抽样或属性匹配诱饵等传统方法相比,InertDB显著提高了人工智能的预测性能,特别是通过提供可靠的非活性化合物集来预测表型活性。通过提供全面的生物无活性化合物库,ertdb解决了人工智能驱动的药物发现的关键空白,有效地解决了限制预测准确性和模型可靠性的负面数据的稀缺性。通过利用基于语言模型的生物测定多样性指标和生成式人工智能,InertDB将严格筛选的非活性化合物与扩展的化学空间相结合。InertDB作为随机抽样和诱饵生成的有价值的替代方案,提供改进的训练数据集并提高表型药理学活性预测的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信