Journal of Cheminformatics最新文献

筛选
英文 中文
Visualising lead optimisation series using reduced graphs 可视化领先优化系列使用简化的图表
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-24 DOI: 10.1186/s13321-025-01002-7
Jessica Stacey, Baptiste Canault, Stephen D. Pickett, Valerie J. Gillet
{"title":"Visualising lead optimisation series using reduced graphs","authors":"Jessica Stacey,&nbsp;Baptiste Canault,&nbsp;Stephen D. Pickett,&nbsp;Valerie J. Gillet","doi":"10.1186/s13321-025-01002-7","DOIUrl":"10.1186/s13321-025-01002-7","url":null,"abstract":"<div><p>The typical way in which lead optimisation (LO) series are represented in the medicinal chemistry literature is as Markush structures and associated R-group tables. The Markush structure shows a central core or molecular scaffold that is common to the series with R groups that indicate the points of variability that have been explored in the series. The associated R-group table shows the substituent combinations that exist in individual molecules in the series together with properties of those compounds. This format provides an intuitive way of visualising any structure–activity relationship (SAR) that is present. Automated approaches that attempt to reproduce this well understood format, such as the SAR map, are based on maximum common substructure approaches and do not take account of small changes that may be made to the core structure itself or of the situation where more than one core exists in the data. Here we describe an automated approach to represent LO series that is based on reduced graph descriptions of molecules. A publicly available LO dataset from a drug discovery programme at GSK is analysed to show how the method can group together compounds from the same series even when there are small substructural differences within the core of the series while also being able to identify different related compound series. The resulting visualisation is useful in identifying areas where series are under explored and for mapping design ideas onto the current dataset. The code to generate the visualisations is released into the public domain to promote further research in this area.</p><p><b>Scientific contribution</b>: We describe a software tool for analysing lead optimisation series using reduced graph representations of molecules. The representation allows compounds that have similar but not identical chemical scaffolds to be grouped together and is, therefore, an advance on methods that are based on the more traditional Markush structure and SAR tables. The software is a useful addition to the med chem toolbox as it can provide a holistic view of lead optimisation data by representing what might otherwise be seen as separate series as a single series of compounds.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01002-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143865585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design 使用预训练bert和贝叶斯主动学习的分子特性预测:药物设计的数据高效方法
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-23 DOI: 10.1186/s13321-025-00986-6
Muhammad Arslan Masood, Samuel Kaski, Tianyu Cui
{"title":"Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design","authors":"Muhammad Arslan Masood,&nbsp;Samuel Kaski,&nbsp;Tianyu Cui","doi":"10.1186/s13321-025-00986-6","DOIUrl":"10.1186/s13321-025-00986-6","url":null,"abstract":"<p>In drug discovery, prioritizing compounds for experimental testing is a critical task that can be optimized through active learning by strategically selecting informative molecules. Active learning typically trains models on labeled examples alone, while unlabeled data is only used for acquisition. This fully supervised approach neglects valuable information present in unlabeled molecular data, impairing both predictive performance and the molecule selection process. We address this limitation by integrating a transformer-based BERT model, pretrained on 1.26 million compounds, into the active learning pipeline. This effectively disentangles representation learning and uncertainty estimation, leading to more reliable molecule selection. Experiments on Tox21 and ClinTox datasets demonstrate that our approach achieves equivalent toxic compound identification with 50% fewer iterations compared to conventional active learning. Analysis reveals that pretrained BERT representations generate a structured embedding space enabling reliable uncertainty estimation despite limited labeled data, confirmed through Expected Calibration Error measurements. This work establishes that combining pretrained molecular representations with active learning significantly improves both model performance and acquisition efficiency in drug discovery, providing a scalable framework for compound prioritization.\u0000</p><p>We demonstrate that high-quality molecular representations fundamentally determine active learning success in drug discovery, outweighing acquisition strategy selection. We provide a framework that integrates pretrained transformer models with Bayesian active learning to separate representation learning from uncertainty estimation—a critical distinction in low-data scenarios. This approach establishes a foundation for more efficient screening workflows across diverse pharmaceutical applications.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00986-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143865506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-throughput screening data generation, scoring and FAIRification: a case study on nanomaterials 高通量筛选数据生成,评分和公平化:纳米材料的案例研究
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-23 DOI: 10.1186/s13321-025-01001-8
Gergana Tancheva, Vesa Hongisto, Konrad Patyra, Luchesar Iliev, Nikolay Kochev, Penny Nymark, Pekka Kohonen, Nina Jeliazkova, Roland Grafström
{"title":"High-throughput screening data generation, scoring and FAIRification: a case study on nanomaterials","authors":"Gergana Tancheva,&nbsp;Vesa Hongisto,&nbsp;Konrad Patyra,&nbsp;Luchesar Iliev,&nbsp;Nikolay Kochev,&nbsp;Penny Nymark,&nbsp;Pekka Kohonen,&nbsp;Nina Jeliazkova,&nbsp;Roland Grafström","doi":"10.1186/s13321-025-01001-8","DOIUrl":"10.1186/s13321-025-01001-8","url":null,"abstract":"<div><p>In vitro-based high-throughput screening (HTS) technology is applicable to hazard-based ranking and grouping of diverse agents, including nanomaterials (NMs). We present a standardized HTS-derived human cell-based testing protocol which combines the analysis of five assays into a broad toxic mode-of-action-based hazard value, termed Tox5-score. The overall protocol includes automated data FAIRification, preprocessing and score calculation. A newly developed Python module ToxFAIRy can be used independently or within an Orange Data Mining workflow that has custom widgets for fine-tuning, included in the custom-developed Orange add-on Orange3-ToxFAIRy. The created data-handling workflow has the advantage of facilitated conversion of the FAIR HTS data into the NeXus format, capable of integrating all data and metadata into a single file and multidimensional matrix amenable to interactive visualizations and selection of data subsets. The resulting FAIR HTS data includes both raw and interpreted data (scores) in machine-readable formats distributable as data archive, including into the eNanoMapper database and Nanosafety Data Interface. We overall present a HTS-driven FAIRifed computational assessment tool for hazard analysis of multiple agents simultaneously, including with broad potential applicability across diverse scientific communities.</p><p><b>Scientific Contribution</b> Our study represents significant tool development for analyzing multiple materials hazards rapidly and simultaneously, aligning with regulatory recommendations and addressing industry needs. The innovative integration of in vitro-based toxicity scoring with automated data preprocessing within FAIRification workflows enhances the applicability of HTS-derived data application in the materials development community. The protocols described increase the effectiveness of materials toxicity testing and mode-of-action research by offering an alternative to manual data processing, enrichment of HTS data with metadata, refining testing methodologies—such as for bioactivity-based grouping—and overall, demonstrates the value of reusing existing data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01001-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143865505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GESim: ultrafast graph-based molecular similarity calculation via von Neumann graph entropy GESim:通过冯-诺依曼图熵进行基于图的超快分子相似性计算
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-22 DOI: 10.1186/s13321-025-01003-6
Hiroaki Shiokawa, Shoichi Ishida, Kei Terayama
{"title":"GESim: ultrafast graph-based molecular similarity calculation via von Neumann graph entropy","authors":"Hiroaki Shiokawa,&nbsp;Shoichi Ishida,&nbsp;Kei Terayama","doi":"10.1186/s13321-025-01003-6","DOIUrl":"10.1186/s13321-025-01003-6","url":null,"abstract":"<div><p>Representing molecules as graphs is a natural approach for capturing their structural information, with atoms depicted as nodes and bonds as edges. Although graph-based similarity calculation approaches, such as the graph edit distance, have been proposed for calculating molecular similarity, these approaches are nondeterministic polynomial (NP)-hard and thus computationally infeasible for routine use, unlike fingerprint-based methods. To address this limitation, we developed GESim, an ultrafast graph-based method for calculating molecular similarity on the basis of von Neumann graph entropy. GESim enables molecular similarity calculations by considering entire molecular graphs, and evaluations using two benchmarks for molecular similarity suggest that GESim has the ability to differentiate between highly similar molecules, even in cases where other methods fail to effectively distinguish their similarity. GESim is provided as an open-source package on GitHub at https://github.com/LazyShion/GESim.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01003-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143857304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset 在高度整理的数据集上用基于图卷积的神经网络预测水的溶解度
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-21 DOI: 10.1186/s13321-025-01000-9
Nadin Ulrich, Karsten Voigt, Anton Kudria, Alexander Böhme, Ralf-Uwe Ebert
{"title":"Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset","authors":"Nadin Ulrich,&nbsp;Karsten Voigt,&nbsp;Anton Kudria,&nbsp;Alexander Böhme,&nbsp;Ralf-Uwe Ebert","doi":"10.1186/s13321-025-01000-9","DOIUrl":"10.1186/s13321-025-01000-9","url":null,"abstract":"<div><p>Water solubility is a relevant physico-chemcial property in environmental chemistry, toxicology, and drug design. Although the water solubility is besides the octanol–water partition coefficient, melting point, and boiling point a property with a large amount of available experimental data, there are still more compounds in the chemical universe for which information on their water solubility is lacking. Thus, prediction tools with a broad application domain are needed to fill the corresponding data gaps. To this end, we developed a graph convolutional neural network model (GNN) to predict the water solubility in the form of log <i>S</i><sub>w</sub> based on a highly curated dataset of 9800 chemicals. We started our model development with a curation workflow of the AqSolDB data, ending with 7605 data points. We added 2195 chemicals with experimental data, which we found in the literature, to our dataset. In the final dataset, log <i>S</i><sub>w</sub> values range from − 13.17 to 0.50. Higher values were excluded by a cut-off introduced to eliminate fully miscible chemicals. We developed a consensus GNN by a fivefold split of the corresponding training set (70% of the data) and validation set (20%) and used 10% as independent test set for the evaluation of the performance of the different splits and the consensus model. By doing so, we achieved an <i>r</i><sup>2</sup> of 0.901, a <i>q</i><sup>2</sup> of 0.896, and an <i>rmse</i> of 0.657 on our independently selected test set, which is close to the experimental error of 0.5 to 0.6 log units. We further provide the information on the application domain and compare our performance to other existing prediction tools.</p><p><b>Scientific contribution</b> Based on a highly curated dataset, we developed a neural network to predict the water solubility of chemicals for a broad application domain. Data curation was done by us in a step-wise procedure, where we identified various errors in the experimental data. Based on an independent test set, we compare our prediction results to those of the available prediction models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01000-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143856582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning motif features and topological structure of molecules for metabolic pathway prediction 学习分子的基序特征和拓扑结构,用于代谢途径预测
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-21 DOI: 10.1186/s13321-025-00994-6
Jianguo Hu, Yiqing Zhang, Jinxin Xie, Zhen Yuan, Zhangxiang Yin, Shanshan Shi, Honglin Li, Shiliang Li
{"title":"Learning motif features and topological structure of molecules for metabolic pathway prediction","authors":"Jianguo Hu,&nbsp;Yiqing Zhang,&nbsp;Jinxin Xie,&nbsp;Zhen Yuan,&nbsp;Zhangxiang Yin,&nbsp;Shanshan Shi,&nbsp;Honglin Li,&nbsp;Shiliang Li","doi":"10.1186/s13321-025-00994-6","DOIUrl":"10.1186/s13321-025-00994-6","url":null,"abstract":"<div><p>Metabolites serve as crucial biomarkers for assessing disease progression and understanding underlying pathogenic mechanisms. However, when the metabolic pathway category of metabolites is unknown, researchers face challenges in conducting metabolomic analyses. Due to the complexity of wet laboratory experimentation for pathway identification, there is a growing demand for predictive methods. Various computational approaches, including machine learning and graph neural networks, have been proposed; however, interpretability remains a challenge. We have developed a neural network framework called MotifMol3D, which is designed for predicting molecular metabolic pathway categories. This framework introduces motif information to mine local features of small-sample molecules, combining with graph neural network and 3D information to complete the prediction task. Using a dataset of 5,698 molecules that participate in 11 metabolic pathway categories in the KEGG database, MotifMol3D outperformed state-of-the-art methods in precision, recall, and F1 score. In addition, ablation study and motif analysis have demonstrated the effectiveness and usefulness of the model. Motif analysis, in particular, has shown motif information can actually characterize the main features of specific pathway molecules to a certain extent and enhance the interpretability of the model. An external validation further corroborates this observation. MotifMol3D is an open-source tool that is available at https://github.com/Irena-Zhang/MotifMol3D.git.</p><p><b>Scientific contribution</b> MotifMol3D integrates motif information, graph neural networks, and 3D structural data to enhance feature extraction for small-sample molecules, improving the precision and interpretability of metabolic pathway predictions. The model outperforms state-of-the-art approaches in precision, recall, and F1 score. This work reveals how motif information characterizes pathway-specific molecules, offering novel insights into molecular properties within metabolic pathways.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00994-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143856583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Activity cliff-aware reinforcement learning for de novo drug design 活动悬崖感知强化学习用于新药物设计
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-21 DOI: 10.1186/s13321-025-01006-3
Xiuyuan Hu, Guoqing Liu, Yang Zhao, Hao Zhang
{"title":"Activity cliff-aware reinforcement learning for de novo drug design","authors":"Xiuyuan Hu,&nbsp;Guoqing Liu,&nbsp;Yang Zhao,&nbsp;Hao Zhang","doi":"10.1186/s13321-025-01006-3","DOIUrl":"10.1186/s13321-025-01006-3","url":null,"abstract":"<div><p>The integration of artificial intelligence (AI) in drug discovery offers promising opportunities to streamline and enhance the traditional drug development process. One core challenge in <i>de novo</i> molecular design is modeling complex structure-activity relationships (SAR), such as activity cliffs, where minor molecular changes yield significant shifts in biological activity. In response to the limitations of current models in capturing these critical discontinuities, we propose the Activity Cliff-Aware Reinforcement Learning (ACARL) framework. ACARL leverages a novel activity cliff index to identify and amplify activity cliff compounds, uniquely incorporating them into the reinforcement learning (RL) process through a tailored contrastive loss. This RL framework is designed to focus model optimization on high-impact regions within the SAR landscape, improving the generation of molecules with targeted properties. Experimental evaluations across multiple protein targets demonstrate ACARL’s superior performance in generating high-affinity molecules compared to existing state-of-the-art algorithms. These findings indicate that ACARL effectively integrates SAR principles into the RL-based drug design pipeline, offering a robust approach for <i>de novo</i> molecular design</p><p><b>Scientific contribution</b> Our work introduces a machine learning-based drug design framework that explicitly models activity cliffs, a first in AI-driven molecular design. ACARL’s primary technical contributions include the formulation of an activity cliff index to detect these critical points, and a contrastive RL loss function that dynamically enhances the generation of activity cliff compounds, optimizing the model for high-impact SAR regions. This approach demonstrates the efficacy of combining domain knowledge with machine learning advances, significantly expanding the scope and reliability of AI in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01006-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143856584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The pucke.rs toolkit to facilitate sampling the conformational space of biomolecular monomers pucke。Rs工具包,以方便采样生物分子单体的构象空间
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-17 DOI: 10.1186/s13321-025-00977-7
Jérôme Rihon, Sten Reynders, Vitor Bernardes Pinheiro, Eveline Lescrinier
{"title":"The pucke.rs toolkit to facilitate sampling the conformational space of biomolecular monomers","authors":"Jérôme Rihon,&nbsp;Sten Reynders,&nbsp;Vitor Bernardes Pinheiro,&nbsp;Eveline Lescrinier","doi":"10.1186/s13321-025-00977-7","DOIUrl":"10.1186/s13321-025-00977-7","url":null,"abstract":"<div><p>Understanding of the structural and dynamic behaviour of molecules is a major objective in molecular modeling research. Sampling through the torsional space is an efficient way to map their behaviour. However, generating a landscape of possible conformations relies on multiple formalisms whose mathematics are often difficult to convert to code. Here we present a command line tool and a scripting module to provide the means to generate such landscapes with different axes according to various formalisms exploited for conformational sampling. Additionally to this toolkit, we apply a benchmarking study on subjecting a DNA nucleoside to a diverse set of quantum mechanical levels of theory for geometry optimisations and energy potential calculations. The potential of the tool is demonstrated on examples including amino acids and synthetic nucleosides having five-membered or six-membered sugar moieties.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00977-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143841665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating QSAR modelling with reinforcement learning for Syk inhibitor discovery 基于QSAR模型和强化学习的Syk抑制剂发现
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-15 DOI: 10.1186/s13321-025-00998-2
Maria Zavadskaya, Anastasia Orlova, Andrei Dmitrenko, Vladimir Vinogradov
{"title":"Integrating QSAR modelling with reinforcement learning for Syk inhibitor discovery","authors":"Maria Zavadskaya,&nbsp;Anastasia Orlova,&nbsp;Andrei Dmitrenko,&nbsp;Vladimir Vinogradov","doi":"10.1186/s13321-025-00998-2","DOIUrl":"10.1186/s13321-025-00998-2","url":null,"abstract":"<div><p>Spleen tyrosine kinase (Syk) is a crucial mediator of inflammatory processes and a promising therapeutic target for the management of autoimmune disorders, such as immune thrombocytopenia. While several Syk inhibitors are known to date, their efficacy and safety profiles remain suboptimal, necessitating the exploration of novel compounds. The study introduces a novel deep reinforcement learning strategy for drug discovery, specifically designed to identify new Syk inhibitors. The approach integrates quantitative structure–activity relationship (QSAR) predictions with generative modelling, employing a stacking-ensemble model that achieves a correlation coefficient of 0.78. From over 78,000 molecules generated by this methodology, we identified 139 promising candidates with high predicted potency, binding affinity and optimal drug-likeness properties, demonstrating structural novelty while maintaining essential Syk inhibitor characteristics. Our approach establishes a versatile framework for accelerated drug discovery, which is particularly valuable for the development of rare disease therapeutics.</p><p><b>Scientific contribution</b></p><p>The study presents the first application of QSAR-guided reinforcement learning for Syk inhibitor discovery, yielding structurally novel candidates with predicted high potency. The presented methodology can be adapted for other therapeutic targets, potentially accelerating the drug development process.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00998-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143830813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem InertDB是一个生成ai扩展资源,从PubChem中获得生物无活性的小分子
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2025-04-10 DOI: 10.1186/s13321-025-00999-1
Seungchan An, Yeonjin Lee, Junpyo Gong, Seokyoung Hwang, In Guk Park, Jayhyun Cho, Min Ju Lee, Minkyu Kim, Yun Pyo Kang, Minsoo Noh
{"title":"InertDB as a generative AI-expanded resource of biologically inactive small molecules from PubChem","authors":"Seungchan An,&nbsp;Yeonjin Lee,&nbsp;Junpyo Gong,&nbsp;Seokyoung Hwang,&nbsp;In Guk Park,&nbsp;Jayhyun Cho,&nbsp;Min Ju Lee,&nbsp;Minkyu Kim,&nbsp;Yun Pyo Kang,&nbsp;Minsoo Noh","doi":"10.1186/s13321-025-00999-1","DOIUrl":"10.1186/s13321-025-00999-1","url":null,"abstract":"<div><p>The development of robust artificial intelligence (AI)-driven predictive models relies on high-quality, diverse chemical datasets. However, the scarcity of negative data and a publication bias toward positive results often hinder accurate biological activity prediction. To address this challenge, we introduce InertDB, a comprehensive database comprising 3,205 curated inactive compounds (CICs) identified through rigorous review of over 4.6 million compound records in PubChem. CIC selection prioritized bioassay diversity, determined using natural language processing (NLP)-based clustering metrics, while ensuring minimal biological activity across all evaluated bioassays. Notably, 97.2% of CICs adhere to the Rule of Five, a proportion significantly higher than that of overall PubChem dataset. To further expand the chemical space, InertDB also features 64,368 generated inactive compounds (GICs) produced using a deep generative AI model trained on the CIC dataset. Compared to conventional approaches such as random sampling or property-matched decoys, InertDB significantly improves predictive AI performance, particularly for phenotypic activity prediction by providing reliable inactive compound sets.</p><p><b>Scientific contributions</b></p><p>InertDB addresses a critical gap in AI-driven drug discovery by providing a comprehensive repository of biologically inactive compounds, effectively resolving the scarcity of negative data that limits prediction accuracy and model reliability. By leveraging language model-based bioassay diversity metrics and generative AI, InertDB integrates rigorously curated inactive compounds with an expanded chemical space. InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy of phenotypic pharmacological activity prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00999-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信