Seeun Kim, Simaek Oh, Hyeonuk Woo, Jiho Sim, Chaok Seok, Hahnbeom Park
{"title":"Deep learning molecular interaction motifs from receptor structures alone","authors":"Seeun Kim, Simaek Oh, Hyeonuk Woo, Jiho Sim, Chaok Seok, Hahnbeom Park","doi":"10.1186/s13321-025-01055-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01055-8","url":null,"abstract":"Interactions of proteins with other molecules are often mediated by a set of critical binding motifs on their surfaces. Most traditional binder designs relied on motifs borrowed from known binder molecules, which highly restricted their applicability to novel targets or new binding sites. This work presents a deep learning network MotifGen that predicts potential binder motifs directly from receptor structures without further supporting information. MotifGen generates motif profiles at the receptor surface for 14 types of functional groups or 6 chemical interaction classes. These profiles are highly human-interpretable and can be further utilized as pre-trained embedding inputs for versatile few-shot binder design applications. We demonstrate MotifGen's effectiveness through its applications to peptide binder design and small molecule binding site prediction, where it either surpassed existing methods or added significant value when integrated. Our motif-centric approach can offer a new design strategy for novel binder discovery for challenging receptor targets. We introduce a new deep-learning based computational strategy for identifying potential binder motifs given a receptor structure. These predicted binder motifs can be directly applied to the design of various drugs types, including peptides and small molecules. To demonstrate its utility, we show its applications in peptide binder sequence discrimination and binding site prediction tasks, both of which are crucial tasks in structure-based drug design.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"27 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144747676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean Current,Ziqi Chen,Daniel Adu-Ampratwum,Xia Ning,Srinivasan Parthasarathy
{"title":"DiffER : categorical diffusion ensembles for single-step chemical retrosynthesis.","authors":"Sean Current,Ziqi Chen,Daniel Adu-Ampratwum,Xia Ning,Srinivasan Parthasarathy","doi":"10.1186/s13321-025-01056-7","DOIUrl":"https://doi.org/10.1186/s13321-025-01056-7","url":null,"abstract":"Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose DiffER , an alternative template-free method for single-step retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that DiffER is a strong baseline for a new class of template-free model and is capable of learning a variety of synthetic techniques used in laboratory settings.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"25 1","pages":"112"},"PeriodicalIF":8.6,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144737377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexis Delabrière, Coline Gianfrotta, Sylvain Dechaumet, Annelaure Damont, Thaïs Hautbergue, Pierrick Roger, Emilien L. Jamin, Olivier Puel, Christophe Junot, François Fenaille, Etienne A. Thévenot
{"title":"mineMS2: annotation of spectral libraries with exact fragmentation patterns","authors":"Alexis Delabrière, Coline Gianfrotta, Sylvain Dechaumet, Annelaure Damont, Thaïs Hautbergue, Pierrick Roger, Emilien L. Jamin, Olivier Puel, Christophe Junot, François Fenaille, Etienne A. Thévenot","doi":"10.1186/s13321-025-01051-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01051-y","url":null,"abstract":"Identification is a major challenge in metabolomics due to the large structural diversity of metabolites. Tandem mass spectrometry is a reference technology for studying the fragmentation of molecules and characterizing their structure. Recent instruments can fragment large amounts of compounds in a single acquisition. The search for similarities within a collection of MS/MS spectra is a powerful approach to facilitate the identification of new metabolites. We propose an innovative de novo strategy for searching for exact fragmentation patterns within collections of MS/MS spectra. This approach is based on (i) a new representation of spectra as graphs of m/z differences, and (ii) an efficient frequent-subgraph mining algorithm. We demonstrate both on a spectral database from standards and on acquisitions in biological matrices that these new fragmentation patterns capture similarities that are not extracted by existing methods, and facilitate the structural interpretation of molecular network components and the elucidation of unknown spectra. The mineMS2 software is publicly available as an R package ( https://github.com/odisce/mineMS2 ). We present an innovative strategy for structural elucidation, which extracts exact fragmentation patterns of m/z differences within collections of MS/MS spectra. The algorithms are implemented in a software library enabling efficient mining of MS/MS data and coupling to molecular networks. We show on real datasets the specific value of the patterns as fragmentation graphs for structural interpretation and de novo identification, and their complementarity to existing approaches.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"285 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HERGAI: an artificial intelligence tool for structure-based prediction of hERG inhibitors","authors":"Viet-Khoa Tran-Nguyen, Ulrick Fineddie Randriharimanamizara, Olivier Taboureau","doi":"10.1186/s13321-025-01063-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01063-8","url":null,"abstract":"The human Ether-à-go-go-Related Gene (hERG) potassium channel is crucial for repolarizing the cardiac action potential and regulating the heartbeat. Molecules that inhibit this protein can cause acquired long QT syndrome, increasing the risk of arrhythmias and sudden fatal cardiac arrests. Detecting compounds with potential hERG inhibitory activity is therefore essential to mitigate cardiotoxicity risks. In this article, we present a new hERG data set of unprecedented size, comprising nearly 300,000 molecules reported in PubChem and ChEMBL, approximately 2000 of which were confirmed hERG blockers identified through in vitro assays. Multiple structure-based artificial intelligence (AI) binary classifiers for predicting hERG inhibitors were developed, employing, as descriptors, protein–ligand extended connectivity (PLEC) fingerprints fed into random forest, extreme gradient boosting, and deep neural network (DNN) algorithms. Our best-performing model, a stacking ensemble classifier with a DNN meta-learner, achieved state-of-the-art classification performance, accurately identifying 86% of molecules having half-maximal inhibitory concentrations (IC50s) not exceeding 20 µM in our challenging test set, including 94% of hERG blockers whose IC50s were not greater than 1 µM. It also demonstrated superior screening power compared to virtual screening schemes that used existing scoring functions. This model, named “HERGAI,” along with relevant input/output data and user-friendly source code, is available in our GitHub repository ( https://github.com/vktrannguyen/HERGAI ) and can be used to predict drug-induced hERG blockade, even on large data sets. We present the largest and most complex hERG inhibition data set for AI research, integrating meticulously curated experimental data from PubChem and ChEMBL. This realistic and challenging data set enables the training and evaluation of advanced models for predicting hERG blockers. We also introduce “HERGAI,” a novel stacking ensemble classifier with strong classification and screening performance, leveraging state-of-the-art machine learning/deep learning techniques and incorporating PLEC fingerprints, for the first time, as descriptors of hERG-bound ligand conformations.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"38 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff
{"title":"The topology of molecular representations and its influence on machine learning performance","authors":"Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff","doi":"10.1186/s13321-025-01045-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01045-w","url":null,"abstract":"Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations. Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"10 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gintautas Kamuntavičius, Tanya Paquet, Orestis Bastas, Dainius Šalkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaišas, Roy Tal
{"title":"Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models","authors":"Gintautas Kamuntavičius, Tanya Paquet, Orestis Bastas, Dainius Šalkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaišas, Roy Tal","doi":"10.1186/s13321-025-01041-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01041-0","url":null,"abstract":"This study, focusing on predicting Absorption, Distribution, Metabolism, Excretion, and Toxicology (ADMET) properties, addresses the key challenges of ML models trained using ligand-based representations. We propose a structured approach to data feature selection, taking a step beyond the conventional practice of combining different representations without systematic reasoning. Additionally, we enhance model evaluation methods by integrating cross-validation with statistical hypothesis testing, adding a layer of reliability to the model assessments. Our final evaluations include a practical scenario, where models trained on one source of data are evaluated on a different one. This approach aims to bolster the reliability of ADMET predictions, providing more dependable and informative model evaluations. Scientific contribution This study provided a structured approach to feature selection. We improve model evaluation by combining cross-validation with statistical hypothesis testing, making results more reliable. The methodology used in our study can be generalized beyond feature selection, boosting the confidence in selected models which is crucial in a noisy domain such as the ADMET prediction tasks. Additionally, we assess how well models trained on one dataset perform on another, offering practical insights for using external data in drug discovery.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"15 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Alberto Santiago-de-la-Cruz,Nadia Alejandra Rivero-Segura,Juan Carlos Gomez-Verjan
{"title":"Structure-based machine learning screening identifies natural product candidates as potential geroprotectors.","authors":"Jose Alberto Santiago-de-la-Cruz,Nadia Alejandra Rivero-Segura,Juan Carlos Gomez-Verjan","doi":"10.1186/s13321-025-01058-5","DOIUrl":"https://doi.org/10.1186/s13321-025-01058-5","url":null,"abstract":"Age-related diseases and syndromes result in poor quality of life and adverse outcomes, representing a challenge to healthcare systems worldwide. Several pharmacological interventions have been proposed to target the aging process to slow its adverse effects. The so-called geroprotectors have been proposed as novel molecules that could maintain the organism's homeostasis, targeting specific aspects linked to the hallmarks of aging and delaying the adverse outcomes associated with age. On the other hand, machine learning (ML) is revolutionising drug design by making the process faster, cheaper, and more efficient.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"10 1","pages":"106"},"PeriodicalIF":8.6,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144640267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaobo Lin, Zhaoqian Su, Yunchao Lance Liu, Jingxian Liu, Xiaohan Kuang, Peter T. Cummings, Jesse Spencer-Smith, Jens Meiler
{"title":"SuperMetal: a generative AI framework for rapid and precise metal ion location prediction in proteins","authors":"Xiaobo Lin, Zhaoqian Su, Yunchao Lance Liu, Jingxian Liu, Xiaohan Kuang, Peter T. Cummings, Jesse Spencer-Smith, Jens Meiler","doi":"10.1186/s13321-025-01038-9","DOIUrl":"https://doi.org/10.1186/s13321-025-01038-9","url":null,"abstract":"Metal ions, as abundant and vital cofactors in numerous proteins, are crucial for enzymatic activities and protein interactions. Given their pivotal role and catalytic efficiency, accurately and efficiently identifying metal-binding sites is fundamental to elucidating their biological functions and has significant implications for protein engineering and drug discovery. To address this challenge, we present SuperMetal, a generative AI framework that leverages a score-based diffusion model coupled with a confidence model to predict metal-binding sites in proteins with high precision and efficiency. Using zinc ions as an example, SuperMetal outperforms existing state-of-the-art models, achieving a precision of 94 % and coverage of 90 %, with zinc ions localization within 0.52 ± 0.55 Å of experimentally determined positions, thus marking a substantial advance in metal-binding site prediction. Furthermore, SuperMetal demonstrates rapid prediction capabilities (under 10 s for proteins with $$sim$$ 2000 residues) and remains minimally affected by increases in protein size. Notably, SuperMetal does not require prior knowledge of the number of metal ions—unlike AlphaFold 3, which depends on this information. Additionally, SuperMetal can be readily adapted to other metal ions or repurposed as a probe framework to identify other types of binding sites, such as protein-binding pockets. Scientific contribution SuperMetal introduces a diffusion-based, SE(3)-equivariant generative model that places metal ions in proteins with 94 % precision, 90 % coverage, and sub-ångström (0.52 Å) accuracy in under 10 s, surpassing current methods and accelerating metal-aware protein engineering and drug discovery.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"10 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144640396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Milestones in cheminformatics","authors":"Karina Martinez-Mayorga, José L. Medina-Franco","doi":"10.1186/s13321-025-01054-9","DOIUrl":"https://doi.org/10.1186/s13321-025-01054-9","url":null,"abstract":"<p>The field of cheminformatics has undergone significant transformation since its inception, evolving from a niche discipline to a cornerstone of modern medicinal chemistry, pharmaceutical research, and several other areas of chemistry [1,2,3]. To celebrate the 15th anniversary of the <i>Journal of Cheminformatics</i>, we present the special collection <i>Milestones in Cheminformatics</i>, https://www.biomedcentral.com/collections/MICHE, showcasing how cheminformatics has grown into a key discipline underpinning chemical research, innovation, and applications. The collection is intended to serve not only as a retrospective view but also as a platform to envision the future directions of cheminformatics.</p><p>This special issue brings together perspectives from renowned scholars and practitioners who highlight transformative developments across various domains of cheminformatics. Bajorath offers a global perspective on the trajectory of the field, setting the stage for future integration and growth. Willett revisits the early efforts in chemical database search. Reymond highlights the conceptual and practical implications of chemical space as a unifying theme, offering insights into its role in visualizing diversity and guiding discovery. Tropsha and colleagues present a critical analysis of current paradigms for assessing the accuracy of QSAR models, arguing for more nuanced and task-specific validation strategies. Steinbeck traces the journey from closed systems to collaborative innovation, while Williams and Richard propose three pillars for ensuring public access and data integrity in chemical databases. Bienstock discusses the impact and potential of AI/ML methods in designing new chemical entities, underscoring their growing role in predictive modeling and virtual screening. Rather than a closing section, Varnek et al. touches an important aspect of the future by exploring achievements and challenges in higher education, emphasizing the need for structured cheminformatics curricula and interdisciplinary competencies.</p><p>Contributions in this collection illustrate the multidimensional character of cheminformatics—from its computational and theoretical foundations to its educational, ethical, and infrastructural components. The collection highlights both the progress achieved and the challenges that remain, such as harmonizing data standards, ensuring reproducibility, and fostering inclusive access to tools and knowledge [3], while also intersecting with disciplines like bioinformatics, materials science, and systems biology [4]. Transparency, collaboration, and interdisciplinary interactions are poised to become key drivers of future developments in the field. The rise of explainable artificial intelligence and sustainable data practices are likely to define the next era of cheminformatics.</p><p>The field of cheminformatics has been sculptured by many scientists—those who contributed to this special collection, those who were unable to ","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"7 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shih-Cheng Li, Pei-Hua Wang, Jheng-Wei Su, Wei-Yin Chiang, Tzu-Lan Yeh, Alex Zhavoronkov, Shih-Hsien Huang, Yen-Chu Lin, Chia-Ho Ou, Chih-Yu Chen
{"title":"Application of the digital annealer unit in optimizing chemical reaction conditions for enhanced production yields","authors":"Shih-Cheng Li, Pei-Hua Wang, Jheng-Wei Su, Wei-Yin Chiang, Tzu-Lan Yeh, Alex Zhavoronkov, Shih-Hsien Huang, Yen-Chu Lin, Chia-Ho Ou, Chih-Yu Chen","doi":"10.1186/s13321-025-01043-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01043-y","url":null,"abstract":"Finding optimal reaction conditions is crucial for chemical synthesis in the pharmaceutical and chemical industries. However, due to the vast chemical space, conducting experiments for all the possible combinations is impractical. Thus, quantitative structure–activity relationship (QSAR) models have been widely used to predict product yields, but evaluating all combinations is still computationally intensive. In this work, we demonstrate the use of Digital Annealer Unit (DAU) can tackle these large-scale optimization problems more efficiently. Two types of models are developed and tested on high-throughput experimentation (HTE) and Reaxys datasets. Our results suggest that the performance of models is comparable to classical machine learning (ML) methods (i.e., Random Forest and Multilayer Perceptron (MLP)), while the inference time of our models requires only seconds with a DAU. In active learning and autonomous reaction condition design, our model shows improvement for reaction yield prediction by incorporating new data, meaning that it can potentially be used in iterative processes. Our method can also accelerate the screening of billions of reaction conditions, achieving speeds millions of times faster than traditional computing units in identifying superior conditions. This study demonstrates the application of DAUs to efficiently optimize chemical reaction conditions, leveraging quadratic unconstrained binary optimization (QUBO) models for accurate yield predictions. The QUBO-based approach exhibits comparable performance to classical machine learning methods while achieving inference times in seconds, significantly accelerating the screening of billions of reaction conditions. By integrating active learning and DAU technology, this research establishes a novel framework for reaction condition optimization, enabling innovative advancements in chemical synthesis.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"10 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}