{"title":"Neural SHAKE: geometric constraints in neural differential equations","authors":"Justin S. Diamond, Markus A. Lill","doi":"10.1186/s13321-025-01053-w","DOIUrl":"10.1186/s13321-025-01053-w","url":null,"abstract":"<p>Generating accurate molecular conformations hinges on sampling effectively from a high-dimensional space of atomic arrangements, which grows exponentially with system size. To ensure physically valid geometries and increase the likelihood of reaching low-energy conformations, it is us ful to incorporate prior physicsbased information by recasting them as geometric constraints that naturally arise as nonlinear constraint satisfaction problems. In this work, we propose an approach to embed these strict constraints into neural differential equations, leveraging the denoising diffusion framework. By projecting the stochastic generative dynamics onto a manifold defined by constraint sets, our method enforces exact feasibility at each step, unlike alternative approaches that merely impose soft constraints through probabilistic guidance. This technique generates lower-energy molecular conformations, enables more efficient subspace exploration, and formally subsumes classifier-guidance-type methods by treating geometric constraints as strict algebraic conditions within the diffusion process.</p><p>Neural SHAKE formulates exact manifold‑projected score‑based diffusion : each reverse-SDEincrement is orthogonally projected, via a Lagrange-multiplier solve, onto the constraint surfaceσₐ(x)=0 for a = 1,…, A, with A the number of independent constraints and thus the manifold’scodimension . This projection preserves global SE(3) symmetry and enforces constraints tosolver tolerance. It induces a well-posed surface Fokker–Planck flow on the (3 N − A)-dimensional manifold, while a coarea/Fixman Jacobian carries the ambient 3 N-dimensionaldensity to a normalized density on that manifold, preserving probability mass after the dimensionality reduction.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01053-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144777973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to crack a SMILES: automatic crosschecked chemical structure resolution across multiple services using MoleculeResolver","authors":"Simon Müller","doi":"10.1186/s13321-025-01064-7","DOIUrl":"10.1186/s13321-025-01064-7","url":null,"abstract":"<p>Accurate chemical structure resolution from textual identifiers such as names and CAS RN® is critical for computational modeling in chemistry and related fields. This paper introduces MoleculeResolver, an automated, robust Python-based tool designed to address inconsistencies and inaccuracies commonly encountered when converting chemical identifiers to canonical SMILES strings. MoleculeResolver systematically crosschecks structures retrieved from multiple reputable chemical databases, implements rigorous identifier plausibility checks, standardizes molecular structures, and intelligently selects the most accurate representation based on a unique resolution algorithm.</p><p> Benchmarks across diverse datasets confirm that MoleculeResolver significantly enhances precision, recall, and overall reliability compared to traditional single-source methods, proving its utility as a valuable resource for chemists, data scientists, and researchers engaged in high-quality molecular data analysis and predictive model development.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01064-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144777999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seeun Kim, Simaek Oh, Hyeonuk Woo, Jiho Sim, Chaok Seok, Hahnbeom Park
{"title":"Deep learning molecular interaction motifs from receptor structures alone","authors":"Seeun Kim, Simaek Oh, Hyeonuk Woo, Jiho Sim, Chaok Seok, Hahnbeom Park","doi":"10.1186/s13321-025-01055-8","DOIUrl":"10.1186/s13321-025-01055-8","url":null,"abstract":"<div><p>Interactions of proteins with other molecules are often mediated by a set of critical binding motifs on their surfaces. Most traditional binder designs relied on motifs borrowed from known binder molecules, which highly restricted their applicability to novel targets or new binding sites. This work presents a deep learning network MotifGen that predicts potential binder motifs directly from receptor structures without further supporting information. MotifGen generates motif profiles at the receptor surface for 14 types of functional groups or 6 chemical interaction classes. These profiles are highly human-interpretable and can be further utilized as pre-trained embedding inputs for versatile few-shot binder design applications. We demonstrate MotifGen's effectiveness through its applications to peptide binder design and small molecule binding site prediction, where it either surpassed existing methods or added significant value when integrated. Our motif-centric approach can offer a new design strategy for novel binder discovery for challenging receptor targets.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01055-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144747676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean Current, Ziqi Chen, Daniel Adu-Ampratwum, Xia Ning, Srinivasan Parthasarathy
{"title":"(texttt {DiffER}): categorical diffusion ensembles for single-step chemical retrosynthesis","authors":"Sean Current, Ziqi Chen, Daniel Adu-Ampratwum, Xia Ning, Srinivasan Parthasarathy","doi":"10.1186/s13321-025-01056-7","DOIUrl":"10.1186/s13321-025-01056-7","url":null,"abstract":"<div><p>Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose <span>(texttt {DiffER})</span>, an alternative template-free method for single-step retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that <span>(texttt {DiffER})</span> is a strong baseline for a new class of template-free model and is capable of learning a variety of synthetic techniques used in laboratory settings.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01056-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144737377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexis Delabrière, Coline Gianfrotta, Sylvain Dechaumet, Annelaure Damont, Thaïs Hautbergue, Pierrick Roger, Emilien L. Jamin, Olivier Puel, Christophe Junot, François Fenaille, Etienne A. Thévenot
{"title":"mineMS2: annotation of spectral libraries with exact fragmentation patterns","authors":"Alexis Delabrière, Coline Gianfrotta, Sylvain Dechaumet, Annelaure Damont, Thaïs Hautbergue, Pierrick Roger, Emilien L. Jamin, Olivier Puel, Christophe Junot, François Fenaille, Etienne A. Thévenot","doi":"10.1186/s13321-025-01051-y","DOIUrl":"10.1186/s13321-025-01051-y","url":null,"abstract":"<p>Identification is a major challenge in metabolomics due to the large structural diversity of metabolites. Tandem mass spectrometry is a reference technology for studying the fragmentation of molecules and characterizing their structure. Recent instruments can fragment large amounts of compounds in a single acquisition. The search for similarities within a collection of MS/MS spectra is a powerful approach to facilitate the identification of new metabolites. We propose an innovative <i>de novo</i> strategy for searching for exact fragmentation patterns within collections of MS/MS spectra. This approach is based on (i) a new representation of spectra as graphs of m/z differences, and (ii) an efficient frequent-subgraph mining algorithm. We demonstrate both on a spectral database from standards and on acquisitions in biological matrices that these new fragmentation patterns capture similarities that are not extracted by existing methods, and facilitate the structural interpretation of molecular network components and the elucidation of unknown spectra. The mineMS2 software is publicly available as an R package (https://github.com/odisce/mineMS2).</p><p> We present an innovative strategy for structural elucidation, which extracts exact fragmentation patterns of m/z differences within collections of MS/MS spectra. The algorithms are implemented in a software library enabling efficient mining of MS/MS data and coupling to molecular networks. We show on real datasets the specific value of the patterns as fragmentation graphs for structural interpretation and <i>de novo</i> identification, and their complementarity to existing approaches.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01051-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HERGAI: an artificial intelligence tool for structure-based prediction of hERG inhibitors","authors":"Viet-Khoa Tran-Nguyen, Ulrick Fineddie Randriharimanamizara, Olivier Taboureau","doi":"10.1186/s13321-025-01063-8","DOIUrl":"10.1186/s13321-025-01063-8","url":null,"abstract":"<div><p>The human Ether-à-go-go-Related Gene (hERG) potassium channel is crucial for repolarizing the cardiac action potential and regulating the heartbeat. Molecules that inhibit this protein can cause acquired long QT syndrome, increasing the risk of arrhythmias and sudden fatal cardiac arrests. Detecting compounds with potential hERG inhibitory activity is therefore essential to mitigate cardiotoxicity risks. In this article, we present a new hERG data set of unprecedented size, comprising nearly 300,000 molecules reported in PubChem and ChEMBL, approximately 2000 of which were confirmed hERG blockers identified through in vitro assays. Multiple structure-based artificial intelligence (AI) binary classifiers for predicting hERG inhibitors were developed, employing, as descriptors, protein–ligand extended connectivity (PLEC) fingerprints fed into random forest, extreme gradient boosting, and deep neural network (DNN) algorithms. Our best-performing model, a stacking ensemble classifier with a DNN meta-learner, achieved state-of-the-art classification performance, accurately identifying 86% of molecules having half-maximal inhibitory concentrations (IC<sub>50</sub>s) not exceeding 20 µM in our challenging test set, including 94% of hERG blockers whose IC<sub>50</sub>s were not greater than 1 µM. It also demonstrated superior screening power compared to virtual screening schemes that used existing scoring functions. This model, named “HERGAI,” along with relevant input/output data and user-friendly source code, is available in our GitHub repository (https://github.com/vktrannguyen/HERGAI) and can be used to predict drug-induced hERG blockade, even on large data sets.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01063-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff
{"title":"The topology of molecular representations and its influence on machine learning performance","authors":"Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff","doi":"10.1186/s13321-025-01045-w","DOIUrl":"10.1186/s13321-025-01045-w","url":null,"abstract":"<div><p>Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.</p><p><b>Scientific contribution</b> Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01045-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gintautas Kamuntavičius, Tanya Paquet, Orestis Bastas, Dainius Šalkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaišas, Roy Tal
{"title":"Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models","authors":"Gintautas Kamuntavičius, Tanya Paquet, Orestis Bastas, Dainius Šalkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaišas, Roy Tal","doi":"10.1186/s13321-025-01041-0","DOIUrl":"10.1186/s13321-025-01041-0","url":null,"abstract":"<div><p>This study, focusing on predicting Absorption, Distribution, Metabolism, Excretion, and Toxicology (ADMET) properties, addresses the key challenges of ML models trained using ligand-based representations. We propose a structured approach to data feature selection, taking a step beyond the conventional practice of combining different representations without systematic reasoning. Additionally, we enhance model evaluation methods by integrating cross-validation with statistical hypothesis testing, adding a layer of reliability to the model assessments. Our final evaluations include a practical scenario, where models trained on one source of data are evaluated on a different one. This approach aims to bolster the reliability of ADMET predictions, providing more dependable and informative model evaluations.</p><p><b>Scientific contribution</b></p><p>This study provided a structured approach to feature selection. We improve model evaluation by combining cross-validation with statistical hypothesis testing, making results more reliable. The methodology used in our study can be generalized beyond feature selection, boosting the confidence in selected models which is crucial in a noisy domain such as the ADMET prediction tasks. Additionally, we assess how well models trained on one dataset perform on another, offering practical insights for using external data in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01041-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jose Alberto Santiago-de-la-Cruz, Nadia Alejandra Rivero-Segura, Juan Carlos Gomez-Verjan
{"title":"Structure-based machine learning screening identifies natural product candidates as potential geroprotectors","authors":"Jose Alberto Santiago-de-la-Cruz, Nadia Alejandra Rivero-Segura, Juan Carlos Gomez-Verjan","doi":"10.1186/s13321-025-01058-5","DOIUrl":"10.1186/s13321-025-01058-5","url":null,"abstract":"<div><p>Age-related diseases and syndromes result in poor quality of life and adverse outcomes, representing a challenge to healthcare systems worldwide. Several pharmacological interventions have been proposed to target the aging process to slow its adverse effects. The so-called <i>geroprotectors</i> have been proposed as novel molecules that could maintain the organism's homeostasis, targeting specific aspects linked to the hallmarks of aging and delaying the adverse outcomes associated with age. On the other hand, machine learning (ML) is revolutionising drug design by making the process faster, cheaper, and more efficient.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01058-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144640267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaobo Lin, Zhaoqian Su, Yunchao Lance Liu, Jingxian Liu, Xiaohan Kuang, Peter T. Cummings, Jesse Spencer-Smith, Jens Meiler
{"title":"SuperMetal: a generative AI framework for rapid and precise metal ion location prediction in proteins","authors":"Xiaobo Lin, Zhaoqian Su, Yunchao Lance Liu, Jingxian Liu, Xiaohan Kuang, Peter T. Cummings, Jesse Spencer-Smith, Jens Meiler","doi":"10.1186/s13321-025-01038-9","DOIUrl":"10.1186/s13321-025-01038-9","url":null,"abstract":"<div><p>Metal ions, as abundant and vital cofactors in numerous proteins, are crucial for enzymatic activities and protein interactions. Given their pivotal role and catalytic efficiency, accurately and efficiently identifying metal-binding sites is fundamental to elucidating their biological functions and has significant implications for protein engineering and drug discovery. To address this challenge, we present SuperMetal, a generative AI framework that leverages a score-based diffusion model coupled with a confidence model to predict metal-binding sites in proteins with high precision and efficiency. Using zinc ions as an example, SuperMetal outperforms existing state-of-the-art models, achieving a precision of 94 % and coverage of 90 %, with zinc ions localization within 0.52 ± 0.55 Å of experimentally determined positions, thus marking a substantial advance in metal-binding site prediction. Furthermore, SuperMetal demonstrates rapid prediction capabilities (under 10 s for proteins with <span>(sim)</span> 2000 residues) and remains minimally affected by increases in protein size. Notably, SuperMetal does not require prior knowledge of the number of metal ions—unlike AlphaFold 3, which depends on this information. Additionally, SuperMetal can be readily adapted to other metal ions or repurposed as a probe framework to identify other types of binding sites, such as protein-binding pockets.</p><p><b>Scientific contribution</b></p><p>SuperMetal introduces a diffusion-based, SE(3)-equivariant generative model that places metal ions in proteins with 94 % precision, 90 % coverage, and sub-ångström (0.52 Å) accuracy in under 10 s, surpassing current methods and accelerating metal-aware protein engineering and drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01038-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144640396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}