{"title":"Cycle-configuration descriptors: a novel graph-theoretic approach to enhancing molecular inference","authors":"Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu","doi":"10.1186/s13321-025-01042-z","DOIUrl":"10.1186/s13321-025-01042-z","url":null,"abstract":"<div><p>Inference of molecules with desired activities/properties is one of the key and challenging issues in cheminformatics and bioinformatics. For that purpose, our research group has recently developed a state-of-the-art framework <span>mol-infer</span> for molecular inference. This framework first constructs a prediction function for a fixed property using machine learning models, which is then simulated by mixed-integer linear programming to infer desired molecules. The accuracy of the framework heavily relies on the representation power of the descriptors. In this study, we highlight a typical class of non-isomorphic chemical graphs with reasonably different property values that cannot be distinguished by the standard “two-layered (2L) model\" of <span>mol-infer</span>. To address this distinguishability problem of the 2L model, we propose a novel family of descriptors, named <i>cycle-configuration (CC)</i>, which captures the notion of ortho/meta/para patterns that appear in aromatic rings, which was impossible in the framework so far. Extensive computational experiments show that with the new descriptors, we can construct prediction functions with similar or better performance for all 44 tested chemical properties, including 27 regression datasets and 17 classification datasets comparing with our previous studies, confirming the effectiveness of the CC descriptors. For inference, we also provide a system of linear constraints to formulate the CC descriptors as linear constraints. We demonstrate that a chemical graph with up to 50 non-hydrogen vertices can be inferred within a practical time frame. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01042-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144861404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Nevolianis, Jan G. Rittig, Alexander Mitsos, Kai Leonhard
{"title":"Multi-fidelity graph neural networks for predicting toluene/water partition coefficients","authors":"Thomas Nevolianis, Jan G. Rittig, Alexander Mitsos, Kai Leonhard","doi":"10.1186/s13321-025-01057-6","DOIUrl":"10.1186/s13321-025-01057-6","url":null,"abstract":"<div><p>Accurate prediction of toluene/water partition coefficients of neutral species is crucial in drug discovery and separation processes; however, data-driven modeling of these coefficients remains challenging due to limited available experimental data. To address the limitation of available data, we apply multi-fidelity learning approaches leveraging a quantum chemical dataset (low fidelity) of approximately 9000 entries generated by COSMO-RS and an experimental dataset (high fidelity) of about 250 entries collected from the literature. We explore the <i>transfer learning</i>, <i>feature-augmented learning</i>, and <i>multi-target learning</i> approaches in combination with graph neural networks, validating them on two external datasets: one with molecules similar to training data (EXT-Zamora) and one with more challenging molecules (EXT-SAMPL9). Our results show that <i>multi-target learning</i> significantly improves predictive accuracy, achieving a root-mean-square error of 0.44 <span>(log {P})</span> units for the EXT-Zamora, compared to a root-mean-square error of 0.63 <span>(log {P})</span> units for single-task models. For the EXT-SAMPL9 dataset, <i>multi-target learning</i> achieves a root-mean-square error of 1.02 <span>(log {P})</span> units, indicating reasonable performance even for more complex molecular structures. These findings highlight the potential of multi-fidelity learning approaches that leverage quantum chemical data to improve toluene/water partition coefficient predictions and address challenges posed by limited experimental data. We expect the applicability of the methods used beyond just toluene/water partition coefficients.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01057-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advanced machine learning for innovative drug discovery","authors":"Igor V. Tetko, Djork-Arné Clevert","doi":"10.1186/s13321-025-01061-w","DOIUrl":"10.1186/s13321-025-01061-w","url":null,"abstract":"<div><p>This editorial presents an analysis of the articles published in the <i>Journal of Cheminformatics</i> Special Issue “AI in Drug Discovery”. We review how novel machine learning developments are enhancing structural-based drug discovery; providing better forecasts of molecular properties while also improving various elements of chemical reaction prediction. Methodological developments focused on increasing the accuracy of models via pre-training, estimating the accuracy of predictions, tuning model hyperparameters while avoiding overfitting, in addition to a diverse range of other novel and interesting methodological aspects, including the incorporation of human expert knowledge to analysing the susceptibility of models to adversary attacks, were explored in this Special Issue. In summary, the Special Issue brought together an excellent collection of articles that collectively demonstrate how machine learning methods have become an essential asset in modern drug discovery, with the potential to advance autonomous chemistry labs in the near future.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01061-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Melissa Maria Rios Zertuche, Şenay Kafkas, Dominik Renn, Magnus Rueping, Robert Hoehndorf
{"title":"Nanodesigner: resolving the complex-CDR interdependency with iterative refinement","authors":"Melissa Maria Rios Zertuche, Şenay Kafkas, Dominik Renn, Magnus Rueping, Robert Hoehndorf","doi":"10.1186/s13321-025-01069-2","DOIUrl":"10.1186/s13321-025-01069-2","url":null,"abstract":"<div><p>Camelid heavy-chain only antibodies consist of two heavy chains and single variable domains (VHHs), which retain antigen-binding functionality even when isolated. The term “nanobody” is now more generally used for describing small, single-domain antibodies. Several antibody generative models have been developed for the sequence and structure co-design of the complementarity-determining regions (CDRs) based on the binding interface with a target antigen. However, these models are not tailored for nanobodies and are often constrained by their reliance on experimentally determined antigen–antibody structures, which are labor-intensive to obtain. Here, we introduce NanoDesigner, a tool for nanobody design and optimization based on generative AI methods. NanoDesigner integrates key stages—structure prediction, docking, CDR generation, and side-chain packing—into an iterative framework based on an expectation maximization (EM) algorithm. The algorithm effectively tackles an interdependency challenge where accurate docking presupposes <i>a priori</i> knowledge of the CDR conformation, while effective CDR generation relies on accurate docking outputs to guide its design. NanoDesigner approximately doubles the success rate of de novo nanobody designs through continuous refinement of docking and CDR generation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01069-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"From molecules to data: the emerging impact of chemoinformatics in chemistry","authors":"Anup Basnet Chetry, Keisuke Ohto","doi":"10.1186/s13321-025-00978-6","DOIUrl":"10.1186/s13321-025-00978-6","url":null,"abstract":"<div><p>Chemoinformatics is a rapidly advancing field that integrates chemistry, computer science, and data analysis to enhance the study and application of chemical systems. This interdisciplinary approach leverages computational tools and large datasets to drive innovation in various chemical disciplines, including drug discovery, materials science, and environmental chemistry. Recent advancements in artificial intelligence (AI) and machine learning (ML) have significantly improved the ability to analyze complex datasets, predict molecular properties, and design new compounds. Additionally, the expansion of open-access databases and collaborative platforms has facilitated broader access to chemical data and fostered global research collaboration. Sophisticated molecular modeling techniques, such as multi-scale modeling and free energy calculations, have enhanced the accuracy of predictions, while big data analytics has enabled the extraction of valuable insights from vast datasets. Emerging technologies, including quantum computing, hold promise for further revolutionizing the field by offering new capabilities for simulating and optimizing chemical processes. Despite these advancements, chemoinformatics faces challenges related to data integrity, computational demands, and interdisciplinary collaboration. Addressing these challenges is crucial for the continued growth and effectiveness of chemoinformatics. Overall, the field is poised to play a pivotal role in advancing chemical research and developing innovative solutions to address global challenges.</p><p><b>Scientific contribution</b> This article highlights the growing impact of chemoinformatics in modern chemistry by integrating computational tools with molecular science to enhance data-driven discovery. It explores advancements in machine learning, artificial intelligence, and big data analytics, which improve molecular property predictions and accelerate chemical innovations. The study also discusses key applications in drug design and materials science, demonstrating how chemoinformatics drives efficiency and sustainability in research. Additionally, it outlines future challenges and opportunities, emphasizing the need for improved algorithms, data standardization, and interdisciplinary collaboration. This work contributes to the evolving role of chemoinformatics as a crucial pillar of modern chemical research.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00978-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144796717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen
{"title":"VitroBert: modeling DILI by pretraining BERT on in vitro data","authors":"Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen","doi":"10.1186/s13321-025-01048-7","DOIUrl":"10.1186/s13321-025-01048-7","url":null,"abstract":"<div><p>Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01048-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144786529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prediction model for chemical explosion consequences via multimodal feature fusion","authors":"Yilin Wang, Beibei Wang, Yichen Zhang, Jiquan Zhang, Yijie Song, Shuang-Hua Yang","doi":"10.1186/s13321-025-01060-x","DOIUrl":"10.1186/s13321-025-01060-x","url":null,"abstract":"<p>Chemical explosion accidents represent a significant threat to both human safety and environmental integrity. The accurate prediction of such incidents plays a pivotal role in risk mitigation and safety enhancement within the chemical industry. This study proposes an innovative Bayes-Transformer-SVM model based on multimodal feature fusion, integrating Quantitative Structure–Property Relationship (QSPR) and Quantitative Property-Consequence Relationship (QPCR) principles. The model utilizes molecular descriptors derived from the Simplified Molecular Input Line Entry System (SMILES) and Gaussian16 software, combined with leakage condition parameters, as input features to investigate the quantitative relationship between these factors and explosion consequences. A comprehensive validation and evaluation of the constructed model were performed. Results demonstrate that the optimized Bayes-Transformer-SVM model achieves superior performance, with test set metrics reaching an R<sup>2</sup> of 0.9475 and RMSE of 0.1139, outperforming alternative prediction models. The developed model offers a novel and effective approach for assessing explosion risks associated with both existing and newly developed chemical substances. The model enables rapid explosion consequence assessment for chemical storage or transport scenarios, supporting safety-by-design frameworks.</p><p>This study constructed a Bayes-Transformer-SVM model for predicting the consequences of hazardous chemical explosions. The model utilized SMILES encoding and Gaussian16 quantum chemical descriptors, combined with leakage condition scenario parameters, achieving excellent performance. Its core lies in the establishment of a multimodal fusion theoretical framework, breaking through the limitations oftraditional cross-modal correlation analysis; the development of an optimized architecture that combines Transformer feature extraction and SVM regression; highlighting the potential application of the model in chemoinformatics; and enabling the prospective assessment of the explosion risks of unknown chemicals, supporting a safety-oriented design concept.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01060-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tarek Khater, Sara Awni Alkhatib, Aamna AlShehhi, Charalampos Pitsalidis, Anna Maria Pappa, Son Tung Ngo, Vincent Chan, Vi Khanh Truong
{"title":"Generative artificial intelligence based models optimization towards molecule design enhancement","authors":"Tarek Khater, Sara Awni Alkhatib, Aamna AlShehhi, Charalampos Pitsalidis, Anna Maria Pappa, Son Tung Ngo, Vincent Chan, Vi Khanh Truong","doi":"10.1186/s13321-025-01059-4","DOIUrl":"10.1186/s13321-025-01059-4","url":null,"abstract":"<div><p>Generative artificial intelligence (GenAI) models have emerged as a transformative tool for addressing the complex challenges of drug discovery, enabling the design of structurally diverse, chemically valid, and functionally relevant molecules. Despite significant advancements, the rapid expansion of GenAI applications still faces challenges related to prediction accuracy, molecular validity, and optimization for drug-like properties. This review provides a comprehensive analysis of recent techniques and strategies aimed at enhancing the performance of GenAI models in molecular design. We explore key generative architectures, including variational autoencoders, generative adversarial networks, and transformer-based models, highlighting their unique contributions to drug discovery. Additionally, we discuss critical advancements such as reinforcement learning, multi-objective optimization, and the integration of domain-specific chemical knowledge, which collectively enhance molecular validity, novelty, and drug-likeness. Also, the review examines persistent challenges, including data quality limitations, model interpretability, and the need for improved objective functions, while offering insights into future research directions. By mapping the evolving landscape of GenAI-driven molecular design and providing strategic guidance for overcoming existing limitations, this review serves as an essential resource for researchers leveraging GenAI in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01059-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction: A transformer based generative chemical language AI model for structural elucidation of organic compounds","authors":"Xiaofeng Tan","doi":"10.1186/s13321-025-01065-6","DOIUrl":"10.1186/s13321-025-01065-6","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01065-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural SHAKE: geometric constraints in neural differential equations","authors":"Justin S. Diamond, Markus A. Lill","doi":"10.1186/s13321-025-01053-w","DOIUrl":"10.1186/s13321-025-01053-w","url":null,"abstract":"<p>Generating accurate molecular conformations hinges on sampling effectively from a high-dimensional space of atomic arrangements, which grows exponentially with system size. To ensure physically valid geometries and increase the likelihood of reaching low-energy conformations, it is us ful to incorporate prior physicsbased information by recasting them as geometric constraints that naturally arise as nonlinear constraint satisfaction problems. In this work, we propose an approach to embed these strict constraints into neural differential equations, leveraging the denoising diffusion framework. By projecting the stochastic generative dynamics onto a manifold defined by constraint sets, our method enforces exact feasibility at each step, unlike alternative approaches that merely impose soft constraints through probabilistic guidance. This technique generates lower-energy molecular conformations, enables more efficient subspace exploration, and formally subsumes classifier-guidance-type methods by treating geometric constraints as strict algebraic conditions within the diffusion process.</p><p>Neural SHAKE formulates exact manifold‑projected score‑based diffusion : each reverse-SDEincrement is orthogonally projected, via a Lagrange-multiplier solve, onto the constraint surfaceσₐ(x)=0 for a = 1,…, A, with A the number of independent constraints and thus the manifold’scodimension . This projection preserves global SE(3) symmetry and enforces constraints tosolver tolerance. It induces a well-posed surface Fokker–Planck flow on the (3 N − A)-dimensional manifold, while a coarea/Fixman Jacobian carries the ambient 3 N-dimensionaldensity to a normalized density on that manifold, preserving probability mass after the dimensionality reduction.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01053-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144777973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}