Charlotte Neidiger, Tarek Saier, Kai Kühn, Victor Larignon, Michael Färber, Claudia Bizzarri, Helena Šimek Tosino, Laura Holzhauer, Michael Erdmann, An Nguyen, Dean Harvey, Pierre Tremouilhac, Claudia Kramer, Daniel Hansch, Fabian Schönle, Jana Alpin, Maximilian Hartmann, Jérome Wagner, Nicole Jung, Stefan Bräse
{"title":"Implementation of an open chemistry knowledge base with a Semantic Wiki","authors":"Charlotte Neidiger, Tarek Saier, Kai Kühn, Victor Larignon, Michael Färber, Claudia Bizzarri, Helena Šimek Tosino, Laura Holzhauer, Michael Erdmann, An Nguyen, Dean Harvey, Pierre Tremouilhac, Claudia Kramer, Daniel Hansch, Fabian Schönle, Jana Alpin, Maximilian Hartmann, Jérome Wagner, Nicole Jung, Stefan Bräse","doi":"10.1186/s13321-025-01037-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01037-w","url":null,"abstract":"In this work, a concept for an open chemistry knowledge base was developed to integrate chemical research results into a collaboratively usable platform. To achieve this, we enhanced Semantic MediaWiki (SMW) to support the collection and structured summary of chemical data contained in publications. We implemented tools for capturing chemical structures in machine-readable formats and designed data forms along with a data model to ensure standardized input and organization of research results. These enhancements allow for effective data comparison and contextual analysis within an expandable Wiki environment. The use of the platform was specifically demonstrated by organizing and comparing research in the area of “CO2 reduction in homogeneous photocatalytic systems,” showcasing its potential to significantly enhance the collaborative collection of research outcomes. Scientific contribution This work shows ways to collaboratively collect and manage subject-specific knowledge in the domain of chemistry via an open database. By integrating cheminformatic tools into Semantic Mediawiki, an established technology for building knowledge databases is made systematically usable for the chemical community. The integration of chemistry-specific workflows and forms allows the mapping of data from current research with links to the original sources. This work is intended to show how gaps in the information system of scientists can be closed without having to use commercial systems.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"4 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144568753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Todor Kondić, Anjana Elapavalore, Jessy Krier, Adelene Lai, Hiba Mohammed Taha, Mira Narayanan, Emma L. Schymanski
{"title":"Shinyscreen: mass spectrometry data inspection and quality checking utility","authors":"Todor Kondić, Anjana Elapavalore, Jessy Krier, Adelene Lai, Hiba Mohammed Taha, Mira Narayanan, Emma L. Schymanski","doi":"10.1186/s13321-025-01044-x","DOIUrl":"https://doi.org/10.1186/s13321-025-01044-x","url":null,"abstract":"Shinyscreen is an R package and Shiny-based web application designed for the exploration, visualization, and quality assessment of raw data from high resolution mass spectrometry instruments. Its versatile list-based approach supports the curation of data starting from either known or “suspected” compounds (compound list-based screening) or detected masses (mass list-based screening), making it adaptable to diverse analytical needs (target, suspect or non-target screening). Shinyscreen can be operated in multiple modes, including as an R package, an interactive command-line tool, a self-documented web GUI, or a network-deployable service. Shinyscreen has been applied in environmental research, database enrichment, and educational initiatives, showcasing its broad utility. Shinyscreen is available in GitLab ( https://gitlab.com/uniluxembourg/lcsb/eci/shinyscreen ) under the Apache License 2.0. The repository contains detailed instructions for deployment and use. Additionally, a pre-configured Docker image, designed for seamless installation and operation is available, with instructions also provided in the main repository. Scientific Contribution: Shinyscreen is a fully open source prescreening application to assist analysts in the high throughput quality control of the thousands of peaks detected in high resolution mass spectrometry experiments. As a vendor-independent, cross operating system application it covers an important niche in open mass spectrometry workflows. Shinyscreen supports quality control of data for further identification or upload of spectra to public data resources, as well as teaching efforts to educate students on the importance of data quality control and rigorous identification methods.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"14 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler
{"title":"Crossover operators for molecular graphs with an application to virtual drug screening","authors":"Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler","doi":"10.1186/s13321-025-00958-w","DOIUrl":"https://doi.org/10.1186/s13321-025-00958-w","url":null,"abstract":"Genetic algorithms are a powerful method to solve optimization problems with complex cost functions over vast search spaces that rely in particular on recombining parts of previous solutions. Crossover operators play a crucial role in this context. Here, we describe a large class of these operators designed for searching over spaces of graphs. These operators are based on introducing small cuts into graphs and rejoining the resulting induced subgraphs of two parents. This form of cut-and-join crossover can be restricted in a consistent way to preserve local properties such as vertex-degrees (valency), or bond-orders, as well as global properties such as graph-theoretic planarity. In contrast to crossover on strings, cut-and-join crossover on graphs is powerful enough to ergodically explore chemical space even in the absence of mutation operators. Extensive benchmarking shows that the offspring of molecular graphs are again plausible molecules with high probability, while at the same time crossover drastically increases the diversity compared to initial molecule libraries. Moreover, desirable properties such as favorable indices of synthesizability are preserved with sufficient frequency that candidate offsprings can be filtered efficiently for such properties. As an application we utilized the cut-and-join crossover in REvoLd, a GA-based system for computer-aided drug design. In optimization runs searching for ligands binding to four different target proteins we consistently found candidate molecules with binding constants exceeding the best known binders as well as candidates found in make-on-demand libraries. Scientific contribution We define cut-and-join crossover operators on a variety of graph classes including molecular graphs. This constitutes a mathematically simple and well-characterized approach to recombination of molecules that performed very well in real-life CADD tasks.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"44 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancements in thermochemical predictions: a multi-output thermodynamics-informed neural network approach","authors":"Raheel Hammad, Sownyak Mondal","doi":"10.1186/s13321-025-01033-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01033-0","url":null,"abstract":"The Gibbs free energy of an inorganic material represents its maximum reversible work potential under constant temperature and pressure. Its calculation is crucial for understanding material stability, phase transitions, and chemical reactions, thus guiding optimization for diverse applications like catalysis and energy storage. In this study, we have developed a Physics-Informed Neural Network model that leverages the Gibbs free energy equation. The overall loss function is adjusted to allow the model to simultaneously predict all three thermodynamic quantities, including Gibbs free energy, total energy, and entropy, thus transforming it into a multi-output model. In recent literature, there is a growing emphasis on evaluating machine learning models under challenging conditions, such as small datasets and out-of-distribution predictions. Reflecting this trend, we have rigorously benchmarked our model across these scenarios, demonstrating its robustness and adaptability. It turns out that our model demonstrates a 43% improvement for normal scenario and even more in out-of-distribution regime compared to the next-best model. Scientific Contribution This study introduces the application of a Physics-Informed Neural Network to simultaneously compute multiple thermodynamic properties, including Gibbs free energy, total energy, and entropy. By integrating the Gibbs free energy equation into the loss function, the model achieves superior accuracy in low data regimes and enhances robustness in the out-of-distribution scenarios.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144296245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Palistha Shrestha, Chandana S. Talwar, Jeevan Kandel, Kwang-Hyun Park, Kil To Chong, Eui-Jeon Woo, Hilal Tayara
{"title":"NanoBinder: a machine learning assisted nanobody binding prediction tool using Rosetta energy scores","authors":"Palistha Shrestha, Chandana S. Talwar, Jeevan Kandel, Kwang-Hyun Park, Kil To Chong, Eui-Jeon Woo, Hilal Tayara","doi":"10.1186/s13321-025-01040-1","DOIUrl":"https://doi.org/10.1186/s13321-025-01040-1","url":null,"abstract":"Nanobodies offer significant therapeutic potential due to their small size, stability, and versatility. Although advancements in computational protein design have made designing de novo nanobodies increasingly feasible, there are limited tools specifically tailored for this purpose. Rosetta with its specialized protocols, is a prominent tool for nanobody design but is limited by a high false-negative rate, necessitating extensive high-throughput screening. This results in increased costs, time, and labor due to the need for large-scale experimentation and detailed structural analysis. To address current challenges in nanobody design, we introduce NanoBinder, an interpretable machine learning model that predicts nanobody-antigen binding using Rosetta energy scores. NanoBinder utilizes a Random Forest model trained on experimentally validated complexes and can be seamlessly integrated into the Rosetta software. It employs SHAP summary plots for interpretability, which helps identify key features influencing binding interactions. Experimentally validated on forty-nine diverse nanobodies, NanoBinder accurately predicts non-binders and shows reasonable performance in identifying binders. This approach significantly enhances predictive accuracy, reduces the need for extensive experimental assays, and accelerates nanobody development, thereby offering a powerful tool to mitigate the costs, time, and labor associated with high-throughput screening. Scientific contribution This study introduces NanoBinder, a machine learning framework for predicting nanobody-antigen binding using Rosetta-derived energy features. Through rigorous experimental validation across diverse nanobody sets, NanoBinder enhances nanobody screening workflows by reducing false positives and minimizing reliance on extensive wet-lab assays. The approach bridges the gap between physics-based modeling and data-driven prediction in nanobody design.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"227 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144296244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester
{"title":"UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*","authors":"Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester","doi":"10.1186/s13321-025-01039-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01039-8","url":null,"abstract":"Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"218 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144260195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongna Yuan, Xiaohang Pan, Xiaohong Li, Ruisheng Zhang, Wei Su
{"title":"A 3D generation framework using diffusion model and reinforcement learning to generate multi-target compounds with desired properties","authors":"Yongna Yuan, Xiaohang Pan, Xiaohong Li, Ruisheng Zhang, Wei Su","doi":"10.1186/s13321-025-01035-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01035-y","url":null,"abstract":"Deep generative models provide a powerful solution for the de novo design of molecules. However, the majority of existing methods only generate molecules for a single target. Generating molecules with biological activities against multiple specific targets and desired properties remains an extremely difficult challenge. In this study, we propose a novel 3D molecule generation framework based on reinforcement learning and diffusion model to generate molecules with predefined properties for given multiple targets. The proposed framework, MDRL, uses a diffusion model to understand the 3D chemical structure of molecules and employs Kolmogorov-Arnold Networks instead of Multilayer Perceptron to enhance model performance. Through reinforcement learning, the framework is able to generate molecules that simultaneously target two targets and further optimizes multiple molecular properties. Experimental results show that our model exhibits comparable performance to various state-of-the-art molecular generation models, and MDRL can effectively navigate chemical space to design polypharmacological compounds and control multiple molecular properties. In multiple case studies, we verify that the generated molecules can simultaneously target two targets through molecular docking and assess the model’s ability to control multiple molecular properties. The results in this study highlight the advantages and practicalities of our model in generating polypharmacological compounds with desired properties. This study introduces MDRL, a 3D molecular generation framework integrating diffusion models and reinforcement learning for joint optimization of multi-target binding and molecular properties. MDRL shows improvements over existing methods in controlling drug-relevant properties and enhancing multi-target affinity. Experimental results demonstrate that MDRL efficiently generates drug-like compounds with robust polypharmacological profiles, offering a novel strategy for multi-target drug design.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144211377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RLSuccSite: succinylation sites prediction based on reinforcement learning dynamic with balanced reward mechanism and three-peaks enhanced method for physicochemical property scores","authors":"Lun Zhu, Qingchao Zhang, Sen Yang","doi":"10.1186/s13321-025-01034-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01034-z","url":null,"abstract":"Recent progress in computational biology has driven the development of machine learning models for predicting protein post-translational modification sites. However, challenges such as data imbalance and limited sequence-context representation continue to hinder prediction accuracy, particularly for less frequent modifications like succinylation. In this study, we propose RLSuccSite, a reinforcement learning-based framework specifically designed to predict succinylation sites by addressing the class imbalance issue via a dynamic with balanced reward mechanism. To enhance sequence feature representation, this study also introduces Three-Peaks Enhanced Method for Physicochemical Property Scores (TPEM-PPS), a physicochemical property-driven feature extraction method that incorporates position-aware scoring to reflect amino acid contributions more effectively. The code and data of RLSuccSite can be obtained from the website: https://github.com/Zhangqingchao-Ch/RLSuccSite.git . Scientific contribution This study applies reinforcement learning to protein succinylation sites prediction, introducing a dynamic with balanced reward mechanism that effectively addresses dataset imbalance. Additionally, this study proposes a novel Three-Peaks Enhanced Method for Physicochemical Scoring, which captures residue contributions with higher precision than traditional feature extraction techniques. ","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"9 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144193336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eduardo Illueca Fernández, Antonio Jesús Jara Valera, Jesualdo Tomás Fernández Breis
{"title":"Representation of chemistry transport models simulations using knowledge graphs","authors":"Eduardo Illueca Fernández, Antonio Jesús Jara Valera, Jesualdo Tomás Fernández Breis","doi":"10.1186/s13321-025-01025-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01025-0","url":null,"abstract":"Persistent air quality pollution poses a serious threat to human health, and is one of the action points that policy makers should monitor according to the Directive 2008/50/EC. While deploying a massive network of hyperlocal sensors could provide extensive monitoring, this approach cannot generate geospatial continuous data and present several challenges in terms of logistics. Thus, developing accurate and trustable expert systems based on chemistry transport models is a key strategy for environmental protection. However, chemistry transport models present an important lack of standardization, and the formats are not interoperable between different systems, which limits the use for different stakeholders. In this context, semantic technologies provide methods and standards for scientific data and make information readable for expert systems. Therefore, this paper proposes a novel methodology for an ontology driven transformation for CHIMERE simulations, a chemistry transport model, allowing to generate knowledge graphs representing air quality information. It enables the transformation of netCDF files into RDF triples for short term air quality forecasting. Concretely, we utilize the Semantic Web Integration Tool (SWIT) framework for mapping individuals using an ontology as a template. Then, a new ontology for CHIMERE has been defined in this work, reusing concepts for other standards in the state of the art. Our approach demonstrates that RDF files can be created from netCDF in a linear computational time, allowing the scalability for expert systems. In addition, the ontology complains with the OQuaRE quality metrics and can be extended in future extensions to be applied to other chemistry transport models. Development of the first ontology for a chemistry transport model. FAIRification of physical models thanks to the generation of knowledge graphs from netCDF files. The ontology proposed is published in PURL ( https://purl.org/chimere-ontology ) and the knowledge graph generated for a 72-h simulation can be accessed in the following repository: https://doi.org/10.5281/zenodo.13981544 .","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Higher education in chemoinformatics: achievements and challenges","authors":"Alexandre Varnek, Gilles Marcou, Dragos Horvath","doi":"10.1186/s13321-025-01036-x","DOIUrl":"https://doi.org/10.1186/s13321-025-01036-x","url":null,"abstract":"While chemoinformatics is a well-established scientific field, its integration into university curricula is rarely discussed. In this work, we share our experience in developing a chemoinformatics curriculum at the University of Strasbourg and highlight the main challenges in higher education for this discipline.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"28 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}