Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari
{"title":"Semi-supervised prediction of protein fitness for data-driven protein engineering","authors":"Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari","doi":"10.1186/s13321-025-01029-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01029-w","url":null,"abstract":"Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. We explore several semi-supervised learning strategies capable of including the homologous sequences (unlabelled) to the protein of interest in the training process. Among them, we present two new methods to exploit the information in the homologous sequences: i) a new generalised version of MERGE capable of employing any regressor as a base estimator; ii) the Tri-Training Regressor method, an adaptation of the Tri-Training method for regression problems. We find that the information inherent in the homologous sequences has the ability to improve the predictive capacity of models when the number of available sequences is scarce, especially when using the DCA encoding together with MERGE and an SVM regressor.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing atom mapping with multitask learning and symmetry-aware deep graph matching","authors":"Maryam Astero, Juho Rousu","doi":"10.1186/s13321-025-01030-3","DOIUrl":"https://doi.org/10.1186/s13321-025-01030-3","url":null,"abstract":"Atom mapping involves identifying the correspondence between individual atoms in reactant molecules and their counterparts in product molecules. This process is crucial for gaining deeper insight into reaction mechanisms, such as defining reaction templates and determining which chemical bonds are formed or broken during a reaction. However, reliable atom mapping data are often limited or incomplete within chemical databases, rendering manual annotation impractical for large-scale datasets. To address this limitation, we propose the Symmetry-Aware Multitask Atom Mapping Network (SAMMNet), a model designed to automatically infer atom correspondences by incorporating an auxiliary self-supervised task during training. SAMMNet employs molecular graph representations and leverages graph neural networks to capture both general and task-specific features, enabling enhanced predictive performance. Our experimental results demonstrate that the multitask learning framework, coupled with symmetry-aware atom mapping, improves accuracy and robustness in atom mapping predictions. This makes our method a promising advancement for computational chemistry and related fields. This study introduces SAMMNet, a novel Symmetry-Aware Multitask Atom Mapping Network, advancing atom mapping methodologies by integrating multitask learning and post-prediction symmetry refinement. Unlike prior approaches, SAMMNet leverages auxiliary self-supervised tasks to enhance molecular graph representations, improving mapping accuracy while addressing imbalanced reactions through graph padding techniques.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"68 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144176534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pilleriin Peets, Aristeidis Litos, Kai Dührkop, Daniel R. Garza, Justin J. J. van der Hooft, Sebastian Böcker, Bas E. Dutilh
{"title":"Chemical characteristics vectors map the chemical space of natural biomes from untargeted mass spectrometry data","authors":"Pilleriin Peets, Aristeidis Litos, Kai Dührkop, Daniel R. Garza, Justin J. J. van der Hooft, Sebastian Böcker, Bas E. Dutilh","doi":"10.1186/s13321-025-01031-2","DOIUrl":"10.1186/s13321-025-01031-2","url":null,"abstract":"<div><p>Untargeted metabolomics can comprehensively map the chemical space of a biome, but is limited by low annotation rates (< 10%). We used chemical characteristics vectors, consisting of molecular fingerprints or chemical compound classes, predicted from mass spectrometry data, to characterize compounds and samples. These chemical characteristics vectors (CCVs) estimate the fraction of compounds with specific chemical properties in a sample. Unlike the aligned MS1 data with intensity information, CCVs incorporate the chemical properties of compounds, allowing chemical annotation to be used for sample comparison. Thus, we identified compound classes differentiating biomes, such as ethers which are enriched in environmental biomes, while steroids enriched in animal host-related biomes. In biomes with greater variability, CCVs revealed key clustering compound classes, such as organonitrogen compounds in animal distal gut and lipids in animal secretions. CCVs thus enhance the interpretation of untargeted metabolomic data, providing a quantifiable and generalizable understanding of the chemical space of natural biomes.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01031-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandro Martínez León, Benjamin Ries, Jochen S. Hub, Aniket Magarkar
{"title":"Moldrug algorithm for an automated ligand binding site exploration by 3D aware molecular enumerations","authors":"Alejandro Martínez León, Benjamin Ries, Jochen S. Hub, Aniket Magarkar","doi":"10.1186/s13321-025-01022-3","DOIUrl":"10.1186/s13321-025-01022-3","url":null,"abstract":"<div><p>We present Moldrug, a computational tool for accelerating the hit-to-lead phase in structure-based drug design. Moldrug explores the chemical space using structural modifications suggested by the CReM library and by optimizing an adaptable fitness function with a genetic algorithm. Moldrug is complemented by Moldrug-Dashboard, a cross-platform and user-friendly graphical interface tailored for the analysis of Moldrug simulations. To illustrate Moldrug, we designed new potential inhibitors targeting the main protease (M<sup>Pro</sup>) of SARS-CoV-2 by optimizing a consensus fitness function that balances binding affinity, drug-likeness, and synthetic accessibility. The designed molecules exhibited high chemical diversity. A subset of the designed molecules were ranked using MM/GBSA and alchemical binding free energy calculations, revealing predicted affinities as low as <span>(-10,~hbox {kcal},hbox {mol}^{-1})</span>. Moldrug is distributed as a Python package under the Apache 2.0 license. It offers pre-configured multi-parameter fitness functions for molecular design, while being highly adaptable for integrating functionalities from external software. Documentation and tutorials are available at https://moldrug.rtfd.io.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01022-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing molecular property prediction with quantized GNN models","authors":"Areen Rasool, Jamshaid Ul Rahman, Rongin Uwitije","doi":"10.1186/s13321-025-00989-3","DOIUrl":"10.1186/s13321-025-00989-3","url":null,"abstract":"<div><p>Efficient and reliable prediction of molecular properties, such as water solubility, hydration free energy, lipophilicity, and quantum mechanical properties, is essential for rational compound design in the chemical and pharmaceutical industries. While Graph Neural Networks (GNNs) have significantly advanced molecular property prediction tasks, their high memory footprint, computational demands, and inference latency are often overlooked. These challenges hinder the deployment of property prediction models on resource-constrained devices such as smartphones and IoT devices. Therefore, optimizing storage, reducing resource consumption, and improving inference speed are crucial. This paper presents a systematic approach to molecular networks by integrating GNN models with the DoReFa-Net quantization algorithm. The proposed method aims to enhance computational efficiency while maintaining predictive performance, enabling lightweight yet effective models suitable for molecular task. The study investigates the impact of different bitwidth quantization levels on model performance, using metrics such as RMSE and MAE. Results show that, for physical chemistry datasets, the effectiveness of quantization is highly dependent on the model architecture. Notably, the quantum mechanical dipole moment task maintains strong performance up to 8-bit precision, achieving similar or slightly better results. However, extreme quantization, particularly at 2-bit precision, severely degrades performance, highlighting the limitations of aggressive compression.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00989-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Starman, Fabian Kirchner, Martin Held, Catriona Eschke, Sayed-Ahmad Sahim, Regine Willumeit-Römer, Nicole Jung, Stefan Bräse
{"title":"ELNdataBridge: facilitating data exchange and collaboration by linking Electronic Lab Notebooks via API","authors":"Martin Starman, Fabian Kirchner, Martin Held, Catriona Eschke, Sayed-Ahmad Sahim, Regine Willumeit-Römer, Nicole Jung, Stefan Bräse","doi":"10.1186/s13321-025-01024-1","DOIUrl":"10.1186/s13321-025-01024-1","url":null,"abstract":"<div><p>Electronic Lab Notebooks (ELNs) have become indispensable tools for modern research laboratories, facilitating data management, collaboration, and documentation of scientific experiments. However, the proliferation of diverse ELN platforms poses challenges for researchers who need to seamlessly exchange data between different systems. In this paper, we present ELNdataBridge, a novel server-based solution designed to address this challenge by providing a flexible adapter for interfacing and synchronising data between disparate ELN platforms. ELNdataBridge leverages Python APIs to interact with the underlying data structures of various ELN systems, enabling smooth transfer of information between them. The system offers a user-friendly interface that allows researchers to map and configure the transfer of single values and entry types between different ELNs, thereby facilitating interoperability and data exchange. The suitability and efficiency of the developed software was shown by a first demonstrator, enabling the exchange of data from Chemotion ELN and Herbie, and therewith the connection of information with a focus on chemistry and materials sciences.</p><p><b>Scientific contribution:</b> To the best of our knowledge, a method enabling the interoperable exchange of information between different ELNs, as described here, has not yet been reported. Given the increasing number of scientists using ELNs and their reliance on discipline-specific platforms, this work proposes a solution to overcome the current limitations related to ELN interoperability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01024-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-dependent similarity searching for small molecular fragments","authors":"Atsushi Yoshimori, Jürgen Bajorath","doi":"10.1186/s13321-025-01032-1","DOIUrl":"10.1186/s13321-025-01032-1","url":null,"abstract":"<div><p>Similarity searching is a mainstay in cheminformatics that is generally used to identify compounds with desired properties. For small molecular fragments, similarity calculations based on standard descriptors often have limited utility for establishing meaningful similarity relationships due to feature sparseness. As an alternative, we have adapted the concept of context-depending word pair similarity from natural language processing to evaluate similarity relationships between substituents (R-groups) taking latent characteristics into account. Context-dependent similarity assessment is based on vector embeddings as fragment representations generated using neural networks. With active analogue series as a model system to establish a global structure–activity context, we demonstrate that this approach is applicable to systematic similarity searching for substituents and increases the performance of standard descriptor representations. Context-dependent similarity searching is capable of detecting remote and functionally relevant similarity relationships between substituents. Alternative search queries are introduced focusing on individual substituents within a global substituent context or individual sequences of substituents establishing a local context. For similarity searching, different structural or structure–property contexts can be established, providing opportunities for various applications.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01032-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Surfactant representation using COSMO screened charge density for adsorption isotherm prediction using Physics-Informed Neural Network (PINN)","authors":"Achmad Anggawirya Alimin, Kattariya Srasamran, Wanutchaya Yuenyong, Ampira Charoensaeng, Bor-Jier Shiau, Uthaiporn Suriyapraphadilok","doi":"10.1186/s13321-025-01027-y","DOIUrl":"10.1186/s13321-025-01027-y","url":null,"abstract":"<div><p>Predicting surfactant adsorption using the currently available isotherm model is limited to one or two independent variables: equilibrium concentration and temperature. This study aims to develop an adsorption model that includes molecular features, testing conditions, and solid properties in the model. A Physics-Informed Neural Network (PINN) was structured by integrating adsorption isotherm into artificial neural networks (ANN). The model was trained using a dataset containing 56 adsorption isotherms and 20 types of anionic and nonionic surfactants under various conditions with sand and silica oxide as their solids. The surfactants were quantified using sets of descriptors generated from molecular counting, charge distribution, and Conductor-like Screening Model (COSMO) screened charge density. The COSMO-screened charge density descriptors provide the highest accuracy in representing the surfactant molecule. The interpretation of molecular structure effect and surfactant-solid interaction described using COSMO-screened charge density showed that adsorption between the surfactant and solid media involves hydrogen bonding and hydrophobic interaction. The PINN model achieves high accuracy with 93% training and 85% validation with fivefold cross-validation. Later, the model was evaluated and used to generate an adsorption isotherm and predict unseen surfactant adsorption. Adsorption prediction with unseen surfactants showed high accuracy with the surfactant for familiar structure (RMSE 0.07 mg/g) and promising profile for the whole new structure (RMSE 2.95 mg/g). <b>Scientific contribution</b> This study advances the field by integrating COSMO-screened charge density descriptors into a physics-informed deep learning model to predict surfactant adsorption isotherms, accounting for molecular features, testing conditions, and solid properties. The incorporation of COSMO-screened charge density offers a novel approach to accurately represent surfactant molecules, enabling accurate prediction of their adsorption behavior. This approach extends conventional models, which are often limited to empirical parameters or fewer variables. This physics-informed framework significantly enhances the understanding of surfactant-solid interactions and offers a robust predictive tool for optimizing surfactant formulations, aiming to minimize adsorption losses in chemical enhanced oil recovery and environmental remediation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01027-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144136984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cecile Valsecchi, Jose A. Arjona-Medina, Natalia Dyubankova, Ramil Nugmanov
{"title":"Benchmarking molecular conformer augmentation with context-enriched training: graph-based transformer versus GNN models","authors":"Cecile Valsecchi, Jose A. Arjona-Medina, Natalia Dyubankova, Ramil Nugmanov","doi":"10.1186/s13321-025-01004-5","DOIUrl":"10.1186/s13321-025-01004-5","url":null,"abstract":"<div><p>The field of molecular representation has witnessed a shift towards models trained on molecular structures represented by strings or graphs, with chemical information encoded in nodes and bonds. Graph-based representations offer a more realistic depiction and support 3D geometry and conformer-based augmentation. Graph Neural Networks (GNNs) and Graph-based Transformer models (GTs) represent two paradigms in this field, with GT models emerging as a flexible alternative. In this study, we compare the performance of GT models against GNN models on three datasets. We explore the impact of training procedures, including context-enriched training through pretraining on quantum mechanical atomic-level properties and auxiliary task training. Our analysis focuses on sterimol parameters estimation, binding energy estimation, and generalization performance for transition metal complexes. We find that GT models with context-enriched training provide on par results compared to GNN models, with the added advantages of speed and flexibility. Our findings highlight the potential of GT models as a valid alternative for molecular representation learning tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01004-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144117787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingliang Li, Sunghwan Kim, Leonid Zaslavsky, Tiejun Cheng, Bo Yu, Evan E. Bolton
{"title":"A resource description framework (RDF) model of named entity co-occurrences in biomedical literature and its integration with PubChemRDF","authors":"Qingliang Li, Sunghwan Kim, Leonid Zaslavsky, Tiejun Cheng, Bo Yu, Evan E. Bolton","doi":"10.1186/s13321-025-01017-0","DOIUrl":"10.1186/s13321-025-01017-0","url":null,"abstract":"<div><p>Named entities, such as chemicals/drugs, genes/proteins, and diseases, and their associations are not only important components of biomedical literature, but also the foundation of creating biomedical knowledgebases and knowledge graphs. This work addresses the challenges of expressing co-occurrence associations between named entities extracted from a biomedical literature corpus in a machine-readable format. We developed a Resource Description Framework (RDF) data model and integrated it into the PubChemRDF resource, which is freely accessible and publicly available. The developed co-occurrence data model was populated into a triplestore with named entities and their associations derived from text mining of millions of biomedical references found in PubMed. The utility of the data model was demonstrated through multiple use cases. Together with meta-data modeling of the references including the information about the author, journal, grant, and funding agency, this data model allows researchers to address pertinent biomedical questions through SPARQL queries and helps to exploit biomedical knowledge in various user perspectives and use cases.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01017-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}