O. Vavra, J. Tyzack, F. Haddadi, J. Stourac, J. Damborsky, S. Mazurenko, J. M. Thornton, D. Bednar
{"title":"Large-scale annotation of biochemically relevant pockets and tunnels in cognate enzyme–ligand complexes","authors":"O. Vavra, J. Tyzack, F. Haddadi, J. Stourac, J. Damborsky, S. Mazurenko, J. M. Thornton, D. Bednar","doi":"10.1186/s13321-024-00907-z","DOIUrl":"10.1186/s13321-024-00907-z","url":null,"abstract":"<div><p>Tunnels in enzymes with buried active sites are key structural features allowing the entry of substrates and the release of products, thus contributing to the catalytic efficiency. Targeting the bottlenecks of protein tunnels is also a powerful protein engineering strategy. However, the identification of functional tunnels in multiple protein structures is a non-trivial task that can only be addressed computationally. We present a pipeline integrating automated structural analysis with an <i>in-house</i> machine-learning predictor for the annotation of protein pockets, followed by the calculation of the energetics of ligand transport via biochemically relevant tunnels. A thorough validation using eight distinct molecular systems revealed that CaverDock analysis of ligand un/binding is on par with time-consuming molecular dynamics simulations, but much faster. The optimized and validated pipeline was applied to annotate more than 17,000 cognate enzyme–ligand complexes. Analysis of ligand un/binding energetics indicates that the top priority tunnel has the most favourable energies in 75% of cases. Moreover, energy profiles of cognate ligands revealed that a simple geometry analysis can correctly identify tunnel bottlenecks only in 50% of cases. Our study provides essential information for the interpretation of results from tunnel calculation and energy profiling in mechanistic enzymology and protein engineering. We formulated several simple rules allowing identification of biochemically relevant tunnels based on the binding pockets, tunnel geometry, and ligand transport energy profiles.</p><p><b>Scientific contributions</b></p><p>The pipeline introduced in this work allows for the detailed analysis of a large set of protein–ligand complexes, focusing on transport pathways. We are introducing a novel predictor for determining the relevance of binding pockets for tunnel calculation. For the first time in the field, we present a high-throughput energetic analysis of ligand binding and unbinding, showing that approximate methods for these simulations can identify additional mutagenesis hotspots in enzymes compared to purely geometrical methods. The predictor is included in the supplementary material and can also be accessed at https://github.com/Faranehhad/Large-Scale-Pocket-Tunnel-Annotation.git. The tunnel data calculated in this study has been made publicly available as part of the ChannelsDB 2.0 database, accessible at https://channelsdb2.biodata.ceitec.cz/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00907-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prashant Srivastava, Alexandra Steuer, Francesco Ferri, Alessandro Nicoli, Kristian Schultz, Saptarshi Bej, Antonella Di Pizio, Olaf Wolkenhauer
{"title":"Bitter peptide prediction using graph neural networks","authors":"Prashant Srivastava, Alexandra Steuer, Francesco Ferri, Alessandro Nicoli, Kristian Schultz, Saptarshi Bej, Antonella Di Pizio, Olaf Wolkenhauer","doi":"10.1186/s13321-024-00909-x","DOIUrl":"10.1186/s13321-024-00909-x","url":null,"abstract":"<div><p>Bitter taste is an unpleasant taste modality that affects food consumption. Bitter peptides are generated during enzymatic processes that produce functional, bioactive protein hydrolysates or during the aging process of fermented products such as cheese, soybean protein, and wine. Understanding the underlying peptide sequences responsible for bitter taste can pave the way for more efficient identification of these peptides. This paper presents BitterPep-GCN, a feature-agnostic graph convolution network for bitter peptide prediction. The graph-based model learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. BitterPep-GCN was benchmarked using BTP640, a publicly available bitter peptide dataset. The latent peptide embeddings generated by the trained model were used to analyze the activity of sequence motifs responsible for the bitter taste of the peptides. Particularly, we calculated the activity for individual amino acids and dipeptide, tripeptide, and tetrapeptide sequence motifs present in the peptides. Our analyses pinpoint specific amino acids, such as F, G, P, and R, as well as sequence motifs, notably tripeptide and tetrapeptide motifs containing FF, as key bitter signatures in peptides. This work not only provides a new predictor of bitter taste for a more efficient identification of bitter peptides in various food products but also gives a hint into the molecular basis of bitterness.</p><p><b>Scientific Contribution</b></p><p>Our work provides the first application of Graph Neural Networks for the prediction of peptide bitter taste. The best-developed model, BitterPep-GCN, learns the embedding of amino acids in the bitter peptide sequences and uses mixed pooling for bitter classification. The embeddings were used to analyze the sequence motifs responsible for the bitter taste.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00909-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sejal Sharma, Liping Feng, Nicha Boonpattrawong, Arvinder Kapur, Lisa Barroilhet, Manish S. Patankar, Spencer S. Ericksen
{"title":"Data mining of PubChem bioassay records reveals diverse OXPHOS inhibitory chemotypes as potential therapeutic agents against ovarian cancer","authors":"Sejal Sharma, Liping Feng, Nicha Boonpattrawong, Arvinder Kapur, Lisa Barroilhet, Manish S. Patankar, Spencer S. Ericksen","doi":"10.1186/s13321-024-00906-0","DOIUrl":"10.1186/s13321-024-00906-0","url":null,"abstract":"<div><p>Focused screening on target-prioritized compound sets can be an efficient alternative to high throughput screening (HTS). For most biomolecular targets, compound prioritization models depend on prior screening data or a target structure. For phenotypic or multi-protein pathway targets, it may not be clear which public assay records provide relevant data. The question also arises as to whether data collected from disparate assays might be usefully consolidated. Here, we report on the development and application of a data mining pipeline to examine these issues. To illustrate, we focus on identifying inhibitors of oxidative phosphorylation, a druggable metabolic process in epithelial ovarian tumors. The pipeline compiled 8415 available OXPHOS-related bioassays in the PubChem data repository involving 312,093 unique compound records. Application of PubChem assay activity annotations, PAINS (Pan Assay Interference Compounds), and Lipinski-like bioavailability filters yields 1852 putative OXPHOS-active compounds that fall into 464 clusters. These chemotypes are diverse but have relatively high hydrophobicity and molecular weight but lower complexity and drug-likeness. These chemotypes show a high abundance of bicyclic ring systems and oxygen containing functional groups including ketones, allylic oxides (alpha/beta unsaturated carbonyls), hydroxyl groups, and ethers. In contrast, amide and primary amine functional groups have a notably lower than random prevalence. UMAP representation of the chemical space shows strong divergence in the regions occupied by OXPHOS-inactive and -active compounds. Of the six compounds selected for biological testing, 4 showed statistically significant inhibition of electron transport in bioenergetics assays. Two of these four compounds, lacidipine and esbiothrin, increased in intracellular oxygen radicals (a major hallmark of most OXPHOS inhibitors) and decreased the viability of two ovarian cancer cell lines, ID8 and OVCAR5. Finally, data from the pipeline were used to train random forest and support vector classifiers that effectively prioritized OXPHOS inhibitory compounds within a held-out test set (ROCAUC 0.962 and 0.927, respectively) and on another set containing 44 documented OXPHOS inhibitors outside of the training set (ROCAUC 0.900 and 0.823). This prototype pipeline is extensible and could be adapted for focus screening on other phenotypic targets for which sufficient public data are available.</p><p><b>Scientific contribution</b></p><p>Here, we describe and apply an assay data mining pipeline to compile, process, filter, and mine public bioassay data. We believe the procedure may be more broadly applied to guide compound selection in early-stage hit finding on novel multi-protein mechanistic or phenotypic targets. To demonstrate the utility of our approach, we apply a data mining strategy on a large set of public assay data to find drug-like molecules that inhibit oxidative phosphorylation (OXPHOS) a","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00906-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuting Liu, Akiyasu C. Yoshizawa, Yiwei Ling, Shujiro Okuda
{"title":"Insights into predicting small molecule retention times in liquid chromatography using deep learning","authors":"Yuting Liu, Akiyasu C. Yoshizawa, Yiwei Ling, Shujiro Okuda","doi":"10.1186/s13321-024-00905-1","DOIUrl":"10.1186/s13321-024-00905-1","url":null,"abstract":"<p>In untargeted metabolomics, structures of small molecules are annotated using liquid chromatography-mass spectrometry by leveraging information from the molecular retention time (RT) in the chromatogram and <i>m/z</i> (formerly called ''mass-to-charge ratio'') in the mass spectrum. However, correct identification of metabolites is challenging due to the vast array of small molecules. Therefore, various in silico tools for mass spectrometry peak alignment and compound prediction have been developed; however, the list of candidate compounds remains extensive. Accurate RT prediction is important to exclude false candidates and facilitate metabolite annotation. Recent advancements in artificial intelligence (AI) have led to significant breakthroughs in the use of deep learning models in various fields. Release of a large RT dataset has mitigated the bottlenecks limiting the application of deep learning models, thereby improving their application in RT prediction tasks. This review lists the databases that can be used to expand training datasets and concerns the issue about molecular representation inconsistencies in datasets. It also discusses the application of AI technology for RT prediction, particularly in the 5 years following the release of the METLIN small molecule RT dataset. This review provides a comprehensive overview of the AI applications used for RT prediction, highlighting the progress and remaining challenges.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00905-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis H. M. Torres, Joel P. Arrais, Bernardete Ribeiro
{"title":"Combining graph neural networks and transformers for few-shot nuclear receptor binding activity prediction","authors":"Luis H. M. Torres, Joel P. Arrais, Bernardete Ribeiro","doi":"10.1186/s13321-024-00902-4","DOIUrl":"10.1186/s13321-024-00902-4","url":null,"abstract":"<div><p>Nuclear receptors (NRs) play a crucial role as biological targets in drug discovery. However, determining which compounds can act as endocrine disruptors and modulate the function of NRs with a reduced amount of candidate drugs is a challenging task. Moreover, the computational methods for NR-binding activity prediction mostly focus on a single receptor at a time, which may limit their effectiveness. Hence, the transfer of learned knowledge among multiple NRs can improve the performance of molecular predictors and lead to the development of more effective drugs. In this research, we integrate graph neural networks (GNNs) and Transformers to introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the binding activity of compounds using the combined information of different NRs and identify potential NR-modulators with limited data. The Meta-GTNRP model captures the local information in graph-structured data and preserves the global-semantic structure of molecular graph embeddings for NR-binding activity prediction. Furthermore, a few-shot meta-learning approach is proposed to optimize model parameters for different NR-binding tasks and leverage the complementarity among multiple NR-specific tasks to predict binding activity of compounds for each NR with just a few labeled molecules. Experiments with a compound database containing annotations on the binding activity for 11 NRs shows that Meta-GTNRP outperforms other graph-based approaches. The data and code are available at: https://github.com/ltorres97/Meta-GTNRP.</p><p><b>Scientific contribution</b></p><p>The proposed few-shot GNN-Transformer model, Meta-GTNRP captures the local structure of molecular graphs and preserves the global-semantic information of graph embeddings to predict the NR-binding activity of compounds with limited available data; A few-shot meta-learning framework adapts model parameters across NR-specific tasks for different NRs in a joint learning procedure to predict the binding activity of compounds for each NR with just a few labeled molecules in highly imbalanced data scenarios; Meta-GTNRP is a data-efficient approach that combines the strengths of GNNs and Transformers to predict the NR-binding properties of compounds through an optimized meta-learning procedure and deliver robust results valuable to identify potential NR-based drug candidates.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00902-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samar Monem, Aboul Ella Hassanien, Alaa H. Abdel-Hamid
{"title":"A multi-view feature representation for predicting drugs combination synergy based on ensemble and multi-task attention models","authors":"Samar Monem, Aboul Ella Hassanien, Alaa H. Abdel-Hamid","doi":"10.1186/s13321-024-00903-3","DOIUrl":"10.1186/s13321-024-00903-3","url":null,"abstract":"<div><p>This paper proposes a novel multi-view ensemble predictor model that is designed to address the challenge of determining synergistic drug combinations by predicting both the synergy score value values and synergy class label of drug combinations with cancer cell lines. The proposed methodology involves representing drug features through four distinct views: Simplified Molecular-Input Line-Entry System (SMILES) features, molecular graph features, fingerprint features, and drug-target features. On the other hand, cell line features are captured through four views: gene expression features, copy number features, mutation features, and proteomics features. To prevent overfitting of the model, two techniques are employed. First, each view feature of a drug is paired with each corresponding cell line view and input into a multi-task attention deep learning model. This multi-task model is trained to simultaneously predict both the synergy score value and synergy class label. This process results in sixteen input view features being fed into the multi-task model, producing sixteen prediction values. Subsequently, these prediction values are utilized as inputs for an ensemble model, which outputs the final prediction value. The ‘MVME’ model is assessed using the O’Neil dataset, which includes 38 distinct drugs combined across 39 distinct cancer cell lines to output 22,737 drug combination pairs. For the synergy score value, the proposed model scores a mean square error (MSE) of 206.57, a root mean square error (RMSE) of 14.30, and a Pearson score of 0.76. For the synergy class label, the model scores 0.90 for accuracy, 0.96 for precision, 0.57 for kappa, 0.96 for the area under the ROC curve (ROC-AUC), and 0.88 for the area under the precision-recall curve (PR-AUC).</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00903-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sven Marcel Stefan, Katja Stefan, Vigneshwaran Namasivayam
{"title":"Computer-aided pattern scoring (C@PS): a novel cheminformatic workflow to predict ligands with rare modes-of-action","authors":"Sven Marcel Stefan, Katja Stefan, Vigneshwaran Namasivayam","doi":"10.1186/s13321-024-00901-5","DOIUrl":"10.1186/s13321-024-00901-5","url":null,"abstract":"<div><p>The identification, establishment, and exploration of potential pharmacological drug targets are major steps of the drug development pipeline. Target validation requires diverse chemical tools that come with a spectrum of functionality, <i>e.g.</i>, inhibitors, activators, and other modulators. Particularly tools with rare modes-of-action allow for a proper kinetic and functional characterization of the targets-of-interest (<i>e.g.</i>, channels, enzymes, receptors, or transporters). Despite, functional innovation is a prime criterion for patentability and commercial exploitation, which may lead to therapeutic benefit. Unfortunately, data on new, and thus, undruggable or barely druggable targets are scarce and mostly available for mainstream modes-of-action only (<i>e.g.</i>, inhibition). Here we present a novel cheminformatic workflow—computer-aided pattern scoring (C@PS)—which was specifically designed to project its prediction capabilities into an uncharted domain of applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00901-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142276918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EC-Conf: A ultra-fast diffusion model for molecular conformation generation with equivariant consistency","authors":"Zhiguang Fan, Yuedong Yang, Mingyuan Xu, Hongming Chen","doi":"10.1186/s13321-024-00893-2","DOIUrl":"10.1186/s13321-024-00893-2","url":null,"abstract":"<p>Despite recent advancement in 3D molecule conformation generation driven by diffusion models, its high computational cost in iterative diffusion/denoising process limits its application. Here, an equivariant consistency model (EC-Conf) was proposed as a fast diffusion method for low-energy conformation generation. In EC-Conf, a modified SE (3)-equivariant transformer model was directly used to encode the Cartesian molecular conformations and a highly efficient consistency diffusion process was carried out to generate molecular conformations. It was demonstrated that, with only one sampling step, it can already achieve comparable quality to other diffusion-based models running with thousands denoising steps. Its performance can be further improved with a few more sampling iterations. The performance of EC-Conf is evaluated on both GEOM-QM9 and GEOM-Drugs sets. Our results demonstrate that the efficiency of EC-Conf for learning the distribution of low energy molecular conformation is at least two magnitudes higher than current SOTA diffusion models and could potentially become a useful tool for conformation generation and sampling.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00893-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142124484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Barbara R. Terlouw, Friederike Biermann, Sophie P. J. M. Vromans, Elham Zamani, Eric J. N. Helfrich, Marnix H. Medema
{"title":"RAIChU: automating the visualisation of natural product biosynthesis","authors":"Barbara R. Terlouw, Friederike Biermann, Sophie P. J. M. Vromans, Elham Zamani, Eric J. N. Helfrich, Marnix H. Medema","doi":"10.1186/s13321-024-00898-x","DOIUrl":"10.1186/s13321-024-00898-x","url":null,"abstract":"<div><p>Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.</p><p><b>Scientific contribution</b></p><p>RAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.</p></div","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00898-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández
{"title":"Evaluating the generalizability of graph neural networks for predicting collision cross section","authors":"Chloe Engler Hart, António José Preto, Shaurya Chanana, David Healey, Tobias Kind, Daniel Domingo-Fernández","doi":"10.1186/s13321-024-00899-w","DOIUrl":"10.1186/s13321-024-00899-w","url":null,"abstract":"<div><p>Ion Mobility coupled with Mass Spectrometry (IM-MS) is a promising analytical technique that enhances molecular characterization by measuring collision cross-section (CCS) values, which are indicative of the molecular size and shape. However, the effective application of CCS values in structural analysis is still constrained by the limited availability of experimental data, necessitating the development of accurate machine learning (ML) models for in silico predictions. In this study, we evaluated state-of-the-art Graph Neural Networks (GNNs), trained to predict CCS values using the largest publicly available dataset to date. Although our results confirm the high accuracy of these models within chemical spaces similar to their training environments, their performance significantly declines when applied to structurally novel regions. This discrepancy raises concerns about the reliability of in silico CCS predictions and underscores the need for releasing further publicly available CCS datasets. To mitigate this, we introduce Mol2CCS which demonstrates how generalization can be partially improved by extending models to account for additional features such as molecular fingerprints, descriptors, and the molecule types. Lastly, we also show how confidence models can support by enhancing the reliability of the CCS estimates.</p><p><b>Scientific contribution</b></p><p>We have benchmarked state-of-the-art graph neural networks for predicting collision cross section. Our work highlights the accuracy of these models when trained and predicted in similar chemical spaces, but also how their accuracy drops when evaluated in structurally novel regions. Lastly, we conclude by presenting potential approaches to mitigate this issue.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00899-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}