Journal of Cheminformatics最新文献

筛选
英文 中文
Predicting enzyme-compound associations for enzyme-catalysed reactions. 预测酶催化反应的酶-化合物关联。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-22 DOI: 10.1186/s13321-026-01190-w
Liam Brydon-Brown,Gillian Dobbie,Katerina Taškova,Jörg Simon Wicker
{"title":"Predicting enzyme-compound associations for enzyme-catalysed reactions.","authors":"Liam Brydon-Brown,Gillian Dobbie,Katerina Taškova,Jörg Simon Wicker","doi":"10.1186/s13321-026-01190-w","DOIUrl":"https://doi.org/10.1186/s13321-026-01190-w","url":null,"abstract":"Enzyme-catalysed reactions are common in many areas, including pharmaceutical metabolism and agricultural chemical biodegradation. Analysing and predicting how these reactions occur is increasingly important for identifying toxic by-products and achieving regulatory approval. Incorporating enzyme information into these predictions has been shown to improve prediction capabilities. However, existing methods require knowledge of the enzyme to perform prediction, and in many situations, especially biodegradation, the complexities of the reaction environment mean the exact enzymes are not known. In this paper, we alleviate this issue by proposing a framework to train and evaluate a hierarchical multi-label classifier to predict the association between enzyme commission numbers and chemical compounds. Our method achieves a hierarchical F1-score of up to 93.2%, outperforming existing methodologies. Additionally, we examine how including true and predicted enzyme information impacts product prediction performance compared to not using enzyme information. In our case study utilising biodegradation reaction data, we find that including enzyme commission numbers improve product prediction performance by approximately two percentage points.Scientific contributionWe contribute a novel method for predicting enzyme-compound associations using a hierarchical multi-label classifier framework. Our method is self tuning to find the best hyperparameters for a given dataset and achieves higher F1 scores than existing methods. We also contribute an investigation into including enzyme information into product prediction algorithms, showing that including this information can improve product prediction performance.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147733776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ProQSAR: A modular and reproducible framework for small-data QSAR modeling with fit-and-use models ProQSAR:一个模块化和可重复的框架,用于小数据QSAR建模,具有适合使用的模型。
IF 5.7 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-22 Epub Date: 2026-04-28 DOI: 10.1186/s13321-026-01175-9
Tuyet-Minh Phan, Tieu-Long Phan, Phuoc-Chung Van-Nguyen, Lai Hoang Son Le, Van-Thinh To, Tuyen Ngoc Truong, Daniel Merkle, Peter F. Stadler
{"title":"ProQSAR: A modular and reproducible framework for small-data QSAR modeling with fit-and-use models","authors":"Tuyet-Minh Phan,&nbsp;Tieu-Long Phan,&nbsp;Phuoc-Chung Van-Nguyen,&nbsp;Lai Hoang Son Le,&nbsp;Van-Thinh To,&nbsp;Tuyen Ngoc Truong,&nbsp;Daniel Merkle,&nbsp;Peter F. Stadler","doi":"10.1186/s13321-026-01175-9","DOIUrl":"10.1186/s13321-026-01175-9","url":null,"abstract":"<div><h3>Background</h3><p>Quantitative structure-activity relationship (QSAR) models are central to computer-aided drug discovery and predictive toxicology, but practical adoption is often impeded by ad-hoc tooling, inconsistent validation protocols, and poor reproducibility.</p><h3>Objective</h3><p>We introduce <span>ProQSAR</span>, a modular, reproducible workbench that formalizes end-to-end QSAR development while permitting independent use of each component.</p><h3>Methods</h3><p><span>ProQSAR</span> composes interchangeable modules for standardization, feature generation, splitting (including scaffold- and cluster-aware splits), preprocessing, outlier handling, scaling, feature selection, model training and tuning, statistical comparison, conformal calibration, and applicability-domain assessment. The pipeline can run end-to-end to produce versioned artifact bundles (serialized models) and analyst-oriented reports suitable for deployment and audit.</p><h3>Results</h3><p>On representative <span>MoleculeNet</span> benchmarks evaluated under Bemis–Murcko scaffold split, <span>ProQSAR</span> attains state-of-the-art descriptor-based performance: the lowest mean RMSE across the regression suite (<span>ESOL</span>, <span>FreeSolv</span>, <span>Lipophilicity</span>; mean RMSE <span>(0.658pm 0.11)</span>), including a substantial improvement on <span>FreeSolv</span> (RMSE <span>(0.494)</span> vs. <span>(0.731)</span> for a leading graph method). On quantum mechanical benchmarks, <span>ProQSAR</span> demonstrated superior performance on the single-task dataset <span>QM7</span> and maintained competitive results on the multi-task <span>QM8</span> dataset. For classification, <span>ProQSAR</span> achieves the top ROC–AUC on <span>ClinTox</span> (91.4%) while remaining competitive across other benchmark (overall classification average <span>(70.4pm 11.6)</span>). Crucially, all predictions are accompanied by cross-conformal prediction and explicit applicability-domain flags that identify out-of-distribution entries, enabling calibrated and decision support.</p><h3>Availability</h3><p><span>ProQSAR</span> is released on <span>PyPI</span>, <span>Conda</span>, and <span>Docker Hub</span>; all releases embed full provenance (parameters, package versions, checksums) to ensure reproducibility.</p><h3>Scientific contribution</h3><p><span>ProQSAR</span> (i) enforces best-practice, group-aware validation together with formal statistical comparisons across models, (ii) integrates calibrated uncertainty quantification (cross-conformal prediction) and applicability-domain diagnostics for interpretable, risk-aware predictions, and (iii) exposes both a composable developer API and a one-click pipeline that generates deployment-ready artifacts and human-readable reports, demonstrated on representative benchmarks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-026-01175-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147733768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalizable deep-learning-based mRNA-protein interaction prediction strongly depends on protein diversity. 基于广义深度学习的mrna -蛋白相互作用预测在很大程度上依赖于蛋白质多样性。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-21 DOI: 10.1186/s13321-026-01197-3
Yu-Huai Yu,Han-Ting Hong,Tzu-Hsien Yang
{"title":"Generalizable deep-learning-based mRNA-protein interaction prediction strongly depends on protein diversity.","authors":"Yu-Huai Yu,Han-Ting Hong,Tzu-Hsien Yang","doi":"10.1186/s13321-026-01197-3","DOIUrl":"https://doi.org/10.1186/s13321-026-01197-3","url":null,"abstract":"BACKGROUNDProteins regulate diverse biological processes through interactions with other molecules, including RNAs. RNA-binding proteins (RBPs) are essential regulators of gene expression, forming specific mRNA-protein interactions (mRPIs) that influence mRNA processing, translation, and stability. Recently, deep-learning models have been proposed to predict mRPIs using only sequence information, with some reporting near-perfect accuracy. However, such performance appears inconsistent with the biological complexity of RNA recognition by proteins, which is often influenced by protein tertiary structures that are computationally challenging to predict. In related fields such as protein-protein interaction prediction, data leakage, particularly caused by overlapping proteins between training and test sets, has been shown to substantially inflate performance metrics. Nevertheless, whether similar issues affect mRPI prediction has not yet been systematically investigated.RESULTSWe constructed an mRPI benchmark dataset from CLIP experiments and implemented two data partitioning schemes: a random interaction-level split and an RBP-aware split in which pairs of all test RBPs were excluded from training. Three RBP sequence encoding strategies were evaluated within an attention-based deep-learning framework under both partitioning settings: sequence-based one-hot encoding, language model-derived encoding, and structure-aware encoding. Across all models, performance remained high only when test RBPs were also present in the training data. When predicting interactions for unseen RBPs, performance dropped substantially, indicating limited generalization. Even replacing RBPs with their most similar counterparts from the training set did not meaningfully improve generalization. These results suggest that additional protein features beyond sequence information are required to achieve robust mRPI prediction. Overall, our study demonstrated that existing mRPI prediction models are largely overfitted to their original training RBPs and fail to generalize to unseen proteins.CONCLUSIONSOverall, we provided a curated benchmark dataset, a rigorous evaluation framework, and an attention-based model that achieves the best generalization performance among currently available methods, with an approximately 8.5% auROC improvement over existing tools. These resources will facilitate the development of more reliable and broadly applicable mRPI prediction tools.SCIENTIFIC CONTRIBUTIONThis work presented the first systematic investigation of data leakage and generalization in mRNA-protein interaction prediction, demonstrating that most reported near-perfect performance is largely driven by RBP overlap between training and test sets. By introducing an RBP-aware evaluation framework and a benchmark dataset, we revealed that most sequence-based models fail to generalize to unseen RBPs, even when enhanced with protein language model-derived and structure-aware encodings. Our study ","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"13 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147731409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predictive machine learning models for rational permeability design in de novo macrocycle engineering: a review. 大循环工程中合理渗透率设计的预测机器学习模型综述。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-21 DOI: 10.1186/s13321-026-01189-3
M Taleb Albrijawi,Reda Alhajj
{"title":"Predictive machine learning models for rational permeability design in de novo macrocycle engineering: a review.","authors":"M Taleb Albrijawi,Reda Alhajj","doi":"10.1186/s13321-026-01189-3","DOIUrl":"https://doi.org/10.1186/s13321-026-01189-3","url":null,"abstract":"Small molecule drug discovery has been highly successful across many therapeutic areas over decades of progress; however, many disease-relevant proteins remain difficult to target. In particular, intracellular proteins with large, shallow, or flexible interaction surfaces are poorly addressed by classical drug-like compounds. For these reasons, drug discovery efforts have shifted toward alternative molecular classes. This has led to growing interest in macrocyclic compounds in recent years, which have emerged as an important class of therapeutic molecules, particularly for targets that are out of reach for conventional small molecules. These compounds operate within the chemical space beyond Lipinski's Rule of Five (bRo5) and offer new opportunities for modulating difficult intracellular targets. At the same time, their size, flexibility, and structural complexity introduce significant challenges, among which the accurate prediction of membrane permeability remains one of the persistent limitations in their rational design, particularly for orally bioavailable candidates.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"31 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147731408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChemScreener: an active learning enabled hit discovery workflow with WDR5 inhibitor case study. ChemScreener:基于WDR5抑制剂的主动学习命中发现工作流案例研究。
IF 5.7 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-20 DOI: 10.1186/s13321-026-01204-7
Lingling Shen, Jian Fang, Lulu Liu, Rena Wang, Jeremy L Jenkins, He Wang
{"title":"ChemScreener: an active learning enabled hit discovery workflow with WDR5 inhibitor case study.","authors":"Lingling Shen, Jian Fang, Lulu Liu, Rena Wang, Jeremy L Jenkins, He Wang","doi":"10.1186/s13321-026-01204-7","DOIUrl":"https://doi.org/10.1186/s13321-026-01204-7","url":null,"abstract":"<p><p>Active deep learning offers a promising approach for hit discovery starting from limited data by iteratively updating and improving models during screening by applying new data and adapting decisions. Key open questions include how best to explore chemical space, how it compares to non-iterative methods, and how to use it under data scarcity. We present ChemScreener, a multi-task active learning workflow for early drug discovery across large, diverse libraries or chemical spaces. Its Balanced-Ranking acquisition strategy leverages ensemble uncertainty to explore novel chemistry while maintaining hit rate enrichment by prioritizing predicted activity. In five iterative single-dose HTRF screens on WDR5 protein, ChemScreener increased hit rates from 0.49% (primary HTS screen) to 3-10% (average 5.91%; 104 hits from 1760 compounds). Hits were consolidated, retested with close analogs together in the 269 compounds set and clustered; 44 hit compounds from 81 clusters of 269 compounds set advanced to dose-response and filtered by counter HTRF assays. Over 50% of those with IC50 < 45 μM were validated as WDR5 binders by DSF. We de novo identified three scaffold series and three singleton scaffolds as the hits. Overall, we demonstrated that ChemScreener can accelerate early hit discovery and yield more diverse chemotypes.Scientific contributionHit identification is a costly, time-intensive stage in drug discovery. We developed ChemScreener, a scalable active learning workflow for early hit discovery that improves hit rate enrichment through iterative screening of small number of compounds and expands chemical diversity by de novo hit scaffolds identified. ChemScreener offers a generalizable, target-specific, ligand-based virtual screening framework that accelerates early discovery and enhances effectiveness across large, diverse chemical libraries.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147727987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CheMLT-F: multitask learning in biochemistry through transformer fusion. CheMLT-F:通过变压器融合的生物化学多任务学习。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-16 DOI: 10.1186/s13321-026-01199-1
Vladislav Mun,Siamac Fazli
{"title":"CheMLT-F: multitask learning in biochemistry through transformer fusion.","authors":"Vladislav Mun,Siamac Fazli","doi":"10.1186/s13321-026-01199-1","DOIUrl":"https://doi.org/10.1186/s13321-026-01199-1","url":null,"abstract":"Drug discovery remains a slow and costly process, in part because efficacy, toxicity, and physicochemical liabilities must be screened across a vast chemical space. Stand-alone, single-task predictors can help, but they lead to fragmented workflows and make it hard to reuse learned representations, data processing, and infrastructure across endpoints (i.e., prediction tasks). Here we present CheMLT-F, a compact multitask transformer that fuses encoders for molecular and protein sequences to learn a unified representation spanning 680+ endpoints, including diverse toxicities, physicochemical properties, and drug-target interactions. Across 13 public benchmarks, CheMLT-F matches state-of-the-art toxicity classifiers and sets new performance marks for physicochemical property prediction, while remaining competitive for drug-target affinity (KIBA and Davis). Moreover, CheMLT-F demonstrates competitive performance on an external protein-family benchmark spanning seven target superfamilies, indicating broad generalizability in bioactivity prediction. Multitask parameter sharing keeps the model lightweight and inference-efficient, and its modular heads make extensions to new endpoints straightforward. By replacing many individual models with a single, extensible backbone, CheMLT-F streamlines in silico screening and lowers the barrier to broad, data-driven decision-making in early drug discovery. Scientific contribution We introduce a unified transformer architecture that jointly models molecular and protein sequences across hundreds of pharmacologically relevant endpoints spanning toxicity, physicochemical properties, and drug-target interactions. A tailored training strategy that combines partial encoder freezing, global-local loss balancing, and weighted task sampling reduces trainable parameters and deployment complexity while preserving strong cross-domain generalization. Comprehensive evaluation across 13 public datasets, including scaffold-aware and random data splits, demonstrates competitive accuracy with substantially lower operational overhead than maintaining numerous single-task models, establishing a scalable foundation for extensible and holistic predictive modeling in computational drug discovery.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"52 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147695262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CReM-pharm: de novo 3D pharmacophore-based design with synthetic accessibility awareness. CReM-pharm:基于合成可达性意识的全新3D药物团设计。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-15 DOI: 10.1186/s13321-026-01195-5
Alina Denzler,Dinesh Kumar Sriramulu,Jozef Pecha,Pavel Polishchuk
{"title":"CReM-pharm: de novo 3D pharmacophore-based design with synthetic accessibility awareness.","authors":"Alina Denzler,Dinesh Kumar Sriramulu,Jozef Pecha,Pavel Polishchuk","doi":"10.1186/s13321-026-01195-5","DOIUrl":"https://doi.org/10.1186/s13321-026-01195-5","url":null,"abstract":"De novo design methodologies have the potential to significantly enhance the exploration of chemical space in the search for promising ligands featuring novel chemotypes. This exploration can be directed through various computational strategies. 3D pharmacophore models, which represent the interaction patterns critical for protein-ligand recognition, can serve as valuable tools for the design of novel compounds. A common limitation of many generative approaches is the low synthetic feasibility of the generated molecular structures. In the present study, we developed a method capable of controllably generating compounds with a relatively high degree of synthetic accessibility by leveraging the CReM framework, while explicitly conforming to a specified 3D pharmacophore model. Evaluation of this approach across a diverse set of protein targets and pharmacophore models of varying complexity demonstrated its effectiveness and highlighted its advantages over the PGMG method, which employs a deep neural network architecture to generate ligands that may exhibit desired 3D geometries upon embedding. The proposed method has been implemented as an open-source tool, CReM-pharm, available at https://github.com/ci-lab-cz/crem-pharm.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"22 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147685116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ANNalog: generation of MedChem-similar molecules ANNalog:产生类似medchem的分子。
IF 5.7 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-15 Epub Date: 2026-04-28 DOI: 10.1186/s13321-026-01186-6
Wei Dai, Jonathan D. Tyzack, Arianna Fornili, Chris de Graaf, Noel M. O’Boyle
{"title":"ANNalog: generation of MedChem-similar molecules","authors":"Wei Dai,&nbsp;Jonathan D. Tyzack,&nbsp;Arianna Fornili,&nbsp;Chris de Graaf,&nbsp;Noel M. O’Boyle","doi":"10.1186/s13321-026-01186-6","DOIUrl":"10.1186/s13321-026-01186-6","url":null,"abstract":"<div><p>Generative deep learning models have demonstrated significant potential in designing drug-like molecules. However, medicinal chemistry typically requires generating analogues that combine structural similarity with scaffold hopping, which is the replacement of molecular scaffolds while retaining biological relevance. To address this, we introduce ANNalog, a transformer-based sequence-to-sequence generative model trained on pairs of molecules extracted from the same bioactivity assay in a paper as recorded in ChEMBL33. The dataset was constructed based on the idea that molecules tested within the same assay can be considered analogues in medicinal chemistry space. Paired molecules were encoded as Simplified Molecular Input Line Entry System strings, and Levenshtein distance-guided alignment was applied to maximise intrapair string similarity; this preprocessing step was found to markedly enhance model performance. ANNalog has the ability to produce structurally similar analogues involving minor modifications, such as substituent replacements, as well as the ability to perform scaffold hopping, generating structurally distinct yet chemically relevant analogues. Scaffold-hopping capability was validated using manually curated molecule pairs and further confirmed through a case study involving orexin-2 receptor antagonists from patent literature. When the generation process was constrained using ANNalog’s prefix control feature, approximately 25% of the known scaffolds from the patent set were successfully recovered by the model, illustrating enhanced performance under user-guided conditions. Scientific Contribution: This study introduces ANNalog, a generative model trained using pairs of molecules synthesised and tested together within the same medicinal chemistry project. Unlike previous models trained on pairs of molecules selected according to similarity measures, ANNalog successfully generates not only structurally similar molecules but also diverse scaffold-hopping transformations that have precedent in the medicinal chemistry literature.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-026-01186-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147687675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Torsion angular bin strings: algorithmic update and additional validation. 扭转角bin字符串:算法更新和额外的验证。
IF 5.7 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-13 DOI: 10.1186/s13321-026-01194-6
Jessica Braun, Djahan Lamei, Philippe H Hünenberger, Gregory A Landrum, Sereina Riniker
{"title":"Torsion angular bin strings: algorithmic update and additional validation.","authors":"Jessica Braun, Djahan Lamei, Philippe H Hünenberger, Gregory A Landrum, Sereina Riniker","doi":"10.1186/s13321-026-01194-6","DOIUrl":"https://doi.org/10.1186/s13321-026-01194-6","url":null,"abstract":"<p><p>In our previous work, we introduced the concept of torsion angular bin strings (TABS), which is a discrete vector representation of a conformer's torsional angles. Through this discretization, conformational states can be counted, yielding an estimate of the upper limit of the expected conformational ensemble size (nTABS). Besides nTABS being used as a quantitative measure of molecular flexibility, TABS itself is a way of grouping the conformers of a molecule without picking thresholds. This feature of TABS is especially valuable, as selecting suitable thresholds for metrics such as heavy-atom root-mean-square deviation (RMSD) or shape Tanimoto is highly system-dependent and can thus be challenging when working with large sets of molecules. Here, we describe the update to the nTABS algorithm of the TABS package since the last release. In addition, we present a classification study of conformer ensembles by TABS and compare it to classifications by a shape Tanimoto metric. Scientific contribution In contrast to our previous implementation, which handled molecular topological symmetry by enumerating all possible combinations that were simply permutations of one another, the new implementation treats TABS as mathematical objects governed by group theory, specifically Burnside's Lemma. This approach requires substantially less code and delivers a notable improvement in computational speed. The study also builds upon our previously developed framework for categorization comparisons between TABS and heavy-atom RMSD. Here, we show the results of a similar comparison with a shape Tanimoto metric, which further support the hypothesis that TABS encode the shape of conformers in a meaningful way.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147669783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking the performance of uncertainty quantification methods for neural network-based interatomic potentials. 基于神经网络的原子间势不确定度量化方法性能的基准测试。
IF 8.6 2区 化学
Journal of Cheminformatics Pub Date : 2026-04-13 DOI: 10.1186/s13321-026-01193-7
Nicholas T Wimer,Juliane Mueller,Sebastien Hamel,Vincenzo Lordi
{"title":"Benchmarking the performance of uncertainty quantification methods for neural network-based interatomic potentials.","authors":"Nicholas T Wimer,Juliane Mueller,Sebastien Hamel,Vincenzo Lordi","doi":"10.1186/s13321-026-01193-7","DOIUrl":"https://doi.org/10.1186/s13321-026-01193-7","url":null,"abstract":"Machine-learned interatomic potentials (ML-IAPs) continue to gain popularity as accurate, computationally efficient replacements for traditional, physics-based interatomic potentials and expensive ab initio methods. Uncertainty quantification (UQ) of ML-IAPs is a growing area of research as UQ is critical in many applications of IAPs, such as developing curated datasets, active learning-based data augmentation, self-improving models, and estimating the uncertainty of molecular dynamics simulations. In this paper, we construct and benchmark a series of different neural network potentials (NNPs) with varying network architectures to determine the performance of these models with respect to both the mean and uncertainty calibration error. Each NNP method is specifically designed to predict either epistemic or aleatoric uncertainty with particular focus on the differences in behavior between the epistemic and aleatoric uncertainty estimates. We benchmark these methods using multiple datasets common in the ML-IAP literature. The results show that the aleatoric uncertainty from single-shot model architectures is a competitive alternative to ensemble-based epistemic uncertainty predictions in regions of sufficient data-density. However, in regions where the representative data is sparse, aleatoric uncertainty models tend to overpredict and epistemic methods tend to underpredict the actual model error. We conclude that the type of UQ is crucial when discussing performance of probabilistic model results as different methods have different performance characteristics depending on the regime in which they are evaluated. Therefore, the type of UQ method should be carefully evaluated against both the data characteristics and requirements for the intended application.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"9 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147667051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书