Composite machine learning strategy for natural products taxonomical classification and structural insights†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Qisong Xu, Alan K. X. Tan, Liangfeng Guo, Yee Hwee Lim, Dillon W. P. Tay and Shi Jun Ang
{"title":"Composite machine learning strategy for natural products taxonomical classification and structural insights†","authors":"Qisong Xu, Alan K. X. Tan, Liangfeng Guo, Yee Hwee Lim, Dillon W. P. Tay and Shi Jun Ang","doi":"10.1039/D4DD00155A","DOIUrl":null,"url":null,"abstract":"<p >Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGB) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133 092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of NPs that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure–taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 11","pages":" 2192-2200"},"PeriodicalIF":6.2000,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00155a?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00155a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Taxonomical classification of natural products (NPs) can assist in genomic and phylogenetic analysis of source organisms and facilitate streamlining of bioprospecting efforts. Here, a composite machine learning strategy marrying graph convolutional neural networks (GCNNs) and eXteme Gradient boosting (XGB) is proposed and validated for taxonomical classification of NPs in five kingdoms (Animalia, Bacteria, Chromista, Fungi, and Plantae). Our composite model, trained on 133 092 NPs from the LOTUS database, achieved five-fold cross-validated classification accuracy of 97.4%. When employed to classify out-of-sample NPs from the NP Atlas database, accuracies of 82.8% for bacteria and 86.6% for fungi were obtained. Dimensionality-reduced representations of the molecular embeddings from our composite model revealed distinct clusters of NPs that suggest a basis for enhanced classification performance. The top critical substructures from the NPs of each kingdom were also identified and compared to provide insights on structure–taxonomy relationships. Overall, this study showcases the potential of composite machine learning models for robust taxonomical classification of NPs, which can streamline discovery of NPs.

Abstract Image

天然产品分类和结构洞察的复合机器学习策略†
对天然产物(NPs)进行分类有助于对源生物进行基因组和系统发育分析,并有助于简化生物勘探工作。本文提出了一种将图卷积神经网络(GCNN)和梯度提升技术(XGB)结合起来的复合机器学习策略,并对五界(动物界、细菌界、染色体界、真菌界和植物界)的天然产物分类进行了验证。我们的复合模型是在 LOTUS 数据库的 133 092 个 NPs 上训练出来的,经过五倍交叉验证,分类准确率达到 97.4%。在对 NP Atlas 数据库中的样本外 NP 进行分类时,细菌和真菌的准确率分别为 82.8% 和 86.6%。我们的复合模型中分子嵌入的降维表示法揭示了NPs的独特群集,为提高分类性能提供了基础。此外,我们还识别并比较了每个生物界 NPs 中最重要的子结构,从而为结构-分类关系提供了深入的见解。总之,这项研究展示了复合机器学习模型在对 NPs 进行稳健分类方面的潜力,它可以简化 NPs 的发现过程。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信