基于机器学习的结核分枝杆菌抗菌素耐药性预测及amr相关snp鉴定。

IF 1.9 Q3 GENETICS & HEREDITY
Yi Xu, Ying Mao, Xiaoting Hua, Yan Jiang, Yi Zou, Zhichao Wang, Zubi Liu, Hongrui Zhang, Lingling Lu, Yunsong Yu
{"title":"基于机器学习的结核分枝杆菌抗菌素耐药性预测及amr相关snp鉴定。","authors":"Yi Xu, Ying Mao, Xiaoting Hua, Yan Jiang, Yi Zou, Zhichao Wang, Zubi Liu, Hongrui Zhang, Lingling Lu, Yunsong Yu","doi":"10.1186/s12863-025-01338-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Mycobacterium tuberculosis (MTB) is a human-specific pathogen that primarily infects humans, causing tuberculosis (TB). Antimicrobial resistance (AMR) in MTB presents a formidable challenge to global health. The employment of machine learning on whole-genome sequencing data (WGS) presents significant potential for uncovering the genomic mechanisms underlying drug resistance in MTB.</p><p><strong>Methods: </strong>We used 18 binary matrices, each consisting of genotypes and antimicrobial susceptibility testing phenotypes from a specific MTB-antimicrobial dataset. By constructing training and test datasets on all SNPs, intersected SNPs, and randomly generated SNPs, we developed a Machine learning (ML) framework using twelve different algorithms. Then, we compared the performances of the various ML models and used the SHapley Additive exPlanations (SHAP) framework to decipher why and how decisions are made within the optimal algorithm. Lastly, we applied the models to predict the resistance phenotype to rifampicin (RIF) and isoniazid (INH) in the additional independent MTB isolate datasets from India and Israel.</p><p><strong>Results: </strong>In our study, the Gradient Boosting Classifier (GBC) model was the best in terms of correctly identified percentages (97.28%, 96.06%, 94.19%, and 92.81% for the four first-line drugs, RIF, INH, pyrazinamide, and ethambutol respectively). By estimating the contributions of AMR-related SNPs by SHAP values, we found that position 761,155 (rpoB_p.Ser450), 2,155,168 (katG_p.Ser315) rank top in RIF and INH, their higher values (1 for alternative allele) tend to predict the resistance trait for these two drugs. In addition, the best model GBC generalizes well in predicting the resistance phenotypes for RIF and INH in the external independent MTB isolate datasets from India and Israel.</p><p><strong>Conclusions: </strong>This study integrates ML methods into antimicrobial resistance research, develops a framework for predicting resistance phenotypes, and explores AMR-related SNPs in MTB. Quantifying the important SNPs' contribution to model decisions makes the ML algorithmic process more transparent, interpretable enabling and enables clinical practice.</p>","PeriodicalId":72427,"journal":{"name":"BMC genomic data","volume":"26 1","pages":"48"},"PeriodicalIF":1.9000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255030/pdf/","citationCount":"0","resultStr":"{\"title\":\"Machine learning-based prediction of antimicrobial resistance and identification of AMR-related SNPs in Mycobacterium tuberculosis.\",\"authors\":\"Yi Xu, Ying Mao, Xiaoting Hua, Yan Jiang, Yi Zou, Zhichao Wang, Zubi Liu, Hongrui Zhang, Lingling Lu, Yunsong Yu\",\"doi\":\"10.1186/s12863-025-01338-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Mycobacterium tuberculosis (MTB) is a human-specific pathogen that primarily infects humans, causing tuberculosis (TB). Antimicrobial resistance (AMR) in MTB presents a formidable challenge to global health. The employment of machine learning on whole-genome sequencing data (WGS) presents significant potential for uncovering the genomic mechanisms underlying drug resistance in MTB.</p><p><strong>Methods: </strong>We used 18 binary matrices, each consisting of genotypes and antimicrobial susceptibility testing phenotypes from a specific MTB-antimicrobial dataset. By constructing training and test datasets on all SNPs, intersected SNPs, and randomly generated SNPs, we developed a Machine learning (ML) framework using twelve different algorithms. Then, we compared the performances of the various ML models and used the SHapley Additive exPlanations (SHAP) framework to decipher why and how decisions are made within the optimal algorithm. Lastly, we applied the models to predict the resistance phenotype to rifampicin (RIF) and isoniazid (INH) in the additional independent MTB isolate datasets from India and Israel.</p><p><strong>Results: </strong>In our study, the Gradient Boosting Classifier (GBC) model was the best in terms of correctly identified percentages (97.28%, 96.06%, 94.19%, and 92.81% for the four first-line drugs, RIF, INH, pyrazinamide, and ethambutol respectively). By estimating the contributions of AMR-related SNPs by SHAP values, we found that position 761,155 (rpoB_p.Ser450), 2,155,168 (katG_p.Ser315) rank top in RIF and INH, their higher values (1 for alternative allele) tend to predict the resistance trait for these two drugs. In addition, the best model GBC generalizes well in predicting the resistance phenotypes for RIF and INH in the external independent MTB isolate datasets from India and Israel.</p><p><strong>Conclusions: </strong>This study integrates ML methods into antimicrobial resistance research, develops a framework for predicting resistance phenotypes, and explores AMR-related SNPs in MTB. Quantifying the important SNPs' contribution to model decisions makes the ML algorithmic process more transparent, interpretable enabling and enables clinical practice.</p>\",\"PeriodicalId\":72427,\"journal\":{\"name\":\"BMC genomic data\",\"volume\":\"26 1\",\"pages\":\"48\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255030/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC genomic data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s12863-025-01338-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC genomic data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s12863-025-01338-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

摘要

背景:结核分枝杆菌(MTB)是一种主要感染人类的人类特异性病原体,引起结核病(TB)。结核分枝杆菌的抗微生物药物耐药性(AMR)对全球卫生构成了巨大挑战。机器学习在全基因组测序数据(WGS)上的应用为揭示MTB耐药的基因组机制提供了巨大的潜力。方法:我们使用了18个二元矩阵,每个矩阵由来自特定mtb -抗菌数据集的基因型和抗菌药物敏感性测试表型组成。通过在所有snp、交叉snp和随机生成的snp上构建训练和测试数据集,我们开发了一个使用12种不同算法的机器学习(ML)框架。然后,我们比较了各种ML模型的性能,并使用SHapley加性解释(SHAP)框架来解释为什么以及如何在最优算法中做出决策。最后,我们应用该模型预测来自印度和以色列的其他独立结核分枝杆菌分离数据集中对利福平(RIF)和异烟肼(INH)的耐药表型。结果:梯度增强分类器(Gradient Boosting Classifier, GBC)模型对RIF、INH、pyrazinamide和乙胺丁醇4种一线药物的正确率分别为97.28%、96.06%、94.19%和92.81%,准确率最高。通过SHAP值估计amr相关snp的贡献,我们发现在RIF和INH中,位置761155 (rpoB_p.Ser450)、2155168 (katG_p.Ser315)排名靠前,其较高的值(1表示替代等位基因)倾向于预测这两种药物的耐药性状。此外,在来自印度和以色列的外部独立MTB分离数据集中,最佳模型GBC在预测RIF和INH的耐药表型方面具有良好的通用性。结论:本研究将ML方法整合到抗微生物药物耐药性研究中,开发了预测耐药表型的框架,并探索了MTB中与amr相关的snp。量化重要snp对模型决策的贡献使ML算法过程更加透明,可解释,并使临床实践成为可能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Machine learning-based prediction of antimicrobial resistance and identification of AMR-related SNPs in Mycobacterium tuberculosis.

Background: Mycobacterium tuberculosis (MTB) is a human-specific pathogen that primarily infects humans, causing tuberculosis (TB). Antimicrobial resistance (AMR) in MTB presents a formidable challenge to global health. The employment of machine learning on whole-genome sequencing data (WGS) presents significant potential for uncovering the genomic mechanisms underlying drug resistance in MTB.

Methods: We used 18 binary matrices, each consisting of genotypes and antimicrobial susceptibility testing phenotypes from a specific MTB-antimicrobial dataset. By constructing training and test datasets on all SNPs, intersected SNPs, and randomly generated SNPs, we developed a Machine learning (ML) framework using twelve different algorithms. Then, we compared the performances of the various ML models and used the SHapley Additive exPlanations (SHAP) framework to decipher why and how decisions are made within the optimal algorithm. Lastly, we applied the models to predict the resistance phenotype to rifampicin (RIF) and isoniazid (INH) in the additional independent MTB isolate datasets from India and Israel.

Results: In our study, the Gradient Boosting Classifier (GBC) model was the best in terms of correctly identified percentages (97.28%, 96.06%, 94.19%, and 92.81% for the four first-line drugs, RIF, INH, pyrazinamide, and ethambutol respectively). By estimating the contributions of AMR-related SNPs by SHAP values, we found that position 761,155 (rpoB_p.Ser450), 2,155,168 (katG_p.Ser315) rank top in RIF and INH, their higher values (1 for alternative allele) tend to predict the resistance trait for these two drugs. In addition, the best model GBC generalizes well in predicting the resistance phenotypes for RIF and INH in the external independent MTB isolate datasets from India and Israel.

Conclusions: This study integrates ML methods into antimicrobial resistance research, develops a framework for predicting resistance phenotypes, and explores AMR-related SNPs in MTB. Quantifying the important SNPs' contribution to model decisions makes the ML algorithmic process more transparent, interpretable enabling and enables clinical practice.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.90
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信