基于机器学习的SARS-COV-2有效预测特征选择

IF 16.4 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Accounts of Chemical Research Pub Date : 2024-03-01 Epub Date: 2023-11-20 DOI:10.1007/s13258-023-01467-6
Gagan Punacha, Rama Adiga
{"title":"基于机器学习的SARS-COV-2有效预测特征选择","authors":"Gagan Punacha, Rama Adiga","doi":"10.1007/s13258-023-01467-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.</p><p><strong>Objective: </strong>With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.</p><p><strong>Methods: </strong>All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.</p><p><strong>Results: </strong>Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.</p><p><strong>Conclusion: </strong>The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.</p>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature selection for effective prediction of SARS-COV-2 using machine learning.\",\"authors\":\"Gagan Punacha, Rama Adiga\",\"doi\":\"10.1007/s13258-023-01467-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.</p><p><strong>Objective: </strong>With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.</p><p><strong>Methods: </strong>All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.</p><p><strong>Results: </strong>Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.</p><p><strong>Conclusion: </strong>The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.</p>\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s13258-023-01467-6\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/11/20 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-023-01467-6","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/20 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

背景:随着SARS-CoV-2变体的增加,有必要对新出现的SARS-CoV-2进行分类,以便早期发现,从而减少人际传播。在检测SARS-CoV-2的机器学习(ML)方法中,基因组和蛋白质组学信息较少用于分类。目的:利用SARS-CoV-2的核蛋白和病毒蛋白质组学进化信息,结合不同菌株氨基酸的电荷和碱度分布,建立基于ml的SARS-CoV-2疾病严重程度模型。蛋白质组水平计算被加入到数据集中。训练集用于特征选择。采用Select K- Best特征选择方法,与测试集进行交叉验证,并对性能进行评价。德龙的试验也完成了。我们还采用了SARS-CoV-2的BIRCH聚类方法对菌株进行聚类。结果:6个ML模型中有4个在训练和测试中成功。Extra Trees算法产生的微平均f1得分为74.2%,多类别选项下的受试者工作特征曲线下加权平均面积(AUC-ROC)得分为73.7%。特征选择设置为5,ROC AUC从73.7提高到76.4%。所选模型的准确率达到了86.9%。结论:在ML方法中确定的独特特征能够将疾病严重程度分类,并具有预测新变体风险的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Feature selection for effective prediction of SARS-COV-2 using machine learning.

Feature selection for effective prediction of SARS-COV-2 using machine learning.

Background: With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.

Objective: With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.

Methods: All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.

Results: Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.

Conclusion: The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Accounts of Chemical Research
Accounts of Chemical Research 化学-化学综合
CiteScore
31.40
自引率
1.10%
发文量
312
审稿时长
2 months
期刊介绍: Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信