Feature selection for effective prediction of SARS-COV-2 using machine learning.

IF 1.6 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY
Genes & genomics Pub Date : 2024-03-01 Epub Date: 2023-11-20 DOI:10.1007/s13258-023-01467-6
Gagan Punacha, Rama Adiga
{"title":"Feature selection for effective prediction of SARS-COV-2 using machine learning.","authors":"Gagan Punacha, Rama Adiga","doi":"10.1007/s13258-023-01467-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.</p><p><strong>Objective: </strong>With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.</p><p><strong>Methods: </strong>All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.</p><p><strong>Results: </strong>Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.</p><p><strong>Conclusion: </strong>The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.</p>","PeriodicalId":12675,"journal":{"name":"Genes & genomics","volume":" ","pages":"341-354"},"PeriodicalIF":1.6000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-023-01467-6","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/20 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.

Objective: With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.

Methods: All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.

Results: Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.

Conclusion: The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.

Abstract Image

基于机器学习的SARS-COV-2有效预测特征选择
背景:随着SARS-CoV-2变体的增加,有必要对新出现的SARS-CoV-2进行分类,以便早期发现,从而减少人际传播。在检测SARS-CoV-2的机器学习(ML)方法中,基因组和蛋白质组学信息较少用于分类。目的:利用SARS-CoV-2的核蛋白和病毒蛋白质组学进化信息,结合不同菌株氨基酸的电荷和碱度分布,建立基于ml的SARS-CoV-2疾病严重程度模型。蛋白质组水平计算被加入到数据集中。训练集用于特征选择。采用Select K- Best特征选择方法,与测试集进行交叉验证,并对性能进行评价。德龙的试验也完成了。我们还采用了SARS-CoV-2的BIRCH聚类方法对菌株进行聚类。结果:6个ML模型中有4个在训练和测试中成功。Extra Trees算法产生的微平均f1得分为74.2%,多类别选项下的受试者工作特征曲线下加权平均面积(AUC-ROC)得分为73.7%。特征选择设置为5,ROC AUC从73.7提高到76.4%。所选模型的准确率达到了86.9%。结论:在ML方法中确定的独特特征能够将疾病严重程度分类,并具有预测新变体风险的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Genes & genomics
Genes & genomics 生物-生化与分子生物学
CiteScore
3.70
自引率
4.80%
发文量
131
审稿时长
6-12 weeks
期刊介绍: Genes & Genomics is an official journal of the Korean Genetics Society (http://kgenetics.or.kr/). Although it is an official publication of the Genetics Society of Korea, membership of the Society is not required for contributors. It is a peer-reviewed international journal publishing print (ISSN 1976-9571) and online version (E-ISSN 2092-9293). It covers all disciplines of genetics and genomics from prokaryotes to eukaryotes from fundamental heredity to molecular aspects. The articles can be reviews, research articles, and short communications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信