Feature selection for effective prediction of SARS-COV-2 using machine learning.

IF 1.7 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Genes & genomics Pub Date : 2024-03-01 Epub Date: 2023-11-20 DOI:10.1007/s13258-023-01467-6

Gagan Punacha, Rama Adiga

{"title":"Feature selection for effective prediction of SARS-COV-2 using machine learning.","authors":"Gagan Punacha, Rama Adiga","doi":"10.1007/s13258-023-01467-6","DOIUrl":null,"url":null,"abstract":"Background: With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.Objective: With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.Methods: All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.Results: Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.Conclusion: The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.","PeriodicalId":12675,"journal":{"name":"Genes & genomics","volume":" ","pages":"341-354"},"PeriodicalIF":1.7000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes & genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s13258-023-01467-6","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/11/20 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: With rise in variants of SARS-CoV-2, it is necessary to classify the emerging SARS-CoV-2 for early detection and thereby reduce human transmission. Genomic and proteomic information have less frequently been used for classifying in a machine learning (ML) approach for detection of SARS-CoV-2.

Objective: With this aim we used nucleoprotein and viral proteomic evolutionary information of SARS-CoV-2 along with the charge and basicity distribution of amino acids from various strains of SARS-CoV-2 to generate a disease severity model based on ML.

Methods: All sequence and clinical data were obtained from GISAID. Proteomic level calculations were added to comprise the dataset. The training set was used for feature selection. Select K- Best feature selection method was employed which was cross validated with testing set and performance evaluated. Delong's test was also done. We also employed BIRCH clustering on SARS-CoV-2 for clustering the strains.

Results: Out of six ML models four were successful in training and testing. Extra Trees algorithm generated a micro-averaged F1-score of 74.2% and a weighted averaged area under the receiver operating characteristic curve (AUC-ROC) score of 73.7% with multi-class option. The feature selection set to 5, enhanced the ROC AUC from 73.7 to 76.4%. Accuracy of the selected model of 86.9% was achieved.

Conclusion: The unique features identified in the ML approach was able to classify disease severity into classes and had potential for predicting risk in newer variants.

Abstract Image

查看原文本刊更多论文

基于机器学习的SARS-COV-2有效预测特征选择

背景:随着SARS-CoV-2变体的增加，有必要对新出现的SARS-CoV-2进行分类，以便早期发现，从而减少人际传播。在检测SARS-CoV-2的机器学习(ML)方法中，基因组和蛋白质组学信息较少用于分类。目的:利用SARS-CoV-2的核蛋白和病毒蛋白质组学进化信息，结合不同菌株氨基酸的电荷和碱度分布，建立基于ml的SARS-CoV-2疾病严重程度模型。蛋白质组水平计算被加入到数据集中。训练集用于特征选择。采用Select K- Best特征选择方法，与测试集进行交叉验证，并对性能进行评价。德龙的试验也完成了。我们还采用了SARS-CoV-2的BIRCH聚类方法对菌株进行聚类。结果:6个ML模型中有4个在训练和测试中成功。Extra Trees算法产生的微平均f1得分为74.2%，多类别选项下的受试者工作特征曲线下加权平均面积(AUC-ROC)得分为73.7%。特征选择设置为5,ROC AUC从73.7提高到76.4%。所选模型的准确率达到了86.9%。结论:在ML方法中确定的独特特征能够将疾病严重程度分类，并具有预测新变体风险的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Genes & genomics 生物-生化与分子生物学

CiteScore

3.70

自引率

4.80%

发文量

131

审稿时长

6-12 weeks

期刊介绍： Genes & Genomics is an official journal of the Korean Genetics Society (http://kgenetics.or.kr/). Although it is an official publication of the Genetics Society of Korea, membership of the Society is not required for contributors. It is a peer-reviewed international journal publishing print (ISSN 1976-9571) and online version (E-ISSN 2092-9293). It covers all disciplines of genetics and genomics from prokaryotes to eukaryotes from fundamental heredity to molecular aspects. The articles can be reviews, research articles, and short communications.