Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach

IF 1.2 Q2 MATHEMATICS, APPLIED
Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo
{"title":"Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach","authors":"Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo","doi":"10.1155/2023/9991095","DOIUrl":null,"url":null,"abstract":"The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.","PeriodicalId":49251,"journal":{"name":"Journal of Applied Mathematics","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2023/9991095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}
引用次数: 0

Abstract

The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.
基于机器学习分类方法的蛋白质序列数据检测COVID-19
2019年底,严重急性呼吸综合征冠状病毒2 (SARS-CoV-2)的出现导致了COVID-19大流行,因此需要通过蛋白质序列数据快速准确地检测病原体。本研究旨在利用机器学习算法和特征选择技术开发一种高效的冠状病毒蛋白质序列分类模型,以帮助早期发现和预测新型病毒。我们使用了包含2000个蛋白质序列的数据集,包括1000个SARS-CoV-2序列和1000个非SARS-CoV-2序列。特征提取提供27个基本特征,代表主要结构数据,通过Discere软件包实现。为了优化性能,我们采用了机器学习分类算法,如k近邻(KNN)、XGBoost和Naïve贝叶斯,以及特征选择技术,如遗传算法(GA)、LASSO和支持向量机递归特征消除(SVM-RFE)。SVM-RFE+KNN模型的分类准确率为99.30%,特异性为99.52%,灵敏度为99.55%。这些结果证明了该模型在准确分类冠状病毒蛋白序列方面的有效性。我们的研究成功开发了一个强大的分类模型,能够早期检测和预测SARS-CoV-2和其他冠状病毒的蛋白质序列。这一进展在促进制定有针对性的治疗方法和预防战略以应对未来的病毒爆发方面具有很大的希望。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Applied Mathematics
Journal of Applied Mathematics MATHEMATICS, APPLIED-
CiteScore
2.70
自引率
0.00%
发文量
58
审稿时长
3.2 months
期刊介绍: Journal of Applied Mathematics is a refereed journal devoted to the publication of original research papers and review articles in all areas of applied, computational, and industrial mathematics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信