基于机器学习分类方法的蛋白质序列数据检测COVID-19

IF 1.3 Q2 MATHEMATICS, APPLIED

Journal of Applied Mathematics Pub Date : 2023-09-28 DOI:10.1155/2023/9991095

Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo

{"title":"基于机器学习分类方法的蛋白质序列数据检测COVID-19","authors":"Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo","doi":"10.1155/2023/9991095","DOIUrl":null,"url":null,"abstract":"The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.","PeriodicalId":49251,"journal":{"name":"Journal of Applied Mathematics","volume":"89 1","pages":"0"},"PeriodicalIF":1.3000,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach\",\"authors\":\"Siti Aminah, Gianinna Ardaneswari, Mufarrido Husnah, Ghani Deori, Handi Bagus Prasetyo\",\"doi\":\"10.1155/2023/9991095\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.\",\"PeriodicalId\":49251,\"journal\":{\"name\":\"Journal of Applied Mathematics\",\"volume\":\"89 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2023-09-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Mathematics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1155/2023/9991095\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICS, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Mathematics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2023/9991095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

摘要

2019年底，严重急性呼吸综合征冠状病毒2 (SARS-CoV-2)的出现导致了COVID-19大流行，因此需要通过蛋白质序列数据快速准确地检测病原体。本研究旨在利用机器学习算法和特征选择技术开发一种高效的冠状病毒蛋白质序列分类模型，以帮助早期发现和预测新型病毒。我们使用了包含2000个蛋白质序列的数据集，包括1000个SARS-CoV-2序列和1000个非SARS-CoV-2序列。特征提取提供27个基本特征，代表主要结构数据，通过Discere软件包实现。为了优化性能，我们采用了机器学习分类算法，如k近邻(KNN)、XGBoost和Naïve贝叶斯，以及特征选择技术，如遗传算法(GA)、LASSO和支持向量机递归特征消除(SVM-RFE)。SVM-RFE+KNN模型的分类准确率为99.30%，特异性为99.52%，灵敏度为99.55%。这些结果证明了该模型在准确分类冠状病毒蛋白序列方面的有效性。我们的研究成功开发了一个强大的分类模型，能够早期检测和预测SARS-CoV-2和其他冠状病毒的蛋白质序列。这一进展在促进制定有针对性的治疗方法和预防战略以应对未来的病毒爆发方面具有很大的希望。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detection of COVID-19 Using Protein Sequence Data via Machine Learning Classification Approach

The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in late 2019 resulted in the COVID-19 pandemic, necessitating rapid and accurate detection of pathogens through protein sequence data. This study is aimed at developing an efficient classification model for coronavirus protein sequences using machine learning algorithms and feature selection techniques to aid in the early detection and prediction of novel viruses. We utilized a dataset comprising 2000 protein sequences, including 1000 SARS-CoV-2 sequences and 1000 non-SARS-CoV-2 sequences. Feature extraction provided 27 essential features representing the primary structural data, achieved through the Discere package. To optimize performance, we employed machine learning classification algorithms such as K-nearest neighbor (KNN), XGBoost, and Naïve Bayes, along with feature selection techniques like genetic algorithm (GA), LASSO, and support vector machine recursive feature elimination (SVM-RFE). The SVM-RFE+KNN model exhibited exceptional performance, achieving a classification accuracy of 99.30%, specificity of 99.52%, and sensitivity of 99.55%. These results demonstrate the model’s efficacy in accurately classifying coronavirus protein sequences. Our research successfully developed a robust classification model capable of early detection and prediction of protein sequences in SARS-CoV-2 and other coronaviruses. This advancement holds great promise in facilitating the development of targeted treatments and preventive strategies for combating future viral outbreaks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Applied Mathematics MATHEMATICS, APPLIED-

CiteScore

2.70

自引率

0.00%

发文量

审稿时长

3.2 months

期刊介绍： Journal of Applied Mathematics is a refereed journal devoted to the publication of original research papers and review articles in all areas of applied, computational, and industrial mathematics.