SARS-CoV-2 variants classification and characterization

EPiC series in computing Pub Date : 2022-01-01 DOI:10.29007/5qpk

Sofia Borgato, Marco Bottino, Marta Lovino, E. Ficarra

{"title":"SARS-CoV-2 variants classification and characterization","authors":"Sofia Borgato, Marco Bottino, Marta Lovino, E. Ficarra","doi":"10.29007/5qpk","DOIUrl":null,"url":null,"abstract":"As of late 2019, the SARS-CoV-2 virus has spread globally, giving several variants over time. These variants, unfortunately, differ from the original sequence identified in Wuhan, thus risking compromising the efficacy of the vaccines developed. Some software has been released to recognize currently known and newly spread variants. However, some of these tools are not entirely automatic. Some others, instead, do not return a detailed characterization of all the mutations in the samples. Indeed, such characterization can be helpful for biologists to understand the variability between samples. This paper presents a Machine Learning (ML) approach to identifying existing and new variants completely automatically. In addition, a detailed table showing all the alterations and mutations found in the samples is provided in output to the user. SARS-CoV-2 sequences are obtained from the GISAID database, and a list of features is custom designed (e.g., number of mutations in each gene of the virus) to train the algorithm. The recognition of existing variants is performed through a Random Forest classifier while identifying newly spread variants is accomplished by the DBSCAN algorithm. Both Random Forest and DBSCAN techniques demonstrated high precision on a new variant that arose during the drafting of this paper (used only in the testing phase of the algorithm). Therefore, researchers will significantly benefit from the proposed algorithm and the detailed output with the main alterations of the samples.","PeriodicalId":93549,"journal":{"name":"EPiC series in computing","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EPiC series in computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29007/5qpk","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As of late 2019, the SARS-CoV-2 virus has spread globally, giving several variants over time. These variants, unfortunately, differ from the original sequence identified in Wuhan, thus risking compromising the efficacy of the vaccines developed. Some software has been released to recognize currently known and newly spread variants. However, some of these tools are not entirely automatic. Some others, instead, do not return a detailed characterization of all the mutations in the samples. Indeed, such characterization can be helpful for biologists to understand the variability between samples. This paper presents a Machine Learning (ML) approach to identifying existing and new variants completely automatically. In addition, a detailed table showing all the alterations and mutations found in the samples is provided in output to the user. SARS-CoV-2 sequences are obtained from the GISAID database, and a list of features is custom designed (e.g., number of mutations in each gene of the virus) to train the algorithm. The recognition of existing variants is performed through a Random Forest classifier while identifying newly spread variants is accomplished by the DBSCAN algorithm. Both Random Forest and DBSCAN techniques demonstrated high precision on a new variant that arose during the drafting of this paper (used only in the testing phase of the algorithm). Therefore, researchers will significantly benefit from the proposed algorithm and the detailed output with the main alterations of the samples.

查看原文本刊更多论文

SARS-CoV-2变体的分类和表征

截至2019年底，SARS-CoV-2病毒已在全球传播，随着时间的推移产生了几种变体。不幸的是，这些变异与在武汉发现的原始序列不同，因此有可能损害所开发疫苗的效力。已经发布了一些软件来识别当前已知的和新传播的变体。然而，其中一些工具并不是完全自动的。另一些则不返回样本中所有突变的详细特征。事实上，这样的特征可以帮助生物学家了解样本之间的可变性。本文提出了一种完全自动识别现有和新变体的机器学习(ML)方法。此外，在输出给用户时还提供了一个详细的表，显示了样本中发现的所有变化和突变。从GISAID数据库获得SARS-CoV-2序列，并定制设计特征列表(例如，病毒每个基因的突变数量)来训练算法。现有变体的识别通过随机森林分类器完成，而新传播的变体的识别则通过DBSCAN算法完成。Random Forest和DBSCAN技术在本文起草期间出现的新变体(仅在算法的测试阶段使用)上都展示了高精度。因此，研究人员将显著受益于所提出的算法和样本主要变化的详细输出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EPiC series in computing

CiteScore

1.60

自引率

0.00%

发文量