基于操作记忆大数据可视化和聚类分析SARS-CoV-2进化的新方法

IF 1 Q3 AGRICULTURE, MULTIDISCIPLINARY

Vavilovskii Zhurnal Genetiki i Selektsii Pub Date : 2024-12-01 DOI:10.18699/vjgb-24-92

A Yu Palyanov, N V Palyanova

{"title":"基于操作记忆大数据可视化和聚类分析SARS-CoV-2进化的新方法","authors":"A Yu Palyanov, N V Palyanova","doi":"10.18699/vjgb-24-92","DOIUrl":null,"url":null,"abstract":"SARS-CoV-2 is a virus for which an outstanding number of genome variants were collected, sequenced and stored from sources all around the world. Raw data in FASTA format include 16.8 million genomes, each ≈29,900 nt (nucleotides), with a total size of ≈500 ∙ 109 nt, or 465 Gb. We suggest an approach to data representation and organization, with which all this can be stored losslessly in the operative memory (RAM) of a common PC. Moreover, just ≈330 Mb will be enough. Aligning all genomes versus the initial Wuhan-Hu-1 reference sequence allows each to be represented as a data structure containing lists of point mutations, deletions and insertions. Our implementation of such data representation resulted in a 1:1500 compression ratio (for comparison, compression of the same data with the popular WinRAR archiver gives only 1:62) and fast access to genomes (and their metadata) and comparisons between different genome variants. With this approach implemented as a C++ program, we performed an analysis of various properties of the set of SARS-CoV-2 genomes available in NCBI Genbank (within a period from 24.12.2019 to 24.06.2024). We calculated the distribution of the number of genomes with undetermined nucleotides, 'N's, vs the number of such nucleotides in them, the number of unique genomes and clusters of identical genomes, and the distribution of clusters by size (the number of identical genomes) and duration (the time interval between each cluster's first and last genome). Finally, the evolution of distributions of the number of changes (editing distance between each genome and reference sequence) caused by substitutions, deletions and insertions was visualized as 3D surfaces, which clearly show the process of viral evolution over 4.5 years, with a time step = 1 week. It is in good correspondence with phylogenetic trees (usually based on 3-4 thousand of genome variant representatives), but is built over millions of genomes, shows more details and is independent of the type of lineage/clade classification.","PeriodicalId":44339,"journal":{"name":"Vavilovskii Zhurnal Genetiki i Selektsii","volume":"28 8","pages":"843-853"},"PeriodicalIF":1.0000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811502/pdf/","citationCount":"0","resultStr":"{\"title\":\"A novel approach to analyzing the evolution of SARS-CoV-2 based on visualization and clustering of large genetic data compactly represented in operative memory.\",\"authors\":\"A Yu Palyanov, N V Palyanova\",\"doi\":\"10.18699/vjgb-24-92\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SARS-CoV-2 is a virus for which an outstanding number of genome variants were collected, sequenced and stored from sources all around the world. Raw data in FASTA format include 16.8 million genomes, each ≈29,900 nt (nucleotides), with a total size of ≈500 ∙ 109 nt, or 465 Gb. We suggest an approach to data representation and organization, with which all this can be stored losslessly in the operative memory (RAM) of a common PC. Moreover, just ≈330 Mb will be enough. Aligning all genomes versus the initial Wuhan-Hu-1 reference sequence allows each to be represented as a data structure containing lists of point mutations, deletions and insertions. Our implementation of such data representation resulted in a 1:1500 compression ratio (for comparison, compression of the same data with the popular WinRAR archiver gives only 1:62) and fast access to genomes (and their metadata) and comparisons between different genome variants. With this approach implemented as a C++ program, we performed an analysis of various properties of the set of SARS-CoV-2 genomes available in NCBI Genbank (within a period from 24.12.2019 to 24.06.2024). We calculated the distribution of the number of genomes with undetermined nucleotides, 'N's, vs the number of such nucleotides in them, the number of unique genomes and clusters of identical genomes, and the distribution of clusters by size (the number of identical genomes) and duration (the time interval between each cluster's first and last genome). Finally, the evolution of distributions of the number of changes (editing distance between each genome and reference sequence) caused by substitutions, deletions and insertions was visualized as 3D surfaces, which clearly show the process of viral evolution over 4.5 years, with a time step = 1 week. It is in good correspondence with phylogenetic trees (usually based on 3-4 thousand of genome variant representatives), but is built over millions of genomes, shows more details and is independent of the type of lineage/clade classification.\",\"PeriodicalId\":44339,\"journal\":{\"name\":\"Vavilovskii Zhurnal Genetiki i Selektsii\",\"volume\":\"28 8\",\"pages\":\"843-853\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11811502/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Vavilovskii Zhurnal Genetiki i Selektsii\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18699/vjgb-24-92\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"AGRICULTURE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vavilovskii Zhurnal Genetiki i Selektsii","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18699/vjgb-24-92","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"AGRICULTURE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

SARS-CoV-2是一种从世界各地收集、测序和储存大量基因组变异的病毒。FASTA格式的原始数据包括1680万个基因组，每个基因组≈29900 nt（核苷酸），总大小约为≈500∙109 nt，即465 Gb。我们提出了一种数据表示和组织的方法，通过这种方法，所有这些都可以无损地存储在普通PC的操作存储器（RAM）中。而且，只要≈330 Mb就足够了。将所有基因组与最初的武汉-沪-1参考序列比对，可以将每个基因组表示为包含点突变、缺失和插入列表的数据结构。我们对这种数据表示的实现导致了1:1500的压缩比（相比之下，使用流行的WinRAR归档器压缩相同的数据只有1:62）和快速访问基因组（及其元数据）以及不同基因组变体之间的比较。通过c++程序实现该方法，我们对NCBI Genbank中可用的一组SARS-CoV-2基因组（从2019年12月24日到2024年6月24日）的各种特性进行了分析。我们计算了含有未确定核苷酸的基因组数量的分布，N's，相对于这些核苷酸的数量，独特基因组和相同基因组簇的数量，以及按大小（相同基因组的数量）和持续时间（每个簇的第一个和最后一个基因组之间的时间间隔）的簇的分布。最后，将替换、缺失和插入引起的变异数（每个基因组与参考序列之间的编辑距离）的演变分布可视化为3D曲面，清晰地显示了病毒在4.5年以上的进化过程，时间步长为1周。它与系统发育树（通常基于3-4千个基因组变异代表）很好地对应，但建立在数百万个基因组之上，显示了更多的细节，并且独立于谱系/进化分类的类型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A novel approach to analyzing the evolution of SARS-CoV-2 based on visualization and clustering of large genetic data compactly represented in operative memory.

查看原文本刊更多论文

A novel approach to analyzing the evolution of SARS-CoV-2 based on visualization and clustering of large genetic data compactly represented in operative memory.

SARS-CoV-2 is a virus for which an outstanding number of genome variants were collected, sequenced and stored from sources all around the world. Raw data in FASTA format include 16.8 million genomes, each ≈29,900 nt (nucleotides), with a total size of ≈500 ∙ 109 nt, or 465 Gb. We suggest an approach to data representation and organization, with which all this can be stored losslessly in the operative memory (RAM) of a common PC. Moreover, just ≈330 Mb will be enough. Aligning all genomes versus the initial Wuhan-Hu-1 reference sequence allows each to be represented as a data structure containing lists of point mutations, deletions and insertions. Our implementation of such data representation resulted in a 1:1500 compression ratio (for comparison, compression of the same data with the popular WinRAR archiver gives only 1:62) and fast access to genomes (and their metadata) and comparisons between different genome variants. With this approach implemented as a C++ program, we performed an analysis of various properties of the set of SARS-CoV-2 genomes available in NCBI Genbank (within a period from 24.12.2019 to 24.06.2024). We calculated the distribution of the number of genomes with undetermined nucleotides, 'N's, vs the number of such nucleotides in them, the number of unique genomes and clusters of identical genomes, and the distribution of clusters by size (the number of identical genomes) and duration (the time interval between each cluster's first and last genome). Finally, the evolution of distributions of the number of changes (editing distance between each genome and reference sequence) caused by substitutions, deletions and insertions was visualized as 3D surfaces, which clearly show the process of viral evolution over 4.5 years, with a time step = 1 week. It is in good correspondence with phylogenetic trees (usually based on 3-4 thousand of genome variant representatives), but is built over millions of genomes, shows more details and is independent of the type of lineage/clade classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Vavilovskii Zhurnal Genetiki i Selektsii AGRICULTURE, MULTIDISCIPLINARY-

CiteScore

1.90

自引率

0.00%

发文量

119

审稿时长

8 weeks

期刊介绍： The "Vavilov Journal of genetics and breeding" publishes original research and review articles in all key areas of modern plant, animal and human genetics, genomics, bioinformatics and biotechnology. One of the main objectives of the journal is integration of theoretical and applied research in the field of genetics. Special attention is paid to the most topical areas in modern genetics dealing with global concerns such as food security and human health.