SARS-CoV-2 cocoput：分析GISAID和NCBI数据，以获得多年期间的密码子统计、突变和自由能。

IF 5.5 2区医学 Q1 VIROLOGY

Virus Evolution Pub Date : 2025-01-17 eCollection Date: 2025-01-01 DOI:10.1093/ve/veae115

Nigam H Padhiar, Tigran Ghazanchyan, Sarah E Fumagalli, Michael DiCuccio, Guy Cohen, Alexander Ginzburg, Brian Rikshpun, Almog Klein, Luis Santana-Quintero, Sean Smith, Anton A Komar, Chava Kimchi-Sarfaty

{"title":"SARS-CoV-2 cocoput：分析GISAID和NCBI数据，以获得多年期间的密码子统计、突变和自由能。","authors":"Nigam H Padhiar, Tigran Ghazanchyan, Sarah E Fumagalli, Michael DiCuccio, Guy Cohen, Alexander Ginzburg, Brian Rikshpun, Almog Klein, Luis Santana-Quintero, Sean Smith, Anton A Komar, Chava Kimchi-Sarfaty","doi":"10.1093/ve/veae115","DOIUrl":null,"url":null,"abstract":"A consistent area of interest since the beginning of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been the sequence composition of the virus and how it has changed over time. Many resources have been developed for the storage and analysis of SARS-CoV-2 data, such as GISAID (Global Initiative on Sharing All Influenza Data), NCBI, Nextstrain, and outbreak.info. However, relatively little has been done to compile codon usage data, codon-level mutation data, and secondary structure data into a single database. Here, we assemble the aforementioned data and many additional virus attributes in a new database entitled SARS-CoV-2 CoCoPUTs. We begin with an overview of the composition and overlap between two of the largest sources of SARS-CoV-2 sequence data: GISAID and NCBI Virus (GenBank). We then evaluate different types of sequence curation strategies to reduce the dataset of millions of sequences to only one sequence per Pango lineage variant. We then performed specific analyses on the coding sequences (CDSs), including calculating codon usage, codon pair usage, dinucleotides, junction dinucleotides, mutations, GC content, effective number of codons (ENCs), and effective number of codon pairs (ENCPs). We have also performed whole-genome secondary RNA structure prediction calculations for each variant, using the LinearPartition software and modified selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) data that are available online. Finally, we compiled all the data into our resource, SARS-CoV-2 CoCoPUTs, and paired many of the resulting statistics with variant proportion data over time in order to derive trends in viral evolution. Although the overall codon usage of SARS-CoV-2 did not change drastically, in line with the previous literature on this subject, we did observe that while overall GC% content decreased, GC% of the third position in the codon was more positive relative to overall GC% content between February 2021 and July 2023. Over the same interval, we noted that both synonymous and nonsynonymous mutations increased in number, with nonsynonymous mutations outpacing synonymous mutations at a rate of 3:1. We noted that the predicted whole-genome secondary structures nearly all contained the previously described virus-activated inhibitor of translation (VAIT) stem loops, validating for the first time their existence in a whole-genome secondary structure prediction for many SARS-CoV-2 variants (as opposed to previous local secondary structure predictions). We also separately produced a synonymous mutation-deprived set of SARS-CoV-2 variant sequences and repeated the secondary structure calculations on this set. This revealed an interesting trend of reduced ensemble free energy compared to the unaltered variant structures, indicating that synonymous mutations play a role in increasing the free energy of viral RNA molecules. These data both validate previous studies describing increases in viral free energy in human viruses over time and indicate a possible role for synonymous mutations in viral biology.","PeriodicalId":56026,"journal":{"name":"Virus Evolution","volume":"11 1","pages":"veae115"},"PeriodicalIF":5.5000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11776705/pdf/","citationCount":"0","resultStr":"{\"title\":\"SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period.\",\"authors\":\"Nigam H Padhiar, Tigran Ghazanchyan, Sarah E Fumagalli, Michael DiCuccio, Guy Cohen, Alexander Ginzburg, Brian Rikshpun, Almog Klein, Luis Santana-Quintero, Sean Smith, Anton A Komar, Chava Kimchi-Sarfaty\",\"doi\":\"10.1093/ve/veae115\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A consistent area of interest since the beginning of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been the sequence composition of the virus and how it has changed over time. Many resources have been developed for the storage and analysis of SARS-CoV-2 data, such as GISAID (Global Initiative on Sharing All Influenza Data), NCBI, Nextstrain, and outbreak.info. However, relatively little has been done to compile codon usage data, codon-level mutation data, and secondary structure data into a single database. Here, we assemble the aforementioned data and many additional virus attributes in a new database entitled SARS-CoV-2 CoCoPUTs. We begin with an overview of the composition and overlap between two of the largest sources of SARS-CoV-2 sequence data: GISAID and NCBI Virus (GenBank). We then evaluate different types of sequence curation strategies to reduce the dataset of millions of sequences to only one sequence per Pango lineage variant. We then performed specific analyses on the coding sequences (CDSs), including calculating codon usage, codon pair usage, dinucleotides, junction dinucleotides, mutations, GC content, effective number of codons (ENCs), and effective number of codon pairs (ENCPs). We have also performed whole-genome secondary RNA structure prediction calculations for each variant, using the LinearPartition software and modified selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) data that are available online. Finally, we compiled all the data into our resource, SARS-CoV-2 CoCoPUTs, and paired many of the resulting statistics with variant proportion data over time in order to derive trends in viral evolution. Although the overall codon usage of SARS-CoV-2 did not change drastically, in line with the previous literature on this subject, we did observe that while overall GC% content decreased, GC% of the third position in the codon was more positive relative to overall GC% content between February 2021 and July 2023. Over the same interval, we noted that both synonymous and nonsynonymous mutations increased in number, with nonsynonymous mutations outpacing synonymous mutations at a rate of 3:1. We noted that the predicted whole-genome secondary structures nearly all contained the previously described virus-activated inhibitor of translation (VAIT) stem loops, validating for the first time their existence in a whole-genome secondary structure prediction for many SARS-CoV-2 variants (as opposed to previous local secondary structure predictions). We also separately produced a synonymous mutation-deprived set of SARS-CoV-2 variant sequences and repeated the secondary structure calculations on this set. This revealed an interesting trend of reduced ensemble free energy compared to the unaltered variant structures, indicating that synonymous mutations play a role in increasing the free energy of viral RNA molecules. These data both validate previous studies describing increases in viral free energy in human viruses over time and indicate a possible role for synonymous mutations in viral biology.\",\"PeriodicalId\":56026,\"journal\":{\"name\":\"Virus Evolution\",\"volume\":\"11 1\",\"pages\":\"veae115\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2025-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11776705/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Virus Evolution\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/ve/veae115\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"VIROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Virus Evolution","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/ve/veae115","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"VIROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

自严重急性呼吸综合征冠状病毒2 （SARS-CoV-2）大流行开始以来，人们一直感兴趣的一个领域是该病毒的序列组成及其如何随时间变化。目前已经开发了许多资源来存储和分析SARS-CoV-2数据，如GISAID（全球共享所有流感数据倡议）、NCBI、Nextstrain和outbreak.info。然而，将密码子使用数据、密码子水平突变数据和二级结构数据汇编成单一数据库的工作相对较少。在这里，我们将上述数据和许多其他病毒属性组合在一个名为SARS-CoV-2 CoCoPUTs的新数据库中。我们首先概述了SARS-CoV-2序列数据的两个最大来源：GISAID和NCBI病毒（GenBank）之间的组成和重叠。然后，我们评估了不同类型的序列管理策略，以将数百万序列的数据集减少到每个Pango谱系变体只有一个序列。然后，我们对编码序列（CDSs）进行了具体分析，包括计算密码子使用率、密码子对使用率、二核苷酸、连接二核苷酸、突变、GC含量、有效密码子数（ENCs）和有效密码子对数（ENCPs）。我们还使用LinearPartition软件对每个变异进行了全基因组次级RNA结构预测计算，并通过引物扩展（SHAPE）在线数据分析了修饰的选择性2'-羟基酰化。最后，我们将所有数据汇编到我们的资源SARS-CoV-2 CoCoPUTs中，并将许多结果统计数据与随时间变化的比例数据进行配对，以得出病毒进化的趋势。尽管SARS-CoV-2的总体密码子使用率没有发生剧烈变化，但与之前关于该主题的文献一致，我们确实观察到，在2021年2月至2023年7月期间，虽然总体GC%含量下降，但密码子第三位的GC%相对于总体GC%含量更为正。在相同的时间间隔内，我们注意到同义和非同义突变的数量都在增加，非同义突变以3:1的速度超过同义突变。我们注意到，预测的全基因组二级结构几乎都包含先前描述的病毒激活的翻译抑制剂（VAIT）茎环，首次验证了它们在许多SARS-CoV-2变体的全基因组二级结构预测中的存在（与之前的局部二级结构预测相反）。我们还分别生成了一组同义突变缺失的SARS-CoV-2变异序列，并在该序列上重复二级结构计算。这揭示了与未改变的变异结构相比，集合自由能降低的有趣趋势，表明同义突变在增加病毒RNA分子的自由能方面起作用。这些数据都证实了先前的研究，即随着时间的推移，人类病毒的病毒自由能增加，并表明同义突变在病毒生物学中的可能作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SARS-CoV-2 CoCoPUTs: analyzing GISAID and NCBI data to obtain codon statistics, mutations, and free energy over a multiyear period.

A consistent area of interest since the beginning of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been the sequence composition of the virus and how it has changed over time. Many resources have been developed for the storage and analysis of SARS-CoV-2 data, such as GISAID (Global Initiative on Sharing All Influenza Data), NCBI, Nextstrain, and outbreak.info. However, relatively little has been done to compile codon usage data, codon-level mutation data, and secondary structure data into a single database. Here, we assemble the aforementioned data and many additional virus attributes in a new database entitled SARS-CoV-2 CoCoPUTs. We begin with an overview of the composition and overlap between two of the largest sources of SARS-CoV-2 sequence data: GISAID and NCBI Virus (GenBank). We then evaluate different types of sequence curation strategies to reduce the dataset of millions of sequences to only one sequence per Pango lineage variant. We then performed specific analyses on the coding sequences (CDSs), including calculating codon usage, codon pair usage, dinucleotides, junction dinucleotides, mutations, GC content, effective number of codons (ENCs), and effective number of codon pairs (ENCPs). We have also performed whole-genome secondary RNA structure prediction calculations for each variant, using the LinearPartition software and modified selective 2'-hydroxyl acylation analyzed by primer extension (SHAPE) data that are available online. Finally, we compiled all the data into our resource, SARS-CoV-2 CoCoPUTs, and paired many of the resulting statistics with variant proportion data over time in order to derive trends in viral evolution. Although the overall codon usage of SARS-CoV-2 did not change drastically, in line with the previous literature on this subject, we did observe that while overall GC% content decreased, GC% of the third position in the codon was more positive relative to overall GC% content between February 2021 and July 2023. Over the same interval, we noted that both synonymous and nonsynonymous mutations increased in number, with nonsynonymous mutations outpacing synonymous mutations at a rate of 3:1. We noted that the predicted whole-genome secondary structures nearly all contained the previously described virus-activated inhibitor of translation (VAIT) stem loops, validating for the first time their existence in a whole-genome secondary structure prediction for many SARS-CoV-2 variants (as opposed to previous local secondary structure predictions). We also separately produced a synonymous mutation-deprived set of SARS-CoV-2 variant sequences and repeated the secondary structure calculations on this set. This revealed an interesting trend of reduced ensemble free energy compared to the unaltered variant structures, indicating that synonymous mutations play a role in increasing the free energy of viral RNA molecules. These data both validate previous studies describing increases in viral free energy in human viruses over time and indicate a possible role for synonymous mutations in viral biology.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Virus Evolution Immunology and Microbiology-Microbiology

CiteScore

10.50

自引率

5.70%

发文量

108

审稿时长

14 weeks

期刊介绍： Virus Evolution is a new Open Access journal focusing on the long-term evolution of viruses, viruses as a model system for studying evolutionary processes, viral molecular epidemiology and environmental virology. The aim of the journal is to provide a forum for original research papers, reviews, commentaries and a venue for in-depth discussion on the topics relevant to virus evolution.