自然界中核酸和肽短序列稀有性的决定因素。

IF 4 Q1 GENETICS & HEREDITY

NAR Genomics and Bioinformatics Pub Date : 2024-04-04 eCollection Date: 2024-06-01 DOI:10.1093/nargab/lqae029

Nikol Chantzi, Manvita Mareboina, Maxwell A Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares

{"title":"自然界中核酸和肽短序列稀有性的决定因素。","authors":"Nikol Chantzi, Manvita Mareboina, Maxwell A Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares","doi":"10.1093/nargab/lqae029","DOIUrl":null,"url":null,"abstract":"The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"6 2","pages":"lqae029"},"PeriodicalIF":4.0000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10993293/pdf/","citationCount":"0","resultStr":"{\"title\":\"The determinants of the rarity of nucleic and peptide short sequences in nature.\",\"authors\":\"Nikol Chantzi, Manvita Mareboina, Maxwell A Konnaris, Austin Montgomery, Michail Patsakis, Ioannis Mouratidis, Ilias Georgakopoulos-Soares\",\"doi\":\"10.1093/nargab/lqae029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.\",\"PeriodicalId\":33994,\"journal\":{\"name\":\"NAR Genomics and Bioinformatics\",\"volume\":\"6 2\",\"pages\":\"lqae029\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10993293/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NAR Genomics and Bioinformatics\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://doi.org/10.1093/nargab/lqae029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/6/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.1093/nargab/lqae029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

核酸和肽短序列在生物基因组和蛋白质组中的普遍性尚未得到深入研究。我们研究了 45 785 个参考基因组和 21 871 个参考蛋白质组，涵盖古生菌、细菌、真核生物和病毒，以计算其中短序列的稀有程度。为此，我们开发了一种衡量自然界中每种序列稀有程度的指标--稀有度指数。我们发现，在罕见的寡肽序列中，某些二肽的出现频率比预期的低数百倍，而任何二核苷酸的出现频率都不会如此。我们还生成了预测回归模型，分别推断整个自然界或生命和病毒各领域内的核序列和蛋白质组序列的稀有程度。在分别研究生命和病毒的三个领域时，从单肽和二肽中预测 5 聚体肽稀有性的模型的 R² 值在 0.814 和 0.932 之间。另一个从单核苷酸和双核苷酸中预测 10 聚体寡核苷酸稀有性的模型的 R² 值在 0.408 和 0.606 之间。我们的研究结果表明，核序列的单核苷酸和二核苷酸组成以及肽序列的单肽和二肽组成可以解释它们在自然界中出现频率差异的很大一部分原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The determinants of the rarity of nucleic and peptide short sequences in nature.

The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NAR Genomics and Bioinformatics Multiple-

CiteScore

8.00

自引率

2.20%

发文量

审稿时长

15 weeks