{"title":"如何(不)阅读鱼类基因组学数据-细胞基因组学知识在当前测序基因组洪流中的重要性","authors":"Radka Symonová","doi":"10.1111/jai.14365","DOIUrl":null,"url":null,"abstract":"<p>Biologists have been facing a tremendous data explosion during the last years. This is particularly apparent in the still increasing amount of sequenced genomes. However, it is not always straightforward to understand and properly utilize these data that may be publicly available in an incomplete form. This is the case of the current flood of fish genomes among others. In reaction to this situation a recent study by Randhawa and Pawar (<span>2021</span>) tried to exploit these data, however, in an improper way. This study is not the only one suffering from serious problems with handling genomic data and contextualizing them in organismal evolution. On the other hand (Randhawa & Pawar, <span>2021</span>) accumulated several serious issues in a single paper and further analysed their incorrect findings. This short communication aims to elucidate unclarities in that study and to provide some simple hints how to avoid similar issues. This is particularly relevant for fish genomics, where an immense biodiversity results in a so far unexplored diversity of genome traits.</p><p>The study by Randhawa and Pawar (<span>2021</span>) is based on the NCBI repository of genomic data (Genome, <span>2022</span>). Even such excellent and indispensable tools like NCBI/Genome are not absolutely flawless as they rely on data submitted by other scientists. Hence, it is crucial to always manually check all downloaded data. The best way is to sort the dataset according to e.g. genome size (GS) and according to GC content (GC%). Then, the incorrect values are immediately apparent on both of ends of the dataset upon these procedures. Namely, the incomplete genome assemblies result in too small genome sizes (e.g. <i>Squalius pyrenaicus</i> with a fully non-sense genome size of 48 Mb reported by Randhawa & Pawar, <span>2021</span>). As a result, such incomplete assemblies can yield fully aberrant values of GC%, both extremely low (e.g. <i>Chionodraco hamatus</i> with GC = 25.4%) or extremely high (e.g. <i>S. pyrenaicus</i> with GC = 51.1%). Such values have to be, of course, discarded. A similar mistake was identified in the paper by Lu & Luo, <span>2020</span>, who presented the channel catfish to have GC = 31.5% and claimed this value as the lowest one within their dataset. In this particular case, it is apparent that this value is incorrect in their dataset and too low for a vertebrate genome. The value of GC% is crucial for several reasons and particularly regarding the genome completeness, since GC-rich regions were underrepresented from technical reasons in the earlier versions of genome assemblies (Rhie et al., <span>2021</span>).</p><p>A fully different however equally important issue is where to place the borderline between the incomplete and still usable but low(er) quality genome assemblies. This is crucial to be able to decide, which values are to be discarded and which retained. This issue cannot be easily solved because there is usually no gap between the genome size values and the values represent a continuous row. Moreover, some genome assemblies are incomplete for technical reasons although presented and considered as “complete” (Rhie et al., <span>2021</span>). Whereas other genome assemblies are incomplete because of the research goal, which is not always to provide as complete as possible assemblies (different types reduced representation sequencing, e.g. Luca et al., <span>2011</span>; or partial sequencing projects). In fish genomes, combination of these reasons results in a gradient of values from clearly (sometimes intentionally) incomplete genomes towards less incomplete values without any obvious gap. Moreover, fish genomes represent a wide range of genome size values. E.g. 1000 Mb genome size might be correct for one species or lineage while it is incorrect (incomplete) for another one. This requires an active intervention and a deeper insight and an analysis of such data.</p><p>Thinking of genome size should bring us to the question, whether a cypriniform fish (<i>S. pyrenaicus</i> in the case of Randhawa & Pawar, <span>2021</span> genome size 48 Mb) could have an almost ten times smaller genome than tetraodontid fishes (i.e. smooth pufferfish, e.g. <i>Tetraodon</i> or <i>Takifugu</i>) possessing the smallest known vertebrate genomes (Neafsey & Palumbi, <span>2003</span>). Interestingly, even the latest versions of assemblies utilizing such technologies like PacBio sequencing produce genomes still smaller than those determined by cytological methods (the C-value) for the same species. This can be demonstrated on the smallest vertebrate genomes with the recently sequenced <i>Takifugu obscurus</i> yielding 373–381 Mb (Kang et al., <span>2020</span>) whereas the C-value in the genus <i>Takifugu</i> is 0.4–0.42 pg (Gregory, <span>2022</span>).</p><p>The size of a single chromosome in teleost fishes that have not undergone any additionally whole genome duplication (WGD) neither any extreme repeats expansion usually ranges from 20 Mb to about 60 Mb. Chromosome size even in species with highly reduced genomes reaches about 20 Mb (Borůvková et al., <span>2021</span>; Genome, <span>2022</span>). Similarly, GC-content is a crucial genome trait with far-reaching importance although not yet fully understood (e.g. Matoulek et al., <span>2020</span>). As such, GC-content can reach only certain values among eukaryotes and particularly among vertebrates (Borůvková et al., <span>2021</span>). Hence, there are some clues how to handle the currently available data, however, an extra attention has to be paid and the data need some manual curation. A careful comparison with literature is crucial. This means first to check the original paper reporting the relevant genome assembly – Bargelloni et al., <span>2019</span> for <i>Chionodraco hamatus</i> reported in Randhawa & Pawar, <span>2021</span>. Here, one can find that the proportion of unresolved bases, the “Ns”, is 38.01%, and that another <i>Chionodraco</i> species under study, the <i>Chionodraco meyrsi</i>, has GC = ca. 42%. These are very clear and straightforward clues that the value GC = 25.4% in <i>Chionodraco hamatus</i> is wrong.</p><p>It is necessary to bear in our mind that the field of fish genomics is highly specific due to the immense fish diversity reflecting the long evolutionary history undergone (Nelson et al., <span>2016</span>). This results in a broad range of genome sizes and transposon proportions (Sotero-Caio et al., <span>2017</span>) encountered in fish genomes far exceeding ranges usual in mammalian and avian genomes (Borůvková et al., <span>2021</span>). Transposons, i.e. mobile genetic elements also known as jumping genes, are one of the major drivers of genome evolution particularly in fish and regarding the genome size (Borůvková et al., <span>2021</span>; Matoulek et al., <span>2020</span>; Sotero-Caio et al., <span>2017</span>). Transposons cause immense differences in the genome size-to-GC% ratio even within teleosts (e.g. in salmonids, Gaffaroglu et al., <span>2020</span>). Since transposons can occupy more than a half of a genome, their own GC% influences the GC% of their “host genomes” (Boissinot, <span>2022</span>; Symonová & Suh, <span>2018</span>). Transposons and their own GC content are also candidates for the AT/GC compositional homogeneity known in fish genomes (e.g. Majtánová et al., <span>2017</span>), with the single exception of basal ray-finned lineage gars with the mammalian-like AT/GC heterogeneity (Symonová et al., <span>2016</span>). The basal “fish” lineages differ heavily from teleosts in numerous genome traits. Actually, the deeper we move on the vertebrate phylogenetic tree, the riskier and less relevant it is to compare those groups (lamprey, hagfish, chondrostean, lungfish, bichir, coelacanths) with teleosts (e.g. Borůvková et al., <span>2021</span>). Another highly specific aspect of genome evolution in fish lineages is their tolerance to whole genome duplications (WGD; Glasauer & Neuhauss, <span>2014</span>) with far reaching implications in their genomes including the high variability in genome size (Gregory, <span>2022</span>) linked to variability in chromosome numbers (e.g. in sturgeon and paddlefish Symonová et al., <span>2013</span>, <span>2017</span>). Finally, a WGD event can result in transposon reactivation leading to further large-scale genome and chromosome re-arrangements. Similarly, hybridization events that are not infrequent in fish can also result in transposon reactivation (e.g. Dion-Côté et al., <span>2014</span>). On the other side, a genome expansion resembling its duplication can be caused by an extreme amplification of transposons without any link to WGD (e.g. in mudminnows, Lehmann et al., <span>2021</span>).</p><p>Not all currently available fish genomes have been assembled to the chromosome level – as Randhawa & Pawar, <span>2021</span> say, only 16.5% (98 species) of by them presented and analysed genomes are available at the chromosome level. This would not be any issue, should not the authors use the chromosome numbers for their statistical analyses and down-stream evaluations between chondrichthyans and bony fishes. Chondrichthyans do have significantly higher counts of chromosomes than bony fishes (Gregory, <span>2022</span>; Uno et al., <span>2020</span>), despite Randhawa & Pawar, <span>2021</span> say that these two groups do not differ at the order level. The higher chromosome number may be one of the reasons why only chondrichthyans with lower counts of chromosomes were sequenced and assembled to the chromosome level for technical reasons. Hence, these technical bias became the source of misunderstanding and misinterpretations by Randhawa & Pawar, <span>2021</span>. Here, it is necessary to stress that there are also further online and publically available resources of cytogenomic data on chromosome numbers, cytological genome size (C-value; both e.g. Gregory, <span>2022</span>), fundamental numbers (chromosome arms numbers, e.g. Arai, <span>2011</span>) and GC-content (Vinogradov, <span>1998</span>). Another potential issue that can be compensated by data on chromosome counts from elsewhere is the exclusion of orders with less than three records. Chromosome numbers are not influenced by any technical reasons hence they can be combined from more sources. On the other hand, chromosome counts and genome size were factors influencing the availability of genome assemblies for a certain time. Only recently with the increasing availability of genome projects and particularly with decreasing prices of diverse sequencing methods, also larger genomes became sequenced more frequently. In spite of their huge amounts of data and importance, the genomic databases and datasets are far from being complete and hence they do not represent any fully sufficient stand-alone source of information. As such they need to be considered and utilized.</p><p>Moreover, genome assemblies themselves are still far from being complete and perfect (Rhie et al., <span>2021</span>). Currently, merely the human genome is the only one being completely sequenced and assembled that was accomplished only this year (Nurk et al., <span>2022</span>). This process of filling gaps in the human genome was built on the 38th version of the human genome known as GRCh38.p13 (Schneider et al., <span>2017</span>) Genomes of other crucial mammalian model species (house mouse, rat, etc.) are not yet complete. Accordingly, genomes of such large number of fish species are still full of gaps and/or unresolved bases (“N”, IUPAC, <span>n.d.</span>).</p><p>None.</p>","PeriodicalId":14894,"journal":{"name":"Journal of Applied Ichthyology","volume":"2025 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jai.14365","citationCount":"0","resultStr":"{\"title\":\"How (not) to read fish genomics data – The importance of cytogenomics knowledge in the current flood of sequenced genomes\",\"authors\":\"Radka Symonová\",\"doi\":\"10.1111/jai.14365\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Biologists have been facing a tremendous data explosion during the last years. This is particularly apparent in the still increasing amount of sequenced genomes. However, it is not always straightforward to understand and properly utilize these data that may be publicly available in an incomplete form. This is the case of the current flood of fish genomes among others. In reaction to this situation a recent study by Randhawa and Pawar (<span>2021</span>) tried to exploit these data, however, in an improper way. This study is not the only one suffering from serious problems with handling genomic data and contextualizing them in organismal evolution. On the other hand (Randhawa & Pawar, <span>2021</span>) accumulated several serious issues in a single paper and further analysed their incorrect findings. This short communication aims to elucidate unclarities in that study and to provide some simple hints how to avoid similar issues. This is particularly relevant for fish genomics, where an immense biodiversity results in a so far unexplored diversity of genome traits.</p><p>The study by Randhawa and Pawar (<span>2021</span>) is based on the NCBI repository of genomic data (Genome, <span>2022</span>). Even such excellent and indispensable tools like NCBI/Genome are not absolutely flawless as they rely on data submitted by other scientists. Hence, it is crucial to always manually check all downloaded data. The best way is to sort the dataset according to e.g. genome size (GS) and according to GC content (GC%). Then, the incorrect values are immediately apparent on both of ends of the dataset upon these procedures. Namely, the incomplete genome assemblies result in too small genome sizes (e.g. <i>Squalius pyrenaicus</i> with a fully non-sense genome size of 48 Mb reported by Randhawa & Pawar, <span>2021</span>). As a result, such incomplete assemblies can yield fully aberrant values of GC%, both extremely low (e.g. <i>Chionodraco hamatus</i> with GC = 25.4%) or extremely high (e.g. <i>S. pyrenaicus</i> with GC = 51.1%). Such values have to be, of course, discarded. A similar mistake was identified in the paper by Lu & Luo, <span>2020</span>, who presented the channel catfish to have GC = 31.5% and claimed this value as the lowest one within their dataset. In this particular case, it is apparent that this value is incorrect in their dataset and too low for a vertebrate genome. The value of GC% is crucial for several reasons and particularly regarding the genome completeness, since GC-rich regions were underrepresented from technical reasons in the earlier versions of genome assemblies (Rhie et al., <span>2021</span>).</p><p>A fully different however equally important issue is where to place the borderline between the incomplete and still usable but low(er) quality genome assemblies. This is crucial to be able to decide, which values are to be discarded and which retained. This issue cannot be easily solved because there is usually no gap between the genome size values and the values represent a continuous row. Moreover, some genome assemblies are incomplete for technical reasons although presented and considered as “complete” (Rhie et al., <span>2021</span>). Whereas other genome assemblies are incomplete because of the research goal, which is not always to provide as complete as possible assemblies (different types reduced representation sequencing, e.g. Luca et al., <span>2011</span>; or partial sequencing projects). In fish genomes, combination of these reasons results in a gradient of values from clearly (sometimes intentionally) incomplete genomes towards less incomplete values without any obvious gap. Moreover, fish genomes represent a wide range of genome size values. E.g. 1000 Mb genome size might be correct for one species or lineage while it is incorrect (incomplete) for another one. This requires an active intervention and a deeper insight and an analysis of such data.</p><p>Thinking of genome size should bring us to the question, whether a cypriniform fish (<i>S. pyrenaicus</i> in the case of Randhawa & Pawar, <span>2021</span> genome size 48 Mb) could have an almost ten times smaller genome than tetraodontid fishes (i.e. smooth pufferfish, e.g. <i>Tetraodon</i> or <i>Takifugu</i>) possessing the smallest known vertebrate genomes (Neafsey & Palumbi, <span>2003</span>). Interestingly, even the latest versions of assemblies utilizing such technologies like PacBio sequencing produce genomes still smaller than those determined by cytological methods (the C-value) for the same species. This can be demonstrated on the smallest vertebrate genomes with the recently sequenced <i>Takifugu obscurus</i> yielding 373–381 Mb (Kang et al., <span>2020</span>) whereas the C-value in the genus <i>Takifugu</i> is 0.4–0.42 pg (Gregory, <span>2022</span>).</p><p>The size of a single chromosome in teleost fishes that have not undergone any additionally whole genome duplication (WGD) neither any extreme repeats expansion usually ranges from 20 Mb to about 60 Mb. Chromosome size even in species with highly reduced genomes reaches about 20 Mb (Borůvková et al., <span>2021</span>; Genome, <span>2022</span>). Similarly, GC-content is a crucial genome trait with far-reaching importance although not yet fully understood (e.g. Matoulek et al., <span>2020</span>). As such, GC-content can reach only certain values among eukaryotes and particularly among vertebrates (Borůvková et al., <span>2021</span>). Hence, there are some clues how to handle the currently available data, however, an extra attention has to be paid and the data need some manual curation. A careful comparison with literature is crucial. This means first to check the original paper reporting the relevant genome assembly – Bargelloni et al., <span>2019</span> for <i>Chionodraco hamatus</i> reported in Randhawa & Pawar, <span>2021</span>. Here, one can find that the proportion of unresolved bases, the “Ns”, is 38.01%, and that another <i>Chionodraco</i> species under study, the <i>Chionodraco meyrsi</i>, has GC = ca. 42%. These are very clear and straightforward clues that the value GC = 25.4% in <i>Chionodraco hamatus</i> is wrong.</p><p>It is necessary to bear in our mind that the field of fish genomics is highly specific due to the immense fish diversity reflecting the long evolutionary history undergone (Nelson et al., <span>2016</span>). This results in a broad range of genome sizes and transposon proportions (Sotero-Caio et al., <span>2017</span>) encountered in fish genomes far exceeding ranges usual in mammalian and avian genomes (Borůvková et al., <span>2021</span>). Transposons, i.e. mobile genetic elements also known as jumping genes, are one of the major drivers of genome evolution particularly in fish and regarding the genome size (Borůvková et al., <span>2021</span>; Matoulek et al., <span>2020</span>; Sotero-Caio et al., <span>2017</span>). Transposons cause immense differences in the genome size-to-GC% ratio even within teleosts (e.g. in salmonids, Gaffaroglu et al., <span>2020</span>). Since transposons can occupy more than a half of a genome, their own GC% influences the GC% of their “host genomes” (Boissinot, <span>2022</span>; Symonová & Suh, <span>2018</span>). Transposons and their own GC content are also candidates for the AT/GC compositional homogeneity known in fish genomes (e.g. Majtánová et al., <span>2017</span>), with the single exception of basal ray-finned lineage gars with the mammalian-like AT/GC heterogeneity (Symonová et al., <span>2016</span>). The basal “fish” lineages differ heavily from teleosts in numerous genome traits. Actually, the deeper we move on the vertebrate phylogenetic tree, the riskier and less relevant it is to compare those groups (lamprey, hagfish, chondrostean, lungfish, bichir, coelacanths) with teleosts (e.g. Borůvková et al., <span>2021</span>). Another highly specific aspect of genome evolution in fish lineages is their tolerance to whole genome duplications (WGD; Glasauer & Neuhauss, <span>2014</span>) with far reaching implications in their genomes including the high variability in genome size (Gregory, <span>2022</span>) linked to variability in chromosome numbers (e.g. in sturgeon and paddlefish Symonová et al., <span>2013</span>, <span>2017</span>). Finally, a WGD event can result in transposon reactivation leading to further large-scale genome and chromosome re-arrangements. Similarly, hybridization events that are not infrequent in fish can also result in transposon reactivation (e.g. Dion-Côté et al., <span>2014</span>). On the other side, a genome expansion resembling its duplication can be caused by an extreme amplification of transposons without any link to WGD (e.g. in mudminnows, Lehmann et al., <span>2021</span>).</p><p>Not all currently available fish genomes have been assembled to the chromosome level – as Randhawa & Pawar, <span>2021</span> say, only 16.5% (98 species) of by them presented and analysed genomes are available at the chromosome level. This would not be any issue, should not the authors use the chromosome numbers for their statistical analyses and down-stream evaluations between chondrichthyans and bony fishes. Chondrichthyans do have significantly higher counts of chromosomes than bony fishes (Gregory, <span>2022</span>; Uno et al., <span>2020</span>), despite Randhawa & Pawar, <span>2021</span> say that these two groups do not differ at the order level. The higher chromosome number may be one of the reasons why only chondrichthyans with lower counts of chromosomes were sequenced and assembled to the chromosome level for technical reasons. Hence, these technical bias became the source of misunderstanding and misinterpretations by Randhawa & Pawar, <span>2021</span>. Here, it is necessary to stress that there are also further online and publically available resources of cytogenomic data on chromosome numbers, cytological genome size (C-value; both e.g. Gregory, <span>2022</span>), fundamental numbers (chromosome arms numbers, e.g. Arai, <span>2011</span>) and GC-content (Vinogradov, <span>1998</span>). Another potential issue that can be compensated by data on chromosome counts from elsewhere is the exclusion of orders with less than three records. Chromosome numbers are not influenced by any technical reasons hence they can be combined from more sources. On the other hand, chromosome counts and genome size were factors influencing the availability of genome assemblies for a certain time. Only recently with the increasing availability of genome projects and particularly with decreasing prices of diverse sequencing methods, also larger genomes became sequenced more frequently. In spite of their huge amounts of data and importance, the genomic databases and datasets are far from being complete and hence they do not represent any fully sufficient stand-alone source of information. As such they need to be considered and utilized.</p><p>Moreover, genome assemblies themselves are still far from being complete and perfect (Rhie et al., <span>2021</span>). Currently, merely the human genome is the only one being completely sequenced and assembled that was accomplished only this year (Nurk et al., <span>2022</span>). This process of filling gaps in the human genome was built on the 38th version of the human genome known as GRCh38.p13 (Schneider et al., <span>2017</span>) Genomes of other crucial mammalian model species (house mouse, rat, etc.) are not yet complete. Accordingly, genomes of such large number of fish species are still full of gaps and/or unresolved bases (“N”, IUPAC, <span>n.d.</span>).</p><p>None.</p>\",\"PeriodicalId\":14894,\"journal\":{\"name\":\"Journal of Applied Ichthyology\",\"volume\":\"2025 1\",\"pages\":\"\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2022-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jai.14365\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Applied Ichthyology\",\"FirstCategoryId\":\"97\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jai.14365\",\"RegionNum\":4,\"RegionCategory\":\"农林科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"FISHERIES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Ichthyology","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jai.14365","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"FISHERIES","Score":null,"Total":0}
引用次数: 0
摘要
在过去的几年里,生物学家一直面临着巨大的数据爆炸。这在仍在增加的基因组测序数量中尤为明显。然而,理解和正确利用这些可能以不完整的形式公开提供的数据并不总是直截了当的。这就是目前大量鱼类基因组的情况。为了应对这种情况,Randhawa和Pawar(2021)最近的一项研究试图利用这些数据,然而,以一种不恰当的方式。这项研究并不是唯一一个在处理基因组数据和将它们置于生物进化背景中遇到严重问题的研究。另一方面(Randhawa &;Pawar, 2021)在一篇论文中积累了几个严重的问题,并进一步分析了他们的错误发现。这篇简短的交流旨在阐明该研究中的不明确之处,并提供一些如何避免类似问题的简单提示。这与鱼类基因组学特别相关,其中巨大的生物多样性导致了迄今为止未开发的基因组特征多样性。Randhawa和Pawar(2021)的研究基于NCBI基因组数据存储库(Genome, 2022)。即使像NCBI/Genome这样优秀和不可或缺的工具也不是绝对完美的,因为它们依赖于其他科学家提交的数据。因此,总是手动检查所有下载的数据是至关重要的。最好的方法是根据基因组大小(GS)和GC含量(GC%)对数据集进行排序。然后,在这些过程中,数据集两端的不正确值立即显现出来。也就是说,不完整的基因组组装导致基因组大小过小(如Squalius pyrenaicus, Randhawa &;帕瓦尔,2021)。因此,这种不完整的组合可能产生完全异常的GC%值,可能极低(如Chionodraco hamatus, GC = 25.4%),也可能极高(如S. pyrenaicus, GC = 51.1%)。当然,这样的值必须被丢弃。Lu &;Luo, 2020,他提出通道鲶鱼的GC = 31.5%,并声称该值是他们数据集中最低的。在这种特殊情况下,很明显,这个值在他们的数据集中是不正确的,对于脊椎动物基因组来说也太低了。GC%的值至关重要,有几个原因,特别是关于基因组完整性,因为在早期版本的基因组组装中,由于技术原因,富含GC的区域代表性不足(rihie et al., 2021)。一个完全不同但同样重要的问题是,在不完整的基因组和仍然可用但质量较低的基因组组装之间的界限在哪里。这对于决定哪些值要丢弃,哪些值要保留是至关重要的。这个问题不容易解决,因为基因组大小值和代表连续行的值之间通常没有差距。此外,一些基因组组装由于技术原因是不完整的,尽管它们被认为是“完整的”(rihie et al., 2021)。而其他基因组组装是不完整的,因为研究目标并不总是提供尽可能完整的组装(不同类型的减少表征测序,例如Luca等人,2011;或部分测序项目)。在鱼类基因组中,这些原因的结合导致值从明显(有时是故意)不完整的基因组向不太不完整的值梯度,没有任何明显的差距。此外,鱼类基因组代表了广泛的基因组大小值。例如,1000 Mb的基因组大小可能对一个物种或谱系是正确的,而对另一个物种或谱系则是不正确的(不完整的)。这需要积极干预,并对这些数据进行更深入的洞察和分析。考虑到基因组的大小,我们应该想到这样一个问题:在Randhawa的例子中,鲤形鱼(S. pyrenaicus)Pawar, 2021年基因组大小48 Mb)的基因组可能比拥有已知最小脊椎动物基因组(Neafsey &;所有,2003)。有趣的是,即使是利用PacBio测序等技术的最新版本的组装,也比用细胞学方法(c值)确定的相同物种的基因组要小。这可以在最小的脊椎动物基因组上得到证明,最近测序的暗鳍东方鲀的基因组为373-381 Mb (Kang et al., 2020),而东方鲀属的c值为0.4-0.42 pg (Gregory, 2022)。在硬骨鱼中,没有经历任何额外的全基因组复制(WGD),也没有任何极端重复扩增的单染色体的大小通常在20 - 60 Mb之间。即使在基因组高度减少的物种中,染色体大小也达到约20 Mb (Borůvková et al., 2021;基因组,2022)。 同样,gc含量是一个至关重要的基因组性状,虽然尚未完全了解,但具有深远的重要性(例如Matoulek et al., 2020)。因此,gc含量在真核生物中,特别是在脊椎动物中只能达到一定的值(Borůvková et al., 2021)。因此,有一些如何处理当前可用数据的线索,但是,必须给予额外的注意,并且数据需要一些手动管理。与文学作仔细的比较是至关重要的。这意味着首先要检查报告相关基因组组装的原始论文- Bargelloni等人,2019年在Randhawa &;帕瓦尔,2021年。在这里,可以发现未解析碱基“Ns”的比例为38.01%,而正在研究的另一种Chionodraco meyrsi的GC =约为42%。这些都是很清楚和直接的线索,证明了石竹中GC = 25.4%的值是错误的。有必要记住,鱼类基因组学领域是高度特异性的,因为巨大的鱼类多样性反映了漫长的进化史(Nelson et al., 2016)。这导致鱼类基因组中的基因组大小和转座子比例范围广泛(Sotero-Caio等人,2017),远远超过哺乳动物和鸟类基因组中的通常范围(Borůvková等人,2021)。转座子,即移动遗传元件,也称为跳跃基因,是基因组进化的主要驱动因素之一,特别是在鱼类和基因组大小方面(Borůvková等人,2021;Matoulek et al., 2020;Sotero-Caio et al., 2017)。转座子即使在硬骨鱼中也会导致基因组大小与gc %比率的巨大差异(例如在鲑鱼中,Gaffaroglu et al., 2020)。由于转座子可以占据基因组的一半以上,它们自己的GC%影响其“宿主基因组”的GC% (Boissinot, 2022;Symonova,Suh, 2018)。转座子及其自身的GC含量也是鱼类基因组中已知的AT/GC组成同质性的候选者(例如Majtánová等人,2017),但具有类似哺乳动物AT/GC异质性的基底鳍鱼谱系例外(symonov<e:1>等人,2016)。基础“鱼”谱系在许多基因组特征上与硬骨鱼有很大的不同。实际上,我们在脊椎动物系统发育树上走得越深,将这些群体(七鳃鳗、盲鳗、软骨鱼、肺鱼、双鳃鱼、腔棘鱼)与硬骨鱼(例如Borůvková等人,2021年)进行比较的风险就越大,相关性也越低。鱼类谱系中基因组进化的另一个高度特异性的方面是它们对全基因组重复的耐受性(WGD;Glasauer,Neuhauss, 2014),这对它们的基因组产生了深远的影响,包括基因组大小的高度可变性(Gregory, 2022),这与染色体数目的可变性有关(例如鲟鱼和白鲟,symonov<e:1>等人,2013年,2017年)。最后,WGD事件可导致转座子再激活,导致进一步大规模的基因组和染色体重排。同样,在鱼类中并不罕见的杂交事件也会导致转座子再激活(例如Dion-Côté et al., 2014)。另一方面,与复制相似的基因组扩增可能是由转座子的极端扩增引起的,而与WGD没有任何联系(例如,在mudminnows中,Lehmann et al., 2021)。目前并非所有可用的鱼类基因组都已组装到染色体水平——正如Randhawa &;Pawar, 2021说,只有16.5%(98个物种)的基因组在染色体水平上是可用的。如果作者不使用染色体数进行统计分析和软骨鱼和硬骨鱼之间的下游评估,这将不会有任何问题。软骨鱼的染色体数量确实明显高于硬骨鱼类(Gregory, 2022;Uno等人,2020),尽管Randhawa &;Pawar, 2021表示,这两个群体在订单层面上没有差异。较高的染色体数目可能是由于技术原因,只有染色体数目较低的软骨鱼被测序和组装到染色体水平的原因之一。因此,这些技术偏见成为Randhawa &;帕瓦尔,2021年。在这里,有必要强调的是,还有更多的在线和公开的细胞基因组数据资源,如染色体数目、细胞学基因组大小(c值;两者(如Gregory, 2022)、基本数(染色体臂数,如Arai, 2011)和gc含量(Vinogradov, 1998)。另一个可以通过其他地方的染色体计数数据来弥补的潜在问题是,排除了记录少于三条的序列。染色体数目不受任何技术原因的影响,因此它们可以从更多的来源组合。另一方面,染色体数量和基因组大小在一定时间内是影响基因组组装可用性的因素。 直到最近,随着基因组计划的增加,特别是随着各种测序方法的价格下降,更大的基因组也变得更频繁地被测序。尽管基因组数据库和数据集具有巨大的数据量和重要性,但它们还远远不够完整,因此它们不能代表任何完全充分的独立信息来源。因此,它们需要得到考虑和利用。此外,基因组组装本身还远远不够完整和完善(rihie et al., 2021)。目前,只有人类基因组是今年才完成的完整测序和组装(Nurk et al., 2022)。这个填补人类基因组空白的过程是建立在第38个版本的人类基因组上的,即GRCh38。p13 (Schneider et al., 2017)其他关键哺乳动物模式物种(家鼠、大鼠等)的基因组尚未完成。因此,如此大量的鱼类基因组仍然充满了空白和/或未解决的碱基(“N”,IUPAC, n.d)。
How (not) to read fish genomics data – The importance of cytogenomics knowledge in the current flood of sequenced genomes
Biologists have been facing a tremendous data explosion during the last years. This is particularly apparent in the still increasing amount of sequenced genomes. However, it is not always straightforward to understand and properly utilize these data that may be publicly available in an incomplete form. This is the case of the current flood of fish genomes among others. In reaction to this situation a recent study by Randhawa and Pawar (2021) tried to exploit these data, however, in an improper way. This study is not the only one suffering from serious problems with handling genomic data and contextualizing them in organismal evolution. On the other hand (Randhawa & Pawar, 2021) accumulated several serious issues in a single paper and further analysed their incorrect findings. This short communication aims to elucidate unclarities in that study and to provide some simple hints how to avoid similar issues. This is particularly relevant for fish genomics, where an immense biodiversity results in a so far unexplored diversity of genome traits.
The study by Randhawa and Pawar (2021) is based on the NCBI repository of genomic data (Genome, 2022). Even such excellent and indispensable tools like NCBI/Genome are not absolutely flawless as they rely on data submitted by other scientists. Hence, it is crucial to always manually check all downloaded data. The best way is to sort the dataset according to e.g. genome size (GS) and according to GC content (GC%). Then, the incorrect values are immediately apparent on both of ends of the dataset upon these procedures. Namely, the incomplete genome assemblies result in too small genome sizes (e.g. Squalius pyrenaicus with a fully non-sense genome size of 48 Mb reported by Randhawa & Pawar, 2021). As a result, such incomplete assemblies can yield fully aberrant values of GC%, both extremely low (e.g. Chionodraco hamatus with GC = 25.4%) or extremely high (e.g. S. pyrenaicus with GC = 51.1%). Such values have to be, of course, discarded. A similar mistake was identified in the paper by Lu & Luo, 2020, who presented the channel catfish to have GC = 31.5% and claimed this value as the lowest one within their dataset. In this particular case, it is apparent that this value is incorrect in their dataset and too low for a vertebrate genome. The value of GC% is crucial for several reasons and particularly regarding the genome completeness, since GC-rich regions were underrepresented from technical reasons in the earlier versions of genome assemblies (Rhie et al., 2021).
A fully different however equally important issue is where to place the borderline between the incomplete and still usable but low(er) quality genome assemblies. This is crucial to be able to decide, which values are to be discarded and which retained. This issue cannot be easily solved because there is usually no gap between the genome size values and the values represent a continuous row. Moreover, some genome assemblies are incomplete for technical reasons although presented and considered as “complete” (Rhie et al., 2021). Whereas other genome assemblies are incomplete because of the research goal, which is not always to provide as complete as possible assemblies (different types reduced representation sequencing, e.g. Luca et al., 2011; or partial sequencing projects). In fish genomes, combination of these reasons results in a gradient of values from clearly (sometimes intentionally) incomplete genomes towards less incomplete values without any obvious gap. Moreover, fish genomes represent a wide range of genome size values. E.g. 1000 Mb genome size might be correct for one species or lineage while it is incorrect (incomplete) for another one. This requires an active intervention and a deeper insight and an analysis of such data.
Thinking of genome size should bring us to the question, whether a cypriniform fish (S. pyrenaicus in the case of Randhawa & Pawar, 2021 genome size 48 Mb) could have an almost ten times smaller genome than tetraodontid fishes (i.e. smooth pufferfish, e.g. Tetraodon or Takifugu) possessing the smallest known vertebrate genomes (Neafsey & Palumbi, 2003). Interestingly, even the latest versions of assemblies utilizing such technologies like PacBio sequencing produce genomes still smaller than those determined by cytological methods (the C-value) for the same species. This can be demonstrated on the smallest vertebrate genomes with the recently sequenced Takifugu obscurus yielding 373–381 Mb (Kang et al., 2020) whereas the C-value in the genus Takifugu is 0.4–0.42 pg (Gregory, 2022).
The size of a single chromosome in teleost fishes that have not undergone any additionally whole genome duplication (WGD) neither any extreme repeats expansion usually ranges from 20 Mb to about 60 Mb. Chromosome size even in species with highly reduced genomes reaches about 20 Mb (Borůvková et al., 2021; Genome, 2022). Similarly, GC-content is a crucial genome trait with far-reaching importance although not yet fully understood (e.g. Matoulek et al., 2020). As such, GC-content can reach only certain values among eukaryotes and particularly among vertebrates (Borůvková et al., 2021). Hence, there are some clues how to handle the currently available data, however, an extra attention has to be paid and the data need some manual curation. A careful comparison with literature is crucial. This means first to check the original paper reporting the relevant genome assembly – Bargelloni et al., 2019 for Chionodraco hamatus reported in Randhawa & Pawar, 2021. Here, one can find that the proportion of unresolved bases, the “Ns”, is 38.01%, and that another Chionodraco species under study, the Chionodraco meyrsi, has GC = ca. 42%. These are very clear and straightforward clues that the value GC = 25.4% in Chionodraco hamatus is wrong.
It is necessary to bear in our mind that the field of fish genomics is highly specific due to the immense fish diversity reflecting the long evolutionary history undergone (Nelson et al., 2016). This results in a broad range of genome sizes and transposon proportions (Sotero-Caio et al., 2017) encountered in fish genomes far exceeding ranges usual in mammalian and avian genomes (Borůvková et al., 2021). Transposons, i.e. mobile genetic elements also known as jumping genes, are one of the major drivers of genome evolution particularly in fish and regarding the genome size (Borůvková et al., 2021; Matoulek et al., 2020; Sotero-Caio et al., 2017). Transposons cause immense differences in the genome size-to-GC% ratio even within teleosts (e.g. in salmonids, Gaffaroglu et al., 2020). Since transposons can occupy more than a half of a genome, their own GC% influences the GC% of their “host genomes” (Boissinot, 2022; Symonová & Suh, 2018). Transposons and their own GC content are also candidates for the AT/GC compositional homogeneity known in fish genomes (e.g. Majtánová et al., 2017), with the single exception of basal ray-finned lineage gars with the mammalian-like AT/GC heterogeneity (Symonová et al., 2016). The basal “fish” lineages differ heavily from teleosts in numerous genome traits. Actually, the deeper we move on the vertebrate phylogenetic tree, the riskier and less relevant it is to compare those groups (lamprey, hagfish, chondrostean, lungfish, bichir, coelacanths) with teleosts (e.g. Borůvková et al., 2021). Another highly specific aspect of genome evolution in fish lineages is their tolerance to whole genome duplications (WGD; Glasauer & Neuhauss, 2014) with far reaching implications in their genomes including the high variability in genome size (Gregory, 2022) linked to variability in chromosome numbers (e.g. in sturgeon and paddlefish Symonová et al., 2013, 2017). Finally, a WGD event can result in transposon reactivation leading to further large-scale genome and chromosome re-arrangements. Similarly, hybridization events that are not infrequent in fish can also result in transposon reactivation (e.g. Dion-Côté et al., 2014). On the other side, a genome expansion resembling its duplication can be caused by an extreme amplification of transposons without any link to WGD (e.g. in mudminnows, Lehmann et al., 2021).
Not all currently available fish genomes have been assembled to the chromosome level – as Randhawa & Pawar, 2021 say, only 16.5% (98 species) of by them presented and analysed genomes are available at the chromosome level. This would not be any issue, should not the authors use the chromosome numbers for their statistical analyses and down-stream evaluations between chondrichthyans and bony fishes. Chondrichthyans do have significantly higher counts of chromosomes than bony fishes (Gregory, 2022; Uno et al., 2020), despite Randhawa & Pawar, 2021 say that these two groups do not differ at the order level. The higher chromosome number may be one of the reasons why only chondrichthyans with lower counts of chromosomes were sequenced and assembled to the chromosome level for technical reasons. Hence, these technical bias became the source of misunderstanding and misinterpretations by Randhawa & Pawar, 2021. Here, it is necessary to stress that there are also further online and publically available resources of cytogenomic data on chromosome numbers, cytological genome size (C-value; both e.g. Gregory, 2022), fundamental numbers (chromosome arms numbers, e.g. Arai, 2011) and GC-content (Vinogradov, 1998). Another potential issue that can be compensated by data on chromosome counts from elsewhere is the exclusion of orders with less than three records. Chromosome numbers are not influenced by any technical reasons hence they can be combined from more sources. On the other hand, chromosome counts and genome size were factors influencing the availability of genome assemblies for a certain time. Only recently with the increasing availability of genome projects and particularly with decreasing prices of diverse sequencing methods, also larger genomes became sequenced more frequently. In spite of their huge amounts of data and importance, the genomic databases and datasets are far from being complete and hence they do not represent any fully sufficient stand-alone source of information. As such they need to be considered and utilized.
Moreover, genome assemblies themselves are still far from being complete and perfect (Rhie et al., 2021). Currently, merely the human genome is the only one being completely sequenced and assembled that was accomplished only this year (Nurk et al., 2022). This process of filling gaps in the human genome was built on the 38th version of the human genome known as GRCh38.p13 (Schneider et al., 2017) Genomes of other crucial mammalian model species (house mouse, rat, etc.) are not yet complete. Accordingly, genomes of such large number of fish species are still full of gaps and/or unresolved bases (“N”, IUPAC, n.d.).
期刊介绍:
The Journal of Applied Ichthyology publishes articles of international repute on ichthyology, aquaculture, and marine fisheries; ichthyopathology and ichthyoimmunology; environmental toxicology using fishes as test organisms; basic research on fishery management; and aspects of integrated coastal zone management in relation to fisheries and aquaculture. Emphasis is placed on the application of scientific research findings, while special consideration is given to ichthyological problems occurring in developing countries. Article formats include original articles, review articles, short communications and technical reports.