{"title":"How (not) to read fish genomics data – The importance of cytogenomics knowledge in the current flood of sequenced genomes","authors":"Radka Symonová","doi":"10.1111/jai.14365","DOIUrl":null,"url":null,"abstract":"<p>Biologists have been facing a tremendous data explosion during the last years. This is particularly apparent in the still increasing amount of sequenced genomes. However, it is not always straightforward to understand and properly utilize these data that may be publicly available in an incomplete form. This is the case of the current flood of fish genomes among others. In reaction to this situation a recent study by Randhawa and Pawar (<span>2021</span>) tried to exploit these data, however, in an improper way. This study is not the only one suffering from serious problems with handling genomic data and contextualizing them in organismal evolution. On the other hand (Randhawa & Pawar, <span>2021</span>) accumulated several serious issues in a single paper and further analysed their incorrect findings. This short communication aims to elucidate unclarities in that study and to provide some simple hints how to avoid similar issues. This is particularly relevant for fish genomics, where an immense biodiversity results in a so far unexplored diversity of genome traits.</p><p>The study by Randhawa and Pawar (<span>2021</span>) is based on the NCBI repository of genomic data (Genome, <span>2022</span>). Even such excellent and indispensable tools like NCBI/Genome are not absolutely flawless as they rely on data submitted by other scientists. Hence, it is crucial to always manually check all downloaded data. The best way is to sort the dataset according to e.g. genome size (GS) and according to GC content (GC%). Then, the incorrect values are immediately apparent on both of ends of the dataset upon these procedures. Namely, the incomplete genome assemblies result in too small genome sizes (e.g. <i>Squalius pyrenaicus</i> with a fully non-sense genome size of 48 Mb reported by Randhawa & Pawar, <span>2021</span>). As a result, such incomplete assemblies can yield fully aberrant values of GC%, both extremely low (e.g. <i>Chionodraco hamatus</i> with GC = 25.4%) or extremely high (e.g. <i>S. pyrenaicus</i> with GC = 51.1%). Such values have to be, of course, discarded. A similar mistake was identified in the paper by Lu & Luo, <span>2020</span>, who presented the channel catfish to have GC = 31.5% and claimed this value as the lowest one within their dataset. In this particular case, it is apparent that this value is incorrect in their dataset and too low for a vertebrate genome. The value of GC% is crucial for several reasons and particularly regarding the genome completeness, since GC-rich regions were underrepresented from technical reasons in the earlier versions of genome assemblies (Rhie et al., <span>2021</span>).</p><p>A fully different however equally important issue is where to place the borderline between the incomplete and still usable but low(er) quality genome assemblies. This is crucial to be able to decide, which values are to be discarded and which retained. This issue cannot be easily solved because there is usually no gap between the genome size values and the values represent a continuous row. Moreover, some genome assemblies are incomplete for technical reasons although presented and considered as “complete” (Rhie et al., <span>2021</span>). Whereas other genome assemblies are incomplete because of the research goal, which is not always to provide as complete as possible assemblies (different types reduced representation sequencing, e.g. Luca et al., <span>2011</span>; or partial sequencing projects). In fish genomes, combination of these reasons results in a gradient of values from clearly (sometimes intentionally) incomplete genomes towards less incomplete values without any obvious gap. Moreover, fish genomes represent a wide range of genome size values. E.g. 1000 Mb genome size might be correct for one species or lineage while it is incorrect (incomplete) for another one. This requires an active intervention and a deeper insight and an analysis of such data.</p><p>Thinking of genome size should bring us to the question, whether a cypriniform fish (<i>S. pyrenaicus</i> in the case of Randhawa & Pawar, <span>2021</span> genome size 48 Mb) could have an almost ten times smaller genome than tetraodontid fishes (i.e. smooth pufferfish, e.g. <i>Tetraodon</i> or <i>Takifugu</i>) possessing the smallest known vertebrate genomes (Neafsey & Palumbi, <span>2003</span>). Interestingly, even the latest versions of assemblies utilizing such technologies like PacBio sequencing produce genomes still smaller than those determined by cytological methods (the C-value) for the same species. This can be demonstrated on the smallest vertebrate genomes with the recently sequenced <i>Takifugu obscurus</i> yielding 373–381 Mb (Kang et al., <span>2020</span>) whereas the C-value in the genus <i>Takifugu</i> is 0.4–0.42 pg (Gregory, <span>2022</span>).</p><p>The size of a single chromosome in teleost fishes that have not undergone any additionally whole genome duplication (WGD) neither any extreme repeats expansion usually ranges from 20 Mb to about 60 Mb. Chromosome size even in species with highly reduced genomes reaches about 20 Mb (Borůvková et al., <span>2021</span>; Genome, <span>2022</span>). Similarly, GC-content is a crucial genome trait with far-reaching importance although not yet fully understood (e.g. Matoulek et al., <span>2020</span>). As such, GC-content can reach only certain values among eukaryotes and particularly among vertebrates (Borůvková et al., <span>2021</span>). Hence, there are some clues how to handle the currently available data, however, an extra attention has to be paid and the data need some manual curation. A careful comparison with literature is crucial. This means first to check the original paper reporting the relevant genome assembly – Bargelloni et al., <span>2019</span> for <i>Chionodraco hamatus</i> reported in Randhawa & Pawar, <span>2021</span>. Here, one can find that the proportion of unresolved bases, the “Ns”, is 38.01%, and that another <i>Chionodraco</i> species under study, the <i>Chionodraco meyrsi</i>, has GC = ca. 42%. These are very clear and straightforward clues that the value GC = 25.4% in <i>Chionodraco hamatus</i> is wrong.</p><p>It is necessary to bear in our mind that the field of fish genomics is highly specific due to the immense fish diversity reflecting the long evolutionary history undergone (Nelson et al., <span>2016</span>). This results in a broad range of genome sizes and transposon proportions (Sotero-Caio et al., <span>2017</span>) encountered in fish genomes far exceeding ranges usual in mammalian and avian genomes (Borůvková et al., <span>2021</span>). Transposons, i.e. mobile genetic elements also known as jumping genes, are one of the major drivers of genome evolution particularly in fish and regarding the genome size (Borůvková et al., <span>2021</span>; Matoulek et al., <span>2020</span>; Sotero-Caio et al., <span>2017</span>). Transposons cause immense differences in the genome size-to-GC% ratio even within teleosts (e.g. in salmonids, Gaffaroglu et al., <span>2020</span>). Since transposons can occupy more than a half of a genome, their own GC% influences the GC% of their “host genomes” (Boissinot, <span>2022</span>; Symonová & Suh, <span>2018</span>). Transposons and their own GC content are also candidates for the AT/GC compositional homogeneity known in fish genomes (e.g. Majtánová et al., <span>2017</span>), with the single exception of basal ray-finned lineage gars with the mammalian-like AT/GC heterogeneity (Symonová et al., <span>2016</span>). The basal “fish” lineages differ heavily from teleosts in numerous genome traits. Actually, the deeper we move on the vertebrate phylogenetic tree, the riskier and less relevant it is to compare those groups (lamprey, hagfish, chondrostean, lungfish, bichir, coelacanths) with teleosts (e.g. Borůvková et al., <span>2021</span>). Another highly specific aspect of genome evolution in fish lineages is their tolerance to whole genome duplications (WGD; Glasauer & Neuhauss, <span>2014</span>) with far reaching implications in their genomes including the high variability in genome size (Gregory, <span>2022</span>) linked to variability in chromosome numbers (e.g. in sturgeon and paddlefish Symonová et al., <span>2013</span>, <span>2017</span>). Finally, a WGD event can result in transposon reactivation leading to further large-scale genome and chromosome re-arrangements. Similarly, hybridization events that are not infrequent in fish can also result in transposon reactivation (e.g. Dion-Côté et al., <span>2014</span>). On the other side, a genome expansion resembling its duplication can be caused by an extreme amplification of transposons without any link to WGD (e.g. in mudminnows, Lehmann et al., <span>2021</span>).</p><p>Not all currently available fish genomes have been assembled to the chromosome level – as Randhawa & Pawar, <span>2021</span> say, only 16.5% (98 species) of by them presented and analysed genomes are available at the chromosome level. This would not be any issue, should not the authors use the chromosome numbers for their statistical analyses and down-stream evaluations between chondrichthyans and bony fishes. Chondrichthyans do have significantly higher counts of chromosomes than bony fishes (Gregory, <span>2022</span>; Uno et al., <span>2020</span>), despite Randhawa & Pawar, <span>2021</span> say that these two groups do not differ at the order level. The higher chromosome number may be one of the reasons why only chondrichthyans with lower counts of chromosomes were sequenced and assembled to the chromosome level for technical reasons. Hence, these technical bias became the source of misunderstanding and misinterpretations by Randhawa & Pawar, <span>2021</span>. Here, it is necessary to stress that there are also further online and publically available resources of cytogenomic data on chromosome numbers, cytological genome size (C-value; both e.g. Gregory, <span>2022</span>), fundamental numbers (chromosome arms numbers, e.g. Arai, <span>2011</span>) and GC-content (Vinogradov, <span>1998</span>). Another potential issue that can be compensated by data on chromosome counts from elsewhere is the exclusion of orders with less than three records. Chromosome numbers are not influenced by any technical reasons hence they can be combined from more sources. On the other hand, chromosome counts and genome size were factors influencing the availability of genome assemblies for a certain time. Only recently with the increasing availability of genome projects and particularly with decreasing prices of diverse sequencing methods, also larger genomes became sequenced more frequently. In spite of their huge amounts of data and importance, the genomic databases and datasets are far from being complete and hence they do not represent any fully sufficient stand-alone source of information. As such they need to be considered and utilized.</p><p>Moreover, genome assemblies themselves are still far from being complete and perfect (Rhie et al., <span>2021</span>). Currently, merely the human genome is the only one being completely sequenced and assembled that was accomplished only this year (Nurk et al., <span>2022</span>). This process of filling gaps in the human genome was built on the 38th version of the human genome known as GRCh38.p13 (Schneider et al., <span>2017</span>) Genomes of other crucial mammalian model species (house mouse, rat, etc.) are not yet complete. Accordingly, genomes of such large number of fish species are still full of gaps and/or unresolved bases (“N”, IUPAC, <span>n.d.</span>).</p><p>None.</p>","PeriodicalId":14894,"journal":{"name":"Journal of Applied Ichthyology","volume":"2025 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2022-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jai.14365","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Ichthyology","FirstCategoryId":"97","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jai.14365","RegionNum":4,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"FISHERIES","Score":null,"Total":0}
引用次数: 0
Abstract
Biologists have been facing a tremendous data explosion during the last years. This is particularly apparent in the still increasing amount of sequenced genomes. However, it is not always straightforward to understand and properly utilize these data that may be publicly available in an incomplete form. This is the case of the current flood of fish genomes among others. In reaction to this situation a recent study by Randhawa and Pawar (2021) tried to exploit these data, however, in an improper way. This study is not the only one suffering from serious problems with handling genomic data and contextualizing them in organismal evolution. On the other hand (Randhawa & Pawar, 2021) accumulated several serious issues in a single paper and further analysed their incorrect findings. This short communication aims to elucidate unclarities in that study and to provide some simple hints how to avoid similar issues. This is particularly relevant for fish genomics, where an immense biodiversity results in a so far unexplored diversity of genome traits.
The study by Randhawa and Pawar (2021) is based on the NCBI repository of genomic data (Genome, 2022). Even such excellent and indispensable tools like NCBI/Genome are not absolutely flawless as they rely on data submitted by other scientists. Hence, it is crucial to always manually check all downloaded data. The best way is to sort the dataset according to e.g. genome size (GS) and according to GC content (GC%). Then, the incorrect values are immediately apparent on both of ends of the dataset upon these procedures. Namely, the incomplete genome assemblies result in too small genome sizes (e.g. Squalius pyrenaicus with a fully non-sense genome size of 48 Mb reported by Randhawa & Pawar, 2021). As a result, such incomplete assemblies can yield fully aberrant values of GC%, both extremely low (e.g. Chionodraco hamatus with GC = 25.4%) or extremely high (e.g. S. pyrenaicus with GC = 51.1%). Such values have to be, of course, discarded. A similar mistake was identified in the paper by Lu & Luo, 2020, who presented the channel catfish to have GC = 31.5% and claimed this value as the lowest one within their dataset. In this particular case, it is apparent that this value is incorrect in their dataset and too low for a vertebrate genome. The value of GC% is crucial for several reasons and particularly regarding the genome completeness, since GC-rich regions were underrepresented from technical reasons in the earlier versions of genome assemblies (Rhie et al., 2021).
A fully different however equally important issue is where to place the borderline between the incomplete and still usable but low(er) quality genome assemblies. This is crucial to be able to decide, which values are to be discarded and which retained. This issue cannot be easily solved because there is usually no gap between the genome size values and the values represent a continuous row. Moreover, some genome assemblies are incomplete for technical reasons although presented and considered as “complete” (Rhie et al., 2021). Whereas other genome assemblies are incomplete because of the research goal, which is not always to provide as complete as possible assemblies (different types reduced representation sequencing, e.g. Luca et al., 2011; or partial sequencing projects). In fish genomes, combination of these reasons results in a gradient of values from clearly (sometimes intentionally) incomplete genomes towards less incomplete values without any obvious gap. Moreover, fish genomes represent a wide range of genome size values. E.g. 1000 Mb genome size might be correct for one species or lineage while it is incorrect (incomplete) for another one. This requires an active intervention and a deeper insight and an analysis of such data.
Thinking of genome size should bring us to the question, whether a cypriniform fish (S. pyrenaicus in the case of Randhawa & Pawar, 2021 genome size 48 Mb) could have an almost ten times smaller genome than tetraodontid fishes (i.e. smooth pufferfish, e.g. Tetraodon or Takifugu) possessing the smallest known vertebrate genomes (Neafsey & Palumbi, 2003). Interestingly, even the latest versions of assemblies utilizing such technologies like PacBio sequencing produce genomes still smaller than those determined by cytological methods (the C-value) for the same species. This can be demonstrated on the smallest vertebrate genomes with the recently sequenced Takifugu obscurus yielding 373–381 Mb (Kang et al., 2020) whereas the C-value in the genus Takifugu is 0.4–0.42 pg (Gregory, 2022).
The size of a single chromosome in teleost fishes that have not undergone any additionally whole genome duplication (WGD) neither any extreme repeats expansion usually ranges from 20 Mb to about 60 Mb. Chromosome size even in species with highly reduced genomes reaches about 20 Mb (Borůvková et al., 2021; Genome, 2022). Similarly, GC-content is a crucial genome trait with far-reaching importance although not yet fully understood (e.g. Matoulek et al., 2020). As such, GC-content can reach only certain values among eukaryotes and particularly among vertebrates (Borůvková et al., 2021). Hence, there are some clues how to handle the currently available data, however, an extra attention has to be paid and the data need some manual curation. A careful comparison with literature is crucial. This means first to check the original paper reporting the relevant genome assembly – Bargelloni et al., 2019 for Chionodraco hamatus reported in Randhawa & Pawar, 2021. Here, one can find that the proportion of unresolved bases, the “Ns”, is 38.01%, and that another Chionodraco species under study, the Chionodraco meyrsi, has GC = ca. 42%. These are very clear and straightforward clues that the value GC = 25.4% in Chionodraco hamatus is wrong.
It is necessary to bear in our mind that the field of fish genomics is highly specific due to the immense fish diversity reflecting the long evolutionary history undergone (Nelson et al., 2016). This results in a broad range of genome sizes and transposon proportions (Sotero-Caio et al., 2017) encountered in fish genomes far exceeding ranges usual in mammalian and avian genomes (Borůvková et al., 2021). Transposons, i.e. mobile genetic elements also known as jumping genes, are one of the major drivers of genome evolution particularly in fish and regarding the genome size (Borůvková et al., 2021; Matoulek et al., 2020; Sotero-Caio et al., 2017). Transposons cause immense differences in the genome size-to-GC% ratio even within teleosts (e.g. in salmonids, Gaffaroglu et al., 2020). Since transposons can occupy more than a half of a genome, their own GC% influences the GC% of their “host genomes” (Boissinot, 2022; Symonová & Suh, 2018). Transposons and their own GC content are also candidates for the AT/GC compositional homogeneity known in fish genomes (e.g. Majtánová et al., 2017), with the single exception of basal ray-finned lineage gars with the mammalian-like AT/GC heterogeneity (Symonová et al., 2016). The basal “fish” lineages differ heavily from teleosts in numerous genome traits. Actually, the deeper we move on the vertebrate phylogenetic tree, the riskier and less relevant it is to compare those groups (lamprey, hagfish, chondrostean, lungfish, bichir, coelacanths) with teleosts (e.g. Borůvková et al., 2021). Another highly specific aspect of genome evolution in fish lineages is their tolerance to whole genome duplications (WGD; Glasauer & Neuhauss, 2014) with far reaching implications in their genomes including the high variability in genome size (Gregory, 2022) linked to variability in chromosome numbers (e.g. in sturgeon and paddlefish Symonová et al., 2013, 2017). Finally, a WGD event can result in transposon reactivation leading to further large-scale genome and chromosome re-arrangements. Similarly, hybridization events that are not infrequent in fish can also result in transposon reactivation (e.g. Dion-Côté et al., 2014). On the other side, a genome expansion resembling its duplication can be caused by an extreme amplification of transposons without any link to WGD (e.g. in mudminnows, Lehmann et al., 2021).
Not all currently available fish genomes have been assembled to the chromosome level – as Randhawa & Pawar, 2021 say, only 16.5% (98 species) of by them presented and analysed genomes are available at the chromosome level. This would not be any issue, should not the authors use the chromosome numbers for their statistical analyses and down-stream evaluations between chondrichthyans and bony fishes. Chondrichthyans do have significantly higher counts of chromosomes than bony fishes (Gregory, 2022; Uno et al., 2020), despite Randhawa & Pawar, 2021 say that these two groups do not differ at the order level. The higher chromosome number may be one of the reasons why only chondrichthyans with lower counts of chromosomes were sequenced and assembled to the chromosome level for technical reasons. Hence, these technical bias became the source of misunderstanding and misinterpretations by Randhawa & Pawar, 2021. Here, it is necessary to stress that there are also further online and publically available resources of cytogenomic data on chromosome numbers, cytological genome size (C-value; both e.g. Gregory, 2022), fundamental numbers (chromosome arms numbers, e.g. Arai, 2011) and GC-content (Vinogradov, 1998). Another potential issue that can be compensated by data on chromosome counts from elsewhere is the exclusion of orders with less than three records. Chromosome numbers are not influenced by any technical reasons hence they can be combined from more sources. On the other hand, chromosome counts and genome size were factors influencing the availability of genome assemblies for a certain time. Only recently with the increasing availability of genome projects and particularly with decreasing prices of diverse sequencing methods, also larger genomes became sequenced more frequently. In spite of their huge amounts of data and importance, the genomic databases and datasets are far from being complete and hence they do not represent any fully sufficient stand-alone source of information. As such they need to be considered and utilized.
Moreover, genome assemblies themselves are still far from being complete and perfect (Rhie et al., 2021). Currently, merely the human genome is the only one being completely sequenced and assembled that was accomplished only this year (Nurk et al., 2022). This process of filling gaps in the human genome was built on the 38th version of the human genome known as GRCh38.p13 (Schneider et al., 2017) Genomes of other crucial mammalian model species (house mouse, rat, etc.) are not yet complete. Accordingly, genomes of such large number of fish species are still full of gaps and/or unresolved bases (“N”, IUPAC, n.d.).
期刊介绍:
The Journal of Applied Ichthyology publishes articles of international repute on ichthyology, aquaculture, and marine fisheries; ichthyopathology and ichthyoimmunology; environmental toxicology using fishes as test organisms; basic research on fishery management; and aspects of integrated coastal zone management in relation to fisheries and aquaculture. Emphasis is placed on the application of scientific research findings, while special consideration is given to ichthyological problems occurring in developing countries. Article formats include original articles, review articles, short communications and technical reports.