{"title":"Seeing the Forest Despite the Trees in Repeat-Rich Genomic Regions","authors":"Amanda M. Larracuente, John S. Sproul","doi":"10.1111/1755-0998.70008","DOIUrl":null,"url":null,"abstract":"<p>Technological advances are producing genome assemblies of increasing quality at steadily decreasing costs. These assemblies enable the extraction of rich biological information from previously inaccessible genomic regions (e.g., repeat-rich regions) and from diverse organisms underrepresented in genomic research. Gaining functional insights from new assemblies often requires generating additional data sets, experimental approaches and complex analysis. Novel analytical methods that substantially shorten the path to biological insights are valuable, particularly if they draw conclusions from the direct analysis of assemblies. In this issue of <i>Molecular Ecology Resources</i>, Elphinstone et al. (<span>2025</span>) present RepeatOBserver—a tool to visualise repeat organisation through direct analysis of chromosome-scale assemblies. This tool facilitates the summary and visualisation of large- and fine-scale patterns of repetitive DNA sequence structure across assemblies. Their approach borrows metrics from information theory, which have found uses in ecology (i.e., the Shannon Diversity Index), to help infer functional regions within repetitive sequences including putative centromeres. Importantly, RepeatOBserver does not require annotations, repeat libraries or functional genomic data—just a high-quality assembly. This type of tool addresses ongoing challenges in mapping the structure and functions of repeat-rich chromosomal regions, which remain the least well-understood components of genomes.</p><p>The availability of chromosome-scale genome assemblies is growing rapidly, as advances in long-read sequencing technology make assembly-based approaches accessible to more taxa. These genome assemblies can reveal important insights into genome biology, biomedicine and biodiversity. Our ability to extract these insights from assemblies is built on decades of hard-won work in early genomic model organisms. For example, early work on gene structure, regulation and evolution provided a knowledge base for ab initio gene prediction from nothing more than a DNA sequence. While annotation tools for non-coding sequences like the abundant repetitive DNAs found in most eukaryotic genomes are now accessible, the methods to extract insights from these regions are less mature. Repetitive DNAs evolve rapidly: their composition, organisation and abundance varies across species (Yunis and Yasmineh <span>1971</span>), making predictions based on sequence conservation difficult. Many insights require functional genomic data (e.g., ChIP-seq, methylation and ATAC-seq), which may be challenging to access in non-model systems. Despite recent progress in resolving repeats in chromosome-scale assemblies, their assembly and annotation remain non-trivial problems (Lower et al. <span>2018</span>).</p><p>Some genome regions with critical functions are enriched in, or entirely composed of, repeated DNA sequences. Centromeres—the essential structures that guide chromosome segregation during cell division—are typically embedded in repetitive regions and remain perhaps the most challenging genome regions to predict. Centromere prediction is especially challenging because: (1) they are generally defined by the presence of a centromere-specific histone variant (CENP-A) rather than by specific DNA sequences; and (2) they tend to occur in repeat-dense chromosomal regions enriched in satellite DNA and/or transposable elements, which can be arranged as higher order repeats (reviewed in Allshire and Karpen <span>2008</span>). Centromeres vary widely across species in their size and organisation (reviewed in Hartley and O'Neill <span>2019</span>). Genomic studies have only recently begun to reveal the detailed organisation of centromeres of some fungi, plants and animals. Some interesting patterns are emerging from these studies: the functional centromere core can correlate with regions of low repeat diversity and a regional dip in DNA methylation (e.g., Altemose et al. <span>2022</span>). However, inferring that a repetitive DNA is part of one of these functional domains remains a challenge because this requires both high-quality assemblies and additional supporting data sets (e.g., DNA-protein interaction and methylation data).</p><p>RepeatOBserver (Elphinstone et al. <span>2025</span>) is a tool for visualising and analysing genomic repeat patterns that does not require a priori repeat annotations or additional genomic data (https://github.com/celphin/RepeatOBserverV1). Visualising the distribution and diversity of repeats across a genome can flag features consistent with functional regions like centromeres and help identify structural patterns in other repeat-rich regions. This tool can help leverage the increasing number of chromosome-scale assemblies to provide important insights into genome biology.</p><p>RepeatOBserver's visualisations build around Fourier transforms of DNA walks through chromosome-scale assemblies (Figure 1). The Fourier transform is a mathematical approach that breaks down complex signals into their component parts. When applied to genomic sequence data, it can help identify biological patterns including gene structure, phylogenetic relationships and repeat motifs. The ability to detect hidden periodicity in DNA sequence data makes Fourier transforms a sound choice for repeat identification (e.g., Sharma et al. <span>2004</span>). RepeatOBserver uses DNA walks that slide across chromosomes converting nucleotides to numerical values based on sequence composition. The tool then applies a Fast Fourier Transform (FFT) and plots the resulting spectra as heat maps (Figure 1). These plots reveal repeat organisation across each chromosome, highlighting the location and length of repeats and their abundances.</p><p>Contrasting patterns of repeat diversity and abundance can reveal structure in genomic regions that are otherwise difficult to analyse. RepeatOBserver summarises patterns of repeat diversity by borrowing an information theory approach (Shannon <span>1948</span>) frequently used in ecology. The Shannon Diversity Index (SDI) typically describes species diversity but is finding uses beyond ecological studies. Here, rather than describing species diversity within a geographic area, the genomic SDI summarises diversity of different repeats within a sequence window based on the Fourier spectra. Plotting the SDIs and repeat abundance can identify trends in repeat diversity across chromosomes.</p><p>An exciting application of SDI values is in predicting functional chromosome regions like centromeres based only on DNA sequence data (Figure 1). Monocentric chromosomes rich in tandem satellite DNAs, like those found in humans, tend to form on homogenised younger repeat regions within larger repeat arrays (Altemose et al. <span>2022</span>; reviewed in Naish and Henderson <span>2024</span>). The centromere regions should thus have low repeat diversity and appear as local minima in SDI plots. Alternatively, species with monocentric chromosomes enriched in transposons like those of wheat should appear as regions of high repeat abundance. Holocentric chromosomes with many dispersed centromeres can appear as local SDI minima on dispersed repeat clusters across the chromosome. Thus, RepeatOBserver's summaries of repeat diversity and abundance can provide a starting point for putative centromere identification for a range of known centromere types. The authors validated many centromere predictions in 12 plant and animal species (159 chromosomes) representing varying centromere sizes and organisation, each with experimental data supporting putative centromere location. The overall patterns of repeat distributions reported by RepeatOBserver largely match expectations for well-curated plant and animal genomes like the locations of pericentromeric satellites, and boundaries between centromere, heterochromatin and euchromatin domains.</p><p>The RepeatOBserver output can help identify other structural patterns, including blocks of pericentromeric and subtelomeric repeats, multicopy gene families, some structural rearrangements and higher-order repeats. This tool works even for imperfect higher-order repeats, which can be challenging to identify with standard annotation tools. There are some limitations to this approach, as functional predictions like centromere identification still need to be validated with other tools. However, RepeatOBserver can provide preliminary insights that can serve as a springboard to questions about genome structure, function and evolution in novel genomes. Genome visualisation tools like RepeatOBserver are timely and can accelerate discoveries in difficult genome regions at a time in which repeat-resolved assemblies are increasingly accessible across taxa.</p><p>The cross-pollination of theory and methods among scientific fields can drive rapid advances in biology. Genomics as a field represents a convergence of methods assembled from multiple disciplines. Principles of ecology show excellent promise in understanding complex interactions that comprise genomes, their regulators and their products (e.g., Brookfield <span>2005</span>). The continued integration of methods, tools and thinking across disciplines may help us leverage the vast diversity of available genomes toward a more holistic understanding of their stunning complexity.</p><p>The authors declare no conflicts of interest.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":"25 7","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.70008","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.70008","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Technological advances are producing genome assemblies of increasing quality at steadily decreasing costs. These assemblies enable the extraction of rich biological information from previously inaccessible genomic regions (e.g., repeat-rich regions) and from diverse organisms underrepresented in genomic research. Gaining functional insights from new assemblies often requires generating additional data sets, experimental approaches and complex analysis. Novel analytical methods that substantially shorten the path to biological insights are valuable, particularly if they draw conclusions from the direct analysis of assemblies. In this issue of Molecular Ecology Resources, Elphinstone et al. (2025) present RepeatOBserver—a tool to visualise repeat organisation through direct analysis of chromosome-scale assemblies. This tool facilitates the summary and visualisation of large- and fine-scale patterns of repetitive DNA sequence structure across assemblies. Their approach borrows metrics from information theory, which have found uses in ecology (i.e., the Shannon Diversity Index), to help infer functional regions within repetitive sequences including putative centromeres. Importantly, RepeatOBserver does not require annotations, repeat libraries or functional genomic data—just a high-quality assembly. This type of tool addresses ongoing challenges in mapping the structure and functions of repeat-rich chromosomal regions, which remain the least well-understood components of genomes.
The availability of chromosome-scale genome assemblies is growing rapidly, as advances in long-read sequencing technology make assembly-based approaches accessible to more taxa. These genome assemblies can reveal important insights into genome biology, biomedicine and biodiversity. Our ability to extract these insights from assemblies is built on decades of hard-won work in early genomic model organisms. For example, early work on gene structure, regulation and evolution provided a knowledge base for ab initio gene prediction from nothing more than a DNA sequence. While annotation tools for non-coding sequences like the abundant repetitive DNAs found in most eukaryotic genomes are now accessible, the methods to extract insights from these regions are less mature. Repetitive DNAs evolve rapidly: their composition, organisation and abundance varies across species (Yunis and Yasmineh 1971), making predictions based on sequence conservation difficult. Many insights require functional genomic data (e.g., ChIP-seq, methylation and ATAC-seq), which may be challenging to access in non-model systems. Despite recent progress in resolving repeats in chromosome-scale assemblies, their assembly and annotation remain non-trivial problems (Lower et al. 2018).
Some genome regions with critical functions are enriched in, or entirely composed of, repeated DNA sequences. Centromeres—the essential structures that guide chromosome segregation during cell division—are typically embedded in repetitive regions and remain perhaps the most challenging genome regions to predict. Centromere prediction is especially challenging because: (1) they are generally defined by the presence of a centromere-specific histone variant (CENP-A) rather than by specific DNA sequences; and (2) they tend to occur in repeat-dense chromosomal regions enriched in satellite DNA and/or transposable elements, which can be arranged as higher order repeats (reviewed in Allshire and Karpen 2008). Centromeres vary widely across species in their size and organisation (reviewed in Hartley and O'Neill 2019). Genomic studies have only recently begun to reveal the detailed organisation of centromeres of some fungi, plants and animals. Some interesting patterns are emerging from these studies: the functional centromere core can correlate with regions of low repeat diversity and a regional dip in DNA methylation (e.g., Altemose et al. 2022). However, inferring that a repetitive DNA is part of one of these functional domains remains a challenge because this requires both high-quality assemblies and additional supporting data sets (e.g., DNA-protein interaction and methylation data).
RepeatOBserver (Elphinstone et al. 2025) is a tool for visualising and analysing genomic repeat patterns that does not require a priori repeat annotations or additional genomic data (https://github.com/celphin/RepeatOBserverV1). Visualising the distribution and diversity of repeats across a genome can flag features consistent with functional regions like centromeres and help identify structural patterns in other repeat-rich regions. This tool can help leverage the increasing number of chromosome-scale assemblies to provide important insights into genome biology.
RepeatOBserver's visualisations build around Fourier transforms of DNA walks through chromosome-scale assemblies (Figure 1). The Fourier transform is a mathematical approach that breaks down complex signals into their component parts. When applied to genomic sequence data, it can help identify biological patterns including gene structure, phylogenetic relationships and repeat motifs. The ability to detect hidden periodicity in DNA sequence data makes Fourier transforms a sound choice for repeat identification (e.g., Sharma et al. 2004). RepeatOBserver uses DNA walks that slide across chromosomes converting nucleotides to numerical values based on sequence composition. The tool then applies a Fast Fourier Transform (FFT) and plots the resulting spectra as heat maps (Figure 1). These plots reveal repeat organisation across each chromosome, highlighting the location and length of repeats and their abundances.
Contrasting patterns of repeat diversity and abundance can reveal structure in genomic regions that are otherwise difficult to analyse. RepeatOBserver summarises patterns of repeat diversity by borrowing an information theory approach (Shannon 1948) frequently used in ecology. The Shannon Diversity Index (SDI) typically describes species diversity but is finding uses beyond ecological studies. Here, rather than describing species diversity within a geographic area, the genomic SDI summarises diversity of different repeats within a sequence window based on the Fourier spectra. Plotting the SDIs and repeat abundance can identify trends in repeat diversity across chromosomes.
An exciting application of SDI values is in predicting functional chromosome regions like centromeres based only on DNA sequence data (Figure 1). Monocentric chromosomes rich in tandem satellite DNAs, like those found in humans, tend to form on homogenised younger repeat regions within larger repeat arrays (Altemose et al. 2022; reviewed in Naish and Henderson 2024). The centromere regions should thus have low repeat diversity and appear as local minima in SDI plots. Alternatively, species with monocentric chromosomes enriched in transposons like those of wheat should appear as regions of high repeat abundance. Holocentric chromosomes with many dispersed centromeres can appear as local SDI minima on dispersed repeat clusters across the chromosome. Thus, RepeatOBserver's summaries of repeat diversity and abundance can provide a starting point for putative centromere identification for a range of known centromere types. The authors validated many centromere predictions in 12 plant and animal species (159 chromosomes) representing varying centromere sizes and organisation, each with experimental data supporting putative centromere location. The overall patterns of repeat distributions reported by RepeatOBserver largely match expectations for well-curated plant and animal genomes like the locations of pericentromeric satellites, and boundaries between centromere, heterochromatin and euchromatin domains.
The RepeatOBserver output can help identify other structural patterns, including blocks of pericentromeric and subtelomeric repeats, multicopy gene families, some structural rearrangements and higher-order repeats. This tool works even for imperfect higher-order repeats, which can be challenging to identify with standard annotation tools. There are some limitations to this approach, as functional predictions like centromere identification still need to be validated with other tools. However, RepeatOBserver can provide preliminary insights that can serve as a springboard to questions about genome structure, function and evolution in novel genomes. Genome visualisation tools like RepeatOBserver are timely and can accelerate discoveries in difficult genome regions at a time in which repeat-resolved assemblies are increasingly accessible across taxa.
The cross-pollination of theory and methods among scientific fields can drive rapid advances in biology. Genomics as a field represents a convergence of methods assembled from multiple disciplines. Principles of ecology show excellent promise in understanding complex interactions that comprise genomes, their regulators and their products (e.g., Brookfield 2005). The continued integration of methods, tools and thinking across disciplines may help us leverage the vast diversity of available genomes toward a more holistic understanding of their stunning complexity.
期刊介绍:
Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines.
In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.