Seeing the Forest Despite the Trees in Repeat-Rich Genomic Regions

IF 5.5 1区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Molecular Ecology Resources Pub Date : 2025-07-03 DOI:10.1111/1755-0998.70008

Amanda M. Larracuente, John S. Sproul

{"title":"Seeing the Forest Despite the Trees in Repeat-Rich Genomic Regions","authors":"Amanda M. Larracuente, John S. Sproul","doi":"10.1111/1755-0998.70008","DOIUrl":null,"url":null,"abstract":"Technological advances are producing genome assemblies of increasing quality at steadily decreasing costs. These assemblies enable the extraction of rich biological information from previously inaccessible genomic regions (e.g., repeat-rich regions) and from diverse organisms underrepresented in genomic research. Gaining functional insights from new assemblies often requires generating additional data sets, experimental approaches and complex analysis. Novel analytical methods that substantially shorten the path to biological insights are valuable, particularly if they draw conclusions from the direct analysis of assemblies. In this issue of Molecular Ecology Resources, Elphinstone et al. (2025) present RepeatOBserver—a tool to visualise repeat organisation through direct analysis of chromosome-scale assemblies. This tool facilitates the summary and visualisation of large- and fine-scale patterns of repetitive DNA sequence structure across assemblies. Their approach borrows metrics from information theory, which have found uses in ecology (i.e., the Shannon Diversity Index), to help infer functional regions within repetitive sequences including putative centromeres. Importantly, RepeatOBserver does not require annotations, repeat libraries or functional genomic data—just a high-quality assembly. This type of tool addresses ongoing challenges in mapping the structure and functions of repeat-rich chromosomal regions, which remain the least well-understood components of genomes.The availability of chromosome-scale genome assemblies is growing rapidly, as advances in long-read sequencing technology make assembly-based approaches accessible to more taxa. These genome assemblies can reveal important insights into genome biology, biomedicine and biodiversity. Our ability to extract these insights from assemblies is built on decades of hard-won work in early genomic model organisms. For example, early work on gene structure, regulation and evolution provided a knowledge base for ab initio gene prediction from nothing more than a DNA sequence. While annotation tools for non-coding sequences like the abundant repetitive DNAs found in most eukaryotic genomes are now accessible, the methods to extract insights from these regions are less mature. Repetitive DNAs evolve rapidly: their composition, organisation and abundance varies across species (Yunis and Yasmineh 1971), making predictions based on sequence conservation difficult. Many insights require functional genomic data (e.g., ChIP-seq, methylation and ATAC-seq), which may be challenging to access in non-model systems. Despite recent progress in resolving repeats in chromosome-scale assemblies, their assembly and annotation remain non-trivial problems (Lower et al. 2018).Some genome regions with critical functions are enriched in, or entirely composed of, repeated DNA sequences. Centromeres—the essential structures that guide chromosome segregation during cell division—are typically embedded in repetitive regions and remain perhaps the most challenging genome regions to predict. Centromere prediction is especially challenging because: (1) they are generally defined by the presence of a centromere-specific histone variant (CENP-A) rather than by specific DNA sequences; and (2) they tend to occur in repeat-dense chromosomal regions enriched in satellite DNA and/or transposable elements, which can be arranged as higher order repeats (reviewed in Allshire and Karpen 2008). Centromeres vary widely across species in their size and organisation (reviewed in Hartley and O'Neill 2019). Genomic studies have only recently begun to reveal the detailed organisation of centromeres of some fungi, plants and animals. Some interesting patterns are emerging from these studies: the functional centromere core can correlate with regions of low repeat diversity and a regional dip in DNA methylation (e.g., Altemose et al. 2022). However, inferring that a repetitive DNA is part of one of these functional domains remains a challenge because this requires both high-quality assemblies and additional supporting data sets (e.g., DNA-protein interaction and methylation data).RepeatOBserver (Elphinstone et al. 2025) is a tool for visualising and analysing genomic repeat patterns that does not require a priori repeat annotations or additional genomic data (https://github.com/celphin/RepeatOBserverV1). Visualising the distribution and diversity of repeats across a genome can flag features consistent with functional regions like centromeres and help identify structural patterns in other repeat-rich regions. This tool can help leverage the increasing number of chromosome-scale assemblies to provide important insights into genome biology.RepeatOBserver's visualisations build around Fourier transforms of DNA walks through chromosome-scale assemblies (Figure 1). The Fourier transform is a mathematical approach that breaks down complex signals into their component parts. When applied to genomic sequence data, it can help identify biological patterns including gene structure, phylogenetic relationships and repeat motifs. The ability to detect hidden periodicity in DNA sequence data makes Fourier transforms a sound choice for repeat identification (e.g., Sharma et al. 2004). RepeatOBserver uses DNA walks that slide across chromosomes converting nucleotides to numerical values based on sequence composition. The tool then applies a Fast Fourier Transform (FFT) and plots the resulting spectra as heat maps (Figure 1). These plots reveal repeat organisation across each chromosome, highlighting the location and length of repeats and their abundances.Contrasting patterns of repeat diversity and abundance can reveal structure in genomic regions that are otherwise difficult to analyse. RepeatOBserver summarises patterns of repeat diversity by borrowing an information theory approach (Shannon 1948) frequently used in ecology. The Shannon Diversity Index (SDI) typically describes species diversity but is finding uses beyond ecological studies. Here, rather than describing species diversity within a geographic area, the genomic SDI summarises diversity of different repeats within a sequence window based on the Fourier spectra. Plotting the SDIs and repeat abundance can identify trends in repeat diversity across chromosomes.An exciting application of SDI values is in predicting functional chromosome regions like centromeres based only on DNA sequence data (Figure 1). Monocentric chromosomes rich in tandem satellite DNAs, like those found in humans, tend to form on homogenised younger repeat regions within larger repeat arrays (Altemose et al. 2022; reviewed in Naish and Henderson 2024). The centromere regions should thus have low repeat diversity and appear as local minima in SDI plots. Alternatively, species with monocentric chromosomes enriched in transposons like those of wheat should appear as regions of high repeat abundance. Holocentric chromosomes with many dispersed centromeres can appear as local SDI minima on dispersed repeat clusters across the chromosome. Thus, RepeatOBserver's summaries of repeat diversity and abundance can provide a starting point for putative centromere identification for a range of known centromere types. The authors validated many centromere predictions in 12 plant and animal species (159 chromosomes) representing varying centromere sizes and organisation, each with experimental data supporting putative centromere location. The overall patterns of repeat distributions reported by RepeatOBserver largely match expectations for well-curated plant and animal genomes like the locations of pericentromeric satellites, and boundaries between centromere, heterochromatin and euchromatin domains.The RepeatOBserver output can help identify other structural patterns, including blocks of pericentromeric and subtelomeric repeats, multicopy gene families, some structural rearrangements and higher-order repeats. This tool works even for imperfect higher-order repeats, which can be challenging to identify with standard annotation tools. There are some limitations to this approach, as functional predictions like centromere identification still need to be validated with other tools. However, RepeatOBserver can provide preliminary insights that can serve as a springboard to questions about genome structure, function and evolution in novel genomes. Genome visualisation tools like RepeatOBserver are timely and can accelerate discoveries in difficult genome regions at a time in which repeat-resolved assemblies are increasingly accessible across taxa.The cross-pollination of theory and methods among scientific fields can drive rapid advances in biology. Genomics as a field represents a convergence of methods assembled from multiple disciplines. Principles of ecology show excellent promise in understanding complex interactions that comprise genomes, their regulators and their products (e.g., Brookfield 2005). The continued integration of methods, tools and thinking across disciplines may help us leverage the vast diversity of available genomes toward a more holistic understanding of their stunning complexity.The authors declare no conflicts of interest.","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":"25 7","pages":""},"PeriodicalIF":5.5000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.70008","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.70008","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Technological advances are producing genome assemblies of increasing quality at steadily decreasing costs. These assemblies enable the extraction of rich biological information from previously inaccessible genomic regions (e.g., repeat-rich regions) and from diverse organisms underrepresented in genomic research. Gaining functional insights from new assemblies often requires generating additional data sets, experimental approaches and complex analysis. Novel analytical methods that substantially shorten the path to biological insights are valuable, particularly if they draw conclusions from the direct analysis of assemblies. In this issue of Molecular Ecology Resources, Elphinstone et al. (2025) present RepeatOBserver—a tool to visualise repeat organisation through direct analysis of chromosome-scale assemblies. This tool facilitates the summary and visualisation of large- and fine-scale patterns of repetitive DNA sequence structure across assemblies. Their approach borrows metrics from information theory, which have found uses in ecology (i.e., the Shannon Diversity Index), to help infer functional regions within repetitive sequences including putative centromeres. Importantly, RepeatOBserver does not require annotations, repeat libraries or functional genomic data—just a high-quality assembly. This type of tool addresses ongoing challenges in mapping the structure and functions of repeat-rich chromosomal regions, which remain the least well-understood components of genomes.

The availability of chromosome-scale genome assemblies is growing rapidly, as advances in long-read sequencing technology make assembly-based approaches accessible to more taxa. These genome assemblies can reveal important insights into genome biology, biomedicine and biodiversity. Our ability to extract these insights from assemblies is built on decades of hard-won work in early genomic model organisms. For example, early work on gene structure, regulation and evolution provided a knowledge base for ab initio gene prediction from nothing more than a DNA sequence. While annotation tools for non-coding sequences like the abundant repetitive DNAs found in most eukaryotic genomes are now accessible, the methods to extract insights from these regions are less mature. Repetitive DNAs evolve rapidly: their composition, organisation and abundance varies across species (Yunis and Yasmineh 1971), making predictions based on sequence conservation difficult. Many insights require functional genomic data (e.g., ChIP-seq, methylation and ATAC-seq), which may be challenging to access in non-model systems. Despite recent progress in resolving repeats in chromosome-scale assemblies, their assembly and annotation remain non-trivial problems (Lower et al. 2018).

Some genome regions with critical functions are enriched in, or entirely composed of, repeated DNA sequences. Centromeres—the essential structures that guide chromosome segregation during cell division—are typically embedded in repetitive regions and remain perhaps the most challenging genome regions to predict. Centromere prediction is especially challenging because: (1) they are generally defined by the presence of a centromere-specific histone variant (CENP-A) rather than by specific DNA sequences; and (2) they tend to occur in repeat-dense chromosomal regions enriched in satellite DNA and/or transposable elements, which can be arranged as higher order repeats (reviewed in Allshire and Karpen 2008). Centromeres vary widely across species in their size and organisation (reviewed in Hartley and O'Neill 2019). Genomic studies have only recently begun to reveal the detailed organisation of centromeres of some fungi, plants and animals. Some interesting patterns are emerging from these studies: the functional centromere core can correlate with regions of low repeat diversity and a regional dip in DNA methylation (e.g., Altemose et al. 2022). However, inferring that a repetitive DNA is part of one of these functional domains remains a challenge because this requires both high-quality assemblies and additional supporting data sets (e.g., DNA-protein interaction and methylation data).

RepeatOBserver (Elphinstone et al. 2025) is a tool for visualising and analysing genomic repeat patterns that does not require a priori repeat annotations or additional genomic data (https://github.com/celphin/RepeatOBserverV1). Visualising the distribution and diversity of repeats across a genome can flag features consistent with functional regions like centromeres and help identify structural patterns in other repeat-rich regions. This tool can help leverage the increasing number of chromosome-scale assemblies to provide important insights into genome biology.

RepeatOBserver's visualisations build around Fourier transforms of DNA walks through chromosome-scale assemblies (Figure 1). The Fourier transform is a mathematical approach that breaks down complex signals into their component parts. When applied to genomic sequence data, it can help identify biological patterns including gene structure, phylogenetic relationships and repeat motifs. The ability to detect hidden periodicity in DNA sequence data makes Fourier transforms a sound choice for repeat identification (e.g., Sharma et al. 2004). RepeatOBserver uses DNA walks that slide across chromosomes converting nucleotides to numerical values based on sequence composition. The tool then applies a Fast Fourier Transform (FFT) and plots the resulting spectra as heat maps (Figure 1). These plots reveal repeat organisation across each chromosome, highlighting the location and length of repeats and their abundances.

Contrasting patterns of repeat diversity and abundance can reveal structure in genomic regions that are otherwise difficult to analyse. RepeatOBserver summarises patterns of repeat diversity by borrowing an information theory approach (Shannon 1948) frequently used in ecology. The Shannon Diversity Index (SDI) typically describes species diversity but is finding uses beyond ecological studies. Here, rather than describing species diversity within a geographic area, the genomic SDI summarises diversity of different repeats within a sequence window based on the Fourier spectra. Plotting the SDIs and repeat abundance can identify trends in repeat diversity across chromosomes.

An exciting application of SDI values is in predicting functional chromosome regions like centromeres based only on DNA sequence data (Figure 1). Monocentric chromosomes rich in tandem satellite DNAs, like those found in humans, tend to form on homogenised younger repeat regions within larger repeat arrays (Altemose et al. 2022; reviewed in Naish and Henderson 2024). The centromere regions should thus have low repeat diversity and appear as local minima in SDI plots. Alternatively, species with monocentric chromosomes enriched in transposons like those of wheat should appear as regions of high repeat abundance. Holocentric chromosomes with many dispersed centromeres can appear as local SDI minima on dispersed repeat clusters across the chromosome. Thus, RepeatOBserver's summaries of repeat diversity and abundance can provide a starting point for putative centromere identification for a range of known centromere types. The authors validated many centromere predictions in 12 plant and animal species (159 chromosomes) representing varying centromere sizes and organisation, each with experimental data supporting putative centromere location. The overall patterns of repeat distributions reported by RepeatOBserver largely match expectations for well-curated plant and animal genomes like the locations of pericentromeric satellites, and boundaries between centromere, heterochromatin and euchromatin domains.

The RepeatOBserver output can help identify other structural patterns, including blocks of pericentromeric and subtelomeric repeats, multicopy gene families, some structural rearrangements and higher-order repeats. This tool works even for imperfect higher-order repeats, which can be challenging to identify with standard annotation tools. There are some limitations to this approach, as functional predictions like centromere identification still need to be validated with other tools. However, RepeatOBserver can provide preliminary insights that can serve as a springboard to questions about genome structure, function and evolution in novel genomes. Genome visualisation tools like RepeatOBserver are timely and can accelerate discoveries in difficult genome regions at a time in which repeat-resolved assemblies are increasingly accessible across taxa.

The cross-pollination of theory and methods among scientific fields can drive rapid advances in biology. Genomics as a field represents a convergence of methods assembled from multiple disciplines. Principles of ecology show excellent promise in understanding complex interactions that comprise genomes, their regulators and their products (e.g., Brookfield 2005). The continued integration of methods, tools and thinking across disciplines may help us leverage the vast diversity of available genomes toward a more holistic understanding of their stunning complexity.

The authors declare no conflicts of interest.

Abstract Image

查看原文本刊更多论文

在重复丰富的基因组区域，尽管有树木，但仍能看到森林。

技术进步正在以稳定降低的成本生产质量不断提高的基因组组件。这些组件能够从以前无法访问的基因组区域（例如，重复丰富的区域）和基因组研究中代表性不足的各种生物体中提取丰富的生物信息。从新组件中获得功能见解通常需要生成额外的数据集、实验方法和复杂的分析。新颖的分析方法大大缩短了生物学见解的路径是有价值的，特别是如果它们从组装的直接分析中得出结论。在本期的《分子生态资源》中，Elphinstone等人（2025）展示了repeatobserver——一种通过直接分析染色体尺度组装体来可视化重复序列组织的工具。该工具有助于总结和可视化跨组装的重复DNA序列结构的大型和精细模式。他们的方法借鉴了信息论的指标，信息论在生态学中已经有了应用（例如，香农多样性指数），以帮助推断重复序列中的功能区域，包括假定的着丝粒。重要的是，RepeatOBserver不需要注释、重复库或功能基因组数据——只需要一个高质量的程序集。这种类型的工具解决了在绘制富含重复序列的染色体区域的结构和功能方面的持续挑战，这些区域仍然是基因组中了解最少的组成部分。随着长读测序技术的进步，染色体尺度基因组组装的可用性正在迅速增长，使基于组装的方法可用于更多的分类群。这些基因组组装可以揭示基因组生物学，生物医学和生物多样性的重要见解。我们从组装中提取这些见解的能力是建立在几十年来在早期基因组模式生物中来之不易的工作基础上的。例如，早期关于基因结构、调控和进化的工作为从头开始基因预测提供了知识基础，而仅仅是DNA序列。虽然非编码序列（如在大多数真核生物基因组中发现的大量重复dna）的注释工具现在可以访问，但从这些区域提取见解的方法尚不成熟。重复dna进化迅速：它们的组成、组织和丰度因物种而异（Yunis和Yasmineh， 1971），这使得基于序列保守的预测变得困难。许多见解需要功能基因组数据（例如ChIP-seq，甲基化和ATAC-seq），这在非模型系统中可能具有挑战性。尽管最近在解决染色体尺度组装中的重复方面取得了进展，但它们的组装和注释仍然是非常重要的问题（Lower et al. 2018）。一些具有关键功能的基因组区域富含或完全由重复的DNA序列组成。着丝粒——在细胞分裂过程中指导染色体分离的基本结构——通常嵌入重复区域，并且可能仍然是最具挑战性的基因组区域。着丝粒预测尤其具有挑战性，因为：(1)它们通常由着丝粒特异性组蛋白变体（CENP-A）的存在而不是由特定的DNA序列来定义；(2)它们往往发生在重复密集的染色体区域，这些区域富含卫星DNA和/或转座因子，可以排列成高顺序的重复序列（Allshire和Karpen 2008）。着丝粒的大小和组织在不同物种之间差异很大（Hartley和O'Neill 2019年进行了综述）。基因组研究直到最近才开始揭示一些真菌、植物和动物的着丝粒的详细组织结构。这些研究中出现了一些有趣的模式：功能着丝粒核心可能与低重复多样性区域和DNA甲基化区域下降相关（例如，Altemose et al. 2022）。然而，推断重复DNA是这些功能域之一的一部分仍然是一个挑战，因为这需要高质量的组装和额外的支持数据集（例如DNA-蛋白质相互作用和甲基化数据）。RepeatOBserver （Elphinstone et al. 2025）是一个可视化和分析基因组重复模式的工具，不需要先验的重复注释或额外的基因组数据（https://github.com/celphin/RepeatOBserverV1）。可视化重复序列在基因组中的分布和多样性可以标记与着丝粒等功能区域一致的特征，并有助于识别其他重复序列丰富区域的结构模式。这个工具可以帮助利用越来越多的染色体规模组装，为基因组生物学提供重要的见解。RepeatOBserver的可视化是围绕DNA在染色体尺度上的傅里叶变换构建的（图1）。傅里叶变换是一种将复杂信号分解成其组成部分的数学方法。当应用于基因组序列数据时，它可以帮助识别生物模式，包括基因结构、系统发育关系和重复基序。检测DNA序列数据中隐藏周期性的能力使傅里叶变换成为重复识别的良好选择（例如，Sharma et al. 2004）。RepeatOBserver使用沿着染色体滑动的DNA行走，根据序列组成将核苷酸转换为数值。然后，该工具应用快速傅里叶变换（FFT），并将得到的光谱绘制为热图（图1）。这些图揭示了每条染色体上的重复序列组织，突出了重复序列的位置和长度以及它们的丰度。重复序列多样性和丰度的对比模式可以揭示基因组区域的结构，否则很难分析。通过借用生态学中经常使用的信息论方法（Shannon 1948）， RepeatOBserver总结了重复多样性的模式。香农多样性指数（SDI）通常用来描述物种多样性，但它正在寻找生态学研究之外的用途。在这里，基因组SDI不是描述地理区域内的物种多样性，而是基于傅立叶谱在序列窗口内总结不同重复序列的多样性。绘制sdi和重复序列丰度可以确定染色体间重复序列多样性的趋势。SDI值的一个令人兴奋的应用是仅基于DNA序列数据预测功能染色体区域，如着丝粒（图1）。富含串联卫星dna的单中心染色体，就像在人类中发现的那样，倾向于在较大的重复序列阵列中形成均匀的年轻重复区域（Altemose等人，2022；在Naish和Henderson 2024中进行了综述）。因此，着丝粒区域应具有较低的重复多样性，并在SDI样地中表现为局部最小值。另外，具有单中心染色体的物种，如小麦，富含转座子，应该出现在高重复丰度的区域。具有许多分散着丝粒的全新中心染色体可以在染色体上分散的重复簇上出现局部SDI最小值。因此，RepeatOBserver对重复序列多样性和丰度的总结可以为一系列已知着丝粒类型的假定着丝粒鉴定提供一个起点。作者在12种植物和动物物种（159条染色体）中验证了许多着丝粒预测，这些预测代表着不同的着丝粒大小和组织，每个着丝粒都有实验数据支持假设的着丝粒位置。RepeatOBserver报道的重复分布的总体模式在很大程度上符合对精心设计的植物和动物基因组的预期，如着丝粒周围卫星的位置，着丝粒、异染色质和常染色质结构域之间的边界。RepeatOBserver输出可以帮助识别其他结构模式，包括周中心和亚端粒重复序列、多拷贝基因家族、一些结构重排和高阶重复序列。该工具甚至适用于不完美的高阶重复，这可能很难用标准注释工具进行识别。这种方法有一些局限性，因为像着丝粒鉴定这样的功能预测仍然需要用其他工具进行验证。然而，RepeatOBserver可以提供初步的见解，可以作为一个跳板，解决有关基因组结构、功能和进化的问题。像RepeatOBserver这样的基因组可视化工具是及时的，并且可以在重复解析组装越来越多地跨分类群访问的时候加速在困难的基因组区域的发现。科学领域之间理论和方法的交叉授粉可以推动生物学的快速发展。基因组学作为一个领域代表了从多个学科组装的方法的融合。生态学原理在理解由基因组、调控因子及其产物组成的复杂相互作用方面显示出极好的前景（例如，Brookfield 2005）。跨学科的方法、工具和思维的持续整合可能有助于我们利用现有基因组的巨大多样性，对其惊人的复杂性有更全面的了解。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Molecular Ecology Resources 生物-进化生物学

CiteScore

15.60

自引率

5.20%

发文量

170

审稿时长

3 months

期刊介绍： Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines. In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.