Mem-based pangenome indexing for k-mer queries.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2025-03-01 DOI:10.1186/s13015-025-00272-y

Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead

{"title":"Mem-based pangenome indexing for k-mer queries.","authors":"Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead","doi":"10.1186/s13015-025-00272-y","DOIUrl":null,"url":null,"abstract":"<p><p>Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 <math><mo>×</mo></math> smaller than a comparable KMC3 index and 11.4 <math><mo>×</mo></math> smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 <math><mo>×</mo></math> faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"3"},"PeriodicalIF":1.7000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871630/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00272-y","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 $\times$ smaller than a comparable KMC3 index and 11.4 $\times$ smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 $\times$ faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Abstract Image

查看原文本刊更多论文

针对 k-mer 查询的基于 Mem 的泛基因组索引。

由于高质量长读片段的流行，泛基因组在数量和大小上都在增长。然而，目前研究泛基因组序列组成和保守性的方法存在局限性。基于图形泛基因组的方法需要计算昂贵的多次校准步骤，这可能会遗漏一些变化。基于k-mers和de Bruijn图的索引仅限于回答特定子串长度k的问题。我们提出了一种基于序列之间最大精确匹配（MEMs）的泛基因组索引方法——最大精确匹配有序（MEMO）。单个MEMO索引可以处理泛基因组窗口上任意长度的查询。MEMO支持测试k-mer是否存在的查询（成员查询）和计算一个窗口中包含k-mers的基因组数量（保守查询）。MEMO对89个人类常染色体单倍型的泛基因组的拟合指数为2.04 GB，比可比的KMC3指数小8.8倍，比PanKmer指数小11.4倍。通过牺牲一些计数分辨率，MEMO索引可以变得更小，我们的十分之一分辨率HPRC索引达到0.67 GB。MEMO可以在13.89 s内完成对人白细胞抗原位点31-mers的保守查询，比其他方法快2.5倍。MEMO的小索引大小，缺乏k-mer长度依赖，以及高效的查询使其成为研究和可视化泛基因组子串守恒的灵活工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.