{"title":"Advancements in colored k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05214","DOIUrl":null,"url":null,"abstract":"This paper provides a comprehensive review of recent advancements in\nk-mer-based data structures representing collections of several samples\n(sometimes called colored de Bruijn graphs) and their applications in\nlarge-scale sequence indexing and pangenomics. The review explores the\nevolution of k-mer set representations, highlighting the trade-offs between\nexact and inexact methods, as well as the integration of compression strategies\nand modular implementations. I discuss the impact of these structures on\npractical applications and describe recent utilization of these methods for\nanalysis. By surveying the state-of-the-art techniques and identifying emerging\ntrends, this work aims to guide researchers in selecting and developing methods\nfor large scale and reference-free genomic data. For a broader overview of\nk-mer set representations and foundational data structures, see the\naccompanying article on practical k-mer sets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05214","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper provides a comprehensive review of recent advancements in
k-mer-based data structures representing collections of several samples
(sometimes called colored de Bruijn graphs) and their applications in
large-scale sequence indexing and pangenomics. The review explores the
evolution of k-mer set representations, highlighting the trade-offs between
exact and inexact methods, as well as the integration of compression strategies
and modular implementations. I discuss the impact of these structures on
practical applications and describe recent utilization of these methods for
analysis. By surveying the state-of-the-art techniques and identifying emerging
trends, this work aims to guide researchers in selecting and developing methods
for large scale and reference-free genomic data. For a broader overview of
k-mer set representations and foundational data structures, see the
accompanying article on practical k-mer sets.