{"title":"Advancements in practical k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05210","DOIUrl":null,"url":null,"abstract":"This paper provides a comprehensive survey of data structures for\nrepresenting k-mer sets, which are fundamental in high-throughput sequencing\nanalysis. It categorizes the methods into two main strategies: those using\nfingerprinting and hashing for compact storage, and those leveraging\nlexicographic properties for efficient representation. The paper reviews key\noperations supported by these structures, such as membership queries and\ndynamic updates, and highlights recent advancements in memory efficiency and\nquery speed. A companion paper explores colored k-mer sets, which extend these\nconcepts to integrate multiple datasets or genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05210","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper provides a comprehensive survey of data structures for
representing k-mer sets, which are fundamental in high-throughput sequencing
analysis. It categorizes the methods into two main strategies: those using
fingerprinting and hashing for compact storage, and those leveraging
lexicographic properties for efficient representation. The paper reviews key
operations supported by these structures, such as membership queries and
dynamic updates, and highlights recent advancements in memory efficiency and
query speed. A companion paper explores colored k-mer sets, which extend these
concepts to integrate multiple datasets or genomes.