Swiftly identifying strongly unique k-mers.

IF 1.7 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2025-07-13 DOI:10.1186/s13015-025-00286-6

Jens Zentgraf, Sven Rahmann

{"title":"Swiftly identifying strongly unique k-mers.","authors":"Jens Zentgraf, Sven Rahmann","doi":"10.1186/s13015-025-00286-6","DOIUrl":null,"url":null,"abstract":"Motivation: Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.Results: We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.Availability: An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"13"},"PeriodicalIF":1.7000,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257829/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-025-00286-6","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.

Results: We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.

Availability: An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .

Abstract Image

查看原文本刊更多论文

快速识别强烈独特的k-mers。

动机：出现在单个位置的长度为k的短DNA序列（例如，在单个基因组位置，在较大物种集合中的单个物种中，等等）被称为唯一k-mers。它们有助于将已测序的DNA片段放置在正确的位置，而不需要计算比对，也不会产生歧义。然而，它们并不一定是强大的：单个碱基对的改变可能会将一个独特的k-mer变成一个不同的k-mer，实际上可能存在于一个或多个不同的位置，这可能会在试图通过k-mer内容放置读取时提供混淆或矛盾的信息。一个更鲁棒的概念是强唯一k-mers，即在所有考虑的序列中不存在具有冲突信息的hming -distance-1邻居的唯一k-mers。给定一组k-mers，因此有兴趣找到一种有效的方法来区分集合中具有汉明距离为1的k-mers和那些没有汉明距离为1的k-mers。结果：我们提出了一种工程算法来识别和标记K个（规范）K -mers集合中相同集合中具有hming -distance-1邻居的所有元素。一种算法基于递归地在排序集的子区间上运行4路比较。另一种算法是基于桶，并在排序集的小桶上运行成对并行的汉明距离测试。这两种方法都考虑了标准k-mers（即，考虑了反向互补），并允许有效的并行化。这些方法已经在实践中实现并应用于由数十亿k-mers组成的集合。在16核工作站上运行16个线程的优化组合方法在人类端粒到端粒参考基因组的25亿个不同的31米上的壁时间低于20秒。可用性：可以在https://gitlab.com/rahmannlab/strong-k-mers上找到实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.