KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233587

S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran

{"title":"KmerEstimate: A Streaming Algorithm for Estimating k-mer Counts with Optimal Space Usage","authors":"S. Behera, Sutanu Gayen, J. Deogun, N. V. Vinodchandran","doi":"10.1145/3233547.3233587","DOIUrl":null,"url":null,"abstract":"The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \\em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \\urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \\em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \\urlhttps://github.com/srbehera11/KmerEstimate.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233587","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.

查看原文本刊更多论文

KmerEstimate:一种估算具有最佳空间使用的k-mer计数的流算法

k-mers (DNA/RNA序列中长度为k的子串)的频率分布对于使用下一代测序(NGS)数据的许多生物信息学应用非常有用。其中的一些例子包括基于德布鲁因图的装配、读取错误校正、基因组大小预测和数字归一化。在开发用于此类应用的工具时，计数(或估计)低频k-mers是预处理阶段。然而，计算k-mer频率直方图对于大规模基因组数据来说是一个计算挑战。我们提出了KmerEstimate，这是一种em流算法，可以近似基因组数据集中具有给定频率的k-mers的计数。我们的算法基于Bar-Yossef等人提出的一种众所周知的基于自适应采样的流算法，用于近似数据流中的不同元素。我们在几个数据集上实现并测试了我们的算法。我们的算法的结果比迄今为止用于此问题的其他流方法(特别是最先进的流方法$ntCard$)要好，错误率在0.6%以内。它比$ntCard$使用更少的内存，因为样本大小几乎比$ntCard$少85%。此外，我们的算法具有可证明的近似性和空间使用保证。并给出了一定的空间复杂度下界。我们的算法的源代码可在\urlhttps://github.com/srbehera11/KmerEstimate。我们提出了KmerEstimate，这是一种em流算法，可以近似基因组数据集中具有给定频率的k-mers的计数。我们的算法基于Bar-Yossef等人提出的一种众所周知的基于自适应采样的流算法，用于近似数据流中的不同元素。我们在几个数据集上实现并测试了我们的算法。我们的算法的结果比迄今为止用于此问题的其他流方法(特别是最先进的流方法$ntCard$)要好，错误率在0.6%以内。它比$ntCard$使用更少的内存，因为样本大小几乎比$ntCard$少85%。此外，我们的算法具有可证明的近似性和空间使用保证。并给出了一定的空间复杂度下界。我们的算法的源代码可在\urlhttps://github.com/srbehera11/KmerEstimate。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量