A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Pub Date : 2018-08-15 DOI:10.1145/3233547.3233564

Daniel R. Allen, Sharma V. Thankachan, Bojian Xu

{"title":"A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem","authors":"Daniel R. Allen, Sharma V. Thankachan, Bojian Xu","doi":"10.1145/3233547.3233564","DOIUrl":null,"url":null,"abstract":"This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.","PeriodicalId":131906,"journal":{"name":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3233547.3233564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of $O(nłog^k n )$, while maintaining a practical space complexity at $O(kn)$, where n is the string length. When $k>0$, which is the hard case, our new proposal significantly improves the any-case $O(n^2)$ time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200KB sample DNA sequence with $k=1$ in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than $1/4$ that of the serial implementation, when processing a 10MB sample DNA sequence with $k=2$. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any $k>0$, while this new proposal, using 24 cores, can finish processing a sample of this size with $k=1$ in $206.376$ seconds with a peak memory usage of 46GB, which is both easily available and affordable on Cloud for many users. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology.

查看原文本刊更多论文

一种实用高效的k-不匹配最短唯一子串查找算法

本文回顾了k-失配最短唯一子串查找问题，并证明了最近在解决k-失配平均公共子串问题的背景下提出的一种技术可以适应并与现有解决方案的部分相结合，从而产生一种新的算法，该算法的预期时间复杂度为$O(nłog^k n)$，同时保持实际空间复杂度为$O(kn)$，其中n为字符串长度。当$k>0$，这是一个困难的情况下，我们的新建议显着提高了任何情况下的$O(n^2)$的先验最佳方法的时间复杂度为k-失配最短唯一子串查找。实验研究表明，当k相对于n较小时，我们的新算法是实用的，并且与先前最佳解决方案的实现相比，在处理时间上有了显着的改进。例如，我们的方法在$k=1$时处理200KB样本DNA序列只需0.18秒，而使用先前最佳解决方案则需要174.37秒。此外，可以观察到，采用的技术的很大一部分可以使用两种不同的简单并发模型并行执行，从而进一步显著提高实际性能。例如，当使用8核时，当处理10MB样本DNA序列时，并行实现的处理时间都小于串行实现的1/4美元。在一个拥有数千千兆字节RAM的实例随时可以通过云基础设施提供商使用的时代，为了显著改善处理时间而牺牲额外的内存使用可能是许多用户所希望和需要的。例如，最好的先前解决方案可能需要花费数年时间来完成任何$k> $的200MB DNA样本，而这个新提议，使用24核，可以在$k=1$的206.376$秒内完成处理这个大小的样本，峰值内存使用量为46GB，这对于许多用户来说在云上既容易获得又负担得起。期望这种新的高效实用的k-失配最短唯一子串查找算法将被证明对计算生物学等领域中使用长序列测量的人有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

自引率

0.00%

发文量