Merging Sorted Lists of Similar Strings

Annual Symposium on Combinatorial Pattern Matching Pub Date : 2022-08-19 DOI:10.48550/arXiv.2208.09351

E. Myers

引用次数: 0

Abstract

Merging $T$ sorted, non-redundant lists containing $M$ elements into a single sorted, non-redundant result of size $N \ge M/T$ is a classic problem typically solved practically in $O(M \log T)$ time with a priority-queue data structure the most basic of which is the simple *heap*. We revisit this problem in the situation where the list elements are *strings* and the lists contain many *identical or nearly identical elements*. By keeping simple auxiliary information with each heap node, we devise an $O(M \log T+S)$ worst-case method that performs no more character comparisons than the sum of the lengths of all the strings $S$, and another $O(M \log (T/ \bar e)+S)$ method that becomes progressively more efficient as a function of the fraction of equal elements $\bar e = M/N$ between input lists, reaching linear time when the lists are all identical. The methods perform favorably in practice versus an alternate formulation based on a trie.

查看原文本刊更多论文

合并相似字符串的排序列表

将包含$M$元素的$T$排序、非冗余列表合并到大小为$N \ge M/T$的单个排序、非冗余结果中是一个经典问题，通常在$O(M \log T)$时间内实际解决，使用优先级队列数据结构，其中最基本的是简单的*堆*。在列表元素是“字符串”并且列表包含许多“相同或几乎相同的元素”的情况下，我们重新审视这个问题。通过保留每个堆节点的简单辅助信息，我们设计了一种$O(M \log T+S)$最坏情况方法，它执行的字符比较不超过所有字符串长度之和$S$，而另一种$O(M \log (T/ \bar e)+S)$方法作为输入列表之间相等元素的分数的函数变得越来越高效$\bar e = M/N$，当列表都相同时达到线性时间。与基于试验的替代配方相比，所述方法在实践中表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Symposium on Combinatorial Pattern Matching

自引率

0.00%

发文量