Xiangyun Ding, Xiaojun Dong, Yan Gu, Youzhe Liu, Yihan Sun
{"title":"Efficient Parallel Output-Sensitive Edit Distance","authors":"Xiangyun Ding, Xiaojun Dong, Yan Gu, Youzhe Liu, Yihan Sun","doi":"10.4230/LIPIcs.ESA.2023.40","DOIUrl":null,"url":null,"abstract":"Given two strings $A[1..n]$ and $B[1..m]$, and a set of operations allowed to edit the strings, the edit distance between $A$ and $B$ is the minimum number of operations required to transform $A$ into $B$. Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with $\\Theta(nm)$ cost. In many real-world applications, the strings to be compared are similar and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard $\\Theta(nm)$ algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS) and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All our algorithms have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.","PeriodicalId":201778,"journal":{"name":"Embedded Systems and Applications","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Embedded Systems and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.ESA.2023.40","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Given two strings $A[1..n]$ and $B[1..m]$, and a set of operations allowed to edit the strings, the edit distance between $A$ and $B$ is the minimum number of operations required to transform $A$ into $B$. Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with $\Theta(nm)$ cost. In many real-world applications, the strings to be compared are similar and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard $\Theta(nm)$ algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS) and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All our algorithms have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.
给定两个字符串$A[1..]$和$B[1..]$ a $和$B$之间的编辑距离是将$ a $转换为$B$所需的最小操作数。接着,一个标准的动态规划(DP)算法以$\Theta(nm)$代价求解编辑距离。在许多实际应用程序中,要比较的字符串是相似的,并且具有较小的编辑距离。为了实现高度实用的实现,我们专注于输出敏感的并行编辑距离算法,即,当编辑距离很小时,实现比标准$\Theta(nm)$算法更好的渐近成本界限。本文研究了四种算法,包括三种基于广度优先搜索(BFS)的算法和一种基于分治(DaC)的算法。我们基于bfs的解决方案是基于Landau-Vishkin算法的。对于算法中需要的最长公共前缀(LCP)查询,我们实现了三种不同的数据结构:使用并行后缀数组的经典解决方案,以及本文提出的两种基于哈希的解决方案。我们基于dac的解决方案的灵感来自Apostolico等人提出的输出不敏感的解决方案,我们提出了一个非平凡的适应,使其输出敏感。我们所有的算法都有很好的理论保证,它们在工作量(操作总数)、跨度(计算中最长的依赖链)和空间之间实现了不同的权衡。我们在合成数据和真实数据上测试和比较我们的算法。我们基于bfs的算法在所有测试用例中都优于ParlayLib中现有的并行编辑距离实现。通过比较我们的算法,我们也可以更好地理解不同输入模式下算法的选择。本文是对并行编辑距离理论与实践的首次系统研究。