An efficient uniform-cost normalized edit distance algorithm

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268) Pub Date : 1999-04-24 DOI:10.1109/SPIRE.1999.796572

Abdullah N. Arslan, Ö. Eğecioğlu

引用次数: 21

Abstract

A common model for computing the similarity of two strings X and Y of lengths m, and n respectively with m/spl ges/n, is to transform X into Y through a sequence of three types of edit operations: insertion, deletion, and substitution. The model assumes a given cost function which assigns a non-negative real weight to each edit operation. The amortized weight for a given edit sequence is the ratio of its weight to its length, and the minimum of this ratio over all edit sequences is the normalized edit distance. Existing algorithms for normalized edit distance computation with proven complexity bounds require O(mn/sup 2/) time in the worst-case. We give an O(mn log n)-time algorithm for the problem when the cost function is uniform, i.e., the weight of each edit operation is constant within the same type, except substitutions can have different weights depending on whether they are matching or non-matching.

查看原文本刊更多论文

一种高效的等代价归一化编辑距离算法

计算长度分别为m和n的两个字符串X和Y的相似度的一个常用模型是通过一系列三种类型的编辑操作将X转换为Y:插入、删除和替换。该模型假设一个给定的成本函数，该函数为每个编辑操作分配一个非负的实际权重。给定编辑序列的平摊权值是其权值与长度的比值，该比值在所有编辑序列上的最小值是规范化编辑距离。现有的归一化编辑距离计算算法在最坏情况下需要O(mn/sup 2/)时间。对于代价函数是一致的问题，我们给出了一个O(mn log n)时间算法，即在同一类型中，除了替换可以根据匹配或不匹配而具有不同的权重外，每个编辑操作的权重都是恒定的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

6th International Symposium on String Processing and Information Retrieval. 5th International Workshop on Groupware (Cat. No.PR00268)

自引率

0.00%

发文量