Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data

Technology audit and production reserves Pub Date : 2023-08-28 DOI:10.15587/2706-5448.2023.286382

Kyrylo Kleshch, Volodymyr Shablii

{"title":"Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data","authors":"Kyrylo Kleshch, Volodymyr Shablii","doi":"10.15587/2706-5448.2023.286382","DOIUrl":null,"url":null,"abstract":"The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.","PeriodicalId":22480,"journal":{"name":"Technology audit and production reserves","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology audit and production reserves","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15587/2706-5448.2023.286382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.

查看原文本刊更多论文

基于Damerau-Levenshtein自动机的大数据模糊搜索算法比较

研究对象是基于Damerau-Levenshtein自动机和Levenshtein自动机的模糊搜索算法。本文研究并比较了基于有限状态机的解决方案，利用模糊搜索的概念在给定编辑距离的大型文本数据中高效快速地查找单词和行。模糊搜索算法允许找到比标准显式搜索算法更相关的结果。然而，这种算法通常具有较高的渐近复杂度，因此工作时间更长。使用Damerau-Levenshtein距离的模糊文本搜索允许考虑用户在搜索词中可能出现的常见错误，即:字符替换，额外字符，缺失字符和字符重新排序。要使用有限自动机，必须首先为特定的输入单词和编辑距离构造它，然后对该自动机执行搜索，丢弃自动机不接受的单词。因此，在选择算法时，这两个阶段都要考虑。这是因为制造一台机器需要很长时间。为了加快其中一台机器的速度，使用了SIMD指令，根据搜索词的数量、搜索词的长度和编辑距离，它提供了1-10%的加速。所获得的结果可用于需要在大量数据中快速有效地执行模糊搜索的各种行业，例如，在搜索引擎中或在错误的自动纠正中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Technology audit and production reserves

自引率

0.00%

发文量

审稿时长

8 weeks