Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data

Kyrylo Kleshch, Volodymyr Shablii
{"title":"Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data","authors":"Kyrylo Kleshch, Volodymyr Shablii","doi":"10.15587/2706-5448.2023.286382","DOIUrl":null,"url":null,"abstract":"The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.","PeriodicalId":22480,"journal":{"name":"Technology audit and production reserves","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology audit and production reserves","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15587/2706-5448.2023.286382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.
基于Damerau-Levenshtein自动机的大数据模糊搜索算法比较
研究对象是基于Damerau-Levenshtein自动机和Levenshtein自动机的模糊搜索算法。本文研究并比较了基于有限状态机的解决方案,利用模糊搜索的概念在给定编辑距离的大型文本数据中高效快速地查找单词和行。模糊搜索算法允许找到比标准显式搜索算法更相关的结果。然而,这种算法通常具有较高的渐近复杂度,因此工作时间更长。使用Damerau-Levenshtein距离的模糊文本搜索允许考虑用户在搜索词中可能出现的常见错误,即:字符替换,额外字符,缺失字符和字符重新排序。要使用有限自动机,必须首先为特定的输入单词和编辑距离构造它,然后对该自动机执行搜索,丢弃自动机不接受的单词。因此,在选择算法时,这两个阶段都要考虑。这是因为制造一台机器需要很长时间。为了加快其中一台机器的速度,使用了SIMD指令,根据搜索词的数量、搜索词的长度和编辑距离,它提供了1-10%的加速。所获得的结果可用于需要在大量数据中快速有效地执行模糊搜索的各种行业,例如,在搜索引擎中或在错误的自动纠正中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
89
审稿时长
8 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信