Exact pattern matching. Current achievements and research

Visnik Kiivs''kij nacional''nij universitet imeni Tarasa Sevcenka Istoria Pub Date : 2023-01-01 DOI:10.17721/1812-5409.2023/1.11

A. Zuiev

{"title":"Exact pattern matching. Current achievements and research","authors":"A. Zuiev","doi":"10.17721/1812-5409.2023/1.11","DOIUrl":null,"url":null,"abstract":"The problem of exact pattern matching is an essential programming problem. Different algorithms that solve this problem are core elements of search engines, version control systems, text editors, DNA analyzers, and many others. For simplification reasons articles usually denote pattern as P or p and pattern length as M or m. Similarly, the text is usually denoted as T or t and its length - N or n. Alphabet is denoted Σ and its length - |Σ|. Based on these notations the problem of pattern matching can be written as follows: Find all positions/ amount of i, such that P[0...m] = T[i...i + m], or: Find all positions i in text for which substring starting at position i of the text of length m is equal to the pattern. The main parameters of this problem are pattern length and alphabet size. The length of the text usually doesn’t matter because, for any long enough text of a specific structure, the run time of the algorithm per character will be close to constant. Besides that, the specifics of the input data and text may also impact the performances of the algorithms. All of that makes the problem both very nuanced and interesting to investigate. This problem features a lot of different existing solutions developed over the course of the last 5 decades. The main part of the work provides short descriptions and analyses of a set of algorithms that are still relevant in the field. Besides that, some remarks are made on the topic of their theoretical regions of efficiency and how they depend on the specifics of the input. The results of the practical experimentation on the variety of randomly generated test data are provided. The conclusion provides some analysis of the received results and algorithms’ class efficiency based on the input as well as a visual representation of the received results in a form of a table representing the most efficient algorithm for each pair of pattern length and alphabet size.","PeriodicalId":33822,"journal":{"name":"Visnik Kiivs''kij nacional''nij universitet imeni Tarasa Sevcenka Istoria","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Visnik Kiivs''kij nacional''nij universitet imeni Tarasa Sevcenka Istoria","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17721/1812-5409.2023/1.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The problem of exact pattern matching is an essential programming problem. Different algorithms that solve this problem are core elements of search engines, version control systems, text editors, DNA analyzers, and many others. For simplification reasons articles usually denote pattern as P or p and pattern length as M or m. Similarly, the text is usually denoted as T or t and its length - N or n. Alphabet is denoted Σ and its length - |Σ|. Based on these notations the problem of pattern matching can be written as follows: Find all positions/ amount of i, such that P[0...m] = T[i...i + m], or: Find all positions i in text for which substring starting at position i of the text of length m is equal to the pattern. The main parameters of this problem are pattern length and alphabet size. The length of the text usually doesn’t matter because, for any long enough text of a specific structure, the run time of the algorithm per character will be close to constant. Besides that, the specifics of the input data and text may also impact the performances of the algorithms. All of that makes the problem both very nuanced and interesting to investigate. This problem features a lot of different existing solutions developed over the course of the last 5 decades. The main part of the work provides short descriptions and analyses of a set of algorithms that are still relevant in the field. Besides that, some remarks are made on the topic of their theoretical regions of efficiency and how they depend on the specifics of the input. The results of the practical experimentation on the variety of randomly generated test data are provided. The conclusion provides some analysis of the received results and algorithms’ class efficiency based on the input as well as a visual representation of the received results in a form of a table representing the most efficient algorithm for each pair of pattern length and alphabet size.

查看原文本刊更多论文

精确的模式匹配。目前的研究成果

精确模式匹配问题是一个重要的规划问题。解决这个问题的不同算法是搜索引擎、版本控制系统、文本编辑器、DNA分析器等的核心元素。为了简化，文章通常用P或P表示模式，用M或M表示模式长度。同样，文本通常用T或T表示，其长度为- N或N。字母用Σ表示，其长度为- |Σ|。基于这些符号，模式匹配问题可以写成如下:找到i的所有位置/数量，使得P[0…m] = T[i…]i + m]，或者:查找长度为m的文本中从第i位置开始的子字符串等于模式的所有位置i。这个问题的主要参数是模式长度和字母大小。文本的长度通常无关紧要，因为对于特定结构的任何足够长的文本，每个字符的算法运行时间将接近常数。此外，输入数据和文本的细节也可能影响算法的性能。所有这些都使这个问题变得非常微妙，而且值得研究。这个问题的特点是在过去50年中开发了许多不同的现有解决方案。工作的主要部分提供了一组在该领域仍然相关的算法的简短描述和分析。除此之外，还对它们的理论效率区域以及它们如何依赖于投入的具体情况作了一些评论。给出了各种随机生成的试验数据的实际实验结果。结论部分提供了基于输入的接收结果和算法类效率的一些分析，以及以表的形式表示接收结果的可视化表示，表表示针对每对模式长度和字母表大小的最有效算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Visnik Kiivs''kij nacional''nij universitet imeni Tarasa Sevcenka Istoria

自引率

0.00%

发文量

审稿时长

4 weeks