An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems

IF 2.9 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Computational Intelligence and Soft Computing Pub Date : 2023-11-20 DOI:10.1155/2023/3278505

Prince Mahmud, Anisur Rahman, Kamrul Hasan Talukder

{"title":"An Improved Hashing Approach for Biological Sequence to Solve Exact Pattern Matching Problems","authors":"Prince Mahmud, Anisur Rahman, Kamrul Hasan Talukder","doi":"10.1155/2023/3278505","DOIUrl":null,"url":null,"abstract":"Pattern matching algorithms have gained a lot of importance in computer science, primarily because they are used in various domains such as computational biology, video retrieval, intrusion detection systems, and fraud detection. Finding one or more patterns in a given text is known as pattern matching. Two important things that are used to judge how well exact pattern matching algorithms work are the total number of attempts and the character comparisons that are made during the matching process. The primary focus of our proposed method is reducing the size of both components wherever possible. Despite sprinting, hash-based pattern matching algorithms may have hash collisions. The Efficient Hashing Method (EHM) algorithm is improved in this research. Despite the EHM algorithm’s effectiveness, it takes a lot of time in the preprocessing phase, and some hash collisions are generated. A novel hashing method has been proposed, which has reduced the preprocessing time and hash collision of the EHM algorithm. We devised the Hashing Approach for Pattern Matching (HAPM) algorithm by taking the best parts of the EHM and Quick Search (QS) algorithms and adding a way to avoid hash collisions. The preprocessing step of this algorithm combines the bad character table from the QS algorithm, the hashing strategy from the EHM algorithm, and the collision-reducing mechanism. To analyze the performance of our HAPM algorithm, we have used three types of datasets: E. coli, DNA sequences, and protein sequences. We looked at six algorithms discussed in the literature and compared our proposed method. The Hash-q with Unique FNG (HqUF) algorithm was only compared with E. coli and DNA datasets because it creates unique bits for DNA sequences. Our proposed HAPM algorithm also overcomes the problems of the HqUF algorithm. The new method beats older ones regarding average runtime, number of attempts, and character comparisons for long and short text patterns, though it did worse on some short patterns.","PeriodicalId":44894,"journal":{"name":"Applied Computational Intelligence and Soft Computing","volume":"287 2","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computational Intelligence and Soft Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2023/3278505","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Pattern matching algorithms have gained a lot of importance in computer science, primarily because they are used in various domains such as computational biology, video retrieval, intrusion detection systems, and fraud detection. Finding one or more patterns in a given text is known as pattern matching. Two important things that are used to judge how well exact pattern matching algorithms work are the total number of attempts and the character comparisons that are made during the matching process. The primary focus of our proposed method is reducing the size of both components wherever possible. Despite sprinting, hash-based pattern matching algorithms may have hash collisions. The Efficient Hashing Method (EHM) algorithm is improved in this research. Despite the EHM algorithm’s effectiveness, it takes a lot of time in the preprocessing phase, and some hash collisions are generated. A novel hashing method has been proposed, which has reduced the preprocessing time and hash collision of the EHM algorithm. We devised the Hashing Approach for Pattern Matching (HAPM) algorithm by taking the best parts of the EHM and Quick Search (QS) algorithms and adding a way to avoid hash collisions. The preprocessing step of this algorithm combines the bad character table from the QS algorithm, the hashing strategy from the EHM algorithm, and the collision-reducing mechanism. To analyze the performance of our HAPM algorithm, we have used three types of datasets: E. coli, DNA sequences, and protein sequences. We looked at six algorithms discussed in the literature and compared our proposed method. The Hash-q with Unique FNG (HqUF) algorithm was only compared with E. coli and DNA datasets because it creates unique bits for DNA sequences. Our proposed HAPM algorithm also overcomes the problems of the HqUF algorithm. The new method beats older ones regarding average runtime, number of attempts, and character comparisons for long and short text patterns, though it did worse on some short patterns.

查看原文本刊更多论文

解决精确模式匹配问题的生物序列改进哈希算法

模式匹配算法在计算机科学中的重要性日益凸显，这主要是因为它们被广泛应用于计算生物学、视频检索、入侵检测系统和欺诈检测等多个领域。在给定文本中找到一个或多个模式被称为模式匹配。判断精确模式匹配算法效果的两个重要指标是尝试的总次数和匹配过程中进行的字符比较。我们提出的方法的主要重点是尽可能减少这两个部分的大小。尽管进行了冲刺，但基于散列的模式匹配算法可能会发生散列碰撞。本研究改进了高效散列法（EHM）算法。尽管 EHM 算法很有效，但它在预处理阶段需要花费大量时间，而且会产生一些散列碰撞。我们提出了一种新的散列方法，它减少了 EHM 算法的预处理时间和散列碰撞。我们汲取了 EHM 算法和快速搜索（QS）算法的精华，并增加了避免散列碰撞的方法，从而设计出了模式匹配散列方法（HAPM）算法。该算法的预处理步骤结合了 QS 算法的坏字符表、EHM 算法的散列策略和减少碰撞机制。为了分析 HAPM 算法的性能，我们使用了三种数据集：大肠杆菌、DNA 序列和蛋白质序列。我们研究了文献中讨论的六种算法，并对我们提出的方法进行了比较。Hash-q with Unique FNG (HqUF) 算法只与大肠杆菌和 DNA 数据集进行了比较，因为它能为 DNA 序列创建唯一比特。我们提出的 HAPM 算法也克服了 HqUF 算法的问题。在长文本和短文本模式的平均运行时间、尝试次数和字符比较方面，新方法优于旧方法，但在某些短模式上表现较差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Computational Intelligence and Soft Computing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

6.10

自引率

3.40%

发文量

审稿时长

21 weeks

期刊介绍： Applied Computational Intelligence and Soft Computing will focus on the disciplines of computer science, engineering, and mathematics. The scope of the journal includes developing applications related to all aspects of natural and social sciences by employing the technologies of computational intelligence and soft computing. The new applications of using computational intelligence and soft computing are still in development. Although computational intelligence and soft computing are established fields, the new applications of using computational intelligence and soft computing can be regarded as an emerging field, which is the focus of this journal.