一种新的海量数据集相似性搜索的成对序列比对算法。

IF 7.7 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-08-31 DOI:10.1093/bib/bbaf512

Yosef Masoudi-Sobhanzadeh, Yadollah Omidi

{"title":"一种新的海量数据集相似性搜索的成对序列比对算法。","authors":"Yosef Masoudi-Sobhanzadeh, Yadollah Omidi","doi":"10.1093/bib/bbaf512","DOIUrl":null,"url":null,"abstract":"Advances in sequencing technologies have resulted in the production of a huge volume of data. Since the pairwise sequence alignment plays an essential role in comparing sequencing data, various algorithms have been developed. Among the previously suggested algorithms, the basic local alignment search tool (BLAST) is currently employed in a wide range of biological applications, largely due to its low time and memory complexity. However, not only BLAST but also other improved sequence alignment algorithms may fail to produce accurate results, therefore, more efficient algorithms can be highly advantageous. In the present study, we introduce a novel algorithm for sequence alignment (NASA) consisting of preprocessing and aligning steps. In the preprocessing step, the positions of residues are determined within a provided nucleotide or peptide sequence, resulting in seeking only informative regions. In the aligning step, based on a constant number of comparisons, the sequence similarity score is calculated between two sequences in a linear time and memory orders. To evaluate NASA, a large volume of sequencing data was analyzed and the outcomes were compared with other algorithms. The results showed that NASA outperforms other basic algorithms in terms of the elapsed time, required memory, system resource utilization, and alignment score precision. Collectively, NASA might be a promising method for retrieving similar sequences from large datasets.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 5","pages":""},"PeriodicalIF":7.7000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12476838/pdf/","citationCount":"0","resultStr":"{\"title\":\"A novel pairwise sequence alignment algorithm for similarity search in massive datasets.\",\"authors\":\"Yosef Masoudi-Sobhanzadeh, Yadollah Omidi\",\"doi\":\"10.1093/bib/bbaf512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advances in sequencing technologies have resulted in the production of a huge volume of data. Since the pairwise sequence alignment plays an essential role in comparing sequencing data, various algorithms have been developed. Among the previously suggested algorithms, the basic local alignment search tool (BLAST) is currently employed in a wide range of biological applications, largely due to its low time and memory complexity. However, not only BLAST but also other improved sequence alignment algorithms may fail to produce accurate results, therefore, more efficient algorithms can be highly advantageous. In the present study, we introduce a novel algorithm for sequence alignment (NASA) consisting of preprocessing and aligning steps. In the preprocessing step, the positions of residues are determined within a provided nucleotide or peptide sequence, resulting in seeking only informative regions. In the aligning step, based on a constant number of comparisons, the sequence similarity score is calculated between two sequences in a linear time and memory orders. To evaluate NASA, a large volume of sequencing data was analyzed and the outcomes were compared with other algorithms. The results showed that NASA outperforms other basic algorithms in terms of the elapsed time, required memory, system resource utilization, and alignment score precision. Collectively, NASA might be a promising method for retrieving similar sequences from large datasets.\",\"PeriodicalId\":9209,\"journal\":{\"name\":\"Briefings in bioinformatics\",\"volume\":\"26 5\",\"pages\":\"\"},\"PeriodicalIF\":7.7000,\"publicationDate\":\"2025-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12476838/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bib/bbaf512\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf512","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

测序技术的进步导致了大量数据的产生。由于两两序列比对在序列数据比较中起着至关重要的作用，各种算法已经被开发出来。在之前提出的算法中，基本局部比对搜索工具（BLAST）由于其较低的时间和存储复杂度，目前在生物领域得到了广泛的应用。然而，不仅是BLAST，其他改进的序列比对算法也可能无法产生准确的结果，因此，更高效的算法可能是非常有利的。在本研究中，我们提出了一种新的序列比对算法（NASA），包括预处理和比对两个步骤。在预处理步骤中，残基的位置在提供的核苷酸或肽序列中确定，从而只寻找信息区域。在对齐步骤中，基于一定次数的比较，以线性时间和内存顺序计算两个序列之间的序列相似度得分。为了评估NASA，分析了大量的测序数据，并将结果与其他算法进行了比较。结果表明，在运行时间、所需内存、系统资源利用率和对齐分数精度方面，NASA优于其他基本算法。总的来说，NASA可能是从大型数据集中检索类似序列的一个很有前途的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel pairwise sequence alignment algorithm for similarity search in massive datasets.

Advances in sequencing technologies have resulted in the production of a huge volume of data. Since the pairwise sequence alignment plays an essential role in comparing sequencing data, various algorithms have been developed. Among the previously suggested algorithms, the basic local alignment search tool (BLAST) is currently employed in a wide range of biological applications, largely due to its low time and memory complexity. However, not only BLAST but also other improved sequence alignment algorithms may fail to produce accurate results, therefore, more efficient algorithms can be highly advantageous. In the present study, we introduce a novel algorithm for sequence alignment (NASA) consisting of preprocessing and aligning steps. In the preprocessing step, the positions of residues are determined within a provided nucleotide or peptide sequence, resulting in seeking only informative regions. In the aligning step, based on a constant number of comparisons, the sequence similarity score is calculated between two sequences in a linear time and memory orders. To evaluate NASA, a large volume of sequencing data was analyzed and the outcomes were compared with other algorithms. The results showed that NASA outperforms other basic algorithms in terms of the elapsed time, required memory, system resource utilization, and alignment score precision. Collectively, NASA might be a promising method for retrieving similar sequences from large datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.