SeedHit: A GPU Friendly Pre-Align Filtering Algorithm

IF 3.4 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-06-21 DOI:10.1109/TCBB.2024.3417517

Zhen Ju;Jingjing Zhang;Xuelei Li;Jintao Meng;Yanjie Wei

{"title":"SeedHit: A GPU Friendly Pre-Align Filtering Algorithm","authors":"Zhen Ju;Jingjing Zhang;Xuelei Li;Jintao Meng;Yanjie Wei","doi":"10.1109/TCBB.2024.3417517","DOIUrl":null,"url":null,"abstract":"The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high-throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%–89% dissimilar sequence pairs on average when the similarity is 0.9–0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6–2.1 times speedup compared to the original version.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1794-1802"},"PeriodicalIF":3.4000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10568393/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The amount of genetic data generated by Next Generation Sequencing (NGS) technologies grows faster than Moore's law. This necessitates the development of efficient NGS data processing and analysis algorithms. A filter before the computationally-costly analysis step can significantly reduce the run time of the NGS data analysis. As GPUs are orders of magnitude more powerful than CPUs, this paper proposes a GPU-friendly pre-align filtering algorithm named SeedHit for the fast processing of NGS data. Inspired by BLAST, SeedHit counts seed hits between two sequences to determine their similarity. In SeedHit, a nucleic acid in a gene sequence is presented in binary format. By packaging data and generating a lookup table that fits into the L1 cache, SeedHit is GPU-friendly and high-throughput. Using three 16 s rRNA datasets from Greengenes as input SeedHit can reject 84%–89% dissimilar sequence pairs on average when the similarity is 0.9–0.99. The throughput of SeedHit achieved 1 T/s (Tera base per second) on 3080 Ti. Compared with the other two GPU-based filtering algorithms, GateKeeper and SneakySnake, SeedHit has the highest rejection rate and throughput. By incorporating SeedHit into our in-house clustering algorithm nGIA, the modified nGIA achieved a 1.6–2.1 times speedup compared to the original version.

查看原文本刊更多论文

SeedHit：GPU友好型预对齐过滤算法

下一代测序（NGS）技术产生的基因数据量的增长速度超过了摩尔定律。这就需要开发高效的 NGS 数据处理和分析算法。在计算成本高昂的分析步骤之前进行过滤，可以大大缩短 NGS 数据分析的运行时间。由于 GPU 的性能比 CPU 高出几个数量级，本文提出了一种名为 SeedHit 的 GPU 友好型预对齐过滤算法，用于快速处理 NGS 数据。受 BLAST 的启发，SeedHit 计算两个序列之间的种子命中率，以确定它们的相似性。在 SeedHit 中，基因序列中的核酸以二进制格式呈现。通过打包数据并生成适合 L1 缓存的查找表，SeedHit 对 GPU 非常友好，而且吞吐量很高。使用来自 Greengenes 的三个 16 s rRNA 数据集作为输入，当相似度为 0.9-0.99 时，SeedHit 可以平均剔除 84%-89% 的不相似序列对。在 3080 Ti 上，SeedHit 的吞吐量达到了 1 T/s（每秒 Tera 碱基）。与其他两种基于 GPU 的过滤算法（GateKeeper 和 SneakySnake）相比，SeedHit 的剔除率和吞吐量都是最高的。将 SeedHit 纳入我们的内部聚类算法 nGIA 后，修改后的 nGIA 速度比原始版本提高了 1.6-2.1 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system