Optimization of BLAST Seed Indexing in the Alignment of DNA Sequences with GPU using CUDA

Latin American Computing Conference / Conferencia Latinoamericana En Informatica Pub Date : 2018-10-01 DOI:10.1109/CLEI.2018.00069

Franklin Luis Antonio Cruz Gamero, J. C. Gutiérrez-Cáceres

{"title":"Optimization of BLAST Seed Indexing in the Alignment of DNA Sequences with GPU using CUDA","authors":"Franklin Luis Antonio Cruz Gamero, J. C. Gutiérrez-Cáceres","doi":"10.1109/CLEI.2018.00069","DOIUrl":null,"url":null,"abstract":"In the alignment of biological sequences such as DNA, RNA and proteins, different algorithms are used, mainly the Basic Local Alignment Search Tool (BLAST), which has two phases, a heuristic phase of seed indexing and another extension phase with a comparison of sequences using the Smith-Waterman (SW) algorithm, which allows the alignment of a short sequence \"query\" with a long reference sequence \"db\" in a very fast way in relation to other algorithms of alignment. This work proposes to use a two-dimensional matrix instead of a sparse matrix as a hash table for the storage of the seed index obtained, as well as the use of the GPU of our graphic card to optimize the planting, it reduces 11.24 % of the time of processing of seed indexing phase of the BLAST, presenting the use of GPU with CUDA a better performance in processing time than the sequential implementation and another multi CPUs using threads with OPENMP. Our algorithm has a complexity in time of O(1) to obtain the seeds identical to the pattern key. The performance is greater when the length of the hash key increases. For its evaluation tests we used a laptop core i7 of 16gb of RAM and a graphic card of 384 cores with C++ programming language and CUDA. Alignment tests were performed using real DNA sequences obtained from the National Center for Biotechnology Information (NCBI) and ENSEMBL in FASTA format with reference sequences of up to 1.3 Gb, such as the complete genome of the hen (Gallus gallus) that has 1 230 258 557 base pairs (bp) and with a query sequence of 140 bp, which was indexed with a 5 bp key in 1074 milliseconds using GPU.","PeriodicalId":263586,"journal":{"name":"Latin American Computing Conference / Conferencia Latinoamericana En Informatica","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Latin American Computing Conference / Conferencia Latinoamericana En Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLEI.2018.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In the alignment of biological sequences such as DNA, RNA and proteins, different algorithms are used, mainly the Basic Local Alignment Search Tool (BLAST), which has two phases, a heuristic phase of seed indexing and another extension phase with a comparison of sequences using the Smith-Waterman (SW) algorithm, which allows the alignment of a short sequence "query" with a long reference sequence "db" in a very fast way in relation to other algorithms of alignment. This work proposes to use a two-dimensional matrix instead of a sparse matrix as a hash table for the storage of the seed index obtained, as well as the use of the GPU of our graphic card to optimize the planting, it reduces 11.24 % of the time of processing of seed indexing phase of the BLAST, presenting the use of GPU with CUDA a better performance in processing time than the sequential implementation and another multi CPUs using threads with OPENMP. Our algorithm has a complexity in time of O(1) to obtain the seeds identical to the pattern key. The performance is greater when the length of the hash key increases. For its evaluation tests we used a laptop core i7 of 16gb of RAM and a graphic card of 384 cores with C++ programming language and CUDA. Alignment tests were performed using real DNA sequences obtained from the National Center for Biotechnology Information (NCBI) and ENSEMBL in FASTA format with reference sequences of up to 1.3 Gb, such as the complete genome of the hen (Gallus gallus) that has 1 230 258 557 base pairs (bp) and with a query sequence of 140 bp, which was indexed with a 5 bp key in 1074 milliseconds using GPU.

查看原文本刊更多论文

利用CUDA优化GPU DNA序列比对中的BLAST Seed索引

在DNA, RNA和蛋白质等生物序列的比对中，使用了不同的算法，主要是Basic Local alignment Search Tool (BLAST)，它有两个阶段，一个是种子索引的启发式阶段，另一个是使用Smith-Waterman (SW)算法的序列比较扩展阶段，它允许短序列“查询”与长参考序列“db”以非常快的方式比对。这项工作提出了用一个二维矩阵的稀疏矩阵存储一个哈希表的种子获得的指数,以及使用的GPU图形卡优化种植,它减少了11.24%的时间处理的种子爆炸的索引阶段,呈现的使用与CUDA GPU比顺序实现更好的性能在处理时间和另一个多cpu使用线程OPENMP。该算法获得与模式键相同的种子的时间复杂度为0(1)。当哈希键的长度增加时，性能会提高。在评估测试中，我们使用了一台16gb内存的i7笔记本电脑和384核的显卡，使用c++编程语言和CUDA。比对测试使用从国家生物技术信息中心(NCBI)和ENSEMBL以FASTA格式获取的真实DNA序列进行，参考序列高达1.3 Gb，如母鸡(Gallus Gallus)的全基因组有1 230 258 557碱基对(bp)，查询序列为140 bp，使用GPU在1074毫秒内用5 bp键进行索引。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Latin American Computing Conference / Conferencia Latinoamericana En Informatica

自引率

0.00%

发文量