Optimization of BLAST Seed Indexing in the Alignment of DNA Sequences with GPU using CUDA

Franklin Luis Antonio Cruz Gamero, J. C. Gutiérrez-Cáceres
{"title":"Optimization of BLAST Seed Indexing in the Alignment of DNA Sequences with GPU using CUDA","authors":"Franklin Luis Antonio Cruz Gamero, J. C. Gutiérrez-Cáceres","doi":"10.1109/CLEI.2018.00069","DOIUrl":null,"url":null,"abstract":"In the alignment of biological sequences such as DNA, RNA and proteins, different algorithms are used, mainly the Basic Local Alignment Search Tool (BLAST), which has two phases, a heuristic phase of seed indexing and another extension phase with a comparison of sequences using the Smith-Waterman (SW) algorithm, which allows the alignment of a short sequence \"query\" with a long reference sequence \"db\" in a very fast way in relation to other algorithms of alignment. This work proposes to use a two-dimensional matrix instead of a sparse matrix as a hash table for the storage of the seed index obtained, as well as the use of the GPU of our graphic card to optimize the planting, it reduces 11.24 % of the time of processing of seed indexing phase of the BLAST, presenting the use of GPU with CUDA a better performance in processing time than the sequential implementation and another multi CPUs using threads with OPENMP. Our algorithm has a complexity in time of O(1) to obtain the seeds identical to the pattern key. The performance is greater when the length of the hash key increases. For its evaluation tests we used a laptop core i7 of 16gb of RAM and a graphic card of 384 cores with C++ programming language and CUDA. Alignment tests were performed using real DNA sequences obtained from the National Center for Biotechnology Information (NCBI) and ENSEMBL in FASTA format with reference sequences of up to 1.3 Gb, such as the complete genome of the hen (Gallus gallus) that has 1 230 258 557 base pairs (bp) and with a query sequence of 140 bp, which was indexed with a 5 bp key in 1074 milliseconds using GPU.","PeriodicalId":263586,"journal":{"name":"Latin American Computing Conference / Conferencia Latinoamericana En Informatica","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Latin American Computing Conference / Conferencia Latinoamericana En Informatica","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLEI.2018.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

In the alignment of biological sequences such as DNA, RNA and proteins, different algorithms are used, mainly the Basic Local Alignment Search Tool (BLAST), which has two phases, a heuristic phase of seed indexing and another extension phase with a comparison of sequences using the Smith-Waterman (SW) algorithm, which allows the alignment of a short sequence "query" with a long reference sequence "db" in a very fast way in relation to other algorithms of alignment. This work proposes to use a two-dimensional matrix instead of a sparse matrix as a hash table for the storage of the seed index obtained, as well as the use of the GPU of our graphic card to optimize the planting, it reduces 11.24 % of the time of processing of seed indexing phase of the BLAST, presenting the use of GPU with CUDA a better performance in processing time than the sequential implementation and another multi CPUs using threads with OPENMP. Our algorithm has a complexity in time of O(1) to obtain the seeds identical to the pattern key. The performance is greater when the length of the hash key increases. For its evaluation tests we used a laptop core i7 of 16gb of RAM and a graphic card of 384 cores with C++ programming language and CUDA. Alignment tests were performed using real DNA sequences obtained from the National Center for Biotechnology Information (NCBI) and ENSEMBL in FASTA format with reference sequences of up to 1.3 Gb, such as the complete genome of the hen (Gallus gallus) that has 1 230 258 557 base pairs (bp) and with a query sequence of 140 bp, which was indexed with a 5 bp key in 1074 milliseconds using GPU.
利用CUDA优化GPU DNA序列比对中的BLAST Seed索引
在DNA, RNA和蛋白质等生物序列的比对中,使用了不同的算法,主要是Basic Local alignment Search Tool (BLAST),它有两个阶段,一个是种子索引的启发式阶段,另一个是使用Smith-Waterman (SW)算法的序列比较扩展阶段,它允许短序列“查询”与长参考序列“db”以非常快的方式比对。这项工作提出了用一个二维矩阵的稀疏矩阵存储一个哈希表的种子获得的指数,以及使用的GPU图形卡优化种植,它减少了11.24%的时间处理的种子爆炸的索引阶段,呈现的使用与CUDA GPU比顺序实现更好的性能在处理时间和另一个多cpu使用线程OPENMP。该算法获得与模式键相同的种子的时间复杂度为0(1)。当哈希键的长度增加时,性能会提高。在评估测试中,我们使用了一台16gb内存的i7笔记本电脑和384核的显卡,使用c++编程语言和CUDA。比对测试使用从国家生物技术信息中心(NCBI)和ENSEMBL以FASTA格式获取的真实DNA序列进行,参考序列高达1.3 Gb,如母鸡(Gallus Gallus)的全基因组有1 230 258 557碱基对(bp),查询序列为140 bp,使用GPU在1074毫秒内用5 bp键进行索引。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信