SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment

Guoguang Zhao, Cheng Ling, Donghong Sun
{"title":"SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment","authors":"Guoguang Zhao, Cheng Ling, Donghong Sun","doi":"10.1109/CCGrid.2015.55","DOIUrl":null,"url":null,"abstract":"The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"22 1","pages":"845-852"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 36

Abstract

The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.
SparkSW:大规模生物序列比对可扩展分布式计算系统
Smith-Waterman (SW)算法由于其高灵敏度而被广泛用于数据库搜索。该算法的广泛影响反映在该算法在过去几十年中收到的8000多次引用中。然而,该算法在时间和空间复杂度方面过高,因此提出了重大的计算挑战。Apache Spark是一个日益流行的快速大数据分析引擎,它在商业硬件上实现大规模数据密集型应用方面非常成功。本文提出了第一个基于Apache Spark的分布式计算框架实现软件算法的系统,该系统使用了几个现成的工作站,命名为SparkSW。采用最先进的UniRef100超大数据库对系统的可扩展性和负载均衡效率进行了研究。实验结果表明:1)SparkSW具有负载均衡的并行自适应能力,并且随着计算资源的增加具有良好的可扩展性;2)SparkSW提供了一种快速、通用的高灵敏度生物序列比对选择。SparkSW的成功也表明,Apache Spark框架提供了一个有效的解决方案,以方便应对不断增长的生物序列数据库,特别是由第二代测序技术生成的数据库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信