SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing Pub Date : 2015-05-04 DOI:10.1109/CCGrid.2015.55

Guoguang Zhao, Cheng Ling, Donghong Sun

{"title":"SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment","authors":"Guoguang Zhao, Cheng Ling, Donghong Sun","doi":"10.1109/CCGrid.2015.55","DOIUrl":null,"url":null,"abstract":"The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"22 1","pages":"845-852"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.

查看原文本刊更多论文

SparkSW:大规模生物序列比对可扩展分布式计算系统

Smith-Waterman (SW)算法由于其高灵敏度而被广泛用于数据库搜索。该算法的广泛影响反映在该算法在过去几十年中收到的8000多次引用中。然而，该算法在时间和空间复杂度方面过高，因此提出了重大的计算挑战。Apache Spark是一个日益流行的快速大数据分析引擎，它在商业硬件上实现大规模数据密集型应用方面非常成功。本文提出了第一个基于Apache Spark的分布式计算框架实现软件算法的系统，该系统使用了几个现成的工作站，命名为SparkSW。采用最先进的UniRef100超大数据库对系统的可扩展性和负载均衡效率进行了研究。实验结果表明:1)SparkSW具有负载均衡的并行自适应能力，并且随着计算资源的增加具有良好的可扩展性;2)SparkSW提供了一种快速、通用的高灵敏度生物序列比对选择。SparkSW的成功也表明，Apache Spark框架提供了一个有效的解决方案，以方便应对不断增长的生物序列数据库，特别是由第二代测序技术生成的数据库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量