SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Hamid Mushtaq, Frank Liu, Carlos H. A. Costa, Gang Liu, H. P. Hofstee, Z. Al-Ars
{"title":"SparkGA: A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale","authors":"Hamid Mushtaq, Frank Liu, Carlos H. A. Costa, Gang Liu, H. P. Hofstee, Z. Al-Ars","doi":"10.1145/3107411.3107438","DOIUrl":null,"url":null,"abstract":"In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3107438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 31

Abstract

In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very parallelizable though. In this paper, we present a parallel implementation of a DNA analysis pipeline based on the big data Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level parallelism as well as load balancing techniques. In order to reduce the analysis cost, the framework can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Our solution is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of the software described in this paper is publicly available at https://github.com/HamidMushtaq/SparkGA1.git.
SparkGA:一个具有成本效益,快速和准确的DNA分析的Spark框架
近年来,下一代测序(NGS)技术的成本大幅降低,使其成为诊断遗传疾病的可行方法。NGS技术产生的大量数据,每次实验通常达到数百千兆字节,必须快速分析才能产生有意义的变体结果。来自Broad研究所的GATK最佳实践管道是DNA分析中最流行的计算管道之一。然而,GATK管道的许多组件并不是很容易并行化。在本文中,我们提出了一个基于大数据Apache Spark框架的DNA分析管道的并行实现。这种实现具有高度可伸缩性,并且能够通过利用数据级并行性和负载平衡技术来并行化计算。为了降低分析成本,框架可以在内存低至16GB的节点上运行。对于全基因组测序实验,我们证明在20个节点的集群上,运行时间可以减少到1.5小时左右,准确率高达99.9981%。我们的解决方案比其他最先进的解决方案快71%,同时也更准确。本文中描述的软件的源代码可以在https://github.com/HamidMushtaq/SparkGA1.git上公开获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信