流式分布式DNA序列比对使用Apache Spark

Hamid Mushtaq, Nauman Ahmed, Z. Al-Ars
{"title":"流式分布式DNA序列比对使用Apache Spark","authors":"Hamid Mushtaq, Nauman Ahmed, Z. Al-Ars","doi":"10.1109/BIBE.2017.00-57","DOIUrl":null,"url":null,"abstract":"The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster. In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.","PeriodicalId":262603,"journal":{"name":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Streaming Distributed DNA Sequence Alignment Using Apache Spark\",\"authors\":\"Hamid Mushtaq, Nauman Ahmed, Z. Al-Ars\",\"doi\":\"10.1109/BIBE.2017.00-57\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster. In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.\",\"PeriodicalId\":262603,\"journal\":{\"name\":\"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2017.00-57\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2017.00-57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

下一代测序(NGS)技术产生的大量数据(每次实验通常为数百千兆字节)必须快速分析,以产生有意义的变异结果。分析这些数据的第一步是将这些序列序列映射到它们在人类基因组中的相应位置。其中一个最流行的工具来做这种序列比对是Burrows-Wheeler Aligner (BWA mem)。但是,BWA程序的一个限制是它不能在集群上运行。在本文中,我们提出了StreamBWA,这是一个新的框架,它允许BWA内存程序以分布式的方式在集群上运行,同时输入数据正在流式传输到集群中。它可以直接处理来自压缩文件的输入数据,压缩文件要么位于本地文件系统上,要么位于URL上。此外,StreamBWA可以同时开始组合分布式BWA内存任务的输出文件,而这些任务仍在集群上执行。经验评估表明,这种流分布方法比非流方法快大约2倍。此外,我们的流分布式方法比SparkBWA等其他最先进的解决方案快5倍。StreamBWA的源代码可在https://github.com/HamidMushtaq/StreamBWA上公开获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Streaming Distributed DNA Sequence Alignment Using Apache Spark
The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster. In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信