Investigating Genome Analysis Pipeline Performance on GATK with Cloud Object Storage

Tatsuhiro Chiba, Takeshi Yoshimura
{"title":"Investigating Genome Analysis Pipeline Performance on GATK with Cloud Object Storage","authors":"Tatsuhiro Chiba, Takeshi Yoshimura","doi":"10.1109/MASCOTS50786.2020.9285945","DOIUrl":null,"url":null,"abstract":"Achieving fast, scalable, and cost-effective genome analytics is always important to open up a new frontier in biomedical and life science. Genome Analysis Toolkit (GATK), an industry-standard genome analysis tool, improves its scalability and performance by leveraging Spark and HDFS. Spark with HDFS has been a leading analytics platform in a past few years, however, the system cannot exploit full advantage of cloud elasticity in a recent modern cloud. In this paper we investigate performance characteristics of GATK using Spark with HDFS and identify scalability issues. Based on a quantitative analysis, we introduce a new approach to utilize Cloud Object Storage (COS) in GATK instead of HDFS, which can help decoupling compute and storage. We demonstrate how this approach can contribute to the improvement of the entire pipeline performance and cost saving. As a result, we demonstrate GATK with IBM COS can achieve up to 28% faster than GATK with HDFS. We also show that this approach can achieve up to 67 % cost saving in total, which includes the time for data loading and whole pipeline analysis.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS50786.2020.9285945","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Achieving fast, scalable, and cost-effective genome analytics is always important to open up a new frontier in biomedical and life science. Genome Analysis Toolkit (GATK), an industry-standard genome analysis tool, improves its scalability and performance by leveraging Spark and HDFS. Spark with HDFS has been a leading analytics platform in a past few years, however, the system cannot exploit full advantage of cloud elasticity in a recent modern cloud. In this paper we investigate performance characteristics of GATK using Spark with HDFS and identify scalability issues. Based on a quantitative analysis, we introduce a new approach to utilize Cloud Object Storage (COS) in GATK instead of HDFS, which can help decoupling compute and storage. We demonstrate how this approach can contribute to the improvement of the entire pipeline performance and cost saving. As a result, we demonstrate GATK with IBM COS can achieve up to 28% faster than GATK with HDFS. We also show that this approach can achieve up to 67 % cost saving in total, which includes the time for data loading and whole pipeline analysis.
基于云对象存储的GATK基因组分析流水线性能研究
实现快速、可扩展且具有成本效益的基因组分析对于开辟生物医学和生命科学的新领域一直很重要。Genome Analysis Toolkit (GATK)是一个行业标准的基因组分析工具,通过利用Spark和HDFS来提高其可扩展性和性能。在过去的几年里,Spark with HDFS一直是领先的分析平台,然而,在最近的现代云环境中,该系统无法充分利用云弹性的优势。在本文中,我们研究了使用Spark和HDFS的GATK的性能特征,并确定了可扩展性问题。在定量分析的基础上,我们引入了一种新的方法,在GATK中利用云对象存储(COS)代替HDFS,可以帮助解耦计算和存储。我们演示了这种方法如何有助于改善整个管道性能并节省成本。因此,我们证明了使用IBM COS的GATK可以比使用HDFS的GATK快28%。我们还表明,这种方法可以节省高达67%的总成本,其中包括数据加载和整个管道分析的时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信