高性能计算中数据密集型下一代排序的优化

N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn
{"title":"高性能计算中数据密集型下一代排序的优化","authors":"N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn","doi":"10.1109/BIBE.2015.7367654","DOIUrl":null,"url":null,"abstract":"Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as \"NGS workflow at SIDRA\". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of \"scalability\" (use maximum available CPUs and memory) and \"multiple instances of NGS workflow with different genome data within a node\" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.","PeriodicalId":422807,"journal":{"name":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Optimization of data-intensive next generation sequencing in high performance computing\",\"authors\":\"N. Kathiresan, Rashid J. Al-Ali, P. Jithesh, Tariq AbuZaid, Ramzi Temanni, A. Ptitsyn\",\"doi\":\"10.1109/BIBE.2015.7367654\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as \\\"NGS workflow at SIDRA\\\". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of \\\"scalability\\\" (use maximum available CPUs and memory) and \\\"multiple instances of NGS workflow with different genome data within a node\\\" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.\",\"PeriodicalId\":422807,\"journal\":{\"name\":\"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"volume\":\"166 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBE.2015.7367654\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 15th International Conference on Bioinformatics and Bioengineering (BIBE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBE.2015.7367654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

下一代测序(NGS)技术的进步与每年不断增加的基因组数据量有关。这些基因组数据通过使用高性能计算(HPC)的经验并行有效地处理。处理后的数据可用于全基因组关联研究、遗传学、个性化医疗和许多其他领域。在基因组处理的不同阶段有不同的算法和实现。在本文中,我们使用基于BWAKIT和GATK的软件来处理更大量的基因组数据,这被称为“SIDRA的NGS工作流程”。在NGS工作流程中,我们分别使用BWAKIT进行基因组比对和GATK进行变异发现,这两个工作流程分别需要较大的计算量和巨大的内存需求。我们观察到,在变体发现过程中,CPU利用率不超过45%,因此,有必要了解NGS工作流自动化过程中资源的最佳选择(根据线程数或内核数)。我们从“可伸缩性”(使用最大可用cpu和内存)和“在一个节点内使用不同基因组数据的多个NGS工作流实例”(使用有限的cpu和内存集并发处理更多的基因组数据)方面分析了性能瓶颈和应用程序优化。我们观察到,在使用我们自己的调度启发式方法同时处理2、4、8和16个样本时,性能分别提高了40%、65%、71%和76%。因此,与基于应用程序可伸缩性的工作流相比,我们提出的NGS工作流自动化将提高高达76%的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Optimization of data-intensive next generation sequencing in high performance computing
Advancement in Next Generation Sequencing (NGS) technology are associated with ever-increasing volume of genomic data every year. These genomic data are efficiently processed by empirical parallelism using High Performance Computing (HPC). The processed data can be used for genome-wide association studies, genetics, personalized medicine and many other areas. There are different kind of algorithms and implementations used in different phases of genome processing. In this paper, we used BWAKIT and GATK based software for processing larger volume of genomic data that are referred as "NGS workflow at SIDRA". We used BWAKIT for genome alignment and GATK for variant discovery in the NGS workflow that required larger computation and huge memory requirement respectively. We observed, the CPU utilization is not more than 45% during variant discovery and hence, it is necessary to understand the optimal selection (in terms of number of threads or cores) of the resources during the NGS workflow automation. We analyzed the performance bottleneck and application optimization in terms of "scalability" (use maximum available CPUs and memory) and "multiple instances of NGS workflow with different genome data within a node" (process more volume of genome data concurrently with limited set of CPUs and memory). We observed that, 40%, 65%, 71% and 76% improvement in performance while processing 2, 4, 8 and 16 samples concurrently using our own scheduling heuristics. As a result, our proposed NGS workflow automation will improve the performance upto 76% compared to application scalability based workflows.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信