基因组测序流水线中MarkDuplicate优化研究

Proceedings of the 5th International Conference on Bioinformatics Research and Applications Pub Date : 2018-12-27 DOI:10.1145/3309129.3309134

Qi Zhao

{"title":"基因组测序流水线中MarkDuplicate优化研究","authors":"Qi Zhao","doi":"10.1145/3309129.3309134","DOIUrl":null,"url":null,"abstract":"MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.","PeriodicalId":326530,"journal":{"name":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Study on Optimizing MarkDuplicate in Genome Sequencing Pipeline\",\"authors\":\"Qi Zhao\",\"doi\":\"10.1145/3309129.3309134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.\",\"PeriodicalId\":326530,\"journal\":{\"name\":\"Proceedings of the 5th International Conference on Bioinformatics Research and Applications\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th International Conference on Bioinformatics Research and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3309129.3309134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Conference on Bioinformatics Research and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3309129.3309134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

MarkDuplicate通常是全基因组测序管道中最耗时的操作之一。生物学家广泛使用Picard工具对基因组数据中的reads进行排序，并在排序后的基因组数据中标记重复的reads，但由于其单线程顺序Java实现，使得其在MarkDuplicate上的性能相对较低，严重影响了当今的生物信息学研究。为了加速Picard中的MarkDuplicate，我们提出了两阶段优化方案，作为下一代生物信息学软件工具的初步研究，以更好地服务于生物信息学研究。在第一阶段，我们通过消除大冗余操作来改进原有的光学重复读取跟踪算法。因此，我们仅在第二步就实现了高达50倍的加速，而整个过程的加速则达到了9.57倍。下一步，我们将把MarkDuplicate的I/O处理机制重新设计为磁盘基因组文件和内存基因组数据之间的转换，使用ADAM格式代替之前的SAM格式，并通过Scala实现云规模的MarkDuplicate应用。我们的评估是在具有25个工作节点和Hadoop分布式文件系统的Spark集群上执行的。根据评估结果，我们的云规模MarkDuplicate不仅可以提供相同的输出，而且与原有的Picard工具和其他现有的类似工具相比，性能更好。具体而言，在我们用于两个阶段评估的13组真实全基因组数据中，我们获得的最佳改进是总共减少了92小时的运行时间。平均改善时间达到48.69小时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Study on Optimizing MarkDuplicate in Genome Sequencing Pipeline

MarkDuplicate is typically one of the most time-consuming operations in the whole genome sequencing pipeline. Picard tool, which is widely used by biologists to sort reads in genome data and mark duplicate reads in sorted genome data, has relatively low performance on MarkDuplicate due to its single-thread sequential Java implementation, which has caused serious impact on nowadays bioinformatic researches. To accelerate MarkDuplicate in Picard, we present our two-stage optimization solution as a preliminary study on next generation bioinformatic software tools to better serve bioinformatic researches. In the first stage, we improve the original algorithm of tracking optical duplicate reads by eliminating large redundant operations. As a consequence, we achieve up to 50X speedup for the second step only and 9.57X overall process speedup. At the next stage, we redesign the I/O processing mechanism of MarkDuplicate as transforming between on-disk genome file and in-memory genome data by using ADAM format instead of previous SAM format, and implement cloud-scale MarkDuplicate application by Scala. Our evaluation is performed on top of Spark cluster with 25 worker nodes and Hadoop distributed file system. According to the evaluation results, our cloudscale MarkDuplicate can provide not only the same output but also better performance compared with the original Picard tool and other existing similar tools. Specifically, among the 13 sets of real whole genome data we used for evaluation at both stages, the best improvement we gain is reducing runtime by 92 hours in total. Average improvement reaches 48.69 decreasing hours.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 5th International Conference on Bioinformatics Research and Applications

自引率

0.00%

发文量