{"title":"索引基因组序列上的IBM蓝色基因","authors":"A. Ghoting, K. Makarychev","doi":"10.1145/1654059.1654122","DOIUrl":null,"url":null,"abstract":"With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":"{\"title\":\"Indexing genomic sequences on the IBM Blue Gene\",\"authors\":\"A. Ghoting, K. Makarychev\",\"doi\":\"10.1145/1654059.1654122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.\",\"PeriodicalId\":371415,\"journal\":{\"name\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"23\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1654059.1654122\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1654059.1654122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23
摘要
随着测序技术的进步和积极的测序工作,DNA序列数据集一直在快速增长。为了从这些进步中获益,重要的是为生命科学研究人员提供处理和查询大型序列数据集的能力。在过去的三十年中,后缀树一直是处理顺序数据集的基本数据结构。然而,在大型数据集上构建树的时间太长了。虽然并行后缀树结构是减少执行时间的明显解决方案,但引用的局部性差限制了并行性能。在本文中,我们展示了通过仔细的并行算法设计,可以消除这种限制,允许树形结构扩展到像IBM Blue Gene这样的大规模并行系统。我们证明,整个人类基因组可以在1024个处理器上在15分钟内被索引。
With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.