{"title":"串行和并行的i/o高效后缀树构建方法","authors":"A. Ghoting, K. Makarychev","doi":"10.1145/1559845.1559931","DOIUrl":null,"url":null,"abstract":"Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm. To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.","PeriodicalId":344093,"journal":{"name":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":"{\"title\":\"Serial and parallel methods for i/o efficient suffix tree construction\",\"authors\":\"A. Ghoting, K. Makarychev\",\"doi\":\"10.1145/1559845.1559931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm. To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.\",\"PeriodicalId\":344093,\"journal\":{\"name\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1559845.1559931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2009 ACM SIGMOD International Conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1559845.1559931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 35
摘要
在过去的三十年中,后缀树一直是字符串处理中的基本数据结构。然而,由于后缀树的构建不能很好地随输入字符串的大小进行扩展,因此它的广泛适用性受到了阻碍。随着数据收集和存储技术的进步,大字符串已经变得无处不在,特别是在涉及文本、时间序列和生物序列数据的新兴应用程序中。为了从这些进步中受益,我们必须实现一个可扩展的后缀树构建算法。为了应对上述挑战,过去几年出现了几种基于磁盘的后缀树构建算法。然而,构建时间仍然令人望而生畏——例如,在具有2g物理内存的系统上索引整个人类基因组仍然需要30多个小时。在本文中,首先,我们通过经验证明并论证了所有现有的后缀树构建算法都有一个严重的限制——为了收集合理的磁盘I/O效率,被索引的输入字符串必须适合主内存。这种限制归因于现有后缀树构建算法的局部性差,并且抑制了顺序和并行可伸缩性。为了解决这个限制,其次,我们展示了通过仔细的算法设计,可以重新构建最简单的后缀树构建算法之一,以平铺方式构建后缀树,从而允许实现在索引任何大小的字符串时保持恒定的工作集大小和固定的内存占用。第三,我们展示了改进的引用局部性与有效的集体通信如何促进大规模并行系统(如IBM Blue Gene/L)的高效并行化。最后,我们的经验表明,当索引大字符串时,所提出的方法提供了几个数量级的改进。此外,我们证明了所提出的并行化是可扩展的,并且允许在15分钟内在1024处理器系统上索引整个人类基因组。
Serial and parallel methods for i/o efficient suffix tree construction
Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we realize a scalable suffix tree construction algorithm. To deal with the aforementioned challenge, the past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting -- for e.g., indexing the entire Human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this paper, first, we empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation -- to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality properties of existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, second, we show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be re-architected to build a suffix tree in a tiled fashion, allowing the implementation to maintain a constant working set size and fixed memory footprint when indexing strings of any size. Third, we show how improved locality of reference coupled with effective collective communication facilitates an efficient parallelization on massively parallel systems like the IBM Blue Gene/L. Finally, we empirically show that the proposed approach affords improvements of several orders of magnitude when indexing large strings. Furthermore, we demonstrate that the proposed parallelization is scalable and allows one to index the entire Human genome on a 1024 processor system in under 15 minutes.