外部内存中的全文(子字符串)索引

Synthesis Lectures on Data Management Pub Date : 2011-12-20 DOI:10.2200/S00396ED1V01Y201111DTM022

Marina Barsky, U. Stege, Alex Thomo

{"title":"外部内存中的全文(子字符串)索引","authors":"Marina Barsky, U. Stege, Alex Thomo","doi":"10.2200/S00396ED1V01Y201111DTM022","DOIUrl":null,"url":null,"abstract":"Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the \"1000 Genomes Project\" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes. Querying strings without well-separated tokens poses a different set of challenges, typically addressed by building full-text indexes, which provide effective structures to index all the substrings of the given strings. Since full-text indexes occupy more space than the raw data, it is often necessary to use disk space for their construction. However, until recently, the construction of full-text indexes in secondary storage was considered impractical due to excessive I/O costs. Despite this, algorithms developed in the last decade demonstrated that efficient external construction of full-text indexes is indeed possible. This book is about large-scale construction and usage of full-text indexes. We focus mainly on suffix trees, and show efficient algorithms that can convert suffix trees to other kinds of full-text indexes and vice versa. There are four parts in this book. They are a mix of string searching theory with the reality of external memory constraints. The first part introduces general concepts of full-text indexes and shows the relationships between them. The second part presents the first series of external-memory construction algorithms that can handle the construction of full-text indexes for moderately large strings in the order of few gigabytes. The third part presents algorithms that scale for very large strings. The final part examines queries that can be facilitated by disk-resident full-text indexes. Table of Contents: Structures for Indexing Substrings / External Construction of Suffix Trees / Scaling Up: When the Input Exceeds the Main Memory / Queries for Disk-based Indexes / Conclusions and Open Problems","PeriodicalId":187413,"journal":{"name":"Synthesis Lectures on Data Management","volume":"285 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Full-Text (Substring) Indexes in External Memory\",\"authors\":\"Marina Barsky, U. Stege, Alex Thomo\",\"doi\":\"10.2200/S00396ED1V01Y201111DTM022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the \\\"1000 Genomes Project\\\" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes. Querying strings without well-separated tokens poses a different set of challenges, typically addressed by building full-text indexes, which provide effective structures to index all the substrings of the given strings. Since full-text indexes occupy more space than the raw data, it is often necessary to use disk space for their construction. However, until recently, the construction of full-text indexes in secondary storage was considered impractical due to excessive I/O costs. Despite this, algorithms developed in the last decade demonstrated that efficient external construction of full-text indexes is indeed possible. This book is about large-scale construction and usage of full-text indexes. We focus mainly on suffix trees, and show efficient algorithms that can convert suffix trees to other kinds of full-text indexes and vice versa. There are four parts in this book. They are a mix of string searching theory with the reality of external memory constraints. The first part introduces general concepts of full-text indexes and shows the relationships between them. The second part presents the first series of external-memory construction algorithms that can handle the construction of full-text indexes for moderately large strings in the order of few gigabytes. The third part presents algorithms that scale for very large strings. The final part examines queries that can be facilitated by disk-resident full-text indexes. Table of Contents: Structures for Indexing Substrings / External Construction of Suffix Trees / Scaling Up: When the Input Exceeds the Main Memory / Queries for Disk-based Indexes / Conclusions and Open Problems\",\"PeriodicalId\":187413,\"journal\":{\"name\":\"Synthesis Lectures on Data Management\",\"volume\":\"285 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Synthesis Lectures on Data Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2200/S00396ED1V01Y201111DTM022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthesis Lectures on Data Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2200/S00396ED1V01Y201111DTM022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

如今，文本数据库是增长最快的数据集合之一。其中一些集合包含不同于经典数值或文本数据的新类型数据。这些都是长序列的符号，没有被划分成分开良好的小符号(单词)。在这些收藏中，最突出的是生物序列数据库，目前正经历着前所未有的增长速度。从2008年开始，“千人基因组计划”已经启动，其最终目标是收集额外的1500个人类基因组序列，其中500个来自欧洲、非洲和东亚。这将产生一个广泛的人类遗传变异目录。这个目录中原始序列的大小大约是5tb。查询没有良好分隔的令牌的字符串会带来一系列不同的挑战，通常通过构建全文索引来解决，全文索引提供了有效的结构来索引给定字符串的所有子字符串。由于全文索引比原始数据占用更多的空间，因此通常需要使用磁盘空间来构建全文索引。然而，直到最近，由于I/O成本过高，在二级存储中构建全文索引被认为是不切实际的。尽管如此，过去十年中开发的算法表明，全文索引的高效外部构建确实是可能的。这本书是关于全文索引的大规模构建和使用。我们主要关注后缀树，并展示了可以将后缀树转换为其他类型的全文索引的有效算法，反之亦然。这本书有四个部分。它们是字符串搜索理论与外部内存约束现实的混合。第一部分介绍了全文索引的一般概念，并说明了全文索引与全文索引之间的关系。第二部分介绍了第一个外部内存构造算法系列，这些算法可以为几gb大小的中等大小的字符串处理全文索引的构造。第三部分介绍了适用于非常大字符串的算法。最后一部分研究驻留在磁盘上的全文索引可以促进的查询。目录:索引子字符串的结构/后缀树的外部结构/扩展:当输入超过主内存/基于磁盘的索引/结论和开放问题的查询

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Full-Text (Substring) Indexes in External Memory

Nowadays, textual databases are among the most rapidly growing collections of data. Some of these collections contain a new type of data that differs from classical numerical or textual data. These are long sequences of symbols, not divided into well-separated small tokens (words). The most prominent among such collections are databases of biological sequences, which are experiencing today an unprecedented growth rate. Starting in 2008, the "1000 Genomes Project" has been launched with the ultimate goal of collecting sequences of additional 1,500 Human genomes, 500 each of European, African, and East Asian origin. This will produce an extensive catalog of Human genetic variations. The size of just the raw sequences in this catalog would be about 5 terabytes. Querying strings without well-separated tokens poses a different set of challenges, typically addressed by building full-text indexes, which provide effective structures to index all the substrings of the given strings. Since full-text indexes occupy more space than the raw data, it is often necessary to use disk space for their construction. However, until recently, the construction of full-text indexes in secondary storage was considered impractical due to excessive I/O costs. Despite this, algorithms developed in the last decade demonstrated that efficient external construction of full-text indexes is indeed possible. This book is about large-scale construction and usage of full-text indexes. We focus mainly on suffix trees, and show efficient algorithms that can convert suffix trees to other kinds of full-text indexes and vice versa. There are four parts in this book. They are a mix of string searching theory with the reality of external memory constraints. The first part introduces general concepts of full-text indexes and shows the relationships between them. The second part presents the first series of external-memory construction algorithms that can handle the construction of full-text indexes for moderately large strings in the order of few gigabytes. The third part presents algorithms that scale for very large strings. The final part examines queries that can be facilitated by disk-resident full-text indexes. Table of Contents: Structures for Indexing Substrings / External Construction of Suffix Trees / Scaling Up: When the Input Exceeds the Main Memory / Queries for Disk-based Indexes / Conclusions and Open Problems

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Synthesis Lectures on Data Management

自引率

0.00%

发文量