局部后缀阵列及其在配对短读基因组定位中的应用。

Genome informatics. International Conference on Genome Informatics Pub Date : 2009-10-01

Kouichi Kimura, Asako Koike

{"title":"局部后缀阵列及其在配对短读基因组定位中的应用。","authors":"Kouichi Kimura, Asako Koike","doi":"","DOIUrl":null,"url":null,"abstract":"We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.","PeriodicalId":73143,"journal":{"name":"Genome informatics. International Conference on Genome Informatics","volume":"23 1","pages":"60-71"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Localized suffix array and its application to genome mapping problems for paired-end short reads.\",\"authors\":\"Kouichi Kimura, Asako Koike\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.\",\"PeriodicalId\":73143,\"journal\":{\"name\":\"Genome informatics. International Conference on Genome Informatics\",\"volume\":\"23 1\",\"pages\":\"60-71\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome informatics. International Conference on Genome Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome informatics. International Conference on Genome Informatics","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文介绍了一种新的数据结构——局部后缀数组，在此基础上，文本搜索应用中的出现信息被动态地表示为全局位置信息和本地字典顺序信息的组合。对于在给定距离内搜索一对单词，许多共享粗粒度全局位置的候选位置可以像在传统后缀数组中一样，按照本地字典顺序紧凑地表示，并且可以在粗粒度分辨率下同时检查它们是否违反距离约束。位置和字典信息之间的权衡逐渐向更精细的位置分辨率转移，并相应地重新检查距离约束。因此，即使每个单词有大量的出现，配对搜索也可以有效地执行。本地化后缀数组本身实际上是对传统后缀数组内的位重新排序，它们的内存需求本质上是相同的。我们展示了新一代DNA测序仪产生的对端短读的基因组定位问题的应用。当成对读取高度重复时，naïvely计算、排序和比较所有坐标非常耗时。对于36个碱基对的人类基因组重测序数据，在几乎一半的配对读取的冗余总和(个体出现的数量)大于2000的情况下，观察到比naïve方法加速10倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

Localized suffix array and its application to genome mapping problems for paired-end short reads.

We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Genome informatics. International Conference on Genome Informatics

自引率

0.00%

发文量