An efficient overlap graph coarsening approach for modeling short reads

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops Pub Date : 2012-10-04 DOI:10.1109/BIBMW.2012.6470223

Julia D. Warnke-Sommer, H. Ali

{"title":"An efficient overlap graph coarsening approach for modeling short reads","authors":"Julia D. Warnke-Sommer, H. Ali","doi":"10.1109/BIBMW.2012.6470223","DOIUrl":null,"url":null,"abstract":"Next generation sequencing has quickly emerged as the most exciting yet challenging computational problem in Bioinformatics. Current sequencing technologies are capable of producing several hundreds of thousands to several millions of short sequence reads in a single run. However, current methods for managing, storing, and processing the produced reads remain for the most part simple and lack the complexity needed to model the produced reads efficiently and assemble them correctly. These reads are produced at a high coverage of the original target sequence such that many reads overlap. The overlap relationships are used to align and merge reads into contiguous sequences called contigs. In this paper, we present an overlap graph coarsening scheme for modeling reads and their overlap relationships. Our approach is different from previous read analysis and assembly methods that use a single graph to model read overlap relationships. Instead, we use a series of graphs with different granularities of information to represent the complex read overlap relationships. We present a new graph coarsening algorithm for clustering a simulated metagenomics dataset at various levels of granularity. We also use the proposed graph coarsening scheme along with graph traversal algorithms to find a labeling of the overlap graph that allows for the efficient organization of nodes within the graph data structure.","PeriodicalId":6392,"journal":{"name":"2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops","volume":"6 1","pages":"704-711"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBMW.2012.6470223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Next generation sequencing has quickly emerged as the most exciting yet challenging computational problem in Bioinformatics. Current sequencing technologies are capable of producing several hundreds of thousands to several millions of short sequence reads in a single run. However, current methods for managing, storing, and processing the produced reads remain for the most part simple and lack the complexity needed to model the produced reads efficiently and assemble them correctly. These reads are produced at a high coverage of the original target sequence such that many reads overlap. The overlap relationships are used to align and merge reads into contiguous sequences called contigs. In this paper, we present an overlap graph coarsening scheme for modeling reads and their overlap relationships. Our approach is different from previous read analysis and assembly methods that use a single graph to model read overlap relationships. Instead, we use a series of graphs with different granularities of information to represent the complex read overlap relationships. We present a new graph coarsening algorithm for clustering a simulated metagenomics dataset at various levels of granularity. We also use the proposed graph coarsening scheme along with graph traversal algorithms to find a labeling of the overlap graph that allows for the efficient organization of nodes within the graph data structure.

查看原文本刊更多论文

一种高效的重叠图粗化方法，用于短读段建模

下一代测序已迅速成为生物信息学中最令人兴奋但也最具挑战性的计算问题。目前的测序技术能够在一次运行中产生数十万到数百万个短序列读取。然而，目前用于管理、存储和处理生成的读取的方法在很大程度上仍然很简单，缺乏对生成的读取进行有效建模和正确组装所需的复杂性。这些读取是在原始目标序列的高覆盖率上产生的，因此许多读取重叠。重叠关系用于将读取对齐和合并为称为contigs的连续序列。在本文中，我们提出了一种重叠图粗化方案来建模读取及其重叠关系。我们的方法不同于以前的读取分析和组装方法，这些方法使用单个图来建模读取重叠关系。相反，我们使用一系列具有不同粒度信息的图来表示复杂的读重叠关系。我们提出了一种新的图形粗化算法，用于在不同粒度级别上聚类模拟宏基因组数据集。我们还使用提出的图粗化方案和图遍历算法来找到重叠图的标记，该标记允许在图数据结构中有效地组织节点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops

自引率

0.00%

发文量