Accelerating gene context analysis using bitmaps

Proceedings of the 25th International Conference on Scientific and Statistical Database Management Pub Date : 2013-07-29 DOI:10.1145/2484838.2484856

A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis

{"title":"Accelerating gene context analysis using bitmaps","authors":"A. Romosan, A. Shoshani, Kesheng Wu, V. Markowitz, K. Mavrommatis","doi":"10.1145/2484838.2484856","DOIUrl":null,"url":null,"abstract":"Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called \"gene cassettes\", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer. We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a \"killer query\", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer \"killer queries\" in seconds. Even with an extremely complex \"killer query\" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.","PeriodicalId":269347,"journal":{"name":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484838.2484856","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Gene context analysis determines the function of genes by examining the conservation of chromosomal gene clusters and co-occurrence functional profiles across genomes. This is based on the observation that functionally related genes are often collocated on chromosomes as part of so called "gene cassettes", and relies on the identification of such cassettes across a statistically significant and phylogenetically diverse collection of genomes. Gene context analysis is an important part of a genomic data management system such as the Integrated Microbial Genomes (IMG) system, which has one of the largest public genome collections. As of January 2013, IMG contains 3.3 million gene cassettes across 8,000 genomes. A gene context analysis in IMG performs many millions of comparisons among the cassettes and their functions. Using a traditional relational database management system, these cassettes and their functional characteristics are represented by a correlation table of more than 2 billion rows along with a dozen auxiliary tables. This correlation table requires 16.5 hours to build and a typical query requires 5 to 10 minutes to answer. We developed an alternative approach that encodes the cassettes and their functions using bitmaps. Reading the input data now takes about 1.5 hours and constructing the bitmap representations takes only 8 minutes. This amounts to less than one tenth of the time needed to build the correlation table. Furthermore, fairly complex queries can now be answered in seconds. In this work, we considered three basic forms of queries required to support gene context analysis and devised two different bitmap representations to answer such queries. These queries can be answered in less than a second. A more complex query, which we referred to as a "killer query", requires the examination of multi-way cross-products of all cassettes. We developed a progressive pruning strategy that effectively reduces the number of possible combinations examined. Tests have shown that we can now answer "killer queries" in seconds. Even with an extremely complex "killer query" involving 161 genomes (needing a 161-way cross-product), our algorithm took less 10 seconds. A query involving this many genomes is expected to take so much time using a traditional DBMS that it has never been attempted before. Working with the IMG developers, we have verified our implementation and have integrated it into the production version of IMG.

查看原文本刊更多论文

使用位图加速基因上下文分析

基因上下文分析通过检查染色体基因簇的保守性和跨基因组共现的功能谱来确定基因的功能。这是基于这样一种观察，即功能相关的基因通常作为所谓的“基因磁带”的一部分在染色体上搭配，并依赖于对这些磁带的识别，这些磁带跨越统计上显著的和系统发育上多样化的基因组集合。基因上下文分析是基因组数据管理系统的重要组成部分，例如集成微生物基因组(IMG)系统，该系统拥有最大的公共基因组库之一。截至2013年1月，IMG在8000个基因组中包含330万个基因磁带。IMG中的基因背景分析在磁带及其功能之间进行了数百万次比较。使用传统的关系数据库管理系统，这些磁带及其功能特征由一个超过20亿行的关联表以及十几个辅助表表示。构建这个关联表需要16.5小时，回答一个典型的查询需要5到10分钟。我们开发了另一种方法，使用位图对磁带及其功能进行编码。读取输入数据现在需要大约1.5小时，而构建位图表示只需要8分钟。这相当于不到构建相关表所需时间的十分之一。此外，相当复杂的查询现在可以在几秒钟内得到回答。在这项工作中，我们考虑了支持基因上下文分析所需的三种基本查询形式，并设计了两种不同的位图表示来回答这些查询。这些问题可以在不到一秒钟的时间内得到回答。一个更复杂的查询，我们称之为“杀手级查询”，需要检查所有磁带的多路交叉乘积。我们开发了一种渐进修剪策略，有效地减少了可能的组合检查的数量。测试表明，我们现在可以在几秒钟内回答“杀手级问题”。即使是涉及161个基因组的极其复杂的“杀手级查询”(需要161个方向的交叉积)，我们的算法也只用了不到10秒。使用传统的DBMS，涉及这么多基因组的查询预计将花费如此多的时间，这是以前从未尝试过的。与IMG开发人员合作，我们已经验证了我们的实现，并将其集成到IMG的生产版本中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 25th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量