Analysis of 16S Genomic Data using Graphical Databases

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics Pub Date : 2017-08-20 DOI:10.1145/3107411.3108208

O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels

{"title":"Analysis of 16S Genomic Data using Graphical Databases","authors":"O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels","doi":"10.1145/3107411.3108208","DOIUrl":null,"url":null,"abstract":"Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as \"common nodes\" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of \"compressive genomics\", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3107411.3108208","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as "common nodes" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of "compressive genomics", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.

查看原文本刊更多论文

用图形数据库分析16S基因组数据

自2003年人类基因组计划完成以来，许多数据科学家开发了算法来存储和查询大量的基因组数据。这些算法中最常用的数据存储技术是平面文件或关系数据库。虽然复杂的索引技术可以加速查询，但另一种方法是以支持高效查询的方式直接存储生物序列数据。在这里，我们介绍了一种新的算法，旨在利用图形数据库压缩冗余信息并提高查询速度的性能，图形数据库自2000年代中后期以来已经商业化。图形数据库使用节点和关系(边)存储信息。我们的方法是识别在许多序列中常见的子序列，并将它们作为“公共节点”存储在图形数据库中。这是完成测序数据如下:将整个序列拆分为k-mer:如果一个给定的k-mer是足够的序列共同的，那么它被标记为一个共同的片段;如果一个k-mer是唯一的(或对太少的序列共有)，那么它被标记为单个片段。这样，公共段和单个段分别形成公共节点和单个节点。这两种节点通过图形数据库中的边连接起来，允许通过图中的边重建每个原始序列。这个图形数据库模型允许对16S rDNA进行快速的分类查询。在查询时，数据库可以首先尝试查找与查询序列匹配的公共节点，然后沿着边缘到单个节点以改进搜索。这种方法类似于“压缩基因组学”，不同之处在于压缩是隐含在图形数据库存储模型中的。除了简单的序列查询之外，这种图形数据库表示还支持可变性分析，可以识别16S序列的高度可变区域和保守区域。低变异性区域对应普通节点，高变异性区域对应通过单个节点的多种路径。图中显示了共同节点和单个节点，以及相应的变异性图。序列搜索的基准测试表明，图形数据库中的查询时间明显快于平面文件或关系数据库。在基因组数据分析中实现图形数据库将允许加速搜索，并可能有助于其他形式的有效分析，例如四聚体频率分析，这在宏基因组分组中很有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics

自引率

0.00%

发文量