Compression of nucleotide databases for fast searching.

Computer applications in the biosciences : CABIOS Pub Date : 1997-10-01 DOI:10.1093/bioinformatics/13.5.549

H Williams, J Zobel

引用次数: 21

Abstract

Motivation: International sequencing efforts are creating huge nucleotide databases, which are used in searching applications to locate sequences homologous to a query sequence. In such applications, it is desirable that databases are stored compactly, that sequences can be accessed independently of the order in which they were stored, and that data can be rapidly retrieved from secondary storage, since disk costs are often the bottleneck in searching.

Results: We present a purpose-built direct coding scheme for fast retrieval and compression of genomic nucleotide data. The scheme is lossless, readily integrated with sequence search tools, and does not require a model. Direct coding gives good compression and allows faster retrieval than with either uncompressed data or data compressed by other methods, thus yielding significant improvements in search times for high-speed homology search tools.

查看原文本刊更多论文

压缩核苷酸数据库的快速搜索。

动机:国际上的测序工作正在创建巨大的核苷酸数据库，这些数据库用于搜索应用程序来定位与查询序列同源的序列。在这样的应用程序中，希望数据库存储紧凑，可以独立于存储顺序访问序列，并且可以从二级存储中快速检索数据，因为磁盘成本通常是搜索的瓶颈。结果:我们提出了一个专门构建的直接编码方案，用于快速检索和压缩基因组核苷酸数据。该方案具有无损性，易于与序列搜索工具集成，并且不需要模型。与未压缩数据或使用其他方法压缩的数据相比，直接编码提供了良好的压缩，并且允许更快的检索，从而显著改善了高速同源性搜索工具的搜索时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer applications in the biosciences : CABIOS

自引率

0.00%

发文量